Remote Components¶

The purpose of remote components is to provide a means of adding a remote physics analysis to a local OpenMDAO problem. One situation in which this may be desirable is when the time to carry out a full optimization exceeds an HPC job time limit. Such a situation, without remote components, may normally require manual restarts of the optimization, and would thus limit one to optimizers with this capability. Using remote components, one can keep a serial OpenMDAO optimization running continuously on a login node (e.g., using the nohup or screen Linux commands) while the parallel physics analyses are evaluated across several HPC jobs. Another situation where these components may be advantageous is when the OpenMDAO problem contains components not streamlined for massively parallel environments.

In general, remote components use nested OpenMDAO problems in a server-client arrangement. The outer, client-side OpenMDAO model serves as the overarching analysis/optimization problem while the inner, server-side model serves as the isolated high-fidelity analysis. The server inside the HPC job remains open to evaluate function or gradient calls. Wall times for function and gradient calls are saved, and when the maximum previous time multiplied by a scale factor exceeds the remaining job time, the server will be relaunched.

Three general base classes are used to achieve this.

RemoteComp: Explicit component that wraps communication with server, replicating inputs/outputs to/from server-side group and requesting new a server when estimated analysis time exceeds remaining job time.
ServerManager: Used by RemoteComp to control and communicate with the server.
Server: Loads the inner OpenMDAO problem and evaluates function or gradient calls as requested by the ServerManager.

Currently, there is one derived class for each, which use pbs4py for HPC job control and ZeroMQ for network communication.

RemoteZeroMQComp: Through the use of MPhysZeroMQServerManager, uses encoded JSON dictionaries to send and receive necessary information to and from the server.
MPhysZeroMQServerManager: Uses ZeroMQ socket and ssh port forwarding from login to compute node to communicate with server, and pbs4py to start, stop, and check status of HPC jobs.
MPhysZeroMQServer: Uses ZeroMQ socket to send and receive encoded JSON dictionaries.

RemoteZeroMQComp Options¶

Option	Default	Acceptable Values	Acceptable Types	Description
acceptable_port_range	[5081, 6000]	N/A	N/A	port range to look through if ‘port’ is currently busy
additional_remote_constants	[]	N/A	[‘list’]	same as additional_remote_inputs, but derivatives wrt constants will not be computed
additional_remote_inputs	[]	N/A	[‘list’]	additional inputs not defined as design vars in the remote component
additional_remote_outputs	[]	N/A	[‘list’]	additional outputs not defined as objective/constraints in the remote component
additional_server_args		N/A	N/A	Optional arguments to give server, in addition to –port <port number>
always_opt	False	[True, False]	[‘bool’]	If True, force nonlinear operations on this component to be included in the optimization loop even if this component is not relevant to the design variables and responses.
default_shape	(1,)	N/A	[‘tuple’]	Default shape for variables that do not set val to a non-scalar value or set shape, shape_by_conn, copy_shape, or compute_shape. Default is (1,).
derivs_method	N/A	[‘jax’, ‘cs’, ‘fd’, None]	N/A	The method to use for computing derivatives
distributed	False	[True, False]	[‘bool’]	If True, set all variables in this component as distributed across multiple processes
dump_json	False	N/A	N/A	dump input/output json file in client
dump_separate_json	False	N/A	N/A	dump a separate input/output json file for each evaluation
job_expiration_max_restarts	N/A	N/A	N/A	Optional maximum number of server restarts due to job expiration; unlimited by default
pbs	pbs4py Launcher object	N/A	N/A
port	5081	N/A	N/A	port number for server/client communication
reboot_only_on_function_call	True	N/A	N/A	only allows server reboot before function call, not gradient call. Avoids having to rerun forward solution on next job, but shortens current job time
run_root_only	False	[True, False]	[‘bool’]	If True, call compute, compute_partials, linearize, apply_linear, apply_nonlinear, solve_linear, solve_nonlinear, and compute_jacvec_product only on rank 0 and broadcast the results to the other ranks.
run_server_filename	mphys_server.py	N/A	N/A	python file that will launch the Server class
time_estimate_buffer	0.0	N/A	[‘float’]	constant time in seconds to add to model evaluation esimate. When using parallel remote components with very different evaluation times, setting to slowest component’s estimated evaluation time avoids having the faster component’s job expire while the slower one is being evaluated
time_estimate_multiplier	2.0	N/A	N/A	when determining whether to reboot the server, estimate model run time as this times max prior run time
use_derivative_coloring	False	[True, False]	[‘bool’]	assign derivative coloring to objective/constraints. Only for cases with parallel servers
use_jit	True	[True, False]	[‘bool’]	If True, attempt to use jit on compute_primal, assuming jax or some other AD package capable of jitting is active.
var_naming_dot_replacement	:	N/A	N/A	what to replace ‘.’ within dv/response name trees

Usage¶

When adding a RemoteZeroMQComp component, the two required options are run_server_filename, which is the server to be launched on an HPC job, and pbs, which is the pbs4py Launcher object. The server file should accept port number as an argument to facilitate communication with the client. Within this file, the MPhysZeroMQServer class’s get_om_group_function_pointer option is the pointer to the OpenMDAO Group or Multipoint class to be evaluated. By default, any design variables, objectives, and constraints defined in the group will be added on the client side. Any other desired inputs or outputs must be added in the additional_remote_inputs or additional_remote_outputs options. On the client side, any “.” characters in these input and output names will be replaced by var_naming_dot_replacement.

The screen output from a particular remote component’s Nth server will be sent to mphys_<component name>_serverN.out, where component name is the subsystem name of the RemoteZeroMQComp instance. Searching for the keyword “SERVER” will display what the server is currently doing; the keyword “CLIENT” will do the same on the client-side. The HPC job for the component’s server is named MPhys<port number>; the pbs4py-generated job submission script is the same followed by “.pbs”. Note that running the remote component in parallel is not supported, and a SystemError will be triggered otherwise.

Example¶

Two examples are provided for the supersonic panel aerostructural case: as_opt_remote_serial.py and as_opt_remote_parallel.py. Both run the optimization problem defined in as_opt_parallel.py, which contains a MultipointParallel class and thus evaluates two aerostructural scenarios in parallel. The serial remote example runs this group on one server. The parallel remote example, on the other hand, contains an OpenMDAO parallel group which runs two servers in parallel. Both examples use the same server file, mphys_server.py, but point to either as_opt_parallel.py or run.py by sending the model’s filename through the use of the RemoteZeroMQComp’s additional_server_args option. As demonstrated in this server file, additional configuration options may be sent to the server-side OpenMDAO group through the use of a functor (called GetModel in this case) in combination with additional_server_args. In this particular case, scenario name(s) are sent as additional_server_args from the client side; on the server side, the GetModel functor allows the scenario name(s) to be sent as OpenMDAO options to the server-side group. Using the scenario run_directory option, the scenarios can then be evaluated in different directories. In both examples, the remote component(s) use a K4 pbs4py Launcher object, which will launch, monitor, and stop jobs using the K4 queue of the NASA K-cluster.

Troubleshooting¶

The dump_json option for RemoteZeroMQComp will make the component write input and output JSON files, which contain all data sent to and received from the server. An exception is the wall_time entry (given in seconds) in the output JSON file, which is added on the client-side after the server has completed the design evaluation. Similarly, the down_time entry keeps track of the elapsed time between the end of the previous design evaluation and the beginning of the current one. Another entry that is only provided for informational purposes is design_counter, which keeps track of how many different designs have been evaluated on the current server. If dump_separate_json is set to True, then separate files will be written for each design evaluation. On the server side, an n2 file titled n2_inner_analysis_<component name>.html will be written after each evaluation.

Current Limitations¶

A pbs4py Launcher must be implemented for your HPC environment
On the client side, RemoteZeroMQComp.stop_server() should be added after your analysis/optimization to stop the HPC job and ssh port forwarding, which the server manager starts as a background process.
If stop_server is not called or the server stops unexpectedly, stopping the port forwarding manually is difficult, as it involves finding the ssh process associated with the remote server’s port number. This must be done on the same login node that the server was launched from.
Stopping the HPC job is somewhat easier as the job name will be MPhys followed by the port number; however, if runs are launched from multiple login nodes then one may have multiple jobs with the same name.
Currently, the of option (as well as wrt) for check_totals or compute_totals is not used by the remote component; on the server side, compute_totals will be evaluated for all responses (objectives, constraints, and additional_remote_outputs). Depending on how many of responses are desired, this may be more costly than not using remote components.
The HPC environment must allow ssh port forwarding from the login node to a compute node.

class mphys.network.remote_component.RemoteComp(**kwargs)[source]¶

A component used for network communication between top-level OpenMDAO problem and remote problem evaluated on an HPC job. Serves as the top-level component on the client side.

To make a particular derived class, implement the _setup_server_manager, _send_inputs_to_server, and _receive_outputs_from_server functions.

Store some bound methods so we can detect runtime overrides.

initialize()[source]¶: Perform any one-time initialization run at instantiation.

setup()[source]¶

Declare inputs and outputs.

Available attributes:: name pathname comm options

compute(inputs, outputs)[source]¶

Compute outputs given inputs. The model is assumed to be in an unscaled state.

An inherited component may choose to either override this function or to define a compute_primal function.

Parameters:

inputsVector: Unscaled, dimensional input variables read via inputs[key].
outputsVector: Unscaled, dimensional output variables read via outputs[key].
discrete_inputsdict-like or None: If not None, dict-like object containing discrete input values.
discrete_outputsdict-like or None: If not None, dict-like object containing discrete output values.

compute_partials(inputs, partials)[source]¶

Compute sub-jacobian parts. The model is assumed to be in an unscaled state.

Parameters:

inputsVector: Unscaled, dimensional input variables read via inputs[key].
partialsJacobian: Sub-jac components written to partials[output_name, input_name]..
discrete_inputsdict or None: If not None, dict containing discrete input values.

class mphys.network.server_manager.ServerManager[source]¶

A class used by the client-side RemoteComp to facilitate communication with the remote, server-side OpenMDAO problem.

To make a particular derived class, implement the start_server, stop_server, and enough_time_is_remaining functions.

start_server()[source]¶: Start the server.

stop_server()[source]¶: Stop the server.

enough_time_is_remaining(estimated_model_time)[source]¶

Check if the current HPC job has enough time remaining to run the next analysis.

Parameters:

estimated_model_timefloat: How much time the new analysis is estimated to take

job_has_expired()[source]¶: Check if the job has run out of time.

class mphys.network.server.Server(get_om_group_function_pointer, ignore_setup_warnings=False, ignore_runtime_warnings=False, rerun_initial_design=False, write_n2=False)[source]¶

A class that serves as an OpenMDAO model analysis server. Launched by a server run file by the ServerManager and runs on an HPC job, awaiting design variables to evaluate and sending back resulting function or derivative information.

To make a particular derived class, implement the _parse_incoming_message and _send_outputs_to_client functions.

Parameters:

get_om_group_function_pointerfunction pointer: Pointer to the OpenMDAO/MPhys group to evaluate on the server
ignore_setup_warningsbool: Whether to ignore OpenMDAO setup warnings
ignore_runtime_warningsbool: Whether to ignore OpenMDAO runtime warnings
rerun_initial_designbool: Whether to evaluate the baseline design upon starup

run()[source]¶: Run the server.

class mphys.network.zmq_pbs.RemoteZeroMQComp(**kwargs)[source]¶

A derived RemoteComp class that uses pbs4py for HPC job management and ZeroMQ for network communication.

Store some bound methods so we can detect runtime overrides.

initialize()[source]¶: Perform any one-time initialization run at instantiation.

class mphys.network.zmq_pbs.MPhysZeroMQServerManager(pbs: PBS, run_server_filename: str, component_name: str, port=5081, acceptable_port_range=[5081, 6000], additional_server_args='', job_expiration_max_restarts=None)[source]¶

A derived ServerManager class that uses pbs4py for HPC job management and ZeroMQ for network communication.

Parameters:

pbsPBS: pbs4py launcher used for HPC job management
run_server_filenamestr: Python filename that initializes and runs the MPhysZeroMQServer server
component_namestr: Name of the remote component, for capturing output from separate remote components to mphys_{component_name}_server{server_number}.out
portint: Desired port number for ssh port forwarding
acceptable_port_rangelist: Range of alternative port numbers if specified port is already in use
additional_server_argsstr: Optional arguments to give server, in addition to –port <port number>
job_expiration_max_restartsint: Optional maximum number of server restarts due to job expiration; unlimited by default

start_server()[source]¶: Start the server.

stop_server()[source]¶: Stop the server.

enough_time_is_remaining(estimated_model_time)[source]¶

Check if the current HPC job has enough time remaining to run the next analysis.

Parameters:

estimated_model_timefloat: How much time the new analysis is estimated to take

job_has_expired()[source]¶: Check if the job has run out of time.

class mphys.network.zmq_pbs.MPhysZeroMQServer(port, get_om_group_function_pointer, ignore_setup_warnings=False, ignore_runtime_warnings=False, rerun_initial_design=False, write_n2=False)[source]¶: A derived Server class that uses ZeroMQ for network communication.

Remote Components¶

RemoteZeroMQComp Options¶

Usage¶

Example¶

Troubleshooting¶

Current Limitations¶

Table of Contents

Table of Contents