Remote Components¶
The purpose of remote components is to provide a means of adding a remote physics analysis to a local OpenMDAO problem. One situation in which this may be desirable is when the time to carry out a full optimization exceeds an HPC job time limit. Such a situation, without remote components, may normally require manual restarts of the optimization, and would thus limit one to optimizers with this capability. Using remote components, one can keep a serial OpenMDAO optimization running continuously on a login node (e.g., using the nohup or screen Linux commands) while the parallel physics analyses are evaluated across several HPC jobs. Another situation where these components may be advantageous is when the OpenMDAO problem contains components not streamlined for massively parallel environments.
In general, remote components use nested OpenMDAO problems in a server-client arrangement. The outer, client-side OpenMDAO model serves as the overarching analysis/optimization problem while the inner, server-side model serves as the isolated high-fidelity analysis. The server inside the HPC job remains open to evaluate function or gradient calls. Wall times for function and gradient calls are saved, and when the maximum previous time multiplied by a scale factor exceeds the remaining job time, the server will be relaunched.
Three general base classes are used to achieve this.
RemoteComp
: Explicit component that wraps communication with server, replicating inputs/outputs to/from server-side group and requesting new a server when estimated analysis time exceeds remaining job time.ServerManager
: Used byRemoteComp
to control and communicate with the server.Server
: Loads the inner OpenMDAO problem and evaluates function or gradient calls as requested by theServerManager
.
Currently, there is one derived class for each, which use pbs4py for HPC job control and ZeroMQ for network communication.
RemoteZeroMQComp
: Through the use ofMPhysZeroMQServerManager
, uses encoded JSON dictionaries to send and receive necessary information to and from the server.MPhysZeroMQServerManager
: Uses ZeroMQ socket and ssh port forwarding from login to compute node to communicate with server, and pbs4py to start, stop, and check status of HPC jobs.MPhysZeroMQServer
: Uses ZeroMQ socket to send and receive encoded JSON dictionaries.
RemoteZeroMQComp Options¶
Option |
Default |
Acceptable Values |
Acceptable Types |
Description |
---|---|---|---|---|
acceptable_port_range |
[5081, 6000] |
N/A |
N/A |
port range to look through if ‘port’ is currently busy |
additional_remote_inputs |
[] |
N/A |
[‘list’] |
additional inputs not defined as design vars in the remote component |
additional_remote_outputs |
[] |
N/A |
[‘list’] |
additional outputs not defined as objective/constraints in the remote component |
additional_server_args |
N/A |
N/A |
Optional arguments to give server, in addition to –port <port number> |
|
always_opt |
False |
[True, False] |
[‘bool’] |
If True, force nonlinear operations on this component to be included in the optimization loop even if this component is not relevant to the design variables and responses. |
derivs_method |
N/A |
[‘jax’, ‘cs’, ‘fd’, None] |
N/A |
The method to use for computing derivatives |
distributed |
False |
[True, False] |
[‘bool’] |
If True, set all variables in this component as distributed across multiple processes |
dump_json |
False |
N/A |
N/A |
dump input/output json file in client |
dump_separate_json |
False |
N/A |
N/A |
dump a separate input/output json file for each evaluation |
pbs |
pbs4py Launcher object |
N/A |
N/A |
|
port |
5081 |
N/A |
N/A |
port number for server/client communication |
reboot_only_on_function_call |
True |
N/A |
N/A |
only allows server reboot before function call, not gradient call. Avoids having to rerun forward solution on next job, but shortens current job time |
run_root_only |
False |
[True, False] |
[‘bool’] |
If True, call compute, compute_partials, linearize, apply_linear, apply_nonlinear, and compute_jacvec_product only on rank 0 and broadcast the results to the other ranks. |
run_server_filename |
mphys_server.py |
N/A |
N/A |
python file that will launch the Server class |
time_estimate_buffer |
0.0 |
N/A |
[‘float’] |
constant time in seconds to add to model evaluation esimate. When using parallel remote components with very different evaluation times, setting to slowest component’s estimated evaluation time avoids having the faster component’s job expire while the slower one is being evaluated |
time_estimate_multiplier |
2.0 |
N/A |
N/A |
when determining whether to reboot the server, estimate model run time as this times max prior run time |
use_derivative_coloring |
False |
[True, False] |
[‘bool’] |
assign derivative coloring to objective/constraints. Only for cases with parallel servers |
use_jit |
True |
[True, False] |
[‘bool’] |
If True, attempt to use jit on compute_primal, assuming jax or some other AD package is active. |
var_naming_dot_replacement |
: |
N/A |
N/A |
what to replace ‘.’ within dv/response name trees |
Usage¶
When adding a RemoteZeroMQComp
component, the two required options are run_server_filename
, which is the server to be launched on an HPC job, and pbs
, which is the pbs4py Launcher object.
The server file should accept port number as an argument to facilitate communication with the client.
Within this file, the MPhysZeroMQServer
class’s get_om_group_function_pointer
option is the pointer to the OpenMDAO Group or Multipoint class to be evaluated.
By default, any design variables, objectives, and constraints defined in the group will be added on the client side.
Any other desired inputs or outputs must be added in the additional_remote_inputs
or additional_remote_outputs
options.
On the client side, any “.” characters in these input and output names will be replaced by var_naming_dot_replacement
.
The screen output from a particular remote component’s Nth server will be sent to mphys_<component name>_serverN.out
, where component name
is the subsystem name of the RemoteZeroMQComp
instance.
Searching for the keyword “SERVER” will display what the server is currently doing; the keyword “CLIENT” will do the same on the client-side.
The HPC job for the component’s server is named MPhys<port number>
; the pbs4py-generated job submission script is the same followed by “.pbs”.
Note that running the remote component in parallel is not supported, and a SystemError will be triggered otherwise.
Example¶
Two examples are provided for the supersonic panel aerostructural case: as_opt_remote_serial.py
and as_opt_remote_parallel.py
.
Both run the optimization problem defined in as_opt_parallel.py
, which contains a MultipointParallel
class and thus evaluates two aerostructural scenarios in parallel.
The serial remote example runs this group on one server.
The parallel remote example, on the other hand, contains an OpenMDAO parallel group which runs two servers in parallel.
Both examples use the same server file, mphys_server.py
, but point to either as_opt_parallel.py
or run.py
by sending the model’s filename through the use of the RemoteZeroMQComp
’s additional_server_args
option.
As demonstrated in this server file, additional configuration options may be sent to the server-side OpenMDAO group through the use of a functor (called GetModel
in this case) in combination with additional_server_args
.
In this particular case, scenario name(s) are sent as additional_server_args
from the client side; on the server side, the GetModel
functor allows the scenario name(s) to be sent as OpenMDAO options to the server-side group.
Using the scenario run_directory
option, the scenarios can then be evaluated in different directories.
In both examples, the remote component(s) use a K4
pbs4py Launcher object, which will launch, monitor, and stop jobs using the K4 queue of the NASA K-cluster.
Troubleshooting¶
The dump_json
option for RemoteZeroMQComp
will make the component write input and output JSON files, which contain all data sent to and received from the server.
An exception is the wall_time
entry (given in seconds) in the output JSON file, which is added on the client-side after the server has completed the design evaluation.
Another entry that is only provided for informational purposes is design_counter
, which keeps track of how many different designs have been evaluated on the current server.
If dump_separate_json
is set to True, then separate files will be written for each design evaluation.
On the server side, an n2 file titled n2_inner_analysis_<component name>.html
will be written after each evaluation.
Current Limitations¶
A pbs4py Launcher must be implemented for your HPC environment
On the client side,
RemoteZeroMQComp.stop_server()
should be added after your analysis/optimization to stop the HPC job and ssh port forwarding, which the server manager starts as a background process.If
stop_server
is not called or the server stops unexpectedly, stopping the port forwarding manually is difficult, as it involves finding the ssh process associated with the remote server’s port number. This must be done on the same login node that the server was launched from.Stopping the HPC job is somewhat easier as the job name will be
MPhys
followed by the port number; however, if runs are launched from multiple login nodes then one may have multiple jobs with the same name.Currently, the
of
option (as well aswrt
) forcheck_totals
orcompute_totals
is not used by the remote component; on the server side,compute_totals
will be evaluated for all responses (objectives, constraints, andadditional_remote_outputs
). Depending on how manyof
responses are desired, this may be more costly than not using remote components.The HPC environment must allow ssh port forwarding from the login node to a compute node.
- class mphys.network.remote_component.RemoteComp(**kwargs)[source]¶
A component used for network communication between top-level OpenMDAO problem and remote problem evaluated on an HPC job. Serves as the top-level component on the client side.
To make a particular derived class, implement the _setup_server_manager, _send_inputs_to_server, and _receive_outputs_from_server functions.
Store some bound methods so we can detect runtime overrides.
- compute(inputs, outputs)[source]¶
Compute outputs given inputs. The model is assumed to be in an unscaled state.
An inherited component may choose to either override this function or to define a compute_primal function.
- Parameters:
- inputsVector
Unscaled, dimensional input variables read via inputs[key].
- outputsVector
Unscaled, dimensional output variables read via outputs[key].
- discrete_inputsdict-like or None
If not None, dict-like object containing discrete input values.
- discrete_outputsdict-like or None
If not None, dict-like object containing discrete output values.
- compute_partials(inputs, partials)[source]¶
Compute sub-jacobian parts. The model is assumed to be in an unscaled state.
- Parameters:
- inputsVector
Unscaled, dimensional input variables read via inputs[key].
- partialsJacobian
Sub-jac components written to partials[output_name, input_name]..
- discrete_inputsdict or None
If not None, dict containing discrete input values.
- class mphys.network.server_manager.ServerManager[source]¶
A class used by the client-side RemoteComp to facilitate communication with the remote, server-side OpenMDAO problem.
To make a particular derived class, implement the start_server, stop_server, and enough_time_is_remaining functions.
- class mphys.network.server.Server(get_om_group_function_pointer, ignore_setup_warnings=False, ignore_runtime_warnings=False, rerun_initial_design=False)[source]¶
A class that serves as an OpenMDAO model analysis server. Launched by a server run file by the ServerManager and runs on an HPC job, awaiting design variables to evaluate and sending back resulting function or derivative information.
To make a particular derived class, implement the _parse_incoming_message and _send_outputs_to_client functions.
- Parameters:
- get_om_group_function_pointerfunction pointer
Pointer to the OpenMDAO/MPhys group to evaluate on the server
- ignore_setup_warningsbool
Whether to ignore OpenMDAO setup warnings
- ignore_runtime_warningsbool
Whether to ignore OpenMDAO runtime warnings
- rerun_initial_designbool
Whether to evaluate the baseline design upon starup
- class mphys.network.zmq_pbs.RemoteZeroMQComp(**kwargs)[source]¶
A derived RemoteComp class that uses pbs4py for HPC job management and ZeroMQ for network communication.
Store some bound methods so we can detect runtime overrides.
- class mphys.network.zmq_pbs.MPhysZeroMQServerManager(pbs: PBS, run_server_filename: str, component_name: str, port=5081, acceptable_port_range=[5081, 6000], additional_server_args='')[source]¶
A derived ServerManager class that uses pbs4py for HPC job management and ZeroMQ for network communication.
- Parameters:
- pbs
PBS
pbs4py launcher used for HPC job management
- run_server_filenamestr
Python filename that initializes and runs the
MPhysZeroMQServer
server- component_namestr
Name of the remote component, for capturing output from separate remote components to mphys_{component_name}_server{server_number}.out
- portint
Desired port number for ssh port forwarding
- acceptable_port_rangelist
Range of alternative port numbers if specified port is already in use
- additional_server_argsstr
Optional arguments to give server, in addition to –port <port number>
- pbs