Remote Components

The purpose of remote components is to provide a means of adding a remote physics analysis to a local OpenMDAO problem. One situation in which this may be desirable is when the time to carry out a full optimization exceeds an HPC job time limit. Such a situation, without remote components, may normally require manual restarts of the optimization, and would thus limit one to optimizers with this capability. Using remote components, one can keep a serial OpenMDAO optimization running continuously on a login node (e.g., using the nohup or screen Linux commands) while the parallel physics analyses are evaluated across several HPC jobs. Another situation where these components may be advantageous is when the OpenMDAO problem contains components not streamlined for massively parallel environments.

In general, remote components use nested OpenMDAO problems in a server-client arrangement. The outer, client-side OpenMDAO model serves as the overarching analysis/optimization problem while the inner, server-side model serves as the isolated high-fidelity analysis. The server inside the HPC job remains open to evaluate function or gradient calls. Wall times for function and gradient calls are saved, and when the maximum previous time multiplied by a scale factor exceeds the remaining job time, the server will be relaunched.

Three general base classes are used to achieve this.

  • RemoteComp: Explicit component that wraps communication with server, replicating inputs/outputs to/from server-side group and requesting new a server when estimated analysis time exceeds remaining job time.

  • ServerManager: Used by RemoteComp to control and communicate with the server.

  • Server: Loads the inner OpenMDAO problem and evaluates function or gradient calls as requested by the ServerManager.

Currently, there is one derived class for each, which use pbs4py for HPC job control and ZeroMQ for network communication.

  • RemoteZeroMQComp: Through the use of MPhysZeroMQServerManager, uses encoded JSON dictionaries to send and receive necessary information to and from the server.

  • MPhysZeroMQServerManager: Uses ZeroMQ socket and ssh port forwarding from login to compute node to communicate with server, and pbs4py to start, stop, and check status of HPC jobs.

  • MPhysZeroMQServer: Uses ZeroMQ socket to send and receive encoded JSON dictionaries.

RemoteZeroMQComp Options

Option

Default

Acceptable Values

Acceptable Types

Description

acceptable_port_range

[5081, 6000]

N/A

N/A

port range to look through if ‘port’ is currently busy

additional_remote_inputs

[]

N/A

[‘list’]

additional inputs not defined as design vars in the remote component

additional_remote_outputs

[]

N/A

[‘list’]

additional outputs not defined as objective/constraints in the remote component

additional_server_args

N/A

N/A

Optional arguments to give server, in addition to –port <port number>

always_opt

False

[True, False]

[‘bool’]

If True, force nonlinear operations on this component to be included in the optimization loop even if this component is not relevant to the design variables and responses.

derivs_method

N/A

[‘jax’, ‘cs’, ‘fd’, None]

N/A

The method to use for computing derivatives

distributed

False

[True, False]

[‘bool’]

If True, set all variables in this component as distributed across multiple processes

dump_json

False

N/A

N/A

dump input/output json file in client

dump_separate_json

False

N/A

N/A

dump a separate input/output json file for each evaluation

pbs

pbs4py Launcher object

N/A

N/A

port

5081

N/A

N/A

port number for server/client communication

reboot_only_on_function_call

True

N/A

N/A

only allows server reboot before function call, not gradient call. Avoids having to rerun forward solution on next job, but shortens current job time

run_root_only

False

[True, False]

[‘bool’]

If True, call compute, compute_partials, linearize, apply_linear, apply_nonlinear, and compute_jacvec_product only on rank 0 and broadcast the results to the other ranks.

run_server_filename

mphys_server.py

N/A

N/A

python file that will launch the Server class

time_estimate_buffer

0.0

N/A

[‘float’]

constant time in seconds to add to model evaluation esimate. When using parallel remote components with very different evaluation times, setting to slowest component’s estimated evaluation time avoids having the faster component’s job expire while the slower one is being evaluated

time_estimate_multiplier

2.0

N/A

N/A

when determining whether to reboot the server, estimate model run time as this times max prior run time

use_derivative_coloring

False

[True, False]

[‘bool’]

assign derivative coloring to objective/constraints. Only for cases with parallel servers

use_jit

True

[True, False]

[‘bool’]

If True, attempt to use jit on compute_primal, assuming jax or some other AD package is active.

var_naming_dot_replacement

:

N/A

N/A

what to replace ‘.’ within dv/response name trees

Usage

When adding a RemoteZeroMQComp component, the two required options are run_server_filename, which is the server to be launched on an HPC job, and pbs, which is the pbs4py Launcher object. The server file should accept port number as an argument to facilitate communication with the client. Within this file, the MPhysZeroMQServer class’s get_om_group_function_pointer option is the pointer to the OpenMDAO Group or Multipoint class to be evaluated. By default, any design variables, objectives, and constraints defined in the group will be added on the client side. Any other desired inputs or outputs must be added in the additional_remote_inputs or additional_remote_outputs options. On the client side, any “.” characters in these input and output names will be replaced by var_naming_dot_replacement.

The screen output from a particular remote component’s Nth server will be sent to mphys_<component name>_serverN.out, where component name is the subsystem name of the RemoteZeroMQComp instance. Searching for the keyword “SERVER” will display what the server is currently doing; the keyword “CLIENT” will do the same on the client-side. The HPC job for the component’s server is named MPhys<port number>; the pbs4py-generated job submission script is the same followed by “.pbs”. Note that running the remote component in parallel is not supported, and a SystemError will be triggered otherwise.

Example

Two examples are provided for the supersonic panel aerostructural case: as_opt_remote_serial.py and as_opt_remote_parallel.py. Both run the optimization problem defined in as_opt_parallel.py, which contains a MultipointParallel class and thus evaluates two aerostructural scenarios in parallel. The serial remote example runs this group on one server. The parallel remote example, on the other hand, contains an OpenMDAO parallel group which runs two servers in parallel. Both examples use the same server file, mphys_server.py, but point to either as_opt_parallel.py or run.py by sending the model’s filename through the use of the RemoteZeroMQComp’s additional_server_args option. As demonstrated in this server file, additional configuration options may be sent to the server-side OpenMDAO group through the use of a functor (called GetModel in this case) in combination with additional_server_args. In this particular case, scenario name(s) are sent as additional_server_args from the client side; on the server side, the GetModel functor allows the scenario name(s) to be sent as OpenMDAO options to the server-side group. Using the scenario run_directory option, the scenarios can then be evaluated in different directories. In both examples, the remote component(s) use a K4 pbs4py Launcher object, which will launch, monitor, and stop jobs using the K4 queue of the NASA K-cluster.

Troubleshooting

The dump_json option for RemoteZeroMQComp will make the component write input and output JSON files, which contain all data sent to and received from the server. An exception is the wall_time entry (given in seconds) in the output JSON file, which is added on the client-side after the server has completed the design evaluation. Another entry that is only provided for informational purposes is design_counter, which keeps track of how many different designs have been evaluated on the current server. If dump_separate_json is set to True, then separate files will be written for each design evaluation. On the server side, an n2 file titled n2_inner_analysis_<component name>.html will be written after each evaluation.

Current Limitations

  • A pbs4py Launcher must be implemented for your HPC environment

  • On the client side, RemoteZeroMQComp.stop_server() should be added after your analysis/optimization to stop the HPC job and ssh port forwarding, which the server manager starts as a background process.

  • If stop_server is not called or the server stops unexpectedly, stopping the port forwarding manually is difficult, as it involves finding the ssh process associated with the remote server’s port number. This must be done on the same login node that the server was launched from.

  • Stopping the HPC job is somewhat easier as the job name will be MPhys followed by the port number; however, if runs are launched from multiple login nodes then one may have multiple jobs with the same name.

  • Currently, the of option (as well as wrt) for check_totals or compute_totals is not used by the remote component; on the server side, compute_totals will be evaluated for all responses (objectives, constraints, and additional_remote_outputs). Depending on how many of responses are desired, this may be more costly than not using remote components.

  • The HPC environment must allow ssh port forwarding from the login node to a compute node.

class mphys.network.remote_component.RemoteComp(**kwargs)[source]

A component used for network communication between top-level OpenMDAO problem and remote problem evaluated on an HPC job. Serves as the top-level component on the client side.

To make a particular derived class, implement the _setup_server_manager, _send_inputs_to_server, and _receive_outputs_from_server functions.

Store some bound methods so we can detect runtime overrides.

initialize()[source]

Perform any one-time initialization run at instantiation.

setup()[source]

Declare inputs and outputs.

Available attributes:

name pathname comm options

compute(inputs, outputs)[source]

Compute outputs given inputs. The model is assumed to be in an unscaled state.

An inherited component may choose to either override this function or to define a compute_primal function.

Parameters:
inputsVector

Unscaled, dimensional input variables read via inputs[key].

outputsVector

Unscaled, dimensional output variables read via outputs[key].

discrete_inputsdict-like or None

If not None, dict-like object containing discrete input values.

discrete_outputsdict-like or None

If not None, dict-like object containing discrete output values.

compute_partials(inputs, partials)[source]

Compute sub-jacobian parts. The model is assumed to be in an unscaled state.

Parameters:
inputsVector

Unscaled, dimensional input variables read via inputs[key].

partialsJacobian

Sub-jac components written to partials[output_name, input_name]..

discrete_inputsdict or None

If not None, dict containing discrete input values.

class mphys.network.server_manager.ServerManager[source]

A class used by the client-side RemoteComp to facilitate communication with the remote, server-side OpenMDAO problem.

To make a particular derived class, implement the start_server, stop_server, and enough_time_is_remaining functions.

start_server()[source]

Start the server.

stop_server()[source]

Stop the server.

enough_time_is_remaining(estimated_model_time)[source]

Check if the current HPC job has enough time remaining to run the next analysis.

Parameters:
estimated_model_timefloat

How much time the new analysis is estimated to take

class mphys.network.server.Server(get_om_group_function_pointer, ignore_setup_warnings=False, ignore_runtime_warnings=False, rerun_initial_design=False)[source]

A class that serves as an OpenMDAO model analysis server. Launched by a server run file by the ServerManager and runs on an HPC job, awaiting design variables to evaluate and sending back resulting function or derivative information.

To make a particular derived class, implement the _parse_incoming_message and _send_outputs_to_client functions.

Parameters:
get_om_group_function_pointerfunction pointer

Pointer to the OpenMDAO/MPhys group to evaluate on the server

ignore_setup_warningsbool

Whether to ignore OpenMDAO setup warnings

ignore_runtime_warningsbool

Whether to ignore OpenMDAO runtime warnings

rerun_initial_designbool

Whether to evaluate the baseline design upon starup

run()[source]

Run the server.

class mphys.network.zmq_pbs.RemoteZeroMQComp(**kwargs)[source]

A derived RemoteComp class that uses pbs4py for HPC job management and ZeroMQ for network communication.

Store some bound methods so we can detect runtime overrides.

initialize()[source]

Perform any one-time initialization run at instantiation.

class mphys.network.zmq_pbs.MPhysZeroMQServerManager(pbs: PBS, run_server_filename: str, component_name: str, port=5081, acceptable_port_range=[5081, 6000], additional_server_args='')[source]

A derived ServerManager class that uses pbs4py for HPC job management and ZeroMQ for network communication.

Parameters:
pbsPBS

pbs4py launcher used for HPC job management

run_server_filenamestr

Python filename that initializes and runs the MPhysZeroMQServer server

component_namestr

Name of the remote component, for capturing output from separate remote components to mphys_{component_name}_server{server_number}.out

portint

Desired port number for ssh port forwarding

acceptable_port_rangelist

Range of alternative port numbers if specified port is already in use

additional_server_argsstr

Optional arguments to give server, in addition to –port <port number>

start_server()[source]

Start the server.

stop_server()[source]

Stop the server.

enough_time_is_remaining(estimated_model_time)[source]

Check if the current HPC job has enough time remaining to run the next analysis.

Parameters:
estimated_model_timefloat

How much time the new analysis is estimated to take

class mphys.network.zmq_pbs.MPhysZeroMQServer(port, get_om_group_function_pointer, ignore_setup_warnings=False, ignore_runtime_warnings=False, rerun_initial_design=False)[source]

A derived Server class that uses ZeroMQ for network communication.