6. Extension guide

N.B. The content of this section may vary depending on the version of moller.

6.1. Bulk job execution by moller

A bulk job execution means that a set of small tasks are executed in parallel within a single batch job submitted to a large batch queue. It is schematically shown as follows, in which N tasks are launched as background processes and executed in parallel, and a wait statement is invoked to wait for all tasks to be completed.

task param_1 &
task param_2 &
     ...
task param_N &
wait

To manage the bulk job, it is required to distribute nodes and cores allocated to the batch job over the tasks param_1 … param_N so that they are executed on distinct nodes and cores. It is also needed to arrange task execution where at most N tasks are run simultaneously according to the allocated resources.

Hereafter a job script generated by moller will be denoted as a moller script. In a moller script, the concurrent execution and control of tasks are managed by GNU parallel [1]. It takes a list holding the items param_1 … param_N and runs commands for each items in parallel. An example is given as follows, where list.dat contains param_1 … param_N in each line.

cat list.dat | parallel -j N task

The number of concurrent tasks is determined at runtime from the number of nodes and cores obtained from the execution environment and the degree of parallelism (number of nodes, processes, and threads specified by node parameter).

The way to assign tasks to nodes and cores varies according to the job scheduler. For SLURM job scheduler variants, the concurrent calls of srun command within the batch job are appropriately assigned to the nodes and cores by exploiting the option of exclusive resource usage. The explicit option may depend on the platform.

On the other hand, for PBS job scheduler variants that do not have such features, the distribution of nodes and cores to tasks has to be handled within the moller script. The nodes and cores allocated to a batch job are divided into slots, and the slots are assigned to the concurrent tasks. The division is determined from the allocated nodes and cores and the degree of parallelism of the task, and kept in a form of table variables. Within a task, the programs are executed on the assigned hosts and cores (optionally pinned to the program) through the options to mpirun (or mpiexec) and the environment variables. This feature depends on the MPI implementation.

Reference

[1] O. Tange, GNU Parallel - The command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

6.2. How moller works

Structure of moller script

moller reads the input YAML file and generates a job script for bulk execution. The structure of the generated script is described as follows.

  1. Header

    This part contains the options to the job scheduler. The content of the platform section is formatted according to the type of job scheduler. This feature depends on platforms.

  2. Prologue

    This part corresponds to the prologue section of the input file. The content of the code block is written as-is.

  3. Function definitions

    This part contains the definitions of functions and variables used within the moller script. The description of the functions will be given in the next section. This feature depends on platforms.

  4. Processing Command-line options

    The SLURM variants accept additional arguments to the job submission command (sbatch) that are passed to the job script as a command-line options. The name of the list file and/or the options such as the retry feature can be processed.

    For the PBS variants, these command-line arguments are ignored, and therefore the name of the list file is fixed to list.dat by default, and the retry feature may be enabled by modifying the script with retry set to 1.

  5. Description of tasks

    This part contains the description of tasks specified in the jobs section of the input file. When more than one task is given, the following procedure is applied to each task.

    When parallel = false, the content of the run block is written as-is.

    When parallel = true (default), a function is created by the name task_{task name} that contains the pre-processing for concurrent execution and the content of the run block. The keywords for the parallel execution (srun, mpiexec, or mpirun) are substituted by the platform-dependent command. The definition of the task function is followed by the concurrent execution command.

  6. Epilogue

    This part corresponds to the epilogue section of the input file. The content of the code block is written as-is.

Brief description of moller script functions

The main functions of the moller script is briefly described below.

  • run_parallel

    This function performs concurrent execution of task functions. It takes the degree of parallelism, the task function, and the status file as arguments. Within the function, it calls _find_multiplicity to find the number of tasks that can be run simultaneously, and invokes GNU parallel to run tasks concurrently. The task function is actually wrapped by the _run_parallel_task function to deal with the nested call of GNU parallel.

    The platform-dependence is separated out by the functions _find_multiplicity and _setup_run_parallel.

  • _find_multiplicity

    This function determines the number of tasks that can be simultaneously executed on the allocated resources (nodes and cores) taking account of the degree of parallelism of the task. For the PBS variants, the compute nodes and the cores are divided into slots, and the slots are kept as table variables. The information obtained at the batch job execution is summarized as follows.

    • For SLURM variants,

      The number of allocated nodes (_nnodes)

      SLURM_NNODES

      The number of allocated cores (_ncores)

      SLURM_CPUS_ON_NODE

    • For PBS variants,

      The allocated nodes (_nodes[])

      The list of unique compute nodes is obtained from the file given by PBS_NODEFILE.

      The number of allocated nodes (_nnodes)

      The number of entries of _nodes[].

      The number of allocated cores

      Searched from below (in order of examination)

      • NCPUS (for PBS Professional)

      • OMP_NUM_THREADS

      • core parameter of platform section (written in the script as a variable moller_core.)

      • ncpus or ppn parameter in the header.

  • _setup_run_parallel

    This function is called from the run_parallel function to supplement some procedures before running GNU parallel. For PBS variants, the slot tables are exported so that the task functions can refer to. For SLURM variants, there is nothing to do.

The structure of the task function is described as follows.

  • A task function is created by a name task_{task name}.

  • The arguments of the task function are 1) the degree of parallelism (the number of nodes, processes, and threads), 2) the execution directory (that corresponds to the entry of list file), 3) the slot ID assigned by GNU parallel.

  • The platform-dependent _setup_taskenv function is called to set up execution environment. For PBS variants, the compute node and the cores are obtained from the slot table based on the slot ID. For SLURM variants, there is nothing to do.

  • The _is_ready function is called to check if the preceding task has been completed successfully. If it is true, the remaining part of the function is executed. Otherwise, the task is terminated with the status -1.

  • The content of the code block is written. The keywords for parallel calculation (srun, mpiexec, or mpirun) are substituted by the command provided for the platform.

6.3. How to extend moller for other systems

The latest version of moller provides profiles for ISSP supercomputer systems, ohtaka and kugui. An extension guide to use moller in other systems is described in the following.

Class structure

The platform-dependent parts of moller are placed in the directory platform/. Their class structure is depicted below.

digraph class_diagram {
size="5,5"
node[shape=record,style=filled,fillcolor=gray95]
edge[dir=back,arrowtail=empty]

Platform[label="{Platform (base.py)}"]
BaseSlurm[label="{BaseSlurm (base_slurm.py)}"]
BasePBS[label="{BasePBS (base_pbs.py)}"]
BaseDefault[label="{BaseDefault (base_default.py)}"]

Ohtaka[label="{Ohtaka (ohtaka.py)}"]
Kugui[label="{Kugui (kugui.py)}"]
Pbs[label="{Pbs (pbs.py)}"]
Default[label="{DefaultPlatform (default.py)}"]

Platform->BaseSlurm
Platform->BasePBS
Platform->BaseDefault

BaseSlurm->Ohtaka
BasePBS->Kugui
BasePBS->Pbs
BaseDefault->Default
}

A factory is provided to select a system in the input file. A class is imported in platform/__init__.py and registered to the factory by register_platform(system_name, class_name), and then it becomes available in the system parameter of the platform section in the input YAML file.

SLURM job scheduler variants

For the SLURM job scheduler variants, the system-specific settings should be applied to a derived class of BaseSlurm class. The string that substitute the keywords for the parallel execution of programs is given by the return value of parallel_command() method. It corresponds to the srun command with the options for the exclusive use of resources. See ohtaka.py for an example.

PBS job scheduler variants

For the PBS job scheduler variants (PBS Professional, OpenPBS, Torque, and others), the system-specific settings should be applied to a derived class of BasePBS class.

There are two ways of specifying the number of nodes for a batch job in the PBS variants. PBS Professional takes the form of select=N:ncpus=n, while Torque and others take the form of node=N:ppn=n. The BasePBS class has a parameter self.pbs_use_old_format that is set to True for the latter type.

The number of cores per compute node can be specified by node parameter of the input file, while the default value may be set for a known system. In kugui.py, the number of cores per node is set to 128 by default.

Customizing features

When further customization is required, the methods of the base class may be overridden in the derived classes. The list of relevant methods is given below.

  • setup

    This method extracts parameters of the platform section.

  • parallel_command

    This method returns a string that is used to substitute the keywords for parallel execution of programs (srun, mpiexec, mpirun).

  • generate_header

    This method generates the header part of the job script that contains options to the job scheduler.

  • generate_function

    This method generates functions that are used within the moller script. It calls the following methods to generate function body and variable definitions.

    • generate_variable

    • generate_function_body

    The definitions of the functions are provided as embedded strings within the class.

Porting to new type of job scheduler

The platform-dependent parts of the moller scripts are the calculation of task multiplicity, the resource distribution over tasks, and the command string of parallel calculation. The internal functions need to be developed with the following information on the platform:

  • how to acquire the allocated nodes and cores from the environment at the execution of batch jobs.

  • how to launch parallel calculation (e.g. mpiexec command) and how to assign the nodes and cores to the command.

To find which environment variables are set within the batch jobs, it may be useful to call printenv command in the job script.

Trouble shooting

When the variable _debug in the moller script is set to 1, the debug outputs are printed during the execution of the batch jobs. If the job does not work well, it is recommended that the debug option is turned on and the output is examined to check if the internal parameters are appropriately defined.