6. Extension guide¶
N.B. The content of this section may vary depending on the version of moller.
6.1. Bulk job execution by moller¶
A bulk job execution means that a set of small tasks are executed in parallel within a single batch job submitted to a large batch queue. It is schematically shown as follows, in which N tasks are launched as background processes and executed in parallel, and a wait
statement is invoked to wait for all tasks to be completed.
task param_1 &
task param_2 &
...
task param_N &
wait
To manage the bulk job, it is required to distribute nodes and cores allocated to the batch job over the tasks param_1 … param_N so that they are executed on distinct nodes and cores. It is also needed to arrange task execution where at most N tasks are run simultaneously according to the allocated resources.
Hereafter a job script generated by moller will be denoted as a moller script. In a moller script, the concurrent execution and control of tasks are managed by GNU parallel [1]. It takes a list holding the items param_1 … param_N and runs commands for each items in parallel. An example is given as follows, where list.dat contains param_1 … param_N in each line.
cat list.dat | parallel -j N task
The number of concurrent tasks is determined at runtime from the number of nodes and cores obtained from the execution environment and the degree of parallelism (number of nodes, processes, and threads specified by node parameter).
The way to assign tasks to nodes and cores varies according to the job scheduler.
For SLURM job scheduler variants, the concurrent calls of srun
command within the batch job are appropriately assigned to the nodes and cores by exploiting the option of exclusive resource usage. The explicit option may depend on the platform.
On the other hand, for PBS job scheduler variants that do not have such features, the distribution of nodes and cores to tasks has to be handled within the moller script. The nodes and cores allocated to a batch job are divided into slots, and the slots are assigned to the concurrent tasks. The division is determined from the allocated nodes and cores and the degree of parallelism of the task, and kept in a form of table variables. Within a task, the programs are executed on the assigned hosts and cores (optionally pinned to the program) through the options to mpirun (or mpiexec) and the environment variables. This feature depends on the MPI implementation.
Reference
6.2. How moller works¶
Structure of moller script¶
moller reads the input YAML file and generates a job script for bulk execution. The structure of the generated script is described as follows.
Header
This part contains the options to the job scheduler. The content of the platform section is formatted according to the type of job scheduler. This feature depends on platforms.
Prologue
This part corresponds to the prologue section of the input file. The content of the
code
block is written as-is.Function definitions
This part contains the definitions of functions and variables used within the moller script. The description of the functions will be given in the next section. This feature depends on platforms.
Processing Command-line options
The SLURM variants accept additional arguments to the job submission command (sbatch) that are passed to the job script as a command-line options. The name of the list file and/or the options such as the retry feature can be processed.
For the PBS variants, these command-line arguments are ignored, and therefore the name of the list file is fixed to
list.dat
by default, and the retry feature may be enabled by modifying the script withretry
set to 1.Description of tasks
This part contains the description of tasks specified in the jobs section of the input file. When more than one task is given, the following procedure is applied to each task.
When parallel = false, the content of the
run
block is written as-is.When parallel = true (default), a function is created by the name task_{task name} that contains the pre-processing for concurrent execution and the content of the
run
block. The keywords for the parallel execution (srun
,mpiexec
, ormpirun
) are substituted by the platform-dependent command. The definition of the task function is followed by the concurrent execution command.Epilogue
This part corresponds to the epilogue section of the input file. The content of the
code
block is written as-is.
Brief description of moller script functions¶
The main functions of the moller script is briefly described below.
run_parallel
This function performs concurrent execution of task functions. It takes the degree of parallelism, the task function, and the status file as arguments. Within the function, it calls
_find_multiplicity
to find the number of tasks that can be run simultaneously, and invokes GNU parallel to run tasks concurrently. The task function is actually wrapped by the_run_parallel_task
function to deal with the nested call of GNU parallel.The platform-dependence is separated out by the functions
_find_multiplicity
and_setup_run_parallel
._find_multiplicity
This function determines the number of tasks that can be simultaneously executed on the allocated resources (nodes and cores) taking account of the degree of parallelism of the task. For the PBS variants, the compute nodes and the cores are divided into slots, and the slots are kept as table variables. The information obtained at the batch job execution is summarized as follows.
For SLURM variants,
The number of allocated nodes (
_nnodes
)SLURM_NNODES
The number of allocated cores (
_ncores
)SLURM_CPUS_ON_NODE
For PBS variants,
The allocated nodes (
_nodes[]
)The list of unique compute nodes is obtained from the file given by
PBS_NODEFILE
.The number of allocated nodes (
_nnodes
)The number of entries of
_nodes[]
.The number of allocated cores
Searched from below (in order of examination)
NCPUS
(for PBS Professional)OMP_NUM_THREADS
core
parameter of platform section (written in the script as a variablemoller_core
.)ncpus
orppn
parameter in the header.
_setup_run_parallel
This function is called from the
run_parallel
function to supplement some procedures before running GNU parallel. For PBS variants, the slot tables are exported so that the task functions can refer to. For SLURM variants, there is nothing to do.
The structure of the task function is described as follows.
A task function is created by a name
task_{task name}
.The arguments of the task function are 1) the degree of parallelism (the number of nodes, processes, and threads), 2) the execution directory (that corresponds to the entry of list file), 3) the slot ID assigned by GNU parallel.
The platform-dependent
_setup_taskenv
function is called to set up execution environment. For PBS variants, the compute node and the cores are obtained from the slot table based on the slot ID. For SLURM variants, there is nothing to do.The
_is_ready
function is called to check if the preceding task has been completed successfully. If it is true, the remaining part of the function is executed. Otherwise, the task is terminated with the status -1.The content of the
code
block is written. The keywords for parallel calculation (srun
,mpiexec
, ormpirun
) are substituted by the command provided for the platform.
6.3. How to extend moller for other systems¶
The latest version of moller provides profiles for ISSP supercomputer systems, ohtaka and kugui. An extension guide to use moller in other systems is described in the following.
Class structure¶
The platform-dependent parts of moller are placed in the directory platform/
.
Their class structure is depicted below.
![digraph class_diagram {
size="5,5"
node[shape=record,style=filled,fillcolor=gray95]
edge[dir=back,arrowtail=empty]
Platform[label="{Platform (base.py)}"]
BaseSlurm[label="{BaseSlurm (base_slurm.py)}"]
BasePBS[label="{BasePBS (base_pbs.py)}"]
BaseDefault[label="{BaseDefault (base_default.py)}"]
Ohtaka[label="{Ohtaka (ohtaka.py)}"]
Kugui[label="{Kugui (kugui.py)}"]
Pbs[label="{Pbs (pbs.py)}"]
Default[label="{DefaultPlatform (default.py)}"]
Platform->BaseSlurm
Platform->BasePBS
Platform->BaseDefault
BaseSlurm->Ohtaka
BasePBS->Kugui
BasePBS->Pbs
BaseDefault->Default
}](../../_images/graphviz-06368183531b8fc81c8d91cef6b17e1dc3c86493.png)
A factory is provided to select a system in the input file.
A class is imported in platform/__init__.py
and registered to the factory by register_platform(system_name, class_name)
, and then it becomes available in the system parameter of the platform section in the input YAML file.
SLURM job scheduler variants¶
For the SLURM job scheduler variants, the system-specific settings should be applied to a derived class of BaseSlurm class.
The string that substitute the keywords for the parallel execution of programs is given by the return value of parallel_command()
method. It corresponds to the srun
command with the options for the exclusive use of resources. See ohtaka.py for an example.
PBS job scheduler variants¶
For the PBS job scheduler variants (PBS Professional, OpenPBS, Torque, and others), the system-specific settings should be applied to a derived class of BasePBS class.
There are two ways of specifying the number of nodes for a batch job in the PBS variants. PBS Professional takes the form of select=N:ncpus=n, while Torque and others take the form of node=N:ppn=n. The BasePBS class has a parameter self.pbs_use_old_format
that is set to True
for the latter type.
The number of cores per compute node can be specified by node parameter of the input file, while the default value may be set for a known system. In kugui.py, the number of cores per node is set to 128 by default.
Customizing features¶
When further customization is required, the methods of the base class may be overridden in the derived classes. The list of relevant methods is given below.
setup
This method extracts parameters of the platform section.
parallel_command
This method returns a string that is used to substitute the keywords for parallel execution of programs (
srun
,mpiexec
,mpirun
).generate_header
This method generates the header part of the job script that contains options to the job scheduler.
generate_function
This method generates functions that are used within the moller script. It calls the following methods to generate function body and variable definitions.
generate_variable
generate_function_body
The definitions of the functions are provided as embedded strings within the class.
Porting to new type of job scheduler¶
The platform-dependent parts of the moller scripts are the calculation of task multiplicity, the resource distribution over tasks, and the command string of parallel calculation. The internal functions need to be developed with the following information on the platform:
how to acquire the allocated nodes and cores from the environment at the execution of batch jobs.
how to launch parallel calculation (e.g. mpiexec command) and how to assign the nodes and cores to the command.
To find which environment variables are set within the batch jobs, it may be useful to call printenv
command in the job script.
Trouble shooting¶
When the variable _debug
in the moller script is set to 1, the debug outputs are printed during the execution of the batch jobs. If the job does not work well, it is recommended that the debug option is turned on and the output is examined to check if the internal parameters are appropriately defined.