3.1. Basic usage

The procedure to use the batch job script generator moller consists of the following steps: First, a job description file is prepared that defines the tasks to be executed. Next, the program moller is to be run with the job description file, and a batch job script is generated. The script is then transferred to the target supercomputer system. A batch job is submitted with the script to perform calculations. In this tutorial, we will explain the steps along a sample in docs/tutorial/moller.

Prepare job description file

A job description file describes the content of calculations that are carried out in a batch job. Here, a batch job is used for a set of instructions submitted to job schedulers running on supercomputer systems. On the other hand, for the concurrent execution of programs that moller handles, we call a series of program executions performed for one set of parameters by a job. A job may consist of several contents that we call tasks. moller organizes job execution so that each task is run in parallel, and the synchronization between the jobs is taken at every start and end of the tasks.

Tasks and jobs

Fig. 3.1 An example of tasks and jobs: Three jobs #1 … #3 are carried out within a single batch job. Each job corresponds to different set of parameters. A job consists of 4 tasks. Each task is run in parallel among these three jobs.

An example of job description file is presented in the following. A job description file is in text-based YAML format. It contains parameters concerning the platform and the batch job, task descriptions, and pre/post-processes.

name: testjob
description: Sample task file

platform:
  system:  ohtaka
  queue:   i8cpu
  node:    1
  elapsed: 00:10:00

prologue:
  code: |
    module purge
    module load oneapi_compiler/2023.0.0 openmpi/4.1.5-oneapi-2023.0.0-classic

    ulimit -s unlimited

    source /home/issp/materiapps/intel/parallel/parallelvars-20210622-1.sh

jobs:
  start:
    parallel: false
    run: |
      echo "start..."

  hello:
    description: hello world
    node: [1,1]
    run: |
      echo "hello world." > result.txt
      sleep 2

  hello_again:
    description: hello world again
    node: [1,1]
    run: |
      echo "hello world again." >> result.txt
      sleep 2

epilogue:
  code: |
    echo "done."
    date

In the platform section, you can specify the type of platform on which to execute. In this case, settings for the System B (ohtaka) are being made.

The prologue section describes the preprocessing of the batch job. It details the common command line to be executed before running the task.

In the jobs section, the content of the task processing is described. The series of tasks to be executed in the job are described in a table format, with the task name as the key and the processing content as the value.

In this example, a task that first outputs “start…” is defined with the task name “start”. Here, it is set to parallel = false. In this case, the content of run parameter is executed sequentially.

Next, a task that outputs “hello world.” is defined with the task name “hello_world” . Here, since “parallel” is not set, it is treated as parallel = true. In this case, parallel processing is performed on a per-job basis. Similarly, next, a task that outputs “hello world again.” is defined with the task name “hello_again”.

Finally, in the epilogue section, the post-processing of the batch job is described. It details the common command line to be executed after running the task.

For more details on the specifications, please refer to the chapter File Format.

Generate batch job script

moller is to be run with the job description file (input.yaml) as an input as follows:

$ moller -o job.sh input.yaml

A batch job script is generated and written to a file specified by the parameter in the job description file, or the command line option -o or --output. If both specified, the command line option is used. If neither specified, the result is written to the standard output.

The obtained batch job script is to be transferred to the target system as required. It is noted that the batch job script is prepared for bash; users may need to set the shell for job execution to bash. (A care should be needed if the login shell is set to csh-type.)

Create list file

A list of jobs is to be created. moller is designed so that each job is executed within a directory prepared for the job with the job name. The job list can be created, for example, by the following command:

$ /usr/bin/ls -1d * > list.dat

In this tutorial, an utility script make_inputs.sh is enclosed which generates datasets and a list file.

$ bash ./make_inputs.sh

By running the above command, a directory output and a set of subdirectories dataset-0001dataset-0020 that correspond to datasets, and a list file list.dat are created.

Run batch job

The batch job is to be submitted to the job scheduler with the batch job script. In this example, the job script and the input parameter files are copied into the output directory, and the current directory is changed to ``output` as follows:

$ cp job.sh input.yaml output/
$ cd output

In ohtaka, slurm is used for the job scheduling system. In order to submit a batch job, a command sbatch is invoked with the job script as an argument. Parameters can be passed to the script as additional arguments; the name of list file is specified as a parameter.

$ sbatch job.sh list.dat

Files named ‘result.txt’ will be generated in each directory listed on the list.dat. You can confirm that the ‘result.txt’ contains the strings ‘hello world.’ and ‘hello world again.’ as the job results.

Check status

The status of execution of the tasks are written to log files. A tool named moller_status is provided to generate a summary of the status of each job from the log files. It is invoked by the following command in the directory where the batch job is executed:

$ moller_status input.yaml list.dat

The command takes the job description file input.yaml and the list file list.dat as arguments. The list file may be omitted; in this case, the information of the jobs are extracted from the log files.

An example of the output is shown below:

| job          | hello   | hello_again   |
|--------------|---------|---------------|
| dataset-0001 | o       | o             |
| dataset-0002 | o       | o             |
| dataset-0003 | o       | o             |
| dataset-0004 | o       | o             |
| dataset-0005 | o       | o             |
| dataset-0006 | o       | o             |
| dataset-0007 | o       | o             |
| dataset-0008 | o       | o             |
| dataset-0009 | o       | o             |
| dataset-0010 | o       | o             |
| dataset-0011 | o       | o             |
| dataset-0012 | o       | o             |
| dataset-0013 | o       | o             |
| dataset-0014 | o       | o             |
| dataset-0015 | o       | o             |
| dataset-0016 | o       | o             |
| dataset-0017 | o       | o             |
| dataset-0018 | o       | o             |
| dataset-0019 | o       | o             |
| dataset-0020 | o       | o             |

where “o” corresponds to a task that has been completed successfully, “x” corresponds to a failed task, “-” corresponds to a skipped task because the previous task has been terminated with errors, and “.” corresponds to a task yet unexecuted. In the above example, the all tasks have been completed successfully.

Rerun failed tasks

If a task fails, the subsequent tasks within the job will not be executed. The following is an example of job status in which each task fails by 10% change.

| job          | task1   | task2   | task3   |
|--------------|---------|---------|---------|
| dataset_0001 | o       | o       | o       |
| dataset_0002 | o       | x       | -       |
| dataset_0003 | x       | -       | -       |
| dataset_0004 | x       | -       | -       |
| dataset_0005 | o       | o       | o       |
| dataset_0006 | o       | o       | o       |
| dataset_0007 | o       | x       | -       |
| dataset_0008 | o       | o       | o       |
| dataset_0009 | o       | o       | x       |
| dataset_0010 | o       | o       | o       |
| dataset_0011 | o       | o       | o       |
| dataset_0012 | o       | o       | o       |
| dataset_0013 | o       | x       | -       |
| dataset_0014 | o       | o       | o       |
| dataset_0015 | o       | o       | o       |
| dataset_0016 | o       | o       | o       |
| dataset_0017 | o       | o       | o       |
| dataset_0018 | o       | o       | o       |
| dataset_0019 | o       | o       | o       |
| dataset_0020 | o       | o       | o       |

There, the jobs of dataset_0003 and dataset_0004 failed at task1, and the subsequent task2 and task3 were not executed. The other jobs were successful at task1, and proceeded to task2. In this way, each job is executed independently of other jobs.

Users can rerun the failed tasks by submitting the batch job with the retry option. For SLURM job scheduler (e.g. used in ISSP system B), resubmit the job as follows:

$ sbatch job.sh --retry list.dat

For PBS job scheduler (e.g. used in ISSP system C), edit the job script so that the line retry=0 is replaced by retry=1, and resubmit the job.

| job          | task1   | task2   | task3   |
|--------------|---------|---------|---------|
| dataset_0001 | o       | o       | o       |
| dataset_0002 | o       | o       | x       |
| dataset_0003 | o       | x       | -       |
| dataset_0004 | o       | o       | o       |
| dataset_0005 | o       | o       | o       |
| dataset_0006 | o       | o       | o       |
| dataset_0007 | o       | o       | o       |
| dataset_0008 | o       | o       | o       |
| dataset_0009 | o       | o       | o       |
| dataset_0010 | o       | o       | o       |
| dataset_0011 | o       | o       | o       |
| dataset_0012 | o       | o       | o       |
| dataset_0013 | o       | o       | o       |
| dataset_0014 | o       | o       | o       |
| dataset_0015 | o       | o       | o       |
| dataset_0016 | o       | o       | o       |
| dataset_0017 | o       | o       | o       |
| dataset_0018 | o       | o       | o       |
| dataset_0019 | o       | o       | o       |
| dataset_0020 | o       | o       | o       |

The tasks that have failed will be executed in the second run. In the above example, the task1 for dataset_0003 was successful, but the task2 failed. For dataset_0004, task1, task2, and task3 were successfully executed. For the jobs of datasets whose tasks have already finished successfully, the second run will not do anything.

N.B. the list file must not be modified on the rerun. The jobs are managed according to the order of entries in the list file, and therefore, if the order is changed, the jobs will not be executed properly.