Synopsis

QCG-PilotJob is designed to schedule and execute many small jobs inside one SLURM scheduling system allocation.

Introduction

QCG-PilotJob is a system that schedules many small tasks inside a large Slurm allocation. The key features include:

  • Support of heterogeneous tasks inside an allocation (different numbers of threads, varying runtimes)
  • Complex workflows with task dependencies (implemented as a directed acyclic graph)
  • Resuming failed tasks or tasks that were prematurely canceled due to time limit
  • Performance analysis tools. 

Jobs are defined in a Python script or a JSON file.  For static scenarios, when the number of tasks is known in advance, configuration in a JSON file is sufficient. The Python API is more flexible as it allows to dynamically
generate jobs by reading input parameters from a file. Moreover, it allows one to implement dynamic, iterative applications with complex workflows.
Since the Python API can also cover simple static cases, we recommend starting with Python the API from the beginning.

In the following, we describe a typical use case (See https://qcg-pilotjob.readthedocs.io/en/stable/overview.html for more details).

Examples

Scenario: Independent serial tasks

Suppose you need to run a serial program my_program for many different parameter combinations. The parameter combinations can be stored in a CSV file, for example, job_parameters.csv :

idinputfiletype
1/path/to/input/file/for/job.cnfA

2

/path/to/input/file/for/job.cnf

B

...

The following Python script extracts all parameter combinations from the CSV file and defines QCG tasks accordingly (dependencies and modules are optional):  

qcg_job.py
import csv
from qcg.pilotjob.api.job import Jobs
from qcg.pilotjob.api.manager import LocalManager
manager = LocalManager()

jobs = Jobs()
with open('job_parameters.csv', newline='', encoding='utf-8', errors='ignore') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        job_name = row['id'] + row['type']
        print(job_name)
        j = jobs.add(name=job_name,
                 exec='my_program',
                 args=["--inputfile", row['inputfile'], --type, row['type']],
                 stdout='job.out.' + job_name,
                 stderr='job.err.' + job_name,
                 model='default',
                 # modules=["2022", "..."], # optional, load required modules for the application
                 iteration=1)
        print(j)
manager.submit(jobs)
manager.wait4all()
manager.finish()

To submit this tasks, create a simple Slurm script:

qcg_job.sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=08:00:00
#SBATCH --partition=thin
module load 2022
module load QCG-PilotJob/0.13.1-foss-2022a
python3 qcg_job.py

and submit it with sbatch as usual:

sbatch qcg_job.sh

Resuming prematurely interrupted computations

If not all jobs are completed with the provided walltime (`--time` parameter in the Slurm script), the allocation can be resumed without rerunning completed tasks:

  1. Determine the name of the previous temporary directory in your submission directory. The directory name should start with .qcgpjm-service- 
  2. Define a Slurm job for resuming the tasks (see below)
  3. Submit the resume job:  sbatch resume.sh
resume.sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=00:05:00
#SBATCH --partition=thin
module load 2022
module load QCG-PilotJob/0.13.1-foss-2022a

PREVIOUS_WORKDIR=/path/to/previous/working/dir
RESUME_DIR=$PREVIOUS_WORKDIR/.qcgpjm-service-tcn393.local.snellius.surf.nl.9081/ # define the name of the temporary path here

# resume the job
qcg-pm-service --wd $PREVIOUS_WORKDIR --resume $RESUME_DIR

See https://qcg-pilotjob.readthedocs.io/en/stable/resume.html for more details.

  • No labels