Slurm Deployment

Usage and options

  • Non-OOD support requires the use of the create_dlg_job.py script.

Script has two configuration approaches:

  • Command line interface (CLI)

  • Configuration files:
    • Facility INI [Experimental]

    • Slurm template [Experimental]

Command-line Interface (CLI)

The CLI allows the user to submit a remote SLURM job from their local machine, which will spin up the requested number of DALiuGE Island and Node Managers and run the graph.

The minimal requirements for submitting a job via the command-line are:

  • The facility (e.g. Setonix, Hyades, Galaxy)

  • The graph (either logical or physical, but not both).

  • Specifying if remote or local submission

  • The remote user account

All other options have defaults provided. Thus the most basic job submission will look like:

python create_dlg_job.py -a 1 -f setonix -L /path/to/graph/ArrayLoop.graph -U user_name

However, the defaults for jobs submissions will lead to limited use of the available resources (i.e. number of nodes provisioned) and won’t account for specific job durations. DALiuGE Translator options are also available, so it is possible to specify what partitioning algorithm is preferred. A more complete job submission, that takes advantage of the SLURM and environment options, will look something like:

python create_dlg_job.py -a 1 -n 32 -s 1 -t 60 -A pso -u -f setonix -L/path/to/graph/ArrayLoop.graph -v 4 --remote --submit -U user_name

This performs the following:

  • Submits and runs a remote job to Pawsey’s Setonix (-f setonix) machine

  • Uses 1 data island manager (-s 1) and requests 32 nodes (-n 32) for a job duration of 60 minutes (-t)

  • Translates the Logical Graph (-L) using the PSO algorithm (-A PSO).

Facility INI

Currently, deploying onto a HPC facility requires using the facilities DALiuGE already supports, or adding a brand new class entry to the deploy/config/__init__.py file. To make deployment more flexible and easier to expand to feasibly any facility, we have added (experimental) support for using an INI configuration file for facility deployment parameters.

The following configuration is an example deployment that contains all variables necessary to deploy onto a remove system:

[ENVIRONMENT]
ACCOUNT = pawsey0411
USER = test
LOGIN_NODE = setonix.pawsey.org.au
HOME_DIR = /scratch/${ACCOUNT}
DLG_ROOT = ${HOME_DIR}/${USER}/dlg
LOG_DIR = ${DLG_ROOT}/log
MODULES =
VENV = source /software/projects/${ACCOUNT}/venv/bin/activate
EXEC_PREFIX = srun -l

A user can create and reference their own .ini file using these parameters, and run with the –config_file option:

python create_dlg_job.py -a 1 -n 1 -s 1 -u -f setonix -L ~/github/EAGLE_test_repo/eagle_test_graphs/daliuge_tests/dropmake/logical_graphs/ArrayLoop.graph -v 5 --remote --submit -U rbunney --config_file example_config.ini

SLURM Template

There are significantly more SLURM options than are practical as CLI options. The SLURM template is an experimental feature that allows you to specify additional SBATCH options that are not currently supported in the CLI. The template will be prefixed to the final SLURM script that runs the DALiuGE job on the remote system.

A basic example that replicates the current SLURM script that is created by create_dlg_job.py is available in dlg/deploy/config/default.slurm

#!/bin/bash --login

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --job-name=DALiuGE-$SESSION_ID # NECESSARY, DO NOT REMOVE
#SBATCH --time=00:45:00
#SBATCH --error=err-%j.log

export DLG_ROOT=$DLG_ROOT # DO NOT CHANGE - use .INI file or CLI
source /software/projects/pawsey0411/venv/bin/activate
# Keep an empty line in the file

Note

Settings defined in the SLURM template will over-write anything passed via the CLI _and_ the .INI. For example, the source for a virtualenv declared in the .slurm file will overwrite the VENV environment variable in the .INI file. This may change in the future depending on the extent of the features we add.

Running with a SLURM template is similar to the .ini method:

python create_dlg_job.py -a 1 -n 1 -s 1 -u -f setonix -L ~/github/EAGLE_test_repo/eagle_test_graphs/daliuge_tests/dropmake/logical_graphs/ArrayLoop.graph -v 5 –remote –submit -U rbunney –config_file example_config.ini –slurm_template example.slurm

Complete command-line options

Help output:

create_dlg_job.py -a [1|2] -f <facility> [options]

create_dlg_job.py -h for further help

Options:
-h, --help            show this help message and exit
-a ACTION, --action=ACTION
                        1 - create/submit job, 2 - analyse log
-l LOG_ROOT, --log-root=LOG_ROOT
                        The root directory of the log file
-d LOG_DIR, --log-dir=LOG_DIR
                        The directory of the log file for parsing
-L LOGICAL_GRAPH, --logical-graph=LOGICAL_GRAPH
                        The filename of the logical graph to deploy
-A ALGORITHM, --algorithm=ALGORITHM
                        The algorithm to be used for the translation
-O ALGORITHM_PARAMS, --algorithm-parameters=ALGORITHM_PARAMS
                        Parameters for the translation algorithm
-P PHYSICAL_GRAPH, --physical-graph=PHYSICAL_GRAPH
                        The filename of the physical graph (template) to
                        deploy
-t JOB_DUR, --job-dur=JOB_DUR
                        job duration in minutes
-n NUM_NODES, --num_nodes=NUM_NODES
                        number of compute nodes requested
-i, --visualise_graph
                        Whether to visualise graph (poll status)
-p, --run_proxy       Whether to attach proxy server for real-time
                        monitoring
-m MON_HOST, --monitor_host=MON_HOST
                        Monitor host IP (optional)
-o MON_PORT, --monitor_port=MON_PORT
                        The port to bind DALiuGE monitor
-v VERBOSE_LEVEL, --verbose-level=VERBOSE_LEVEL
                        Verbosity level (1-3) of the DIM/NM logging
-c CSV_OUTPUT, --csvoutput=CSV_OUTPUT
                        CSV output file to keep the log analysis result
-z, --zerorun         Generate a physical graph that takes no time to run
-y, --sleepncopy      Whether include COPY in the default Component drop
-T MAX_THREADS, --max-threads=MAX_THREADS
                        Max thread pool size used for executing drops. 0
                        (default) means no pool.
-s NUM_ISLANDS, --num_islands=NUM_ISLANDS
                        The number of Data Islands
-u, --all_nics        Listen on all NICs for a node manager
-S, --check_with_session
                        Check for node managers' availability by
                        creating/destroy a session
-f FACILITY, --facility=FACILITY
                        The facility for which to create a submission job
                        Valid options: ['galaxy_mwa', 'galaxy_askap',
                        'magnus', 'galaxy', 'setonix', 'shao', 'hyades',
                        'ood', 'ood_cloud']
--submit              If set to False, the job is not submitted, but the
                        script is generated
--remote              If set to True, the job is submitted/created for a
                        remote submission
-D DLG_ROOT, --dlg_root=DLG_ROOT
                        Overwrite the DLG_ROOT directory provided by the
                        config
-C, --configs         Display the available configurations and exit
-U USERNAME, --username=USERNAME
                        Remote username, if different from local

Experimental Options:
   Caution: These are not properly tested and likely tobe rough around
   the edges.

   --config_file=CONFIG_FILE
                        Use INI configuration file.
   --slurm_template=SLURM_TEMPLATE
                        Use SLURM template file for job submission. WARNING:
                        Using this command will over-write other job-
                        parameters passed here.