P-Cluster Library and Job Management#

This tutorial introduces the Spack package manager and the SLURM job manager tools available on the P-Cluster.

The Spack Package Manager#

Spack is a flexible package manager designed for building and managing multiple software versions in high-performance computing environments. It allows users to easily install software with different configurations, dependencies, and compilers without interference between installations. Spack supports reproducibility and portability, making it ideal for complex scientific workflows across different systems. - ChatGPT

Setting up your environment for Spack#

One can use Spack to install software and generate new modules even as a non-root user, although the collection of modules on the P-Cluster is already extensive.

  • For the EMU application we recommend a more specific configutation – example .bashrc initializes Spack profile.

  • For tutorials that run MITgcm directly you can also reset Spack and add modules using Julia_ECCO_and_more/setup_modules.csh.

  • Or for a default initialization of Spack, you can just run the following command.

source /shared/spack/share/spack/setup-env.sh

Adding software to your environment using the module Command#

The module command is used on the P-Cluster to manage environment modules. Modules allow users to easily load, unload, and switch between different software environments without manually modifying environment variables like PATH and LD_LIBRARY_PATH manually. This command is especially useful for managing multiple versions of software or libraries in shared environments.

When you load a module, it configures your environment to use a specific version of software. You can list available modules, load and unload modules, and reset your environment using various module subcommands.

Below is a list of common module commands and their functions:

Command

Description

Example

module avail

Lists all available modules that can be loaded.

module avail

module list

Shows a list of currently loaded modules in your environment.

module list

module load

Loads a specific module into your environment, making the software available for use.

module load gcc/9.3.0

module unload

Unloads a specific module, removing it from your environment.

module unload gcc/9.3.0

module purge

Unloads all currently loaded modules, resetting your environment.

module purge

The SLURM Batch System#

SLURM (Simple Linux Utility for Resource Management) is an open-source batch scheduling system widely used in high-performance computing (HPC) environments to manage and allocate computational resources. It enables users to submit, schedule, and manage jobs on clusters, ensuring efficient use of available nodes and resources. SLURM provides flexible scheduling policies, supports parallel and distributed workloads, and includes features for job prioritization and resource accounting. - ChatGPT

Common SLURM commands#

Command

Description

Example

sbatch

Submits a job script to the SLURM scheduler.

sbatch job_script.sh

scancel

Cancels a pending or running job.

scancel <job_id>

squeue

Displays information about jobs in the queue.

squeue

sinfo

Displays information about available SLURM nodes and partitions.

sinfo

salloc

Allocates resources for a job interactively.

salloc --ntasks=2 --ntasks-per-node=2 --partition=sealevel-c5xl-demand --time=01:00:00

srun

Submits a job or launches parallel tasks (can be used in a script or interactively).

srun --ntasks=4 ./my_program

There are numerous resources on the web about how to use SLURM. One useful example can be found here.

Note that the head node (the machine where you first log in to 34.210.1.198) has very limited resources and is suitable for editing, but not for intensive data analysis or processing. We recommend using salloc to request an interactive node for heavy data processing.

Partition#

On SLURM systems, a “partition” is a set of compute nodes grouped for specific job submissions. Partitions define a collection of resources with particular attributes or policies where jobs are submitted, such as job limits or access to specific hardware resources. The equivalent to “partition” on PBS (Portable Batch System) the “queue.”

The command sinfo (see the table above) can display available SLURM partitions. Below is the output of sinfo showing the available partitions on the P-Cluster.

PARTITION

AVAIL

TIMELIMIT

NODES

STATE

NODELIST

sealevel-c5n18xl-spot*

up

infinite

50

idle~

sealevel-c5n18xl-spot-dy-c5n18xlarge-[1-50]

sealevel-c5n18xl-demand

up

infinite

49

idle~

sealevel-c5n18xl-demand-dy-c5n18xlarge-[2-50]

sealevel-c5n18xl-demand

up

infinite

1

alloc

sealevel-c5n18xl-demand-dy-c5n18xlarge-1

sealevel-c5xl-spot

up

infinite

1000

idle~

sealevel-c5xl-spot-dy-c5xlarge-[1-1000]

sealevel-c5xl-demand

up

infinite

1000

idle~

sealevel-c5xl-demand-dy-c5xlarge-[1-1000]

There are four kinds of partitions: two are AWS c5n18xl instances, and the other two are c5xl instances. The former are suitable for large jobs, while the latter are for smaller jobs, such as interactive jobs (see Amazon EC2 C5n instances for more details). Partitions ending with spot are AWS spot instances, which are priced lower than demand instances. However, AWS can terminate a spot instance if there is a demand for it, meaning your job will be killed. To avoid your job being terminated by AWS, use demand partitions unless it is acceptable for your job to be interrupted.

Starting an interactive node#

As stated above, the head node (the machine where you first log in to 34.210.1.198) has very limited resources and is suitable for editing, but not for intensive data analysis or processing. We recommend using salloc to request an interactive node for heavy data processing.

For instance, to start an interactive with machine type c5.xlarge, issue the following command from your home directory:

salloc --ntasks=2 --ntasks-per-node=2 --partition=sealevel-c5xl-demand --time=01:00:00 

This command requests an interactive node on the partition called sealevel-c5xl-demand with two tasks (a term similar to processes) running for one hour.

After issuing the salloc command, and waiting a few minutes, one would receive the following message on the screen with a job identification number (ID), as shown below (using 123 as an example ID):

salloc: Granted job allocation 128
salloc: Waiting for resource configuration

SLURM may take several minutes to allocate and configure the requested resources. Once the resources are ready, the prompt will appear as follows:

salloc: Nodes sealevel-c5xl-demand-dy-c5xlarge-1 are ready for job
USERNAME@ip-10-20-22-69:~$ 

Then, one can run command or executable scripts. If the interactive node is no longer needed, use scancel JOB_ID to exit the partition and return the requested resources.

In addition to salloc, srun can also be used to request an interactive node, though it is often used to run a specific script or job. salloc, on the other hand, allows users to run multiple commands once the resources are allocated.