P-Cluster Library and Job Management#
This tutorial introduces the Spack package manager and the SLURM job manager tools available on the P-Cluster.
The Spack Package Manager#
Spack is a flexible package manager designed for building and managing multiple software versions in high-performance computing environments. It allows users to easily install software with different configurations, dependencies, and compilers without interference between installations. Spack supports reproducibility and portability, making it ideal for complex scientific workflows across different systems. - ChatGPT
Setting up your environment for Spack#
One can use Spack to install software and generate new modules even as a non-root user, although the collection of modules on the P-Cluster is already extensive.
For the EMU application we recommend a more specific configutation – example .bashrc initializes Spack profile.
For tutorials that run
MITgcm
directly you can also reset Spack and add modules using Julia_ECCO_and_more/setup_modules.csh.Or for a default initialization of Spack, you can just run the following command.
source /shared/spack/share/spack/setup-env.sh
Adding software to your environment using the module
Command#
The module
command is used on the P-Cluster to manage environment modules. Modules allow users to easily load, unload, and switch between different software environments without manually modifying environment variables like PATH
and LD_LIBRARY_PATH
manually. This command is especially useful for managing multiple versions of software or libraries in shared environments.
When you load a module, it configures your environment to use a specific version of software. You can list available modules, load and unload modules, and reset your environment using various module
subcommands.
Below is a list of common module
commands and their functions:
Command |
Description |
Example |
---|---|---|
|
Lists all available modules that can be loaded. |
|
|
Shows a list of currently loaded modules in your environment. |
|
|
Loads a specific module into your environment, making the software available for use. |
|
|
Unloads a specific module, removing it from your environment. |
|
|
Unloads all currently loaded modules, resetting your environment. |
|
The SLURM Batch System#
SLURM (Simple Linux Utility for Resource Management) is an open-source batch scheduling system widely used in high-performance computing (HPC) environments to manage and allocate computational resources. It enables users to submit, schedule, and manage jobs on clusters, ensuring efficient use of available nodes and resources. SLURM provides flexible scheduling policies, supports parallel and distributed workloads, and includes features for job prioritization and resource accounting. - ChatGPT
Common SLURM commands#
Command |
Description |
Example |
---|---|---|
|
Submits a job script to the SLURM scheduler. |
|
|
Cancels a pending or running job. |
|
|
Displays information about jobs in the queue. |
|
|
Displays information about available SLURM nodes and partitions. |
|
|
Allocates resources for a job interactively. |
|
|
Submits a job or launches parallel tasks (can be used in a script or interactively). |
|
There are numerous resources on the web about how to use SLURM. One useful example can be found here.
Note that the head node (the machine where you first log in to 34.210.1.198) has very limited resources and is suitable for editing, but not for intensive data analysis or processing. We recommend using salloc to request an interactive node for heavy data processing.
Partition#
On SLURM systems, a “partition” is a set of compute nodes grouped for specific job submissions. Partitions define a collection of resources with particular attributes or policies where jobs are submitted, such as job limits or access to specific hardware resources. The equivalent to “partition” on PBS (Portable Batch System) the “queue.”
The command sinfo
(see the table above) can display available SLURM
partitions. Below is the output of sinfo
showing the available partitions on the P-Cluster.
PARTITION |
AVAIL |
TIMELIMIT |
NODES |
STATE |
NODELIST |
---|---|---|---|---|---|
sealevel-c5n18xl-spot* |
up |
infinite |
50 |
idle~ |
sealevel-c5n18xl-spot-dy-c5n18xlarge-[1-50] |
sealevel-c5n18xl-demand |
up |
infinite |
49 |
idle~ |
sealevel-c5n18xl-demand-dy-c5n18xlarge-[2-50] |
sealevel-c5n18xl-demand |
up |
infinite |
1 |
alloc |
sealevel-c5n18xl-demand-dy-c5n18xlarge-1 |
sealevel-c5xl-spot |
up |
infinite |
1000 |
idle~ |
sealevel-c5xl-spot-dy-c5xlarge-[1-1000] |
sealevel-c5xl-demand |
up |
infinite |
1000 |
idle~ |
sealevel-c5xl-demand-dy-c5xlarge-[1-1000] |
There are four kinds of partitions: two are AWS c5n18xl instances, and the other two are c5xl instances. The former are suitable for large jobs, while the latter are for smaller jobs, such as interactive jobs (see Amazon EC2 C5n instances for more details). Partitions ending with spot
are AWS spot
instances, which are priced lower than demand
instances. However, AWS can terminate a spot
instance if there is a demand for it, meaning your job will be killed. To avoid your job being terminated by AWS, use demand
partitions unless it is acceptable for your job to be interrupted.
Starting an interactive node#
As stated above, the head node (the machine where you first log in to 34.210.1.198) has very limited resources and is suitable for editing, but not for intensive data analysis or processing. We recommend using salloc to request an interactive node for heavy data processing.
For instance, to start an interactive with machine type c5.xlarge, issue the following command from your home directory:
salloc --ntasks=2 --ntasks-per-node=2 --partition=sealevel-c5xl-demand --time=01:00:00
This command requests an interactive node on the partition
called sealevel-c5xl-demand
with two tasks (a term similar to processes) running for one hour.
After issuing the salloc
command, and waiting a few minutes, one would receive the following message on the screen with a job identification number (ID), as shown below (using 123 as an example ID):
salloc: Granted job allocation 128
salloc: Waiting for resource configuration
SLURM
may take several minutes to allocate and configure the requested resources. Once the resources are ready, the prompt will appear as follows:
salloc: Nodes sealevel-c5xl-demand-dy-c5xlarge-1 are ready for job
USERNAME@ip-10-20-22-69:~$
Then, one can run command or executable scripts. If the interactive node is no longer needed, use scancel JOB_ID
to exit the partition and return the requested resources.
In addition to salloc
, srun
can also be used to request an interactive node, though it is often used to run a specific script or job. salloc
, on the other hand, allows users to run multiple commands once the resources are allocated.