2. General Cluster theory

2.1. What is a computer Cluster ?

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system.

Wikipedia

2.2. Parts of a computing cluster

In order to provide high performance computation capabilities, clusters can combine hundreds to thousands of computers, called nodes, which are all inter-connected with a high-performance communication network. Most nodes are designed for high-performance computations, but clusters can also use specialized nodes to offer parallel file systems, databases, login nodes and even the cluster scheduling functionality as pictured in the image below.

_images/cluster_overview2.png

We will overview the different types of nodes which you can encounter on a typical cluster.

2.2.1. The Login Nodes

To execute computing processes on a cluster, you must first connect to a cluster and this is accomplished through a login node. These so-called login nodes are the entry point to most clusters.

Another entry point to some clusters such as the Mila cluster is the JupyterHub WEB interface, but we’ll read about that later. For now let’s return to the subject of this section; Login nodes. To connect to these, you would typically use a remote shell connection. The most usual tool to do so is SSH. You’ll hear and read a lot about this tool. Imagine it as a very long (and somewhat magical) extension cord which connects the computer you are using now, such as your laptop, to a remote computer’s terminal shell. You might already know what a terminal shell is if you ever used the command line.

2.2.2. The Compute Nodes

In the field of artificial intelligence, you will usually be on the hunt for GPUs. In most clusters, the compute nodes are the ones with GPU capacity.

While there is a general paradigm to tend towards a homogeneous configuration for nodes, this is not always possible in the field of artificial intelligence as the hardware evolve rapidly as is being complemented by new hardware and so on. Hence, you will often read about computational node classes. Some of which might have different GPU models or even no GPU at all. It is important to keep this in mind as you’ll have to be aware of which nodes you are working on. More on that later.

2.2.3. The Storage nodes

Some computers on a cluster will have one function only which is to serve files. While the name of these computers might matter to some, as a user, you’ll only be concerned about the path to the data. More on that in the Processing data section.

2.2.4. Different nodes for different uses

It is important to note here the difference in intended uses between the compute nodes and the login nodes. While the Compute Nodes are meant for heavy computation, the Login Nodes are not.

The login nodes however are used by everyone who uses the cluster and care must be taken not to overburden these nodes. Consequently, only very short and light processes should be run on these otherwise the cluster may become inaccessible. In other words, please refrain from executing long or compute intensive processes on login nodes because it affects all other users. In some cases, you will also find that doing so might get you into trouble.

2.3. UNIX

All clusters typically run on GNU/Linux distributions. Hence a minimum knowledge of GNU/Linux and BASH is usually required to use them. See the following tutorial for a rough guide on getting started with Linux.

2.4. The batch scheduler

Once connected to a login node, presumably with SSH, you can issue a job execution request to what is called the job scheduler. The job scheduler used at Mila and Compute Canada clusters is called SLURM (slurm). The job scheduler’s main role is to find a place to run your program in what is simply called : a job. This “place” is in fact one of many computers synchronised to the scheduler which are called : Compute Nodes.

In fact it’s a bit trickier than that, but we’ll stay at this abstraction level for now.

2.4.1. Slurm

Resource sharing on a supercomputer/cluster is orchestrated by a resource manage/job scheduler. Users submit jobs, which are scheduled and allocated resources (CPU time, memory, GPUs, etc.) by the resource manager, if the resources are available the job can start otherwise it will be placed in queue.

On a cluster, users don’t have direct access to the compute nodes but instead connect to a login node to pass the commands they would like to execute in a script for the workload manager to execute.

Mila as well as Compute Canada use the workload manager Slurm to schedule and allocate resources on their infrastructure.

Slurm client commands are available on the login nodes for you to submit jobs to the main controller and add your job to the queue. Jobs are of 2 types: batch jobs and interactive jobs.

For practical examples of SLURM commands on the Mila cluster, see 1.2   Running your code.

2.5. Processing data

Clusters have different types of file systems to support different data storage use cases. We differentiate them by name. You’ll hear or read about file systems such as “home”, “scratch” or “project” and so on.

Most of these file systems are are provided in a way which is globally available to all nodes in the cluster. Software or data required by jobs can be accessed from any node on the cluster. (See Mila or CC for more information on available file systems)

Different file systems have different performance levels. For instance, backed up file-systems ( such as $PROJECT ) provide more space and can handle large files but cannot sustain highly parallel accesses typically required for high speed model training.

Each compute node has local file systems ( of which $SLURM_TMPDIR ) that are usually more efficient but any data remaining on these will be erased at the end of the job execution for the next job to come along.

2.6. Software dependency management and associated challenges

This section aims to raise awareness to problems one can encounter when trying to run a software on different computers and how this is dealt with on typical computation clusters.

2.6.1. Python Virtual environments

TODO

2.6.2. Cluster software modules

Both Mila and Compute Canada clusters provides various software through the module command. Modules are small files which modify your environment variables to register the correct location of the software you wish to use. To learn practical examples of module uses, see 1.3.3.1   The module command.

2.6.3. Containers

Containers are a special form of isolation of software and it’s dependencies. It does not only create a separate file system, but can also create a separate network and execution environment. All software you have used for your experiments is packaged inside one file. You simply copy the image of the container you built on every environment without the need to install anything.