3. Mila research computing infrastructure information and policies¶

This section seeks to provide factual information and policies on the Mila cluster computing environments.

3.1. Roles and authorisations¶

There are mainly two types of researchers statuses at Mila :

Core researchers
Affiliated researchers

This is determined by Mila policy. Core researchers have access to the Mila computing cluster. See your supervisor’s Mila status to know what is your own status.

3.2. Overview of available computing resources at Mila¶

The Mila cluster is to be used for regular development and relatively small number of jobs (< 5). It is an heterogeneous cluster. It uses SLURM to schedule jobs.

3.3. Node profile description¶

Name	GPU				CPUs	Sockets	Cores/Socket	Threads/Core	Memory (Gb)	TmpDisk (Tb)	Arch	Features
	Primary		Secondary									Features
	Primary		Secondary									GPU Arch and Memory
	Model	#	Model	#								GPU Arch and Memory
KEPLER
kepler[2-3]	k80	8			16	2	4	2	256	3.6	x86_64	tesla,12GB
kepler4	m40	4			16	2	4	2	256	3.6	x86_64	maxwell,24GB
kepler5	v100	2	m40	1	16	2	4	2	256	3.6	x86_64	volta,12GB
MILA
mila01	v100	8			80	2	20	2	512	7	x86_64	tesla,16GB
mila02	v100	8			80	2	20	2	512	7	x86_64	tesla,32GB
mila03	v100	8			80	2	20	2	512	7	x86_64	tesla,32GB
POWER9
power9[1-2]	v100	4			128	2	16	4	586	0.88	power9	tesla,nvlink,16gb
TITAN RTX
rtx[6,9]	titanrtx	2			20	1	10	2	128	3.6	x86_64	turing,24gb
rtx[1-5,7-8]	titanrtx	2			20	1	10	2	128	0.93	x86_64	turing,24gb
New Compute Nodes
cn-a[01-11]	rtx8000	8			80	2	20	2	380	3.6	x86_64	turing,48gb
cn-b[01-05]	v100	8			80	2	20	2	380	3.6	x86_64	tesla,nvlink,32gb
cn-c[01-40]	rtx8000	8			64	2	32	1	386	3	x86_64	turing,48gb
cn-d[01-02]	A100	8			256	8	16	2	1032	1.4	x86_64	ampere,40gb

3.3.1. Special Nodes and outliers¶

3.3.1.1. Power9¶

Power9 servers are using a different processor instruction set than Intel and AMD (x86_64). As such you need to setup your environment again for those nodes specifically.

Power9 Machines have 128 threads. (2 processors / 16 cores / 4 way SMT)
4 x V100 SMX2 (16 GB) with NVLink
In a Power9 machine GPUs and CPUs communicate with each other using NVLink instead of PCIe.

This allow them to communicate quickly between each other. More on LMS

Power9 have the same software stack as the regular nodes and each software should be included to deploy your environment as on a regular node.

3.3.1.2. AMD¶

Warning

As of August 20 the GPUs had to return back to AMD. Mila will get more samples. You can join the amd slack channels to get the latest information

Mila has a few node equipped with MI50 GPUs.

srun --gres=gpu -c 8 --reservation=AMD --pty bash

 first time setup of AMD stack
conda create -n rocm python=3.6
conda activate rocm

pip install tensorflow-rocm
pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl

3.5. Monitoring¶

Every compute node on the Mila cluster has a monitoring daemon allowing you to check the resource usage of your model and identify bottlenecks. You can access the monitoring web page by typing in your browser: <node>.server.mila.quebec:19999.

For example, if I have a job running on eos1 I can type eos1.server.mila.quebec:19999 and the page below should appear.

3.5.1. Notable Sections¶

You should focus your attention on the metrics below

CPU
- iowait (pink line): High values means your model is waiting on IO a lot (disk or network)

RAM
- Make sure you are only allocating enough to make your code run and not more otherwise you are wasting resources.

NV
- Usage of each GPU
- You should make sure you use the GPU to its fullest
  Select the biggest batch size if possible
  
  Spawn multiple experiments

Users
- In some cases the machine might seem slow, it may be useful to check if other people are using the machine as well

3.6. Storage¶

Path	Performance	Usage	Quota (Space/Files)	Auto-cleanup
`$HOME` or `/home/mila/<u>/<username>/`	Low	Personal user space Specific libraries, code, binaries	200G/1000K
`/network/projects/<groupname>/`	Fair	Shared space to facilitate collaboration between researchers Long-term project storage	200G/1000K
`/network/data1/`	High	Raw datasets (read only)
`/network/datasets/`	High	Curated raw datasets (read only)
`/miniscratch/`	High	Temporary job results Processed datasets Optimized for small Files Supports ACL to help share the data with others		90 days
`$SLURM_TMPDIR`	Highest	High speed disk for temporary job results	4T/-	at job end

$HOME is appropriate for codes and libraries which are small and read once, as well as the experimental results that would be needed at a later time (e.g. the weights of a network referenced in a paper).
projects can be used for collaborative projects. It aims to ease the sharing of data between users working on a long-term project. It’s possible to request a bigger quota if the project requires it.
datasets contains curated datasets to the benefit of the Mila community. To request the addition of a dataset or a preprocessed dataset you think could benefit the research of others, you can fill this form.
data1 should only contain compressed datasets. Now deprecated and replaced by the datasets space.
miniscratch can be used to store processed datasets, work in progress datasets or temporary job results. Its blocksize is optimized for small files which minimizes the performance hit of working on extracted datasets. It supports ACL which can be used to share data between users. This space is cleared weekly and files older then 90 days will be deleted.
$SLURM_TMPDIR points to the local disk of the node on which a job is running. It should be used to copy the data on the node at the beginning of the job and write intermediate checkpoints. This folder is cleared after each job.

Note

Auto-cleanup is applied on files not read or modified during the specified period

Warning

Currently there are no backup system in the lab. Storage local to personal computers, Google Drive and other related solutions should be used to backup important data

3. Mila research computing infrastructure information and policies¶

3.1. Roles and authorisations¶

3.2. Overview of available computing resources at Mila¶

3.3. Node profile description¶

KEPLER

MILA

POWER9

TITAN RTX

New Compute Nodes