3. Mila research computing infrastructure information and policies¶
This section seeks to provide factual information and policies on the Mila cluster computing environments.
3.1. Roles and authorisations¶
There are mainly two types of researchers statuses at Mila :
Core researchers
Affiliated researchers
This is determined by Mila policy. Core researchers have access to the Mila computing cluster. See your supervisor’s Mila status to know what is your own status.
3.2. Overview of available computing resources at Mila¶
The Mila cluster is to be used for regular development and relatively small number of jobs (< 5). It is an heterogeneous cluster. It uses SLURM to schedule jobs.
3.3. Node profile description¶
Name |
GPU |
CPUs |
Sockets |
Cores/Socket |
Threads/Core |
Memory (Gb) |
TmpDisk (Tb) |
Arch |
Features |
|||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Primary |
Secondary |
|||||||||||
GPU Arch and Memory |
||||||||||||
Model |
# |
Model |
# |
|||||||||
|
||||||||||||
kepler[2-3] |
k80 |
8 |
16 |
2 |
4 |
2 |
256 |
3.6 |
x86_64 |
tesla,12GB |
||
kepler4 |
m40 |
4 |
16 |
2 |
4 |
2 |
256 |
3.6 |
x86_64 |
maxwell,24GB |
||
kepler5 |
v100 |
2 |
m40 |
1 |
16 |
2 |
4 |
2 |
256 |
3.6 |
x86_64 |
volta,12GB |
|
||||||||||||
mila01 |
v100 |
8 |
80 |
2 |
20 |
2 |
512 |
7 |
x86_64 |
tesla,16GB |
||
mila02 |
v100 |
8 |
80 |
2 |
20 |
2 |
512 |
7 |
x86_64 |
tesla,32GB |
||
mila03 |
v100 |
8 |
80 |
2 |
20 |
2 |
512 |
7 |
x86_64 |
tesla,32GB |
||
|
||||||||||||
power9[1-2] |
v100 |
4 |
128 |
2 |
16 |
4 |
586 |
0.88 |
power9 |
tesla,nvlink,16gb |
||
|
||||||||||||
rtx[6,9] |
titanrtx |
2 |
20 |
1 |
10 |
2 |
128 |
3.6 |
x86_64 |
turing,24gb |
||
rtx[1-5,7-8] |
titanrtx |
2 |
20 |
1 |
10 |
2 |
128 |
0.93 |
x86_64 |
turing,24gb |
||
|
||||||||||||
cn-a[01-11] |
rtx8000 |
8 |
80 |
2 |
20 |
2 |
380 |
3.6 |
x86_64 |
turing,48gb |
||
cn-b[01-05] |
v100 |
8 |
80 |
2 |
20 |
2 |
380 |
3.6 |
x86_64 |
tesla,nvlink,32gb |
||
cn-c[01-40] |
rtx8000 |
8 |
64 |
2 |
32 |
1 |
386 |
3 |
x86_64 |
turing,48gb |
||
cn-d[01-02] |
A100 |
8 |
256 |
8 |
16 |
2 |
1032 |
1.4 |
x86_64 |
ampere,40gb |
3.3.1. Special Nodes and outliers¶
3.3.1.1. Power9¶
Power9 servers are using a different processor instruction set than Intel and AMD (x86_64). As such you need to setup your environment again for those nodes specifically.
Power9 Machines have 128 threads. (2 processors / 16 cores / 4 way SMT)
4 x V100 SMX2 (16 GB) with NVLink
In a Power9 machine GPUs and CPUs communicate with each other using NVLink instead of PCIe.
This allow them to communicate quickly between each other. More on LMS
Power9 have the same software stack as the regular nodes and each software should be included to deploy your environment as on a regular node.
3.3.1.2. AMD¶
Warning
As of August 20 the GPUs had to return back to AMD. Mila will get more samples. You can join the amd slack channels to get the latest information
Mila has a few node equipped with MI50 GPUs.
srun --gres=gpu -c 8 --reservation=AMD --pty bash
first time setup of AMD stack
conda create -n rocm python=3.6
conda activate rocm
pip install tensorflow-rocm
pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl
3.4. Data Sharing Policies¶
/miniscratch supports ACL to allows collaborative work on rapidly changing data, i.g. work in process datasets, model checkpoints, etc…
/network/projects aims to offer a collaborative space for long-term projects. Data that should be kept for a longer period then 90 days can be stored in that location but first a request to Mila’s helpdesk has to be made.
3.5. Monitoring¶
Every compute node on the Mila cluster has a monitoring daemon allowing you to
check the resource usage of your model and identify bottlenecks.
You can access the monitoring web page by typing in your browser: <node>.server.mila.quebec:19999
.
For example, if I have a job running on eos1
I can type eos1.server.mila.quebec:19999
and
the page below should appear.
3.5.1. Notable Sections¶
You should focus your attention on the metrics below
CPU
iowait (pink line): High values means your model is waiting on IO a lot (disk or network)
RAM
Make sure you are only allocating enough to make your code run and not more otherwise you are wasting resources.
NV
Usage of each GPU
You should make sure you use the GPU to its fullest
Select the biggest batch size if possible
Spawn multiple experiments
Users
In some cases the machine might seem slow, it may be useful to check if other people are using the machine as well
3.6. Storage¶
Path |
Performance |
Usage |
Quota (Space/Files) |
Auto-cleanup |
---|---|---|---|---|
|
Low |
|
200G/1000K |
|
|
Fair |
|
200G/1000K |
|
|
High |
|
||
|
High |
|
||
|
High |
|
90 days |
|
|
Highest |
|
4T/- |
at job end |
$HOME
is appropriate for codes and libraries which are small and read once, as well as the experimental results that would be needed at a later time (e.g. the weights of a network referenced in a paper).projects
can be used for collaborative projects. It aims to ease the sharing of data between users working on a long-term project. It’s possible to request a bigger quota if the project requires it.datasets
contains curated datasets to the benefit of the Mila community. To request the addition of a dataset or a preprocessed dataset you think could benefit the research of others, you can fill this form.data1
should only contain compressed datasets. Now deprecated and replaced by thedatasets
space.miniscratch
can be used to store processed datasets, work in progress datasets or temporary job results. Its blocksize is optimized for small files which minimizes the performance hit of working on extracted datasets. It supports ACL which can be used to share data between users. This space is cleared weekly and files older then 90 days will be deleted.$SLURM_TMPDIR
points to the local disk of the node on which a job is running. It should be used to copy the data on the node at the beginning of the job and write intermediate checkpoints. This folder is cleared after each job.
Note
Auto-cleanup is applied on files not read or modified during the specified period
Warning
Currently there are no backup system in the lab. Storage local to personal computers, Google Drive and other related solutions should be used to backup important data