3. Mila research computing infrastructure information and policies

This section seeks to provide factual information and policies on the Mila cluster computing environments.

3.1. Roles and authorisations

There are mainly two types of researchers statuses at Mila :

  1. Core researchers

  2. Affiliated researchers

This is determined by Mila policy. Core researchers have access to the Mila computing cluster. See your supervisor’s Mila status to know what is your own status.

3.2. Overview of available computing resources at Mila

The Mila cluster is to be used for regular development and relatively small number of jobs (< 5). It is an heterogeneous cluster. It uses SLURM to schedule jobs.

3.3. Node profile description

Name

GPU

CPUs

Sockets

Cores/Socket

Threads/Core

Memory (Gb)

TmpDisk (Tb)

Arch

Features

Primary

Secondary

GPU Arch and Memory

Model

#

Model

#

KEPLER

kepler[2-3]

k80

8

16

2

4

2

256

3.6

x86_64

tesla,12GB

kepler4

m40

4

16

2

4

2

256

3.6

x86_64

maxwell,24GB

kepler5

v100

2

m40

1

16

2

4

2

256

3.6

x86_64

volta,12GB

MILA

mila01

v100

8

80

2

20

2

512

7

x86_64

tesla,16GB

mila02

v100

8

80

2

20

2

512

7

x86_64

tesla,32GB

mila03

v100

8

80

2

20

2

512

7

x86_64

tesla,32GB

POWER9

power9[1-2]

v100

4

128

2

16

4

586

0.88

power9

tesla,nvlink,16gb

TITAN RTX

rtx[6,9]

titanrtx

2

20

1

10

2

128

3.6

x86_64

turing,24gb

rtx[1-5,7-8]

titanrtx

2

20

1

10

2

128

0.93

x86_64

turing,24gb

New Compute Nodes

cn-a[01-11]

rtx8000

8

80

2

20

2

380

3.6

x86_64

turing,48gb

cn-b[01-05]

v100

8

80

2

20

2

380

3.6

x86_64

tesla,nvlink,32gb

cn-c[01-40]

rtx8000

8

64

2

32

1

386

3

x86_64

turing,48gb

cn-d[01-02]

A100

8

256

8

16

2

1032

1.4

x86_64

ampere,40gb

3.3.1. Special Nodes and outliers

3.3.1.1. Power9

Power9 servers are using a different processor instruction set than Intel and AMD (x86_64). As such you need to setup your environment again for those nodes specifically.

  • Power9 Machines have 128 threads. (2 processors / 16 cores / 4 way SMT)

  • 4 x V100 SMX2 (16 GB) with NVLink

  • In a Power9 machine GPUs and CPUs communicate with each other using NVLink instead of PCIe.

This allow them to communicate quickly between each other. More on LMS

Power9 have the same software stack as the regular nodes and each software should be included to deploy your environment as on a regular node.

3.3.1.2. AMD

Warning

As of August 20 the GPUs had to return back to AMD. Mila will get more samples. You can join the amd slack channels to get the latest information

Mila has a few node equipped with MI50 GPUs.

srun --gres=gpu -c 8 --reservation=AMD --pty bash

 first time setup of AMD stack
conda create -n rocm python=3.6
conda activate rocm

pip install tensorflow-rocm
pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl

3.4. Data Sharing Policies

/miniscratch supports ACL to allows collaborative work on rapidly changing data, i.g. work in process datasets, model checkpoints, etc…

/network/projects aims to offer a collaborative space for long-term projects. Data that should be kept for a longer period then 90 days can be stored in that location but first a request to Mila’s helpdesk has to be made.

3.5. Monitoring

Every compute node on the Mila cluster has a monitoring daemon allowing you to check the resource usage of your model and identify bottlenecks. You can access the monitoring web page by typing in your browser: <node>.server.mila.quebec:19999.

For example, if I have a job running on eos1 I can type eos1.server.mila.quebec:19999 and the page below should appear.

monitoring.png

3.5.1. Notable Sections

You should focus your attention on the metrics below

  • CPU

    • iowait (pink line): High values means your model is waiting on IO a lot (disk or network)

monitoring_cpu.png
  • RAM

    • Make sure you are only allocating enough to make your code run and not more otherwise you are wasting resources.

monitoring_ram.png
  • NV

    • Usage of each GPU

    • You should make sure you use the GPU to its fullest

      • Select the biggest batch size if possible

      • Spawn multiple experiments

monitoring_gpu.png
  • Users

    • In some cases the machine might seem slow, it may be useful to check if other people are using the machine as well

monitoring_users.png

3.6. Storage

Path

Performance

Usage

Quota (Space/Files)

Auto-cleanup

$HOME or /home/mila/<u>/<username>/

Low

  • Personal user space

  • Specific libraries, code, binaries

200G/1000K

/network/projects/<groupname>/

Fair

  • Shared space to facilitate collaboration between researchers

  • Long-term project storage

200G/1000K

/network/data1/

High

  • Raw datasets (read only)

/network/datasets/

High

  • Curated raw datasets (read only)

/miniscratch/

High

  • Temporary job results

  • Processed datasets

  • Optimized for small Files

  • Supports ACL to help share the data with others

90 days

$SLURM_TMPDIR

Highest

  • High speed disk for temporary job results

4T/-

at job end

  • $HOME is appropriate for codes and libraries which are small and read once, as well as the experimental results that would be needed at a later time (e.g. the weights of a network referenced in a paper).

  • projects can be used for collaborative projects. It aims to ease the sharing of data between users working on a long-term project. It’s possible to request a bigger quota if the project requires it.

  • datasets contains curated datasets to the benefit of the Mila community. To request the addition of a dataset or a preprocessed dataset you think could benefit the research of others, you can fill this form.

  • data1 should only contain compressed datasets. Now deprecated and replaced by the datasets space.

  • miniscratch can be used to store processed datasets, work in progress datasets or temporary job results. Its blocksize is optimized for small files which minimizes the performance hit of working on extracted datasets. It supports ACL which can be used to share data between users. This space is cleared weekly and files older then 90 days will be deleted.

  • $SLURM_TMPDIR points to the local disk of the node on which a job is running. It should be used to copy the data on the node at the beginning of the job and write intermediate checkpoints. This folder is cleared after each job.

Note

Auto-cleanup is applied on files not read or modified during the specified period

Warning

Currently there are no backup system in the lab. Storage local to personal computers, Google Drive and other related solutions should be used to backup important data