Efficiency¶

This page is a stub

As of now, this page is incomplete, possibly incorrect and open for contributions.

There are multiple types of resources you may need. This page is about using HPC resources efficiently, i.e. how to schedule your HPC jobs optimally.

What is not included is how to profile computer code, as this can be done on any local computer, instead of the heavier compute resources.

How was this overview created?

This overview was created by going through all HPC cluster centers and merging the material they provided regarding this topic.

This is the material found:

HPC cluster name	Guide on how to improve efficiency	Center(s)
Alvis	None found
Bianca	UPPMAX `jobstats` page
COSMOS	None found
COSMOS SENS	None found
Dardel	None found
Data Science Platform	None found
Kebnekaise	`job-usage`
LUMI	No guide
Rackham	UPPMAX `jobstats` page
Sigma	None found
Tetralith	None found
Trusted research environment	None found
Vera	None found

My center's guide is not linked to!

If your center's guide is not linked to, please contribute or contact us.

Additionally, searching for this topic, these sources were found too:

From that, all material was merged into one.

Below is a general strategy to effectively use your HPC resources. How this looks like in practice, depends on the tool available on your HPC cluster.

flowchart TD
  obtain_data[Obtain CPU and memory usage of a job]
  lower_limit_based_on_memory(Book enough memory)
  limited_by_cpu(For that amount of cores, would runtime by limited by CPU?)
  lower_limit_based_on_cpu(Increase the number of cores, so that on average, the right amount of CPUs is booked)

  done(Use that amount of cores)

  add_one(Increase the number of cores by one for safety)

  obtain_data --> lower_limit_based_on_memory
  lower_limit_based_on_memory --> limited_by_cpu
  limited_by_cpu --> |no| add_one
  limited_by_cpu --> |yes| lower_limit_based_on_cpu
  lower_limit_based_on_cpu --> done
  add_one --> done

Why not look at CPU usage?

Because CPU is more flexible.

For example, imagine a job with a short CPU spike, that can be processed by 16 CPUs. If 1 core has enough memory, use 1 core of memory: the CPU spike will be turned into a 100% CPU use (of that one core) for a longer duration.

The first step, 'Obtain CPU and memory usage of a job' depends on your HPC cluster:

HPC cluster name	Tool and guide	Center(s)
Alvis	Using a graphical representation
Bianca	Using `jobstats`
COSMOS	Using `sacct`
COSMOS SENS	Using `sacct`
Dardel	Using `seff`
Data Science Platform	?
Kebnekaise	Using `seff`
LUMI	Using `seff`
Pelle	Using `sacct`
Rackham	Using `jobstats`
Sigma	Using `seff`
Tetralith	Using `seff`
Trusted research environment	?
Vera	Using a graphical representation

Need a worked-out example?

Worked-out examples can be found on each page specific to the tool used.