PLGrid PLGrid Documentation

What is HPC?

High-performance computing (HPC) is the use of supercomputers and highly parallel algorithms to solve problems that would take a very long time on a regular computer. Typical examples include drug design, chemical reaction simulations, weather forecasting, training large language models, aerodynamics, protein-folding simulations, and many more.

Why use HPC and supercomputers?

Consider using HPC resources when:

  • the computation would take too long on a normal computer,
  • the task needs a lot of memory or disk space,
  • you need to run a very large number of jobs at once.

What is a cluster/supercomputer?

A cluster is a collection of high-performance servers (called nodes) connected with a fast network. Each node has powerful multi-core processors (CPU) and/or specialised accelerators (GPU), allowing many operations to run in parallel. Supercomputer is simply a prestigious name for a large cluster.

Compared to a personal computer, the main difference is scale. A laptop has a processor with a few cores, while a supercomputer can have millions. By spreading a task across many cores, week-long computations shrink to hours or minutes. Using specialised GPU units can accelerate things even further. Depending on the workload, a single GPU can deliver the performance of hundreds of CPUs, though CPUs remain more universal and handle diverse tasks better.

Supercomputers also provide very large amounts of high-bandwidth RAM. They offer massive storage spaces, typically in petabytes, and use high-performance distributed filesystems (e.g. Lustre). These aspects allow users to work with huge datasets by removing memory and disk-access bottlenecks that can happen on a regular computer.

HPC is not only powerful hardware. An essential part is the software installed by system administrators, including:

  • compilers and interpreters for various programming languages,
  • ready-to-use libraries such as BLAS/LAPACK,,
  • debugging and performance-analysis tools,
  • domain-specific applications (e.g. VASP, GROMACS, Ansys).

Using them does not require programming knowledge. This software is usually prebuilt - compiled and configured to ensure high performance on the given cluster. This allows users to focus on efficiently running these applications rather than troubleshooting environment configuration. HPC centres often hold appropriate site licenses, providing access to commercial software as well.

Main advantages of supercomputers:

  • very large numbers of multi-core CPUs,
  • access to accelerators (GPUs),
  • large and fast RAM,
  • filesystems with huge capacity,
  • an environment for developing your own code,
  • access to ready-to-use domain-specific applications.

How to work on a supercomputer?

job submission scheme on a supercomputer

Working on a supercomputer is similar to using a standard computer with one key difference. On a supercomputer, all users’ jobs are managed by a queueing system, to which users submit jobs - sets of commands to be executed on the cluster. The most commonly used system is Slurm. Unlike a personal computer, you cannot simply start a program at any time directly from the terminal.

After logging in, the user lands on the login node. This node is shared by all users and is not intended for computations. On the login node, batch scripts are created, containing specification of required resources and the commands needed to run the calculation. These scripts are then used to submit jobs to the queueing system. The nodes where the actual computations take place are called compute nodes, and their numbers typically reach hundreds or even thousands.

Additional parameters in a batch script define the computational resources the user wants to use for their job. These include the number of reserved nodes, CPU cores or accelerators, the required memory size, and the time limit by which the job must finish. The script also specifies parameters such as the queueing system partition for the job and the grant name. Importantly, once the job starts, it has exclusive access to the requested resources for the entire runtime!

Once a job is submitted, it enters the queue with a pending status and waits for execution. The queueing system decides when and on which nodes the job will run. When enough compute nodes become available, the job status changes to running. At that point, the computations in the batch script begin, meaning the commands in the script are executed just as they would be on a regular computer. However, to fully utilize all allocated processors, the program must employ a parallelization method, such as multithreading, MPI, OpenMP, or CUDA.

Summary:

  • the login node is used for submitting jobs and accessing data,
  • compute nodes perform the actual computations,
  • the queueing system (Slurm) allocates resources to jobs based on user requests and current availability.

Last update: December 17, 2025