> See also:
> - [[MIT SuperCloud Server]]
> - [[UNIX Terminals]]
> - [[Computers]]
# Computing Clusters
*Traditional computers* often *lack the speed and storage capabilities* to perform the more computationally intensive tasks.
A supercomputer allows for high-performance computing (HPC) to be achieved
For a problem to benefit from the use of a supercomputer, the data being processed must be able to be split up to be
These are some potential issues that might prompt their usage:
1. Solution requires hundreds or thousands of trials so it takes a long time to complete
2. The application memory requirements are greater than the memory of a single machine so it needs to be distributed over many machines
3. The application takes a long time to run and has high memory requirements
A supercomputer is the end-all-be-all of high-performance computing (HPC)
## Computer Cluster Architecture
![[supercomputer architecture example.png|600]]
HPC Systems are made up of the same or *similar components as a laptop or desktop, but on a much larger scale.*
> [!abstract] **Main Components**
> - *Compute Nodes*
> - CPUs & GPUs
> - Each compute node runs its own operating system to manage resources
> - Storage
> - Local Storage (on each node)
> - Centralized (Shared) File System
> - Scheduler
> - *Interconnect Network:* What actually connects the different componenets of the supercomputer; typically comprised of physical connections through cables to facilitate higher bandwidth and lower latency.
> - *External Network:* Allows for the connection to external laptops/desktops for user access
When you `ssh` into a supercomputer system, you are on a special purpose node called the *login node*. This node allows users to access the system and:
- Edit code/files
- Install packages/software
- Download and stage (setup) data
- Start jobs for the supercomputer to process
The *scheduler* is able to use the *interconnect network* to receive these commands from the *login node* and then allocate the jobs to the *compute nodes*, where the code is run and the actual computation is done.
## HPC Workflows
- High-Throughput
- Loosely Coupled
- Parallel
There are two main categories of programming models used for high-performance computing (HPC) workflows:
**Single Program Multiple Data (SPMD):** To run the same set of instructions on different data
- When we start running thousands of independent programs in this way, it can be referred to as *high-throughput computing*.
- If the independently processed data goes through a *reduction (gathering)* and *intermediate processing* stage to produce a single result, it has gone through *loosely-coupled (single dependency) computing*
- Ex: Determining the overall cost (maintenance, gas, payments) of a car and then outputting the cheapest one
These workflows typically require *minimal modifications to the application* code before they can be integrated into a supercloud environment
---
**Multiple Program Multiple Data (MPMD):** To run multiple different programs on multiple different sets of data before processing
- These types of processes require sharing a lot of intermediate results across many machines (computing cores)
- In a *task-parallel workflow*, different tasks are completed on different sections or nodes of the supercomputing systems. In addition to the programs themselves, the system also *heavily relies on scripts* which:
- Distribute tasks (programs + data) to the different nodes
- Coordinate data movement
- Gather the intermediate results
## Scaling Computing Workflows
> [[Computer Performance]]
Regardless of how parallelized a workflow is, there will always be some portion of it that *must be run sequentially* on a single node.
According to **Amdahl’s Law**, this serial fraction of your workflow is what *prevents a linear speedup* of performance when adding additional cores/computational resources to a task.
> [!hint] Best Practicies for Scalable Development
> - Optimize the serial portions (what must be run sequentially on a single node)
> - Minimize the communication overhead necessary between nodes
### Transferring Files
#### Secure Copy (`scp`)
The `scp` (secure copy) command can be used to transfer files between local and external file systems.
**Local to Remote:** `scp -r [local target] [login node address]:[remote target]`
**Remote to Local:** `scp -r [login node address]:[remote target] [local target]`
#### Remote Synchronization (`rsync`)
Unlike `scp`, the `rsync` command will only transfer new or updated files of a given directory
- Functionally similarly to version control software (such as [[Git Version Control|Git]])
> [!summary] **The `rsync` Command**
>
> ```
rsync -av --exclude='.git' --exclude='*.pdf' SOURCE/ DESTINATION/
> ```
>
> ---
> **Arguments:**
> - `-r` to recursively copy files in all sub-directories
> - `-l` to copy and retain symbolic links
> - `-u` *(update)* is needed if you have modified files on the destination and you don't want the old file to overwrite over the newer version on the destination
> - `-g` is used to preserve group attributes associated with files in a shared group
> - `-h` human readable
> - `-v` verbose so that you get any error or warning information
> ---
> **Example source/destination:**
> - Remote: `
[email protected]:homework/COP3503C`
> - Local: `homework/COP3503C`
- The local destination cannot simply be `/`
- When transfering folders, if a `/` is added to the end of a directory, the program will copy all contents INSIDE the directory to the new destination (not the folder itself).
## Simple Linux Utility for Research Management (SLURM)
http://slurm.schedmd.com/