Computing Clusters - Michael's Notes

> See also: > - [[MIT SuperCloud Server]] > - [[UNIX Terminals]] > - [[Computers]] # Computing Clusters *Traditional computers* often *lack the speed and storage capabilities* to perform the more computationally intensive tasks. A supercomputer allows for high-performance computing (HPC) to be achieved For a problem to benefit from the use of a supercomputer, the data being processed must be able to be split up to be These are some potential issues that might prompt their usage: 1. Solution requires hundreds or thousands of trials so it takes a long time to complete 2. The application memory requirements are greater than the memory of a single machine so it needs to be distributed over many machines 3. The application takes a long time to run and has high memory requirements A supercomputer is the end-all-be-all of high-performance computing (HPC) ## Computer Cluster Architecture ![[supercomputer architecture example.png|600]] HPC Systems are made up of the same or *similar components as a laptop or desktop, but on a much larger scale.* > [!abstract] **Main Components** > - *Compute Nodes* > - CPUs & GPUs > - Each compute node runs its own operating system to manage resources > - Storage > - Local Storage (on each node) > - Centralized (Shared) File System > - Scheduler > - *Interconnect Network:* What actually connects the different componenets of the supercomputer; typically comprised of physical connections through cables to facilitate higher bandwidth and lower latency. > - *External Network:* Allows for the connection to external laptops/desktops for user access When you `ssh` into a supercomputer system, you are on a special purpose node called the *login node*. This node allows users to access the system and: - Edit code/files - Install packages/software - Download and stage (setup) data - Start jobs for the supercomputer to process The *scheduler* is able to use the *interconnect network* to receive these commands from the *login node* and then allocate the jobs to the *compute nodes*, where the code is run and the actual computation is done. ## HPC Workflows - High-Throughput - Loosely Coupled - Parallel There are two main categories of programming models used for high-performance computing (HPC) workflows: **Single Program Multiple Data (SPMD):** To run the same set of instructions on different data - When we start running thousands of independent programs in this way, it can be referred to as *high-throughput computing*. - If the independently processed data goes through a *reduction (gathering)* and *intermediate processing* stage to produce a single result, it has gone through *loosely-coupled (single dependency) computing* - Ex: Determining the overall cost (maintenance, gas, payments) of a car and then outputting the cheapest one These workflows typically require *minimal modifications to the application* code before they can be integrated into a supercloud environment --- **Multiple Program Multiple Data (MPMD):** To run multiple different programs on multiple different sets of data before processing - These types of processes require sharing a lot of intermediate results across many machines (computing cores) - In a *task-parallel workflow*, different tasks are completed on different sections or nodes of the supercomputing systems. In addition to the programs themselves, the system also *heavily relies on scripts* which: - Distribute tasks (programs + data) to the different nodes - Coordinate data movement - Gather the intermediate results ## Scaling Computing Workflows > [[Computer Performance]] Regardless of how parallelized a workflow is, there will always be some portion of it that *must be run sequentially* on a single node. According to **Amdahl’s Law**, this serial fraction of your workflow is what *prevents a linear speedup* of performance when adding additional cores/computational resources to a task. > [!hint] Best Practicies for Scalable Development > - Optimize the serial portions (what must be run sequentially on a single node) > - Minimize the communication overhead necessary between nodes ### Transferring Files #### Secure Copy (`scp`) The `scp` (secure copy) command can be used to transfer files between local and external file systems. **Local to Remote:** `scp -r [local target] [login node address]:[remote target]` **Remote to Local:** `scp -r [login node address]:[remote target] [local target]` #### Remote Synchronization (`rsync`) Unlike `scp`, the `rsync` command will only transfer new or updated files of a given directory - Functionally similarly to version control software (such as [[Git Version Control|Git]]) > [!summary] **The `rsync` Command** > > ``` rsync -av --exclude='.git' --exclude='*.pdf' SOURCE/ DESTINATION/ > ``` > > --- > **Arguments:** > - `-r` to recursively copy files in all sub-directories > - `-l` to copy and retain symbolic links > - `-u` *(update)* is needed if you have modified files on the destination and you don't want the old file to overwrite over the newer version on the destination > - `-g` is used to preserve group attributes associated with files in a shared group > - `-h` human readable > - `-v` verbose so that you get any error or warning information > --- > **Example source/destination:** > - Remote: `[email protected]:homework/COP3503C` > - Local: `homework/COP3503C` - The local destination cannot simply be `/` - When transfering folders, if a `/` is added to the end of a directory, the program will copy all contents INSIDE the directory to the new destination (not the folder itself). ## Simple Linux Utility for Research Management (SLURM) http://slurm.schedmd.com/