> Not to be confused with the classification algorithm, $K$-Nearest Neighbors (k-NN).
# K-Means Clustering
The algorithm relies on the value of $k$ being defined by the user. Given this pre-specified number of clusters, $k$, the algorithm partitions (aka [[Clustering Algorithms|clusters]]) the dataset into exactly $k$ disjoint subsets.
The general steps of the K-Means algorithm are:
- $k$ amount of centroids are randomly placed throughout the data. This random placement
- [ ] Dissimilarity Matrix
- [ ]
The goal of the k-means algorithm is to minimize the global dissimilarity of the data points
Cluster *centroids* are randomly selected
## Data Pre-Processing
## Determining K Values (The Elbow Method)
![[Pasted image 20240623062954.png|450]]
**Distortion:** It is calculated as the average of the squared distances from the cluster centers of the respective clusters to each data point. Typically, the Euclidean distance metric is used.
**Inertia:** It is the sum of the squared distances of samples to their closest cluster center.
- *Within-Cluster Sum of Squares (WCSS)*
The elbow point represents a balance between having a low WCSS (indicating compact clusters) and not having too many clusters, which could lead to overfitting and poor generalization.
- *Too Many Clusters (Overfitting):* Can capture noise and outliers, resulting in very specific and non-generalizable clusters.
- *Too Few Clusters (Underfitting):* Simplifies data too much and misses important patterns and structure
## Performing Clustering
![[Pasted image 20240623060946.png|400]]