K-Means Clustering - Michael's Notes

> Not to be confused with the classification algorithm, $K$-Nearest Neighbors (k-NN). # K-Means Clustering The algorithm relies on the value of $k$ being defined by the user. Given this pre-specified number of clusters, $k$, the algorithm partitions (aka [[Clustering Algorithms|clusters]]) the dataset into exactly $k$ disjoint subsets. The general steps of the K-Means algorithm are: - $k$ amount of centroids are randomly placed throughout the data. This random placement - [ ] Dissimilarity Matrix - [ ] The goal of the k-means algorithm is to minimize the global dissimilarity of the data points Cluster *centroids* are randomly selected ## Data Pre-Processing ## Determining K Values (The Elbow Method) ![[Pasted image 20240623062954.png|450]] **Distortion:** It is calculated as the average of the squared distances from the cluster centers of the respective clusters to each data point. Typically, the Euclidean distance metric is used. **Inertia:** It is the sum of the squared distances of samples to their closest cluster center. - *Within-Cluster Sum of Squares (WCSS)* The elbow point represents a balance between having a low WCSS (indicating compact clusters) and not having too many clusters, which could lead to overfitting and poor generalization. - *Too Many Clusters (Overfitting):* Can capture noise and outliers, resulting in very specific and non-generalizable clusters. - *Too Few Clusters (Underfitting):* Simplifies data too much and misses important patterns and structure ## Performing Clustering ![[Pasted image 20240623060946.png|400]]