데이터분석 부트캠프

K-Means Clustering

minimin227 2025. 3. 18. 21:41

K-Means is an unsupervised machine learning algorithm used for clustering data into groups (or clusters) based on their similarity. It is one of the most popular clustering algorithms due to its simplicity and efficiency.


How K-Means Works
  1. Initialization:
    • The algorithm starts by selecting k random points as the initial cluster centroids (where k is the number of clusters you want to create).
  2. Assignment:
    • Each data point is assigned to the nearest cluster centroid based on a distance metric (usually Euclidean distance).
  3. Update:
    • The centroids of the clusters are recalculated as the mean of all the data points assigned to that cluster.
  4. Repeat:
    • Steps 2 and 3 are repeated until the centroids no longer change significantly or a maximum number of iterations is reached.

Key Parameters
  • k (Number of Clusters):
    • The number of clusters to divide the data into. This is a hyperparameter that must be chosen before running the algorithm.
  • Centroids:
    • The central points of each cluster, which are recalculated iteratively.
  • Distance Metric:
    • Typically, Euclidean distance is used to measure the distance between data points and centroids.
      1. Distance Calculation:
        • For each data point in hour_mean_pct, the Euclidean distance to each of the k centroids is calculated.
        • The formula for Euclidean distance between a data point \(p\) and a centroid \(c\) is: \(d(p, c) = \sqrt{\sum_{i=1}^n (p_i - c_i)^2}\) Where:
          • \(p_i\) : The value of the \(i\)-th feature of the data point.
          • \(c_i\) : The value of the \(i\) -th feature of the centroid.
          • \(n\) : The number of features (in this case, 2: '08시' and '18시').

Advantages
  1. Simple and Fast:
    • Easy to implement and computationally efficient for small to medium-sized datasets.
  2. Scalable:
    • Works well with large datasets.
  3. Interpretable:
    • The results are easy to understand and visualize.

Disadvantages
  1. Choosing k:
    • The number of clusters (k) must be specified beforehand, which can be challenging.
  2. Sensitive to Initialization:
    • Poor initialization of centroids can lead to suboptimal clustering.
  3. Assumes Spherical Clusters:
    • Works best when clusters are roughly spherical and equally sized.
  4. Sensitive to Outliers:
    • Outliers can significantly affect the clustering results.

Use Cases
  1. Customer Segmentation:
    • Grouping customers based on purchasing behavior.
  2. Image Compression:
    • Reducing the number of colors in an image by clustering similar colors.
  3. Document Clustering:
    • Grouping similar documents or articles.
  4. Anomaly Detection:
    • Identifying outliers as separate clusters.

K-Means in Your Code

In your code:

  1. KMeans():
    • Initializes the K-Means clustering model.
  2. KElbowVisualizer:
    • A tool from the yellowbrick library that helps determine the optimal number of clusters (k) using the "elbow method."
  3. hour_mean_pct:
    • The dataset being clustered.
  4. Elbow Method:
    • The elbow method evaluates the sum of squared distances (inertia) for different values of k and identifies the "elbow point," where adding more clusters does not significantly reduce the inertia. This helps determine the optimal number of clusters.

Summary

K-Means is a clustering algorithm that groups data into k clusters based on similarity. It is simple, efficient, and widely used in various applications. In your code, the KElbowVisualizer is used to determine the optimal number of clusters for the hour_mean_pct dataset.