K-Means Clustering

3 assignment k-means clustering.

Let’s apply K-means clustering on the same data set we used for kNN.

You have to determine a number of still unknown clusters of, in this case, makes and models of cars.

There is no criterion that we can use as a training and test set!

The questions and assignments are:

  • Read the file ( cars.csv ).
  • You have seen from the description in the previous assignment that the variable for origin (US, versus non-US) is a factor variable. We cannot calculate distances from a factor variable. Because we want to include it anyway, we have to make it a dummy (0/1) variable.
  • Normalize the data.
  • Determine the number of clusters using the (graphical) method described above.
  • Determine the clustering, and add the cluster to the data set.
  • Describe the clusters in terms of all variables used in the clustering.
  • Characterize ( label ) the clusters.
  • Repeat the exercise with more or fewer clusters, and decide if the new solutions are better than the original solution!

3.1 Solution: Some Help

Read the data:

And make a function for normalizing your data:

And normalize the data, after creating a dummy for origin .

The normalized data to use, are now in cars2_n .

Decide on the best number of clusters:

k means assignment

Homework 4: K-Means Clustering

Part I Due : at 11:59pm on Tuesday, February 8, 2022. Submit via Gradescope. (REQUIRED Feedback Survey ) Part II Due : at 11:59pm on Monday, February 14, 2022. Submit via Gradescope. (REQUIRED Feedback Survey )

  • For Part I you will submit kmeans.py with at least Part 1 completed. You will not necessarily get any feedback on Part I before Part II is due, so do not wait to get started on Part II. What you submit for the Part I deadline will not be graded for style at that time - but of course we recommend that you go ahead and use good style!
  • For Part II you will re-submit kmeans.py and submit analysis.py with everything (Parts 1 and 2) completed and with good style. At that time we will grade the entire homework for correctness and style. If you made a mistake in Part I, please fix those mistakes before submitting Part II so you do not lose points on them twice.
  • Your overall grade for this homework will come approximately 25% from Part I and 75% from Part II. Overall HW4 will count as approximately 1.5 homeworks.
  • For all of our assignments you should NOT use parts of Python not yet discussed in class or the course readings. In addition, do not use the functions: min , max , or sum .

Learning Objectives:

  • Implement a popular clustering algorithm for data analysis
  • Practice using lists and dictionaries for computational problem solving

This time we have a couple videos available to help clarify the assignment. We highly recommend watching these videos before starting, or if you are confused about the content.

The videos will also be embedded here:

The Algorithm

Part 0: getting started, part 1: implementing the algorithm, step 1: calculating euclidean distances, step 2: assigning data points to closest centroids, step 3: update assignment of points to centroids, step 4: update centroids, step 5: clustering 2d points, step 6: submit, step 1: load the final centroids: what happened, step 2: rewrite update_assignment, step 3: calculate the count of the majority label, step 4: calculate the accuracy of the algorithm, step 5: run analysis and think about the results, step 6: final submit, introduction.

In Homework 4, we will implement a commonly-used algorithm: K-means clustering. (Read more about K-means on the K-means Wikipedia page if you're interested). K-means is a popular machine learning and data mining algorithm that discovers potential clusters within a dataset. Finding these clusters in a dataset can often reveal interesting and meaningful structures underlying the distribution of data. K-means clustering has been applied to many problems in science and still remains popular today for its simplicity and effectiveness. Here are a few examples of recently published papers using K-means:

  • (Computer Science) Hand-written digit recognition (We will do this in Part 2!)
  • (Natural Language Processing) Categorizing text based on clusters
  • (Biology) Discovering gene expression networks

Hopefully, these examples motivate you to try this homework out! You will very likely be able to apply this computational tool to some problems relevant to you! We will break down the name and explain the algorithm now.

Background: Data, Distances, Clusters and Centroids

We will assume we have a dataset of m data points where each data point is n -dimensional. For example, we could have 100 points on a two dimensional plane, then m = 100 and n = 2.

Euclidean Distance

k means assignment

This is commonly known as the Euclidean distance . For example, if we had the three-dimensional points x1 and x2 defined as Python lists:

x1 = [-2, -1, -4] x2 = [10, -5, 5]

k means assignment

Here is another example with two four-dimensional points:

k means assignment

A cluster is a collection of points that are part of the same group. For k-means, every point has a cluster that it is part of. As the algorithm progresses, points may change what cluster they are currently in, even though the point itself does not move.

k means assignment

For instance, all blue points are part of cluster1 , all red points are part of cluster2 , all green points are part of cluster3

k means assignment

For example, if a cluster contains three 2D points (displayed as Python lists) as below: x1 = [1, 1] x2 = [2, 3] x3 = [3, 2] The centroid for this cluster is defined as: c = [(1 + 2 + 3)/3, (1 + 3 + 2)/3] # evaluates to [2, 2]

k means assignment

Our ultimate goal is to partition the data points into K clusters. How do we know which data point belongs to which cluster? In K-means, we assign each data point to the cluster whose centroid is closest to that data point.

  • First, we assign all data points to their corresponding centroids by finding the centroid that is closest.
  • Then, we adjust the centroid location to be the average of all points in that cluster.

We iteratively repeat these two steps to achieve better and better cluster centroids and meaningful assignments of data points.

Note that the data points do not change! Only the location of the centroids change with each iteration (and thus the points that are closest to each centroid changes).

k means assignment

K-means Convergence

One last question you may have is, when do we stop repeating these two steps? I.e., what does it mean for the algorithm to have converged? We say the algorithm has converged if the locations of all centroids did not change much between two iterations. (ex. Within 1 * 10 -5 or 0.00001)

Take this example of two centroids of 2-dimensional points. If at the previous iteration, we have the two centroid locations as: c1 = [0.45132, -0.99134] c2 = [-0.34135, -2.3525] And at the next iteration the centroids are updated to: c1 = [1.43332, -1.9334] c2 = [-1.78782, -2.3523] Then we say the algorithm has NOT converged, since the location of the two centroids are very different between iterations. However, if the new locations of the centroids are instead: c1 = [0.45132, -0.99134] c2 = [-0.34135, -2.3524] Then we say the algorithm HAS converged, since the locations of the two centroids are very similar between iterations (Typically, "very similar" means for every dimension of two points a and b, their difference (|a - b|) is smaller than some constant (1 * 10 -5 or 0.00001 is a common choice)).

In summary, the K-means algorithm (in pseudo code) is described as follows:

  • First, data points are assigned to whichever centroid is closest (and colored accordingly),
  • Then centroid locations are adjusted based on the mean of all red/blue points.

In the following parts of this homework, you will implement this algorithm and apply it to (1) 2D points shown in the gif above; (2) a small subset of the MNIST hand-written digit dataset.

Download homework4.zip and unzip the file to get started. You should see the following file structure in the homework4 folder:

The red highlighted parts are where you will work on your code. We have provided utils.py which contains helper code to help you complete the assignment. Do not change anything in utils.py !

  • data : A list of data points, where each data point is a list of floats. Example: data = [[0.34, -0.2, 1.13, 4.3], [5.1, -12.6, -7.0, 1.9], [-15.7, 0.06, -7.1, 11.2]] Here, data contains three data points, where each data point is four-dimensional. (In this example point 1 is [0.34, -0.2, 1.13, 4.3], point 2 is [5.1, -12.6, -7.0, 1.9] and point 3 is [-15.7, 0.06, -7.1, 11.2])
  • centroids : A dictionary of centroids, where the keys are strings (centroid names) and the values are lists (centroid locations) Example: centroids = {"centroid1": [1.1, 0.2, -3.1, -0.4], "centroid2": [9.3, 6.1, -4.7, 0.18]} Here, centroids contains two key-value pairs, where key is the centroid name, and value is the location of the centroid. You should NOT change the names of the centroids when you later update their locations.

Implement the following function in kmeans.py :

If you want more information on what Euclidean distance is, re-read the Euclidean distance background section in the spec and/or the Euclidean distance Wikipedia page . In the code we have provided you for this (and all functions you need to implement), the body of the function is empty. We have provided testing code for you to verify your implementation. Run python kmeans.py in a terminal window to verify your implementation of euclidean_distance . If you passed our tests, you should see: test_euclidean_distance passed. Note that you may find several assertions further down in the program failing at this point. This is to be expected since you have not implemented those parts of the program yet! In fact this will continue to happen as you progress through the problems. Success usually means you are no longer failing the same assertion, but that instead a different assertion is failing further down in the program.

If you want more information on centroids, check out the Centroids section in the spec. We have provided testing code for you to verify your implementation. test_closest_centroid() is already set up to run when python kmeans.py is typed in a terminal window. Check the output of this test to verify your implementation.

Notice that your algorithm should run regardless of if any centroid becomes empty, i.e. there are no points that are the closest to a centroid. In that case you should not include that centroid in the dictionary you return. We have provided testing code for you to verify your implementation. test_update_assignment() is already set up to run when python kmeans.py is typed in a terminal window. Check the output of this test to verify your implementation.

Notice that you should not hard-code the dimensionality of data - your code needs to run regardless of the dimension of data!

We have provided testing code for you to verify your implementation. test_mean_of_points() is already set up to run when python kmeans.py is typed in a terminal window. Check the output of this test to verify your implementation. A common error students make when implementing this function is to modify the provided list of points. Be sure your implementation does not modify the provided list of points. Now implement the following function in kmeans.py :

Notice that you should not hard-code the dimensionality of data - your code needs to run regardless of the dimension of data! We have provided testing code for you to verify your implementation. test_update_centroids() is already set up to run when python kmeans.py is typed in a terminal window. Check the output of this test to verify your implementation. You have now completed all parts needed to run your k-means algorithm! As a sanity check, try running everything together one more time by typing: python kmeans.py in the terminal window. You should see: All tests passed.

You have now completed all parts needed to run your k-means algorithm! Now, let's verify your implementation by running the algorithm on 2D points. At the bottom of kmeans.py you should see code like this:

This is the code that actually runs when you run this file. Right now your code is only calling the main_test() function. De-comment the last four lines and turn the call to main_test() into a comment . These new commands will load the 2D data points and the initial centroids we provided in from .csv files. After that, main_2d will execute the k-means algorithm and write_centroids_tofile will save the final centroids to a file to save them permanently. Your code should now look like this:

Now run your program again by typing python kmeans.py in the terminal window. If you are successful, you should see: K-means converged after 7 steps. as the output of the program. In addition, there should be 7 plots generated, in the results/2D/ folder. If you examine these 7 plots, they should be identical to the steps shown in the the gif shown in the algorithm section of this spec .

You've completed part 1! Before you take a break, be sure to submit kmeans.py via Gradescope . To ensure your program produces the correct output, you should make the code at the bottom of your kmeans.py look identical to what we described in Step 5.

Answer a REQUIRED survey asking how much time you spent and other reflections on this part of the assignment.

Part 2: Another dataset & Analyzing the Performance

Trying the algorithm on mnist.

Now that you have successfully run K-means clustering on 2-dimensional data points with 2 centroids, we will use your code to do clustering of 784-dimensional data points with 10 centroids. Don't worry, you won't need to change any of the code that you wrote!

k means assignment

Each of these images is 28 pixels by 28 pixels, for a total of 784 pixels. (As in HW3, pixels have values between 0 and 255 (0=black, 255=white), although we don't need to know this for K-means.) Each image can be thought of as a 784-dimensional data point, represented as a list containing 784 numbers. For the images shown above, many of the numbers would be zero since the image is mostly black. So the nine sample images above are actually nine sample data points!

We will see how applying K-means to these digit images helps us discover clusters of images which contain mostly one kind of digit!

To load the MNIST dataset instead of the 2D points, change the code at the bottom of the file to the following (the red text shows the changes that should be made):

You should leave the rest of your code intact, and it should run with the new data without errors! Run this command again (it may take a minute to run on this data set!):

k means assignment

Analyzing the Performance

How do we evaluate the performance of our algorithm? The images in the MNIST data set have been labelled with the digit they actually represent. (You can see this label as the digit in the first column on each row in the file mnist.csv . Each row of that file represents one data point (image).) So, one way to evaluate our algorithm would be to examine the accuracy of the classification of the images from MNIST. For example, how many of the images known to be (labelled as) the digit 4 ended up in the centroid we think of as representing the digit 4? First, let's define the "classifications" of our data points provided by K-means. We take a majority-takes-all approach: we assign each cluster a label as the majority labels of all data points in this cluster. For example, say we have data points with these labels assigned to cluster1: 2, 2, 2, 2, 2, 2, 8, 6, 2, 2, 6, 8, 8, 2, 2 Since there are 10 2's, 3 8's, and 2 6's, we say the label of cluster 1 is 2 since it's the majority label. (i.e The first six images have a label of 2, the seventh image has a label of 8, the eighth image has a label of 6)

k means assignment

Now we can start the analysis. We have provided the code for you near the bottom of analysis.py :

First, please take a closer look at this centroids dictionary and/or the images that were generated. Feel free to write some print statements to explore the content (you don't need to show this code when you submit your homework) - how many centroids are there in the dictionary now? In Part A, we initialized the algorithm with 10 randomly chosen positions as the initial centroids. Why are there fewer centroids than 10 now? What do you suspect happened? >This question is just for fun (un-graded), but we want you to at least think about what is happening and give an answer. Leave your answers as comments at the bottom of the file ( analysis.py ) . Please do not stress about these questions! There is no need to do anything for this question other than think about it.

When we implemented the update_assignment function, it assigned each data point to the centroid it is closest to. This works great for our algorithm, but now we are not only interested in which centroid the data point is assigned to, but also the true label of that data point in order to perform the accuracy analysis. Therefore, we need to update the function to incorporate more parameters and return different values. Implement the following function in analysis.py :

Notice the red parts as the difference between the previous version of the function you wrote and what you need to write now. Instead of nested lists of data points as the values of the dictionary returned, it should now be single lists that contain the labels of the data points, not the actual data points . Feel free to copy your previous implementation in kmeans.py over as a starting point. And just like your old implementation, you should use the get_closest_centroid function when implementing this one. We have provided testing code for you to verify your implementation. De-comment the call to main_test() near the bottom of analysis.py and then run python analysis.py , which will call that same main_test(). Similar to part 1, you should see test_update_assignment passed if your update_assignment is working. You will see a test fail after that, since your other functions are not ready yet.

In the next function, you should take in a single list of labels, and return the count of the majority label defined as above. Implement the following function in analysis.py :

We have provided testing code for you to verify your implementation. Run python analysis.py again. This time you should see test_majority_count passed if your majority_count is working. You will see a test fail after that, since your last function is not ready yet.

Use the two functions you implemented in the previous parts of Part 2 to calculate the accuracy for every cluster and the whole algorithm, defined as above in the Analyzing the Performance section . Implement the following function in analysis.py :

We have provided testing code for you to verify your implementation. Run python analysis.py again. This time you should see test_accuracy passed if your accuracy function is working. Now you should also see all tests passed.

You have now completed all the parts required to run the analysis. Please de-comment the last few commented lines of code at the bottom of analysis.py and comment out the call to main_test() (and remove your exploration code from Step 1) so you have:

And run the program by typing: python analysis.py Note that the file name returned in the output does not necessarily correspond with the label assigned to the digit (ex. centroid0.png is not necessarily an image of '0'). Instead, decide on the label of the image by looking at the returned image itself. What is the output? Is this value surprising? It might be helpful to take another look at the final centroid images from MNIST in the results/MNIST/final folder -- In your opinion, based on these centroids, which digits are easier to be distinguished by the algorithm, and which are harder? If you can't tell from the centroid images, consider printing the accuracy for each label inside of your accuracy function. This question is graded, but don't stress too much about it . Reasonable answers will receive full credit. (ie 2-3 sentences). Leave your answer as comments at the bottom of the file analysis.py . There is no need to do anything for this question other than think about it.

Be sure to submit both analysis.py and kmeans.py via Gradescope under part 2.

k means assignment

The text is released under the CC-BY-NC-ND license , and code is released under the MIT license . If you find this content useful, please consider supporting the work by buying the book !

In Depth: k-Means Clustering

< In-Depth: Manifold Learning | Contents | In Depth: Gaussian Mixture Models >

In the previous few sections, we have explored one category of unsupervised machine learning models: dimensionality reduction. Here we will move on to another class of unsupervised machine learning models: clustering algorithms. Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.

Many clustering algorithms are available in Scikit-Learn and elsewhere, but perhaps the simplest to understand is an algorithm known as k-means clustering , which is implemented in sklearn.cluster.KMeans .

We begin with the standard imports:

Introducing k-Means ¶

The k -means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:

  • The "cluster center" is the arithmetic mean of all the points belonging to the cluster.
  • Each point is closer to its own cluster center than to other cluster centers.

Those two assumptions are the basis of the k -means model. We will soon dive into exactly how the algorithm reaches this solution, but for now let's take a look at a simple dataset and see the k -means result.

First, let's generate a two-dimensional dataset containing four distinct blobs. To emphasize that this is an unsupervised algorithm, we will leave the labels out of the visualization

By eye, it is relatively easy to pick out the four clusters. The k -means algorithm does this automatically, and in Scikit-Learn uses the typical estimator API:

Let's visualize the results by plotting the data colored by these labels. We will also plot the cluster centers as determined by the k -means estimator:

The good news is that the k -means algorithm (at least in this simple case) assigns the points to clusters very similarly to how we might assign them by eye. But you might wonder how this algorithm finds these clusters so quickly! After all, the number of possible combinations of cluster assignments is exponential in the number of data points—an exhaustive search would be very, very costly. Fortunately for us, such an exhaustive search is not necessary: instead, the typical approach to k -means involves an intuitive iterative approach known as expectation–maximization .

k-Means Algorithm: Expectation–Maximization ¶

Expectation–maximization (E–M) is a powerful algorithm that comes up in a variety of contexts within data science. k -means is a particularly simple and easy-to-understand application of the algorithm, and we will walk through it briefly here. In short, the expectation–maximization approach here consists of the following procedure:

  • Guess some cluster centers
  • E-Step : assign points to the nearest cluster center
  • M-Step : set the cluster centers to the mean

Here the "E-step" or "Expectation step" is so-named because it involves updating our expectation of which cluster each point belongs to. The "M-step" or "Maximization step" is so-named because it involves maximizing some fitness function that defines the location of the cluster centers—in this case, that maximization is accomplished by taking a simple mean of the data in each cluster.

The literature about this algorithm is vast, but can be summarized as follows: under typical circumstances, each repetition of the E-step and M-step will always result in a better estimate of the cluster characteristics.

We can visualize the algorithm as shown in the following figure. For the particular initialization shown here, the clusters converge in just three iterations. For an interactive version of this figure, refer to the code in the Appendix .

(run code in Appendix to generate image)

The k -Means algorithm is simple enough that we can write it in a few lines of code. The following is a very basic implementation:

Most well-tested implementations will do a bit more than this under the hood, but the preceding function gives the gist of the expectation–maximization approach.

Caveats of expectation–maximization ¶

There are a few issues to be aware of when using the expectation–maximization algorithm.

The globally optimal result may not be achieved ¶

First, although the E–M procedure is guaranteed to improve the result in each step, there is no assurance that it will lead to the global best solution. For example, if we use a different random seed in our simple procedure, the particular starting guesses lead to poor results:

Here the E–M approach has converged, but has not converged to a globally optimal configuration. For this reason, it is common for the algorithm to be run for multiple starting guesses, as indeed Scikit-Learn does by default (set by the n_init parameter, which defaults to 10).

The number of clusters must be selected beforehand ¶

Another common challenge with k -means is that you must tell it how many clusters you expect: it cannot learn the number of clusters from the data. For example, if we ask the algorithm to identify six clusters, it will happily proceed and find the best six clusters:

Whether the result is meaningful is a question that is difficult to answer definitively; one approach that is rather intuitive, but that we won't discuss further here, is called silhouette analysis .

Alternatively, you might use a more complicated clustering algorithm which has a better quantitative measure of the fitness per number of clusters (e.g., Gaussian mixture models; see In Depth: Gaussian Mixture Models ) or which can choose a suitable number of clusters (e.g., DBSCAN, mean-shift, or affinity propagation, all available in the sklearn.cluster submodule)

k-means is limited to linear cluster boundaries ¶

The fundamental model assumptions of k -means (points will be closer to their own cluster center than to others) means that the algorithm will often be ineffective if the clusters have complicated geometries.

In particular, the boundaries between k -means clusters will always be linear, which means that it will fail for more complicated boundaries. Consider the following data, along with the cluster labels found by the typical k -means approach:

This situation is reminiscent of the discussion in In-Depth: Support Vector Machines , where we used a kernel transformation to project the data into a higher dimension where a linear separation is possible. We might imagine using the same trick to allow k -means to discover non-linear boundaries.

One version of this kernelized k -means is implemented in Scikit-Learn within the SpectralClustering estimator. It uses the graph of nearest neighbors to compute a higher-dimensional representation of the data, and then assigns labels using a k -means algorithm:

We see that with this kernel transform approach, the kernelized k -means is able to find the more complicated nonlinear boundaries between clusters.

k-means can be slow for large numbers of samples ¶

Because each iteration of k -means must access every point in the dataset, the algorithm can be relatively slow as the number of samples grows. You might wonder if this requirement to use all data at each iteration can be relaxed; for example, you might just use a subset of the data to update the cluster centers at each step. This is the idea behind batch-based k -means algorithms, one form of which is implemented in sklearn.cluster.MiniBatchKMeans . The interface for this is the same as for standard KMeans ; we will see an example of its use as we continue our discussion.

Examples ¶

Being careful about these limitations of the algorithm, we can use k -means to our advantage in a wide variety of situations. We'll now take a look at a couple examples.

Example 1: k-means on digits ¶

To start, let's take a look at applying k -means on the same simple digits data that we saw in In-Depth: Decision Trees and Random Forests and In Depth: Principal Component Analysis . Here we will attempt to use k -means to try to identify similar digits without using the original label information ; this might be similar to a first step in extracting meaning from a new dataset about which you don't have any a priori label information.

We will start by loading the digits and then finding the KMeans clusters. Recall that the digits consist of 1,797 samples with 64 features, where each of the 64 features is the brightness of one pixel in an 8×8 image:

The clustering can be performed as we did before:

The result is 10 clusters in 64 dimensions. Notice that the cluster centers themselves are 64-dimensional points, and can themselves be interpreted as the "typical" digit within the cluster. Let's see what these cluster centers look like:

We see that even without the labels , KMeans is able to find clusters whose centers are recognizable digits, with perhaps the exception of 1 and 8.

Because k -means knows nothing about the identity of the cluster, the 0–9 labels may be permuted. We can fix this by matching each learned cluster label with the true labels found in them:

Now we can check how accurate our unsupervised clustering was in finding similar digits within the data:

With just a simple k -means algorithm, we discovered the correct grouping for 80% of the input digits! Let's check the confusion matrix for this:

As we might expect from the cluster centers we visualized before, the main point of confusion is between the eights and ones. But this still shows that using k -means, we can essentially build a digit classifier without reference to any known labels !

Just for fun, let's try to push this even farther. We can use the t-distributed stochastic neighbor embedding (t-SNE) algorithm (mentioned in In-Depth: Manifold Learning ) to pre-process the data before performing k -means. t-SNE is a nonlinear embedding algorithm that is particularly adept at preserving points within clusters. Let's see how it does:

That's nearly 92% classification accuracy without using the labels . This is the power of unsupervised learning when used carefully: it can extract information from the dataset that it might be difficult to do by hand or by eye.

Example 2: k -means for color compression ¶

One interesting application of clustering is in color compression within images. For example, imagine you have an image with millions of colors. In most images, a large number of the colors will be unused, and many of the pixels in the image will have similar or even identical colors.

For example, consider the image shown in the following figure, which is from the Scikit-Learn datasets module (for this to work, you'll have to have the pillow Python package installed).

The image itself is stored in a three-dimensional array of size (height, width, RGB) , containing red/blue/green contributions as integers from 0 to 255:

One way we can view this set of pixels is as a cloud of points in a three-dimensional color space. We will reshape the data to [n_samples x n_features] , and rescale the colors so that they lie between 0 and 1:

We can visualize these pixels in this color space, using a subset of 10,000 pixels for efficiency:

Now let's reduce these 16 million colors to just 16 colors, using a k -means clustering across the pixel space. Because we are dealing with a very large dataset, we will use the mini batch k -means, which operates on subsets of the data to compute the result much more quickly than the standard k -means algorithm:

The result is a re-coloring of the original pixels, where each pixel is assigned the color of its closest cluster center. Plotting these new colors in the image space rather than the pixel space shows us the effect of this:

Some detail is certainly lost in the rightmost panel, but the overall image is still easily recognizable. This image on the right achieves a compression factor of around 1 million! While this is an interesting application of k -means, there are certainly better way to compress information in images. But the example shows the power of thinking outside of the box with unsupervised methods like k -means.

  • Comprehensive Learning Paths
  • 150+ Hours of Videos
  • Complete Access to Jupyter notebooks, Datasets, References.

Rating

Machine Learning

K-means clustering algorithm from scratch.

  • April 26, 2020
  • Venmani A D

K-Means Clustering is an unsupervised learning algorithm that aims to group the observations in a given dataset into clusters. The number of clusters is provided as an input. It forms the clusters by minimizing the sum of the distance of points from their respective cluster centroids.

  • Basic Overview

Introduction to K-Means Clustering

Steps involved.

  • Maths Behind K-Means Clustering

Implementing K-Means from scratch

  • Elbow Method to find the optimal number of clusters

Grouping mall customers using K-Means

Basic overview of clustering.

Clustering is a type of unsupervised learning which is used to split unlabeled data into different groups.

Now, what does unlabeled data mean?

Unlabeled data means we don’t have a dependent variable (response variable) for the algorithm to compare as the ground truth. Clustering is generally used in Data Analysis to get to know about the different groups that may exist in our dataset.

We try to split the dataset into different groups, such that the data points in the same group have similar characteristics than the data points in different groups.

Now how to find whether the points are similar?

Use a good distance metric to compute the distance between a point and every other point. The points that have less distance are more similar. Euclidean distance is the most common metric.

k means assignment

Clustering algorithms are generally used in network traffic classification, customer, and market segmentation. It can be used on any tabular dataset, where you want to know which rows are similar to each other and form meaningful groups out of the dataset. First I am going to install the libraries that I will be using.

Let’s look into one of the most common clustering algorithms: K-Means in detail.     Hands-on implementation on real project: Learn how to implement classification algorithms using multiple techniques in my Microsoft Malware Detection Project Course.    

K-Means follows an iterative process in which it tries to minimize the distance of the data points from the centroid points. It’s used to group the data points into k number of clusters based on their similarity.

Euclidean distance is used to calculate the similarity.

Let’s see a simple example of how K-Means clustering can be used to segregate the dataset. In this example, I am going to use the make_blobs the command to generate isotropic gaussian blobs which can be used for clustering.

I passed in the number of samples to be generated to be 100 and the number of centers to be 5.

k means assignment

If you look at the value of y, you can see that the points are classified based on their clusters, but I won’t be using this compute the clusters, I will be using this only for evaluating purpose. For using K-Means you need to import KMeans from sklearn.cluster library.

For using KMeans, you need to specify the no of clusters as arguments. In this case, as we can look from the graph that there are 5 clusters, I will be passing 5 as arguments. But in general cases, you should use the Elbow Method to find the optimal number of clusters. I will be discussing this method in detail in the upcoming sections.

After passing the arguments, I have fitted the model and predicted the results. Now let’s visualize our predictions in a scatter plot.

k means assignment

There are 3 important steps in K-Means Clustering.

  • 1. Initialize centroids – This is done by randomly choosing K no of points, the points can be present in the dataset or also random points.
  • 2. Assign Clusters – The clusters are assigned to each point in the dataset by calculating their distance from the centroid and assigning it to the centroid with minimum distance.
  • 3. Re-calculate the centroids – Updating the centroid by calculating the centroid of each cluster we have created.

k means assignment

Maths Behind K-Means Working

k means assignment

One of the few things that you need to keep in mind is that as we are using euclidean distance as the main parameter, it will be better to standardize your dataset if the x and y vary way too much like 10 and 100.

Also, it’s recommended to choose a wide range of points as initial clusters to check whether we are getting the same output, there is a possibility that you may get stuck at the local minimum rather than the global minimum.

Let’s use the same make_blobs example we used at the beginning. We will try to do the clustering without using the KMeans library.

I have set the K value to be 5 as before and also initialized the centroids randomly at first using the random.randint() function.

Then I am going to find the distance between the points. Euclidean distance is most commonly used for finding the similarity.

I have also stored all the minimum values in a variable minimum . Then I regrouped the dataset based on the minimum values we got and calculated the centroid value.

Then we need to repeat the above 2 steps over and over again until we reach the convergence.

Let’s plot it.

k means assignment

Elbow method to find the optimal number of clusters

One of the important steps in K-Means Clustering is to determine the optimal no. of clusters we need to give as an input. This can be done by iterating it through a number of n values and then finding the optimal n value.

For finding this optimal n, the Elbow Method is used.

You have to plot the loss values vs the n value and find the point where the graph is flattening, this point is considered as the optimal n value.

Let’s look at the example we have seen at first, to see the working of the elbow method. I am going to iterate it through a series of n values ranging from 1-20 and then plot their loss values.

k means assignment

I am going to be using the Mall_Customers Dataset. You can download the dataset from the given link

k means assignment

Let’s try to find if there are certain clusters between the customers based on their Age and Spending Score.

k means assignment

More Articles

  • Predictive Modeling

How Naive Bayes Algorithm Works? (with example and full code)

Similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

k means assignment

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

k means assignment

Definitive Guide to K-Means Clustering with Scikit-Learn

k means assignment

  • Introduction

K-Means clustering is one of the most widely used unsupervised machine learning algorithms that form clusters of data based on the similarity between data instances.

In this guide, we will first take a look at a simple example to understand how the K-Means algorithm works before implementing it using Scikit-Learn. Then, we'll discuss how to determine the number of clusters (Ks) in K-Means, and also cover distance metrics, variance, and K-Means pros and cons.

Imagine the following situation. One day, when walking around the neighborhood, you noticed there were 10 convenience stores and started to wonder which stores were similar - closer to each other in proximity. While searching for ways to answer that question, you've come across an interesting approach that divides the stores into groups based on their coordinates on a map.

For instance, if one store was located 5 km West and 3 km North - you'd assign (5, 3) coordinates to it, and represent it in a graph. Let's plot this first point to visualize what's happening:

k means assignment

This is just the first point, so we can get an idea of how we can represent a store. Say we already have 10 coordinates to the 10 stores collected. After organizing them in a numpy array, we can also plot their locations:

k means assignment

  • How to Manually Implement K-Means Algorithm

Now we can look at the 10 stores on a graph, and the main problem is to find is there a way they could be divided into different groups based on proximity? Just by taking a quick look at the graph, we'll probably notice two groups of stores - one is the lower points to the bottom-left, and the other one is the upper-right points. Perhaps, we can even differentiate those two points in the middle as a separate group - therefore creating three different groups .

In this section, we'll go over the process of manually clustering points - dividing them into the given number of groups. That way, we'll essentially carefully go over all steps of the K-Means clustering algorithm . By the end of this section, you'll gain both an intuitive and practical understanding of all steps performed during the K-Means clustering. After that, we'll delegate it to Scikit-Learn.

What would be the best way of determining if there are two or three groups of points? One simple way would be to simply choose one number of groups - for instance, two - and then try to group points based on that choice.

k means assignment

Let's say we have decided there are two groups of our stores (points). Now, we need to find a way to understand which points belong to which group. This could be done by choosing one point to represent group 1 and one to represent group 2 . Those points will be used as a reference when measuring the distance from all other points to each group.

In that manner, say point (5, 3) ends up belonging to group 1, and point (79, 60) to group 2. When trying to assign a new point (6, 3) to groups, we need to measure its distance to those two points. In the case of the point (6, 3) is closer to the (5, 3) , therefore it belongs to the group represented by that point - group 1 . This way, we can easily group all points into corresponding groups.

In this example, besides determining the number of groups ( clusters ) - we are also choosing some points to be a reference of distance for new points of each group.

That is the general idea to understand similarities between our stores. Let's put it into practice - we can first choose the two reference points at random . The reference point of group 1 will be (5, 3) and the reference point of group 2 will be (10, 15) . We can select both points of our numpy array by [0] and [1] indexes and store them in g1 (group 1) and g2 (group 2) variables:

After doing this, we need to calculate the distance from all other points to those reference points. This raises an important question - how to measure that distance. We can essentially use any distance measure, but, for the purpose of this guide, let's use Euclidean Distance_.

Advice: If you want learn more more about Euclidean distance, you can read our "Calculating Euclidean Distances with NumPy" guide.

It can be useful to know that Euclidean distance measure is based on Pythagoras' theorem:

$$ c^2 = a^2 + b^2 $$

When adapted to points in a plane - (a1, b1) and (a2, b2) , the previous formula becomes:

$$ c^2 = (a2-a1)^2 + (b2-b1)^2 $$

The distance will be the square root of c , so we can also write the formula as:

$$ euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2)] $$

Note: You can also generalize the Euclidean distance formula for multi-dimensional points. For example, in a three-dimensional space, points have three coordinates - our formula reflects that in the following way: $$ euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2 + (c2 - c1) ^2)] $$ The same principle is followed no matter the number of dimensions of the space we are operating in.

So far, we have picked the points to represent groups, and we know how to calculate distances. Now, let's put the distances and groups together by assigning each of our collected store points to a group.

To better visualize that, we will declare three lists. The first one to store points of the first group - points_in_g1 . The second one to store points from the group 2 - points_in_g2 , and the last one - group , to label the points as either 1 (belongs to group 1) or 2 (belongs to group 2):

We can now iterate through our points and calculate the Euclidean distance between them and each of our group references. Each point will be closer to one of two groups - based on which group is closest, we'll assign each point to the corresponding list, while also adding 1 or 2 to the group list:

Let's look at the results of this iteration to see what happened:

Which results in:

We can also plot the clustering result, with different colors based on the assigned groups, using Seaborn's scatterplot() with the group as a hue argument:

k means assignment

It's clearly visible that only our first point is assigned to group 1, and all other points were assigned to group 2. That result differs from what we had envisioned in the beginning. Considering the difference between our results and our initial expectations - is there a way we could change that? It seems there is!

One approach is to repeat the process and choose different points to be the references of the groups. This will change our results, hopefully, more in line with what we've envisioned in the beginning. This second time, we could choose them not at random as we previously did, but by getting a mean of all our already grouped points. That way, those new points could be positioned in the middle of corresponding groups.

For instance, if the second group had only points (10, 15) , (30, 45) . The new central point would be (10 + 30)/2 and (15+45)/2 - which is equal to (20, 30) .

Since we have put our results in lists, we can convert them first to numpy arrays, select their xs, ys and then obtain the mean :

Advice: Try to use numpy and NumPy arrays as much as possible. They are optimized for better performance and simplify many linear algebra operations. Whenever you are trying to solve some linear algebra problem, you should definitely take a look at the numpy documentation to check if there is any numpy method designed to solve your problem. The chance is that there is!

To help repeat the process with our new center points, let's transform our previous code into a function, execute it and see if there were any changes in how the points are grouped:

Note: If you notice you keep repeating the same code over and over again, you should wrap that code into a separate function. It is considered a best practice to organize code into functions, especially because they facilitate testing. It is easier to test an isolated piece of code than a full code without any functions.

Let's call the function and store its results in points_in_g1 , points_in_g2 , and group variables:

And also plot the scatter plot with the colored points to visualize the groups division:

k means assignment

It seems the clustering of our points is getting better . But still, there are two points in the middle of the graph that could be assigned to either group when considering their proximity to both groups. The algorithm we've developed so far assigns both of those points to the second group.

This means we can probably repeat the process once more by taking the means of the Xs and Ys, creating two new central points (centroids) to our groups and re-assigning them based on distance.

Let's also create a function to update the centroids. The whole process now can be reduced to multiple calls of that function:

k means assignment

Notice that after this third iteration, each one of the points now belong to different clusters. It seems the results are getting better - let's do it once again. Now going to the fourth iteration of our method:

k means assignment

This fourth time we got the same result as the previous one. So it seems our points won't change groups anymore, our result has reached some kind of stability - it has got to an unchangeable state, or converged . Besides that, we have exactly the same result as we had envisioned for the 2 groups. We can also see if this reached division makes sense.

Let's just quickly recap what we've done so far. We've divided our 10 stores geographically into two sections - ones in the lower southwest regions and others in the northeast. It can be interesting to gather more data besides what we already have - revenue, the daily number of customers, and many more. That way we can conduct a richer analysis and possibly generate more interesting results.

Clustering studies like this can be conducted when an already established brand wants to pick an area to open a new store. In that case, there are many more variables taken into consideration besides location.
  • What Does All This Have To Do With K-Means Algorithm?

While following these steps you might have wondered what they have to do with the K-Means algorithm. The process we've conducted so far is the K-Means algorithm . In short, we've determined the number of groups/clusters, randomly chosen initial points, and updated centroids in each iteration until clusters converged. We've basically performed the entire algorithm by hand - carefully conducting each step.

The K in K-Means comes from the number of clusters that need to be set prior to starting the iteration process. In our case K = 2 . This characteristic is sometimes seen as negative considering there are other clustering methods, such as Hierarchical Clustering, which don't need to have a fixed number of clusters beforehand.

Due to its use of means, K-means also becomes sensitive to outliers and extreme values - they enhance the variability and make it harder for our centroids to play their part. So, be conscious of the need to perform extreme values and outlier analysis before conducting a clustering using the K-Means algorithm.

Also, notice that our points were segmented in straight parts, there aren't curves when creating the clusters. That can also be a disadvantage of the K-Means algorithm.

Note: When you need it to be more flexible and adaptable to ellipses and other shapes, try using a generalized K-means Gaussian Mixture model . This model can adapt to elliptical segmentation clusters.

K-Means also has many advantages ! It performs well on large datasets which can become difficult to handle if you are using some types of hierarchical clustering algorithms. It also guarantees convergence , and can easily generalize and adapt . Besides that, it is probably the most used clustering algorithm.

Now that we've gone over all the steps performed in the K-Means algorithm, and understood all its pros and cons, we can finally implement K-Means using the Scikit-Learn library.

  • How to Implement K-Means Algorithm Using Scikit-Learn

To double check our result, let's do this process again, but now using 3 lines of code with sklearn :

Here, the labels are the same as our previous groups. Let's just quickly plot the result:

k means assignment

The resulting plot is the same as the one from the previous section.

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Note: Just looking at how we've performed the K-Means algorithm using Scikit-Learn might give you the impression that this is a no-brainer and that you don't need to worry too much about it. Just 3 lines of code perform all the steps we've discussed in the previous section when we've gone over the K-Means algorithm step-by-step. But, the devil is in the details in this case! If you don't understand all the steps and limitations of the algorithm, you'll most likely face the situation where the K-Means algorithm gives you results you were not expecting.

With Scikit-Learn, you can also initialize K-Means for faster convergence by setting the init='k-means++' argument. In broader terms, K-Means++ still chooses the k initial cluster centers at random following a uniform distribution. Then, each subsequent cluster center is chosen from the remaining data points not by calculating only a distance measure - but by using probability. Using the probability speeds up the algorithm and it's helpful when dealing with very large datasets.

Advice: You can learn more about K-Means++ details by reading the "K-Means++: The Advantages of Careful Seeding" paper, proposed in 2007 by David Arthur and Sergei Vassilvitskii.

  • The Elbow Method - Choosing the Best Number of Groups

So far, so good! We've clustered 10 stores based on the Euclidean distance between points and centroids. But what about those two points in the middle of the graph that are a little harder to cluster? Couldn't they form a separate group as well? Did we actually make a mistake by choosing K=2 groups? Maybe we actually had K=3 groups? We could even have more than three groups and not be aware of it.

The question being asked here is how to determine the number of groups (K) in K-Means . To answer that question, we need to understand if there would be a "better" cluster for a different value of K.

The naive way of finding that out is by clustering points with different values of K , so, for K=2, K=3, K=4, and so on :

But, clustering points for different Ks alone won't be enough to understand if we've chosen the ideal value for K . We need a way to evaluate the clustering quality for each K we've chosen.

  • Manually Calculating the Within Cluster Sum of Squares (WCSS)

Here is the ideal place to introduce a measure of how much our clustered points are close to each other. It essentially describes how much variance we have inside a single cluster. This measure is called Within Cluster Sum of Squares , or WCSS for short. The smaller the WCSS is, the closer our points are, therefore we have a more well-formed cluster. The WCSS formula can be used for any number of clusters:

$$ WCSS = \sum(Pi_1 - Centroid_1)^2 + \cdots + \sum(Pi_n - Centroid_n)^2 $$

Note: In this guide, we are using the Euclidean distance to obtain the centroids, but other distance measures, such as Manhattan, could also be used.

Now we can assume we've opted to have two clusters and try to implement the WCSS to understand better what the WCSS is and how to use it. As the formula states, we need to sum up the squared differences between all cluster points and centroids. So, if our first point from the first group is (5, 3) and our last centroid (after convergence) of the first group is (16.8, 17.0) , the WCSS will be:

$$ WCSS = \sum((5,3) - (16.8, 17.0))^2 $$

$$ WCSS = \sum((5-16.8) + (3-17.0))^2 $$

$$ WCSS = \sum((-11.8) + (-14.0))^2 $$

$$ WCSS = \sum((-25.8))^2 $$

$$ WCSS = 335.24 $$

This example illustrates how we calculate the WCSS for the one point from the cluster. But the cluster usually contains more than one point, and we need to take all of them into consideration when calculating the WCSS. We'll do that by defining a function that receives a cluster of points and centroids, and returns the sum of squares:

Now we can get the sum of squares for each cluster:

And sum up the results to obtain the total WCSS :

This results in:

So, in our case, when K is equal to 2, the total WCSS is 2964.39 . Now, we can switch Ks and calculate the WCSS for all of them. That way, we can get an insight into what K we should choose to make our clustering perform the best.

  • Calculating WCSS Using Scikit-Learn

Fortunately, we don't need to manually calculate the WCSS for each K . After performing the K-Means clustering for the given number of clusters, we can obtain its WCSS by using the inertia_ attribute. Now, we can go back to our K-Means for loop, use it to switch the number of clusters, and list corresponding WCSS values:

Notice that the second value in the list, is exactly the same we've calculated before for K=2 :

To visualize those results, let's plot our Ks along with the WCSS values:

k means assignment

There is an interruption on a plot when x = 2 , a low point in the line, and an even lower one when x = 3 . Notice that it reminds us of the shape of an elbow . By plotting the Ks along with the WCSS, we are using the Elbow Method to choose the number of Ks. And the chosen K is exactly the lowest elbow point , so, it would be 3 instead of 2 , in our case:

k means assignment

We can run the K-Means cluster algorithm again, to see how our data would look like with three clusters :

k means assignment

We were already happy with two clusters, but according to the elbow method, three clusters would be a better fit for our data. In this case, we would have three kinds of stores instead of two. Before using the elbow method, we thought about southwest and northeast clusters of stores, now we also have stores in the center. Maybe that could be a good location to open another store since it would have less competition nearby.

  • Alternative Cluster Quality Measures

There are also other measures that can be used when evaluating cluster quality:

  • Silhouette Score - analyzes not only the distance between intra-cluster points but also between clusters themselves
  • Between Clusters Sum of Squares (BCSS) - metric complementary to the WCSS
  • Sum of Squares Error (SSE)
  • Maximum Radius - measures the largest distance from a point to its centroid
  • Average Radius - the sum of the largest distance from a point to its centroid divided by the number of clusters.

It's recommended to experiment and get to know each of them since depending on the problem, some of the alternatives can be more applicable than the most widely used metrics (WCSS and Silhouette Score) .

In the end, as with many data science algorithms, we want to reduce the variance inside each cluster and maximize the variance between different clusters. So we have more defined and separable clusters.

  • Applying K-Means on Another Dataset

Let's use what we have learned on another dataset. This time, we will try to find groups of similar wines.

Note: You can download the dataset here .

We begin by importing pandas to read the wine-clustering CSV (Comma-Separated Values) file into a Dataframe structure:

After loading it, let's take a peek at the first five records of data with the head() method:

We have many measurements of substances present in wines. Here, we also won't need to transform categorical columns because all of them are numerical. Now, let's take a look at the descriptive statistics with the describe() method:

The describe table:

By looking at the table it is clear that there is some variability in the data - for some columns such as Alcohol there is more, and for others, such as Malic_Acid , less. Now we can check if there are any null , or NaN values in our dataset:

There's no need to drop or input data, considering there aren't empty values in the dataset. We can use a Seaborn pairplot() to see the data distribution and to check if the dataset forms pairs of columns that can be interesting for clustering:

k means assignment

By looking at the pair plot, two columns seem promising for clustering purposes - Alcohol and OD280 (which is a method for determining the protein concentration in wines). It seems that there are 3 distinct clusters on plots combining two of them.

There are other columns that seem to be in correlation as well. Most notably Alcohol and Total_Phenols , and Alcohol and Flavonoids . They have great linear relationships that can be observed in the pair plot.

Since our focus is clustering with K-Means, let's choose one pair of columns, say Alcohol and OD280 , and test the elbow method for this dataset.

Note: When using more columns of the dataset, there will be a need for either plotting in 3 dimensions or reducing the data to principal components (use of PCA) . This is a valid, and more common approach, just make sure to choose the principal components based on how much they explain and keep in mind that when reducing the data dimensions, there is some information loss - so the plot is an approximation of the real data, not how it really is.

Let's plot the scatter plot with those two columns set to be its axis to take a closer look at the points we want to divide into groups:

k means assignment

Now we can define our columns and use the elbow method to determine the number of clusters. We will also initiate the algorithm with kmeans++ just to make sure it converges more quickly:

We have calculated the WCSS, so we can plot the results:

k means assignment

According to the elbow method we should have 3 clusters here. For the final step, let's cluster our points into 3 clusters and plot the those clusters identified by colors:

k means assignment

We can see clusters 0 , 1 , and 2 in the graph. Based on our analysis, group 0 has wines with higher protein content and lower alcohol, group 1 has wines with higher alcohol content and low protein, and group 2 has both high protein and high alcohol in its wines.

This is a very interesting dataset and I encourage you to go further into the analysis by clustering the data after normalization and PCA - also by interpreting the results and finding new connections.

K-Means clustering is a simple yet very effective unsupervised machine learning algorithm for data clustering. It clusters data based on the Euclidean distance between data points. K-Means clustering algorithm has many uses for grouping text documents, images, videos, and much more.

You might also like...

  • Get Feature Importances for Random Forest with Python and Scikit-Learn
  • Definitive Guide to Hierarchical Clustering with Python and Scikit-Learn
  • Definitive Guide to Logistic Regression in Python
  • Seaborn Boxplot - Tutorial and Examples

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Data Scientist, Research Software Engineer, and teacher. Cassia is passionate about transformative processes in data, technology and life. She is graduated in Philosophy and Information Systems, with a Strictu Sensu Master's Degree in the field of Foundations Of Mathematics.

In this article

k means assignment

Bank Note Fraud Detection with SVMs in Python with Scikit-Learn

Can you tell the difference between a real and a fraud bank note? Probably! Can you do it for 1000 bank notes? Probably! But it...

David Landup

Data Visualization in Python with Matplotlib and Pandas

Data Visualization in Python with Matplotlib and Pandas is a course designed to take absolute beginners to Pandas and Matplotlib, with basic Python knowledge, and...

© 2013- 2024 Stack Abuse. All rights reserved.

  • Office Hours
  • Python Tutorial
  • Markov Decisions
  • Practice Midterms
  • Midterm Solutions
  • Big Picture
  • Programming:
  • Driverless Car
  • Visual Cortex
  • Problem Sets:
  • Search Pset
  • Variable Pset
  • Learning Pset
  • Variable Models Pset
  • Machine Learning Pset
  • Final Project
  • Self Driving Car Machine Translation Deep Blue Watson

Written by Chris Piech. Based on a handout by Andrew Ng.

The Basic Idea

Say you are given a data set where each observed example has a set of features, but has no labels. Labels are an essential ingredient to a supervised algorithm like Support Vector Machines, which learns a hypothesis function to predict labels given features. So we can't run supervised learning. What can we do?

One of the most straightforward tasks we can perform on a data set without labels is to find groups of data in our dataset which are similar to one another -- what we call clusters.

K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

k means assignment

Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses. (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations of k-means. In each iteration, we assign each training example to the closest cluster centroid (shown by "painting" the training examples the same color as the cluster centroid to which is assigned); then we move each cluster centroid to the mean of the points assigned to it. Images courtesy of Michael Jordan.

The Algorithm

In the clustering problem, we are given a training set ${x^{(1)}, ... , x^{(m)}}$, and want to group the data into a few cohesive "clusters." Here, we are given feature vectors for each data point $x^{(i)} \in \mathbb{R}^n$ as usual; but no labels $y^{(i)}$ (making this an unsupervised learning problem). Our goal is to predict $k$ centroids and a label $c^{(i)}$ for each datapoint. The k-means clustering algorithm is as follows:

k means assignment

Implementation

Here is pseudo-python code which runs k-means on a dataset. It is a short algorithm made longer by verbose commenting. # Function: K Means # ------------- # K-Means is an algorithm that takes in a dataset and a constant # k and returns k centroids (which define clusters of data in the # dataset which are similar to one another). def kmeans(dataSet, k): # Initialize centroids randomly numFeatures = dataSet.getNumFeatures() centroids = getRandomCentroids(numFeatures, k) # Initialize book keeping vars. iterations = 0 oldCentroids = None # Run the main k-means algorithm while not shouldStop(oldCentroids, centroids, iterations): # Save old centroids for convergence test. Book keeping. oldCentroids = centroids iterations += 1 # Assign labels to each datapoint based on centroids labels = getLabels(dataSet, centroids) # Assign centroids based on datapoint labels centroids = getCentroids(dataSet, labels, k) # We can get the labels too by calling getLabels(dataSet, centroids) return centroids # Function: Should Stop # ------------- # Returns True or False if k-means is done. K-means terminates either # because it has run a maximum number of iterations OR the centroids # stop changing. def shouldStop(oldCentroids, centroids, iterations): if iterations > MAX_ITERATIONS: return True return oldCentroids == centroids # Function: Get Random Centroids # ------------- # Returns k random centroids, each of dimension n. def getRandomCentroids(n, k): # return some reasonable randomization. --> # Function: Get Labels # ------------- # Returns a label for each piece of data in the dataset. def getLabels(dataSet, centroids): # For each element in the dataset, chose the closest centroid. # Make that centroid the element's label. # Function: Get Centroids # ------------- # Returns k random centroids, each of dimension n. def getCentroids(dataSet, labels, k): # Each centroid is the geometric mean of the points that # have that centroid's label. Important: If a centroid is empty (no points have # that centroid's label) you should randomly re-initialize it.

Important note: You might be tempted to calculate the distance between two points manually, by looping over values. This will work, but it will lead to a slow k-means! And a slow k-means will mean that you have to wait longer to test and debug your solution.

Let's define three vectors:

To calculate the distance between x and y we can use: np.sqrt(sum((x - y) ** 2))

To calculate the distance between all the length 5 vectors in z and x we can use: np.sqrt(((z-x)**2).sum(axis=0))

Expectation Maximization

K-Means is really just the EM (Expectation Maximization) algorithm applied to a particular naive bayes model.

To demonstrate this remarkable claim, consider the classic naive bayes model with a class variable which can take on discrete values (with domain size $k$) and a set of feature variables, each of which can take on a continuous value (see figure 2). The conditional probability distributions for $P(f_i = x | C= c)$ is going to be slightly different than usual. Instead of storing this conditional probability as a table, we are going to store it as a single normal (gaussian) distribution, with it's own mean and a standard deviation of 1. Specifically, this means that: $P(f_i = x | C= c) \sim \mathcal{N}(\mu_{c,i}, 1)$

Learning the values of $\mu_{c, i}$ given a dataset with assigned values to the features but not the class variables is the provably identical to running k-means on that dataset.

k means assignment

Figure 2: The K-Means algorithm is the EM algorithm applied to this Bayes Net.

If we know that this is the strcuture of our bayes net, but we don't know any of the conditional probability distributions then we have to run Parameter Learning before we can run Inference.

In the dataset we are given, all the feature variables are observed (for each data point) but the class variable is hidden. Since we are running Parameter Learning on a bayes net where some variables are unobserved, we should use EM.

Lets review EM. In EM, you randomly initialize your model parameters, then you alternate between (E) assigning values to hidden variables, based on parameters and (M) computing parameters based on fully observed data.

E-Step : Coming up with values to hidden variables, based on parameters. If you work out the math of chosing the best values for the class variable based on the features of a given piece of data in your data set, it comes out to "for each data-point, chose the centroid that it is closest to, by euclidean distance, and assign that centroid's label." The proof of this is within your grasp! See lecture.

M-Step : Coming up with parameters, based on full assignments. If you work out the math of chosing the best parameter values based on the features of a given piece of data in your data set, it comes out to "take the mean of all the data-points that were labeled as c."

So what? Well this gives you an idea of the qualities of k-means. Like EM, it is provably going to find a local optimum. Like EM, it is not necessarily going to find a global optimum. It turns out those random initial values do matter.

Figure 1 shows k-means with a 2-dimensional feature vector (each point has two dimensions, an x and a y). In your applications, will probably be working with data that has a lot of features. In fact each data-point may be hundreds of dimensions. We can visualize clusters in up to 3 dimensions (see figure 3) but beyond that you have to rely on a more mathematical understanding.

k means assignment

Figure 3: KMeans in other dimensions. (left) K-means in 2d. (right) K-means in 3d. You have to imagine k-means in 4d.

© Stanford 2013 | Designed by Chris . Inspired by Niels and Percy .

  • Python for Machine Learning
  • Machine Learning with R
  • Machine Learning Algorithms
  • Math for Machine Learning
  • Machine Learning Interview Questions
  • ML Projects
  • Deep Learning
  • Computer vision
  • Data Science
  • Artificial Intelligence
  • 100+ Machine Learning Projects with Source Code [2024]

Classification Projects

  • Wine Quality Prediction - Machine Learning
  • ML | Credit Card Fraud Detection
  • Disease Prediction Using Machine Learning
  • Recommendation System in Python
  • Detecting Spam Emails Using Tensorflow in Python
  • SMS Spam Detection using TensorFlow in Python
  • Python | Classify Handwritten Digits with Tensorflow
  • Recognizing HandWritten Digits in Scikit Learn
  • Identifying handwritten digits using Logistic Regression in PyTorch
  • Python | Customer Churn Analysis Prediction
  • Online Payment Fraud Detection using Machine Learning in Python
  • Flipkart Reviews Sentiment Analysis using Python
  • Loan Approval Prediction using Machine Learning
  • Loan Eligibility prediction using Machine Learning Models in Python
  • Stock Price Prediction using Machine Learning in Python
  • Bitcoin Price Prediction using Machine Learning in Python
  • Handwritten Digit Recognition using Neural Network
  • Parkinson Disease Prediction using Machine Learning - Python
  • Spaceship Titanic Project using Machine Learning - Python
  • Rainfall Prediction using Machine Learning - Python
  • Autism Prediction using Machine Learning
  • Predicting Stock Price Direction using Support Vector Machines
  • Fake News Detection Model using TensorFlow in Python
  • CIFAR-10 Image Classification in TensorFlow
  • Black and white image colorization with OpenCV and Deep Learning
  • ML | Kaggle Breast Cancer Wisconsin Diagnosis using Logistic Regression
  • ML | Cancer cell classification using Scikit-learn
  • ML | Kaggle Breast Cancer Wisconsin Diagnosis using KNN and Cross Validation
  • Human Scream Detection and Analysis for Controlling Crime Rate - Project Idea
  • Multiclass image classification using Transfer learning
  • Intrusion Detection System Using Machine Learning Algorithms
  • Heart Disease Prediction using ANN

Regression Projects

  • IPL Score Prediction using Deep Learning
  • Dogecoin Price Prediction with Machine Learning
  • Zillow Home Value (Zestimate) Prediction in ML
  • Calories Burnt Prediction using Machine Learning
  • Vehicle Count Prediction From Sensor Data
  • Analyzing selling price of used cars using Python
  • Box Office Revenue Prediction Using Linear Regression in ML
  • House Price Prediction using Machine Learning in Python
  • ML | Boston Housing Kaggle Challenge with Linear Regression
  • Stock Price Prediction Project using TensorFlow
  • Medical Insurance Price Prediction using Machine Learning - Python
  • Inventory Demand Forecasting using Machine Learning - Python
  • Ola Bike Ride Request Forecast using ML
  • Waiter's Tip Prediction using Machine Learning
  • Predict Fuel Efficiency Using Tensorflow in Python
  • Microsoft Stock Price Prediction with Machine Learning
  • Share Price Forecasting Using Facebook Prophet
  • Python | Implementation of Movie Recommender System
  • How can Tensorflow be used with abalone dataset to build a sequential model?

Computer Vision Projects

  • OCR of Handwritten digits | OpenCV
  • Cartooning an Image using OpenCV - Python
  • Count number of Object using Python-OpenCV
  • Count number of Faces using Python - OpenCV
  • Text Detection and Extraction using OpenCV and OCR
  • FaceMask Detection using TensorFlow in Python
  • Dog Breed Classification using Transfer Learning
  • Flower Recognition Using Convolutional Neural Network
  • Emojify using Face Recognition with Machine Learning
  • Cat & Dog Classification using Convolutional Neural Network in Python
  • Traffic Signs Recognition using CNN and Keras in Python
  • Lung Cancer Detection using Convolutional Neural Network (CNN)
  • Lung Cancer Detection Using Transfer Learning
  • Pneumonia Detection using Deep Learning
  • Detecting Covid-19 with Chest X-ray
  • Skin Cancer Detection using TensorFlow
  • Age Detection using Deep Learning in OpenCV
  • Face and Hand Landmarks Detection using Python - Mediapipe, OpenCV
  • Detecting COVID-19 From Chest X-Ray Images using CNN
  • Image Segmentation Using TensorFlow
  • License Plate Recognition with OpenCV and Tesseract OCR
  • Detect and Recognize Car License Plate from a video in real time
  • Residual Networks (ResNet) - Deep Learning

Natural Language Processing Projects

  • Twitter Sentiment Analysis using Python
  • Facebook Sentiment Analysis using python
  • Next Sentence Prediction using BERT
  • Hate Speech Detection using Deep Learning
  • Image Caption Generator using Deep Learning on Flickr8K dataset
  • Movie recommendation based on emotion in Python
  • Speech Recognition in Python using Google Speech API
  • Voice Assistant using python
  • Human Activity Recognition - Using Deep Learning Model
  • Fine-tuning BERT model for Sentiment Analysis
  • Sentiment Classification Using BERT
  • Sentiment Analysis with an Recurrent Neural Networks (RNN)
  • Autocorrector Feature Using NLP In Python
  • Python | NLP analysis of Restaurant reviews
  • Restaurant Review Analysis Using NLP and SQLite

Clustering Projects

  • Customer Segmentation using Unsupervised Machine Learning in Python
  • Music Recommendation System Using Machine Learning
  • K means Clustering - Introduction
  • Image Segmentation using K Means Clustering

Recommender System Project

  • AI Driven Snake Game using Deep Q Learning

K means Clustering – Introduction

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled dataset into different clusters. The article aims to explore the fundamentals and working of k mean clustering along with the implementation.

Table of Content

What is K-means Clustering?

What is the objective of k-means clustering, how k-means clustering works, implementation of k-means clustering in python.

Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified data and enabling the algorithm to operate on that data without supervision. Without any previous data training, the machine’s job in this case is to organize unsorted data according to parallels, patterns, and variations. 

K means clustering, assigns data points to one of the K clusters depending on their distance from the center of the clusters. It starts by randomly assigning the clusters centroid in the space. Then each data point assign to one of the cluster based on its distance from centroid of the cluster. After assigning each point to one of the cluster, new cluster centroids are assigned. This process runs iteratively until it finds good cluster. In the analysis we assume that number of cluster is given in advanced and we have to put points in one of the group.

In some cases, K is not clearly defined, and we have to think about the optimal number of K. K Means clustering performs best data is well separated. When data points overlapped this clustering is not suitable. K Means is faster as compare to other clustering technique. It provides strong coupling between the data points. K Means cluster do not provide clear information regarding the quality of clusters. Different initial assignment of cluster centroid may lead to different clusters. Also, K Means algorithm is sensitive to noise. It maymhave stuck in local minima.

The goal of clustering is to divide the population or set of data points into a number of groups so that the data points within each group are more comparable to one another and different from the data points within the other groups. It is essentially a grouping of things based on how similar and different they are to one another. 

We are given a data set of items, with certain features, and values for these features (like a vector). The task is to categorize those items into groups. To achieve this, we will use the K-means algorithm, an unsupervised learning algorithm. ‘K’ in the name of the algorithm represents the number of groups/clusters we want to classify our items into.

(It will help if you think of items as points in an n-dimensional space). The algorithm will categorize the items into k groups or clusters of similarity. To calculate that similarity, we will use the Euclidean distance as a measurement.

The algorithm works as follows:  

  • First, we randomly initialize k points, called means or cluster centroids.
  • We categorize each item to its closest mean, and we update the mean’s coordinates, which are the averages of the items categorized in that cluster so far.
  • We repeat the process for a given number of iterations and at the end, we have our clusters.

The “points” mentioned above are called means because they are the mean values of the items categorized in them. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature x, the items have values in [0,3], we will initialize the means with values for x at [0,3]).

The above algorithm in pseudocode is as follows:  

Import the necessary Libraries

We are importing Numpy for statistical computations, Matplotlib to plot the graph, and make_blobs from sklearn.datasets.

Create the custom dataset with make_blobs and plot it

Clustering dataset - Geeksforgeeks

Clustering dataset

Initialize the random centroids

The code initializes three clusters for K-means clustering. It sets a random seed and generates random cluster centers within a specified range, and creates an empty list of points for each cluster.

Plot the random initialize center with data points

Data points with random center - Geeksforgeeks

Data points with random center

The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks the initial cluster centers (red stars) generated for K-means clustering.

Define Euclidean distance

Create the function to assign and update the cluster center.

The E-step assigns data points to the nearest cluster center, and the M-step updates cluster centers based on the mean of assigned points in K-means clustering.

Step 7: Create the function to Predict the cluster for the datapoints

Assign, update, and predict the cluster center, plot the data points with their predicted cluster center.

K-means Clustering - Geeksforgeeks

K-means Clustering

The plot shows data points colored by their predicted clusters. The red markers represent the updated cluster centers after the E-M steps in the K-means clustering algorithm.

Import the necessary libraries

Load the dataset, elbow method .

Finding the ideal number of groups to divide the data into is a basic stage in any unsupervised algorithm. One of the most common techniques for figuring out this ideal value of k is the elbow approach.

Plot the Elbow graph to find the optimum number of cluster

k means assignment

Elbow Method

From the above graph, we can observe that at k=2 and k=3 elbow-like situation. So, we are considering K=3

Build the Kmeans clustering model

Find the cluster center, predict the cluster group:, plot the cluster center with data points.

K-means clustering - Geeksforgeeks

K-means clustering

The subplot on the left display petal length vs. petal width with data points colored by clusters, and red markers indicate K-means cluster centers. The subplot on the right show sepal length vs. sepal width similarly.

In conclusion, K-means clustering is a powerful unsupervised machine learning algorithm for grouping unlabeled datasets. Its objective is to divide data into clusters, making similar data points part of the same group. The algorithm initializes cluster centroids and iteratively assigns data points to the nearest centroid, updating centroids based on the mean of points in each cluster.

Frequently Asked Questions (FAQs)

1. what is k-means clustering for data analysis.

K-means is a partitioning method that divides a dataset into ‘k’ distinct, non-overlapping subsets (clusters) based on similarity, aiming to minimize the variance within each cluster.

2.What is an example of k-means in real life?

Customer segmentation in marketing, where k-means groups customers based on purchasing behavior, allowing businesses to tailor marketing strategies for different segments.

3. What type of data is k-means clustering model?

K-means works well with numerical data, where the concept of distance between data points is meaningful. It’s commonly applied to continuous variables.

4.Is K-means used for prediction?

K-means is primarily used for clustering and grouping similar data points. It does not predict labels for new data; it assigns them to existing clusters based on similarity.

5.What is the objective of k-means clustering?

The objective is to partition data into ‘k’ clusters, minimizing the intra-cluster variance. It seeks to form groups where data points within each cluster are more similar to each other than to those in other clusters.

Please Login to comment...

Similar reads.

  • AI-ML-DS With Python
  • Machine Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

K-means 101: an introductory guide to k-means clustering in r.

Posted on March 13, 2021 by Dr. Shirin Elsinghorst in R bloggers | 0 Comments

[social4i size="small" align="align-left"] --> [This article was first published on Shirin's playgRound , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Editor’s note: This is a guest post by Nathaniel Schmucker. He is the founder of The Analyst Code , a blog that provides tools to instill a love of data in individuals of all backgrounds and to empower aspiring analysts.

Introduction

In this post, we will look at:

What is a k-Means analysis?

How does the k-means algorithm work, how do we implement k-means in r, concluding thoughts.

Along the way, we will cover 5 principles to help guide you through your own exploration of k-Means and to help answer common questions.

If you have questions or would like to talk about articles from my blog (or something else data-related), you can now book 15-minute timeslots with me (it’s free – one slot available per weekday):

k means assignment

If you have been enjoying my content and would like to help me be able to create more, please consider sending me a donation at paypal.me . Thank you! 🙂

A k-Means analysis is one of many clustering techniques for identifying structural features of a set of datapoints. The k-Means algorithm groups data into a pre-specified number of clusters, k , where the assignment of points to clusters minimizes the total sum-of-squares distance to the cluster’s mean. We can then use the mean value of all the points in a given cluster as a prototypical point characterizing the cluster.

For example, we could have a dataset with the horsepower and fuel efficiency of various car models, and we want to use that data to classify the cars into natural groupings–sports cars, sedans, etc. Essentially, we want to find structure in a scatter plot of horsepower against fuel efficiency.

k-Means is easy to implement. In R, you can use the function kmeans() to quickly deploy an efficient k-Means algorithm. On datasets of reasonable size (thousands of rows), the kmeans function runs in fractions of a second.

k-Means is easy to interpret (in 2 dimensions). If you have two features of your k-Means analysis ( e.g. , you are grouping by length and width), the result of the k-Means algorithm can be plotted on an xy-coordinate system to show the extent of each cluster. It’s easy to visually inspect the assignment to see if the k-Means analysis returned a meaningful insight. In more dimensions ( e.g. , length, width, and height) you will need to either create a 3D plot, summarize your features in a table, or find another alternative to describing your analysis. This loses the intuitive power that a 2D k-Means analysis has in convincing you or your audience that your analysis should be trusted. It’s not to say that your analysis is wrong; it simply takes more mental focus to understand what your analysis says.

The k-Means analysis, however, is not always the best choice. k-Means does well on data that naturally falls into spherical clusters. If your data has a different shape (linear, spiral, etc.), k-Means will force clustering into circles, which can result in outputs that defy human expectations. The algorithm is not wrong; we have fed the algorithm data it was never intended to understand.

Every data analyst should be comfortable using and explaining the k-Means algorithm. It is easy to implement and interpret, and very few real-world datasets violate our spherical-clustering assumption.

  • Pick a number of clusters, k
  • Create k random points and call each of these the center of a cluster
  • For each point in your dataset, find the closest cluster center and assign the point to that cluster
  • Once each point has been assigned to a cluster, calculate the new center of that cluster
  • Repeat steps 3 and 4 until you reach a stage when no points need to be reassigned
  • Stop. You have found your k clusters and their centers!

If you want to learn more about k-Means, I would recommend this post on Medium , though be aware that the example code is all written in Python. If you are brave and want to go very deep in k-Means theory, take a look at the Wikipedia page . Or, if you would like to see one application of k-Means in R, see this blog’s post about using k-Means to help assist in image classification with Keras . For a detailed illustration of how to implement k-Means in R, along with answers to some common questions, keep reading below.

Let’s begin by loading packages and functions, which we’ll use later.

The data we will use for this example is from one of R’s pre-loaded datasets, quakes . It is a data.frame with 1000 rows and five columns describing earthquakes near Fiji since 1964. The columns are latitude (degrees), longitude (degrees), depth (km), magnitude (Richter scale), and the number of stations reporting the quake. The only preprocessing we will do now is to remove stations and convert this to a tibble.

Now for the fun. For our first example, let’s run a k-Means analysis on two variables: depth and magnitude. We reduce our raw data to only these two variables and pass it to the base R function, kmeans() . Before we hit “Run” on this function, though, let’s talk through two principles of k-Means clustering.

Principle 1: Number of iterations

The k-Means algorithm, as mentioned above, iterates through a process of assigning points to a cluster based on the closest cluster center and recalulating cluster centers, not stopping until no more points are assigned to a new cluster during the assignement step. In some cases, the number of iterations can be very large, and the algorithm can consequentially become slow. For speed and convenience, we can cap the number of iterations at 10 (the default value), but for precision, more iterations are better than fewer.

Principle 2: Local vs. global minimum

The k-Means algorithm results in an assignment of points to clusters that minimizes the within-cluster sum of squares: in each iterative step, if we were to add up the total squared distances of points to the mean, we would find that the sum was less than the step before. In laymans terms, our algorithm stopped when it couldn’t find a way to assign points into tighter groupings.

But, just because our algorithm couldn’t find a better grouping doesn’t mean that we found the best grouping. Based on our random set of starting points, we found the best solution, but a different set of starting points could have found a better solution.

This is a problem common across machine learning algorithms. Finding the globally best solution is substantially harder than finding the best given particular starting point. The kmeans function can help us ensure our final product is at least a pretty good local minimum by running multiple times and showing only the best answer. By default, the number of tries kmeans takes is 1, but we can easily adjust this to, say, nstart = 5 .

Now let’s hit “Run” on our k-Means analysis and save the output as kclust . This is an object of class “kmeans,” which is not the easiest to use for subsequent analysis. We’ll use augment , tidy , and glance to extract useful summary information as tables. We’ll also include a quick plot to see how our clustering did.

k means assignment

Reflect for a moment on what we just did. First, note that running kmeans is incredibly easy (one function!) and on 1000 data points was very fast. Second, note that our graph looks pretty strange. The algorithm has created clusters that seem only to care about depth. Does this mean magnitude is an irrelevant feature in our data? By no means! Time for Principle 3.

Principle 3: Feature scaling

k-Means calculates distance to the cluster center using Euclidian distance: the length of a line segment connecting the two points. In two dimensions, this is the Pythagorean Theorem. Aha, you say! I see the problem: we are comparing magnitudes (4.0-6.4) to depth (40-680). Depth has significantly more variation (standard deviation 0.4 for magnitude vs. 215 for depth) and therefore gets overweighted when calculating distance to the mean.

We need to employ feature scaling. As a general rule, if we are comparing unlike units (meters and kilograms) or independent measurements (height in meters and circumference in meters), we should normalize values, but if units are related (petal length and petal width), we should leave them as is.

Unfortunately, many cases require judgment both on whether to scale and how to scale. This is where your expert opinion as a data analyst becomes important. For the purposes of this blog post, we will normalize all of our features, including latitude and longitude, by transforming them to standard normal distributions. The geologists might object to this methodology for normalizing (magnitude is a log scale!!), but please forgive some imprecision for the sake of illustration.

With our fully-preprocessed data, let’s re-run our k-Means analysis.

k means assignment

Now for a little more fun. k-Means can be extended beyond 2 feature dimensions. Let’s re-run our k-Means analysis again, but this time include all four variables: latitude, longitude, depth, and magnitude. I’ll forego the 2D ggplot2 graph and show you instead a 3D plotly graph, with magnitude described by each bubble’s size.

If you’ve made it this far, and if you are thinking about analyses that you would like to run on your own data, you may be asking yourself why we have been running these analyses with four clusters. It feels like a good number, but is it the right number?

Principle 4: Choose the right number of clusters

The k-Means algorithm cannot tell us what the ideal number of clusters is. The “right” number of clusters is subjective and depends on both the structure of your data and what your intended purpose is. Frequently, we want to find the number of clusters that most efficiently clusters points; we learned a lot by increasing from k-1 to k clusters, but increasing to k+1 clusters only reduces our sum of squares by a little bit more.

Let’s begin by using purrr::map to run kmeans using 1 through 12 clusters.

We can plot our 12 k-Means analyses using ggplot2::facet_wrap . Since we clustered on four features and our graphs only show two dimensions (latitude and longitude), it’s a little hard to get a full picture of how well each k clustered our data. But, it looks like two clusters is reasonable but not very insightful, four seems pretty good, and by eight everything becomes a mess.

k means assignment

We’ll build an elbow chart to help us understand how our residual error terms–the sum across all clusters of the within-cluster sum of squares–change with the number of clusters. We plot k against total within sum of squares.

k means assignment

Ideally, we would see a sharp bend in this curve, where a single cluster number is a turning point from the graph having a steep negative slope to having a shallow negative slope. In this example, however, the rate of change of the slope is gradual, with no clear elbow. This is where the art comes in, and my interpretation is that 3-5 clusters balances a small sum of squares while not losing too much interpretability.

Principle 5: The human side of data science

Principles 3 (Feature scaling) and 4 (Choose the right number of clusters) highlight the importance of human judgment on machine learning algorithms, and data science in general. Algorithms are very good at doing exactly as they are told. But if we feed them poor instructions or poor quality data, they will accurately calculate a useless result. The value of the data analyst, data scientist, or statistician is not that they can run complicated analyses, but that they have skill at knowing how best to run those analyses.

In this post we have had a brief introduction to the k-Means clustering algorithm, gained an understanding of how it works and where its weaknesses lie, and seen it demonstrated in R using kmeans() . The algorithm is easy to run and visualize, and it is a helpful tool for anyone looking to cluster and characterize a large set of datapoints.

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job . Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Copyright © 2022 | MH Corporate basic by MH Themes

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

What Is the K-Means Clustering Algorithm?

k means assignment

K-means is a very simple clustering algorithm used in machine learning. Clustering is an unsupervised learning task. Learning is unsupervised when it requires no labels on its data. Such algorithms can find inherent structure and patterns in unlabeled data. Contrast this with supervised learning , where a model learns to match inputs to corresponding outputs, which requires the data to be labeled.

Clustering is perhaps the simplest unsupervised learning task to understand. In a data set, it’s possible to see that certain data points cluster together and form a natural group.

For example, suppose we have a lot of data on the behavior of YouTube viewers. We might identify one set of users who visit the site frequently, stay on for a long time, and typically reach videos through the subscription feed. Call these “power users.” We might also see a group of “casual users” who visit the site only occasionally, don’t stick around for very long, and typically reach videos from the home page, recommendation bar or directly from links.

Clustering techniques can identify such groups of users automatically, without any human having to label some users as power users and others as casual. Such an algorithm won’t be able to tell us what these clusters mean, but if there are groups of users with different behavior, it can at least detect that they exist and identify their key qualities. We can then decide what to do from there.

Now let’s get into it!

K-Means Algorithm

K-means is a simple clustering algorithm in machine learning. In a data set, it’s possible to see that certain data points cluster together and form a natural group. The goal of k-means is to locate the centroids around which data is clustered They are the “means” in “k-means.” If we know where these points are, the intuition behind the algorithm is that we can then classify each point by assigning it to its closest cluster center.

More From Noah Topper Is the Human Brain a Computer?

How Does the K-Means Algorithm Work?

Consider the following unlabeled data:

A visual plot of a data set

It was randomly generated to cluster around five central points, called centroids. The goal of k-means is to locate these centroids; they are the “means” in “k-means.” If we know where these points are, the intuition behind the algorithm is that we can then classify each point by assigning it to its closest cluster center.

Okay, but how do we find these centroids? First, we must tell the algorithm how many there are. In this case, there are five. The algorithm then makes a guess, choosing five random points as its initial centroids. Next, it looks at all the points assigned to the same cluster. It’s likely that the centroid we initially picked is actually not at the center of this cluster (as you can see in the visualization below). So, we update our guess by moving the centroid to the middle of the cluster and do the same for every other centroid. We then repeat this process until the centers stop moving.

As an example, here’s what the first three steps of (one particular run of) k-means looks like on this data set:

K-means clustered data

In the top left, we have our initial choice of five random centroids. In the top right, we assign each point to its closest centroid and color the regions accordingly. In the middle left, we then shift our centroids to the middle of our identified clusters, and so on. Running the full algorithm, we eventually get the following clustering:

K-means clustered data

Looks pretty good! These five centroids now lie right at the centers of the five clusters in our data.

If you’re looking to use this algorithm, the machine learning library Scikit-learn has an easy-to-use, built-in k-means class . It comes with some extra optimizations by default. For example, it biases the initial choice of centroids to be more spread out than would typically occur from a purely random choice. It also repeats the algorithm a number of times from scratch and returns the best clustering it finds.

Without this, it’s easy for k-means to get stuck on a clustering that’s too densely packed in one area. For example, look back at the second figure above. We can see that k-means initially has a lot more centroids in the bottom-left than the top-right. If we get an unlucky run, the algorithm may never realize that the “cluster” in the top-right is actually two clusters.

Advantages of K-Means

The main advantages of k-means are its simplicity, speed and scalability. Understanding and implementing k-means is simple, and it’s always good to understand your tools. In practice, it also runs quite fast and even works well for large data sets.  It’s also guaranteed to terminate (although not necessarily to the optimal answer), so the algorithm won’t keep moving its centers back and forth forever.

Disadvantages of K-Means (and How to Fix Them)

K-means has some disadvantages, however. First, we must choose the number of clusters k in advance; the algorithm does not determine it for us. If you have some prior knowledge about how many there should be, all well and good. Otherwise, you’ll have to tune the choice of k . This can be tricky, but there are a few potential solutions.

What Is Inertia in K-Means Clustering?

The inertia of a k-means model is the average mean-squared distance of each point to its assigned centroid.

It’s not enough to just minimize inertia since making k larger always lowers it. If we set k to 10 in our last example, then points would be much closer to their centers, but it wouldn’t be a very natural clustering Often, however, if you try increasing values of k , you will see an initial sharp dropoff in inertia with diminishing returns as k gets larger. Values of k just before these diminishing returns are good candidates. 

Another potential solution is to use the silhouette score , a metric which rewards the model for having points near their centroids but also penalizes it for points being too close to other centroids. This can disincentivize having too many clusters. Looking for k s that score high on this metric is another good way to find candidate values. Additionally, if your k-means model is part of a larger, supervised learning system, as I discuss more later, then you can tune k in a straightforward way in order to minimize loss of the overall system.

K-means is also sensitive to the positions of the initial centroids. It can get stuck at a local optimum and end up with a crummy answer. But we’ve already discussed how to address this: initialize centroids to be spread out, and run the algorithm multiple times. Since k-means is fairly fast, this isn’t too much of a problem.

Next, k-means is sensitive to the scale of the data. Distance in each direction is treated as equally important, so if one of your features is on a larger scale (e.g., a weight of 100 to 500 pounds versus length of five to 10 feet), k-means will treat it as more important. You can address this problem relatively easily by rescaling your data.

Harder to overcome is that k-means is sensitive to the shape of the data. The algorithm only learns spherical clusters, so if your data has clusters of a different shape, it will struggle to learn them. K-means is also sensitive to outliers and struggles with higher-dimensionality data. For example, k-means would have a hard time clustering 1024 by 1024 images since our data points would have over a million dimensions.

More in Machine Learning What Is Sentiment Analysis?

Applications of K-Means Algorithm in Machine Learning

K-means has lots of applications. One is customer segmentation, which we discussed in the YouTube example at the start. This means identifying groups of similar users and can be useful to better understand your userbase. What sorts of people are using your product, and what for and in what way? This may inform overall design decisions and business strategies. 

Customer segmentation is also useful from the more computational side. Continuing with the YouTube example, we can recommend videos or channels to users based on what other people in their cluster are watching.

Another use case is anomaly detection. If some new data point is very far from all your existing clusters, then something strange is going on and may warrant investigation. Suppose we find some YouTube account which clicks on a video, immediately likes it, then moves on and repeats. This is probably a bot that you want to deactivate. Fortunately, its behavior likely does not fit into your natural clusters of users, and we can detect this automatically.

Yet another use is dimensionality reduction and feature engineering. Our data may have dozens of dimensions but only, say, five significant clusters. If we transform each point by representing it with its distance to each centroid, we can drastically reduce the dimensionality of our data. This can speed up learning if we then feed our transformed data into a supervised learning algorithm. Alternatively, we could add these distances as new features to our data set to aid learning, rather than throwing out the original features entirely.

Finally, consider semi-supervised learning, when only a small fraction of our data is labeled. Suppose we have a set of handwritten digits, like the classic MNIST problem . If we have 100,000 such pictures, but only 10 percent of them are labeled, supervised learning might not do very well on just that 10 percent. Is there some way we can use the other 90 percent of the data to our advantage?

Yes! If we cluster the data, we can identify groups of pictures that are probably the same digit without any labeling required. If we then find the labeled picture closest to each centroid, we can propagate its label to the other pictures in the cluster or possibly just the pictures closest to the centroid. We then have a much larger amount of labeled data to train on.

K-means can be a particularly good strategy for semi-supervised classification since we know something about the number of clusters in advance: It’s tied to the number of classes. Usually, we’ll want to set k to be some small multiple of the number of classes. Here, we might have 50 clusters since there are multiple ways to write each digit.

And that’s k-means! Remember to go check out the Scikit-learn k-means class to start using the algorithm right away.

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Great Companies Need Great People. That's Where We Come In.

Datanovia

Partitional Clustering in R: The Essentials

K-Means Clustering in R: Algorithm and Practical Examples

K-means clustering (MacQueen 1967) is one of the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters ), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity ), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity ). In k-means clustering, each cluster is represented by its center (i.e, centroid ) which corresponds to the mean of points assigned to the cluster.

In this article, you will learn:

  • The basic steps of k-means algorithm
  • How to compute k-means in R software using practical examples
  • Advantages and disavantages of k-means clustering

K-means basic ideas

K-means algorithm, required r packages and functions, estimating the optimal number of clusters, computing k-means clustering, accessing to the results of kmeans() function, visualizing k-means clusters, k-means clustering advantages and disadvantages, alternative to k-means clustering, related book.

The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.

There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (Hartigan and Wong 1979) , which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:

\[ W(C_k) = \sum\limits_{x_i \in C_k} (x_i - \mu_k)^2 \]

  • \(x_i\) design a data point belonging to the cluster \(C_k\)
  • \(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\)

Each observation ( \(x_i\) ) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers \(\mu_k\) is a minimum.

We define the total within-cluster variation as follow:

\[ tot.withinss = \sum\limits_{k=1}^k W(C_k) = \sum\limits_{k=1}^k \sum\limits_{x_i \in C_k} (x_i - \mu_k)^2 \]

The total within-cluster sum of square measures the compactness (i.e goodness ) of the clustering and we want it to be as small as possible.

The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final solution.

The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids.

Next, each of the remaining objects is assigned to it’s closest centroid, where closest is defined using the Euclidean distance between the object and the cluster mean. This step is called “cluster assignment step”. Note that, to use correlation distance, the data are input as z-scores.

After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update” is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means.

The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration.

K-means algorithm can be summarized as follow:

  • Specify the number of clusters (K) to be created (by the analyst)
  • Select randomly k objects from the dataset as the initial cluster centers or means
  • Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
  • For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centoid of a K t h cluster is a vector of length p containing the means of all variables for the observations in the k t h cluster; p is the number of variables.
  • Iteratively minimize the total within sum of square. That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. By default, the R software uses 10 as the default value for the maximum number of iterations.

Computing k-means clustering in R

We’ll use the demo data sets “USArrests”. The data should be prepared as described in chapter @ref(data-preparation-and-r-packages). The data must contains only continuous variables, as the k-means algorithm uses variable means. As we don’t want the k-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R function scale() as follow:

The standard R function for k-means clustering is kmeans () [ stats package], which simplified format is as follow:

  • x : numeric matrix, numeric data frame or a numeric vector
  • centers : Possible values are the number of clusters (k) or a set of initial (distinct) cluster centers. If a number, a random set of (distinct) rows in x is chosen as the initial centers.
  • iter.max : The maximum number of iterations allowed. Default value is 10.
  • nstart : The number of random starting partitions when centers is a number. Trying nstart > 1 is often recommended.

To create a beautiful graph of the clusters generated with the kmeans () function, will use the factoextra package.

  • Installing factoextra package as:
  • Loading factoextra :

The k-means clustering requires the users to specify the number of clusters to be generated.

One fundamental question is: How to choose the right number of expected clusters (k)?

Different methods will be presented in the chapter “cluster evaluation and validation statistics”.

Here, we provide a simple solution. The idea is to compute k-means clustering using different values of clusters k. Next, the wss (within sum of square) is drawn according to the number of clusters. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

The R function fviz_nbclust () [in factoextra package] provides a convenient solution to estimate the optimal number of clusters.

k means assignment

The plot above represents the variance within the clusters. It decreases as k increases, but it can be seen a bend (or “elbow”) at k = 4. This bend indicates that additional clusters beyond the fourth have little value.. In the next section, we’ll classify the observations into 4 clusters.

As k-means clustering algorithm starts with k randomly selected centroids, it’s always recommended to use the set.seed() function in order to set a seed for R’s random number generator . The aim is to make reproducible the results, so that the reader of this article will obtain exactly the same results as those shown below.

The R code below performs k-means clustering with k = 4:

As the final result of k-means clustering result is sensitive to the random starting assignments, we specify nstart = 25 . This means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation. The default value of nstart in R is one. But, it’s strongly recommended to compute k-means clustering with a large value of nstart such as 25 or 50, in order to have a more stable result.

The printed output displays:

  • the cluster means or centers: a matrix, which rows are cluster number (1 to 4) and columns are variables
  • the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

It’s possible to compute the mean of each variables by clusters using the original data:

If you want to add the point classifications to the original data, use this:

kmeans() function returns a list of components, including:

  • cluster : A vector of integers (from 1:k) indicating the cluster to which each point is allocated
  • centers : A matrix of cluster centers (cluster means)
  • totss : The total sum of squares (TSS), i.e \(\sum{(x_i - \bar{x})^2}\) . TSS measures the total variance in the data.
  • withinss : Vector of within-cluster sum of squares, one component per cluster
  • tot.withinss : Total within-cluster sum of squares, i.e. \(sum(withinss)\)
  • betweenss : The between-cluster sum of squares, i.e. \(totss - tot.withinss\)
  • size : The number of observations in each cluster

These components can be accessed as follow:

It is a good idea to plot the cluster results. These can be used to assess the choice of the number of clusters as well as comparing two different cluster analyses.

Now, we want to visualize the data in a scatter plot with coloring each data point according to its cluster assignment.

The problem is that the data contains more than 2 variables and the question is what variables to choose for the xy scatter plot.

A solution is to reduce the number of dimensions by applying a dimensionality reduction algorithm, such as Principal Component Analysis (PCA) , that operates on the four variables and outputs two new variables (that represent the original variables) that you can use to do the plot.

In other words, if we have a multi-dimensional data set, a solution is to perform Principal Component Analysis (PCA) and to plot data points according to the first two principal components coordinates.

The function fviz_cluster () [ factoextra package] can be used to easily visualize k-means clusters. It takes k-means results and the original data as arguments. In the resulting plot, observations are represented by points, using principal components if the number of variables is greater than 2. It’s also possible to draw concentration ellipse around each cluster.

k means assignment

K-means clustering is very simple and fast algorithm. It can efficiently deal with very large data sets. However there are some weaknesses, including:

  • It assumes prior knowledge of the data and requires the analyst to choose the appropriate number of cluster (k) in advance
  • The final results obtained is sensitive to the initial random selection of cluster centers. Why is it a problem? Because, for every different run of the algorithm on the same dataset, you may choose different set of initial centers. This may lead to different clustering results on different runs of the algorithm.
  • It’s sensitive to outliers.
  • If you rearrange your data, it’s very possible that you’ll get a different solution every time you change the ordering of your data.

Possible solutions to these weaknesses, include:

  • Solution to issue 1: Compute k-means for a range of k values, for example by varying k between 2 and 10. Then, choose the best k by comparing the clustering results obtained for the different k values.
  • Solution to issue 2: Compute K-means algorithm several times with different initial cluster centers. The run with the lowest total within-cluster sum of square is selected as the final clustering solution.
  • To avoid distortions caused by excessive outliers, it’s possible to use PAM algorithm, which is less sensitive to outliers.

A robust alternative to k-means is PAM, which is based on medoids. As discussed in the next chapter, the PAM clustering can be computed using the function pam () [ cluster package]. The function pamk ( ) [fpc package] is a wrapper for PAM that also prints the suggested number of clusters based on optimum average silhouette width.

K-means clustering can be used to classify observations into k groups, based on their similarity. Each group is represented by the mean value of points in the group, known as the cluster centroid.

K-means algorithm requires users to specify the number of cluster to generate. The R function kmeans () [ stats package] can be used to compute k-means algorithm. The simplified format is kmeans(x, centers), where “x” is the data and centers is the number of clusters to be produced.

After, computing k-means clustering, the R function fviz_cluster () [ factoextra package] can be used to visualize the results. The format is fviz_cluster(km.res, data), where km.res is k-means results and data corresponds to the original data sets.

Hartigan, JA, and MA Wong. 1979. “Algorithm AS 136: A K-means clustering algorithm.” Applied Statistics . Royal Statistical Society, 100–108.

MacQueen, J. 1967. “Some Methods for Classification and Analysis of Multivariate Observations.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics , 281–97. Berkeley, Calif.: University of California Press. http://projecteuclid.org:443/euclid.bsmsp/1200512992 .

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Data science.

  • Course: Machine Learning: Master the Fundamentals by Stanford
  • Specialization: Data Science by Johns Hopkins University
  • Specialization: Python for Everybody by University of Michigan
  • Courses: Build Skills for a Top Job in any Industry by Coursera
  • Specialization: Master Machine Learning Fundamentals by University of Washington
  • Specialization: Statistics with R by Duke University
  • Specialization: Software Development in R by Johns Hopkins University
  • Specialization: Genomic Data Science by Johns Hopkins University

Popular Courses Launched in 2020

  • Google IT Automation with Python by Google
  • AI for Medicine by deeplearning.ai
  • Epidemiology in Public Health Practice by Johns Hopkins University
  • AWS Fundamentals by Amazon Web Services

Trending Courses

  • The Science of Well-Being by Yale University
  • Google IT Support Professional by Google
  • Python for Everybody by University of Michigan
  • IBM Data Science Professional Certificate by IBM
  • Business Foundations by University of Pennsylvania
  • Introduction to Psychology by Yale University
  • Excel Skills for Business by Macquarie University
  • Psychological First Aid by Johns Hopkins University
  • Graphic Design by Cal Arts

Amazing Selling Machine

  • Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM

Books - Data Science

  • Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
  • Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
  • Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
  • R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
  • GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
  • Network Analysis and Visualization in R by A. Kassambara (Datanovia)
  • Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
  • Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)
  • R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
  • Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
  • Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
  • An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
  • Deep Learning with R by François Chollet & J.J. Allaire
  • Deep Learning with Python by François Chollet

Comments ( 2 )

' src=

how we can get data

' src=

The demo data used in this tutorial is available in the default installation of R. Juste type data(“USArrests”)

Give a comment Cancel reply

Course curriculum.

  • K-Means Clustering in R: Algorithm and Practical Examples 50 mins
  • K-Medoids in R: Algorithm and Practical Examples 35 mins
  • CLARA in R : Clustering Large Applications 35 mins

k means assignment

Alboukadel Kassambara

Role : founder of datanovia.

  • Website : https://www.datanovia.com/en
  • Experience : >10 years
  • Specialist in : Bioinformatics and Cancer Biology

Javatpoint Logo

Machine Learning

Artificial Intelligence

Control System

Supervised Learning

Classification, miscellaneous, related tutorials.

Interview Questions

JavaTpoint

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Computer Network

Compiler Design tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

k means assignment

UC Business Analytics R Programming Guide

K-means cluster analysis.

k means assignment

This tutorial serves as an introduction to the k-means clustering method.

  • Replication Requirements : What you’ll need to reproduce the analysis in this tutorial
  • Data Preparation : Preparing our data for cluster analysis
  • Clustering Distance Measures : Understanding how to measure differences in observations
  • K-Means Clustering : Calculations and methods for creating K subgroups of the data
  • Determining Optimal Clusters : Identifying the right number of clusters to group your data

Replication Requirements

To replicate this tutorial’s analysis you will need to load the following packages:

Data Preparation

To perform a cluster analysis in R, generally, the data should be prepared as follows:

  • Rows are observations (individuals) and columns are variables
  • Any missing value in the data must be removed or estimated.
  • The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one. 1

Here, we’ll use the built-in R data set USArrests , which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban areas

To remove any missing value that might be present in the data, type this:

As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale :

Clustering Distance Measures

The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix. There are many methods to calculate this distance information; the choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters.

The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters. The classical methods for distance measures are Euclidean and Manhattan distances , which are defined as follow:

Euclidean distance:

Manhattan distance:

Where, x and y are two vectors of length n .

Other dissimilarity measures exist such as correlation-based distances, which is widely used for gene expression data analyses. Correlation-based distance is defined by subtracting the correlation coefficient from 1. Different types of correlation methods can be used such as:

Pearson correlation distance:

Spearman correlation distance:

The spearman correlation method computes the correlation between the rank of x and the rank of y variables.

Kendall correlation distance:

Kendall correlation distance is defined as follow:

The choice of distance measures is very important, as it has a strong influence on the clustering results. For most common clustering software, the default distance measure is the Euclidean distance. However, depending on the type of the data and the research questions, other dissimilarity measures might be preferred and you should be aware of the options.

Within R it is simple to compute and visualize the distance matrix using the functions get_dist and fviz_dist from the factoextra R package. This starts to illustrate which states have large dissimilarities (red) versus those that appear to be fairly similar (teal).

  • get_dist : for computing a distance matrix between the rows of a data matrix. The default distance computed is the Euclidean; however, get_dist also supports distanced described in equations 2-5 above plus others.
  • fviz_dist : for visualizing a distance matrix

k means assignment

K-Means Clustering

K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.

The Basic Idea

The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:

We define the total within-cluster variation as follows:

The total within-cluster sum of square measures the compactness (i.e goodness) of the clustering and we want it to be as small as possible.

K-means Algorithm

The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final solution. The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids. Next, each of the remaining objects is assigned to it’s closest centroid, where closest is defined using the Euclidean distance (Eq. 1) between the object and the cluster mean. This step is called “cluster assignment step”. After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update” is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means. The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration.

K-means algorithm can be summarized as follows:

  • Specify the number of clusters (K) to be created (by the analyst)
  • Select randomly k objects from the data set as the initial cluster centers or means
  • Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
  • For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; p is the number of variables.
  • Iteratively minimize the total within sum of square (Eq. 7). That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. By default, the R software uses 10 as the default value for the maximum number of iterations.

Computing k-means clustering in R

We can compute k-means in R with the kmeans function. Here will group the data into two clusters ( centers = 2 ). The kmeans function also has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.

The output of kmeans is a list with several bits of information. The most important being:

  • cluster : A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
  • centers : A matrix of cluster centers.
  • totss : The total sum of squares.
  • withinss : Vector of within-cluster sum of squares, one component per cluster.
  • tot.withinss : Total within-cluster sum of squares, i.e. sum(withinss).
  • betweenss : The between-cluster sum of squares, i.e. $totss-tot.withinss$.
  • size : The number of points in each cluster.

If we print the results we’ll see that our groupings resulted in 2 cluster sizes of 30 and 20. We see the cluster centers (means) for the two groups across the four variables ( Murder, Assault, UrbanPop, Rape ). We also get the cluster assignment for each observation (i.e. Alabama was assigned to cluster 2, Arkansas was assigned to cluster 1, etc.).

We can also view our results by using fviz_cluster . This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

k means assignment

Alternatively, you can use standard pairwise scatter plots to illustrate the clusters compared to the original variables.

k means assignment

Because the number of clusters (k) must be set before we start the algorithm, it is often advantageous to use several different values of k and examine the differences in the results. We can execute the same process for 3, 4, and 5 clusters, and the results are shown in the figure:

k means assignment

Although this visual assessment tells us where true dilineations occur (or do not occur such as clusters 2 & 4 in the k = 5 graph) between clusters, it does not tell us what the optimal number of clusters is.

Determining Optimal Clusters

As you may recall the analyst specifies the number of clusters to use; preferably the analyst would like to use the optimal number of clusters. To aid the analyst, the following explains the three most popular methods for determining the optimal clusters, which includes:

  • Elbow method
  • Silhouette method
  • Gap statistic

Elbow Method

Recall that, the basic idea behind cluster partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as total within-cluster variation or total within-cluster sum of square) is minimized:

  • Compute clustering algorithm (e.g., k-means clustering) for different values of k . For instance, by varying k from 1 to 10 clusters
  • For each k , calculate the total within-cluster sum of square (wss)
  • Plot the curve of wss according to the number of clusters k .
  • The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

We can implement this in R with the following code. The results suggest that 4 is the optimal number of clusters as it appears to be the bend in the knee (or elbow).

k means assignment

Fortunately, this process to compute the “Elbow method” has been wrapped up in a single function ( fviz_nbclust ):

k means assignment

Average Silhouette Method

In short, the average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k . The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k . 2

We can use the silhouette function in the cluster package to compuate the average silhouette width. The following code computes this approach for 1-15 clusters. The results show that 2 clusters maximize the average silhouette values with 4 clusters coming in as second optimal number of clusters.

k means assignment

Similar to the elbow method, this process to compute the “average silhoutte method” has been wrapped up in a single function ( fviz_nbclust ):

k means assignment

Gap Statistic Method

For the observed data and the the reference data, the total intracluster variation is computed using different values of k . The gap statistic for a given k is defined as follow:

In short, the algorithm involves the following steps:

To compute the gap statistic method we can use the clusGap function which provides the gap statistic and standard error for an output.

We can visualize the results with fviz_gap_stat which suggests four clusters as the optimal number of clusters.

k means assignment

In addition to these commonly used approaches, the NbClust package, published by Charrad et al., 2014 , provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

Extracting Results

With most of these approaches suggesting 4 as the number of optimal clusters, we can perform the final analysis and extract the results using 4 clusters.

We can visualize the results using fviz_cluster :

And we can extract the clusters and add to our initial data to do some descriptive statistics at the cluster level:

Additional Comments

K-means clustering is a very simple and fast algorithm. Furthermore, it can efficiently deal with very large data sets. However, there are some weaknesses of the k-means approach.

One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters. Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of clusters. Hierarchical clustering has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram. A future tutorial will illustrate the hierarchical clustering approach.

An additional disadvantage of K-means is that it’s sensitive to outliers and different results can occur if you change the ordering of your data. The Partitioning Around Medoids (PAM) clustering approach is less sensititive to outliers and provides a robust alternative to k-means to deal with these situations. A future tutorial will illustrate the PAM clustering approach.

For now, you can learn more about clustering methods with:

  • An Introduction to Statistical Learning
  • Applied Predictive Modeling
  • Elements of Statistical Learning
  • A Practical Guide to Cluster Analysis in R

Standardization makes the four distance measure methods - Euclidean, Manhattan, Correlation and Eisen - more similar than they would be with non-transformed data.  ↩

Kaufman and Rousseeuw, 1990   ↩

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

C3_W1_KMeans_Assignment.ipynb

Latest commit, file metadata and controls.

IMAGES

  1. Segmental and Weighted Segmental K-Means assignment flowchart

    k means assignment

  2. K-means: A Complete Introduction. K-means is an unsupervised clustering

    k means assignment

  3. K-Means assignment step. (a) k initial means are randomly generated, k

    k means assignment

  4. K-Means Clustering

    k means assignment

  5. K-means Clustering analysis Assignment Help

    k means assignment

  6. k-Means Clustering

    k means assignment

VIDEO

  1. BTEC Command Words: Explain

  2. K-Means: Examples of Use Cases and Applications

  3. k means clustering part 3

  4. Unsupervised Learning K Means Clustering with Example

  5. 10 K Means Clustering Step 5a

  6. K MEANS CLUSTERING

COMMENTS

  1. k-means clustering

    k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid ), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k -means clustering minimizes ...

  2. 3 Assignment K-Means Clustering

    3 Assignment K-Means Clustering. 3. Assignment K-Means Clustering. Let's apply K-means clustering on the same data set we used for kNN. You have to determine a number of still unknown clusters of, in this case, makes and models of cars. There is no criterion that we can use as a training and test set! The questions and assignments are:

  3. Homework 4: K-Means Clustering

    In Homework 4, we will implement a commonly-used algorithm: K-means clustering. (Read more about K-means on the K-means Wikipedia page if you're interested). K-means is a popular machine learning and data mining algorithm that discovers potential clusters within a dataset.

  4. The Ultimate Guide to K-Means Clustering

    The ultimate guide to K-means clustering algorithm - definition, concepts, methods, applications, and challenges, along with Python code. Learn Now!

  5. K-Means Clustering in Python: A Practical Guide

    The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k -means is one of the oldest and most approachable. These traits make implementing k -means clustering in Python reasonably straightforward, even for novice programmers and data scientists.

  6. In Depth: k-Means Clustering

    In Depth: k-Means Clustering | Python Data Science Handbook. This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter notebooks are available on GitHub. The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work ...

  7. K-Means Clustering Algorithm from Scratch

    K-Means Clustering is an unsupervised learning algorithm that aims to group the observations in a given dataset into clusters. The number of clusters is provided as an input. It forms the clusters by minimizing the sum of the distance of points from their respective cluster centroids. Contents Basic Overview Introduction to K-Means Clustering Steps Involved … K-Means Clustering Algorithm ...

  8. K-means: A Complete Introduction

    K-means is a hard clustering approach meaning that each observation is partitioned into a single cluster with no information about how confident we are in this assignment.

  9. Definitive Guide to K-Means Clustering with Scikit-Learn

    In this guide, we'll take a comprehensive look at how to cluster a dataset in Python using the K-Means algorithm with the Scikit-Learn library, how to use the elbow method, find optimal cluster number and implement K-Means from scratch.

  10. A Practical Guide on K-Means Clustering

    K-Means minimizes the sum of SSE by optimally iteratively moving the centroids. In a way, K-means works by creating a hard partition in the dataset, which act as the cluster boundaries.

  11. K-means Clustering: Algorithm, Applications, Evaluation Methods, and

    Kmeans Algorithm Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the ...

  12. K-means clustering algorithms: A comprehensive review, variants

    Learn about the principles, methods, and applications of K-means clustering algorithms, and how they cope with big data challenges.

  13. K Means

    K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

  14. PDF Clustering: K Means

    This simply means -. μk equal to the mean of all of the data points xn assigned to cluster k. Hence the name K Means. Image from F loydhub.

  15. K means Clustering

    K Means is faster as compare to other clustering technique. It provides strong coupling between the data points. K Means cluster do not provide clear information regarding the quality of clusters. Different initial assignment of cluster centroid may lead to different clusters. Also, K Means algorithm is sensitive to noise.

  16. PDF Machine learning: k-means

    Suppose we know the centroids. Then for each point the assignment that minimizes the K-means loss is the closer of the two centroids.

  17. k-Means 101: An introductory guide to k-Means clustering in R

    A k-Means analysis is one of many clustering techniques for identifying structural features of a set of datapoints. The k-Means algorithm groups data into a pre-specified number of clusters, k, where the assignment of points to clusters minimizes the total sum-of-squares distance to the cluster's mean.

  18. The Math Behind K-Means Clustering

    K-means clustering is a staple in machine learning for its straightforward approach to organizing complex data. In this article we'll explore the core of the algorithm. We will delve into its applications, dissect the math behind it, build it from scratch, and discuss its relevance in the fast-evolving field of data science.

  19. K-Means Clustering Algorithm in Machine Learning

    K-means is a simple but powerful clustering algorithm in machine learning. Here, our expert explains how it works and its plusses and minuses.

  20. K-Means Clustering in R: Algorithm and Practical Examples

    K-means clustering is one of the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups. In this tutorial, you will learn: 1) the basic steps of k-means algorithm; 2) How to compute k-means in R software using practical examples; and 3) Advantages and disavantages of k-means clustering

  21. K-Means Clustering Algorithm

    K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the algorithm works, along with the Python implementation of k-means clustering.

  22. K-means Cluster Analysis

    The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final solution. The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids.

  23. Stanford/C3_W1_KMeans_Assignment.ipynb at main

    C3_W1_KMeans_Assignment.ipynb. Cannot retrieve latest commit at this time. Standford ML Lab. Contribute to dzdatascience/Stanford development by creating an account on GitHub.