Lab 9: Clustering

Objective

To implement k-means clustering algorithm using basic Python.
To perform k-means clustering algorithm with scikit-learn Python library.

Note

Suggestion: use numpy library to simplify the operation.

k-means clustering using basic Python

The following code structure will be used for this section:

# import libraries
...

# functions
## function to take input of data and number of clusters, return centroids and other data
def get_random_centroids(data_points, n_centroids=2):
  pass

## function to group data according to centroids
def group_to_centroids(data_points, centroids):
  pass

## function to calculate centroids from grouped data
def find_centroids(data_points, groups):
  pass

# generate dataset
...

# identify initial centroids
...

# repeat until centroids stabilise
while ...:
  ## group data to centroids
  ...
  ## update centroids
  ...

print('terminated')

Dataset

The code in this subsection will populate
```
# generate dataset
...
```

We will create a dataset of 200 with 2 input features and 4 clusters.

from sklearn.datasets import make_blobs
data = make_blobs(n_samples=200, n_features=2, centers=4, cluster_std=1.6, random_state=50)

data is an array of two elements. First element contains the data points, and second element contains the index of the cluster.
```
points = data[0]
```

Plot the data on a scatter graph with colour representing the cluster of the points.

import matplotlib.pyplot as plt
plt.scatter(data[0][:,0], data[0][:,1], c=data[1])

Initialisation

The initialisation for k-means clustering algorithm involves the identification of k random points from the dataset.
In the get_random_centroids function, pass the dataset and the number of centroids to be identified as the input arguments.
In the body of the function, randomly identify the centroids from the dataset.
The function should return two outputs, the randomly identified centroids and the dataset without the centroids.

Update your code with the following snippet:

# identify initial centroids
centroids, others = get_random_centroids(points, 4)

Cluster the points

The group_to_centroids function takes two inputs, the data points to be clustered and the centroids to cluster to.
In the group_to_centroids function, calculate the distance of every point from each centroid.
Then identify the centroid each point should be clustered to.
The function should return the index of the centroid each point is clustered to.

Update your code with the following snippet:

## group data to centroids
groups = group_to_centroids(others, centroids)

Update the centroids

The find_centroids function takes two inputs, the data points and the index of the centroid each point is clustered to (i.e. output of group_to_centroids)
The find_centroids function calculates and returns the new set of centroids based on the clustered data points.

Update your code with the following snippet:

## update centroids
centroids = find_centroids(others, groups)

Termination condition

The clustering algorithm should terminate when the centroids stabilise, i.e. do not change much.

Visualisation

Update your code according to the following sample to visualise the centroids and clusters at every iteration.

fig = plt.figure()
ax = fig.add_subplot(111)
# repeat until centroids stabilise
while ...:
  ## group data to centroids
  groups = group_to_centroids(others, centroids)

  ## update centroids
  centroids = find_centroids(others, groups)

  ## visualise current clusters and centroids
  ax.clear()
  ax.scatter(others[:,0], others[:,1], c=groups)
  ax.scatter(centroids[:,0], centroids[:,1], marker='*', c='k')

  ## pause for one second
  plt.pause(1)

Are Figure 1 (originally generated clusters) and Figure 2 (calculated clusters) identical?

k-means clustering using `scikit-learn` Python library

The following code structure will be used for this section:

# import libraries
...

# generate dataset
...

# initialise the k-means model
...

# train the k-means model
...

# identify the cluster of each point
...

# visualise the result
...

Dataset

We will use the exact same data generation as the previous section.

k-means model

The k-means model is provided by sklearn.cluster.KMeans.
```
from sklearn.cluster import KMeans
```
Initialise the k-means model with 4 clusters using KMeans. (Check the documentation to identify the usage of KMeans)

Training

The k-means model initialised need to be trained with the data.
The training is executed using sklearn.cluster.KMeans.fit. (Identify the input argument(s))
```
kmeans.fit(...)
```

Cluster the points

The trained model can be used to cluster the points to their respective cluster using sklearn.cluster.KMeans.fit_predict. (Identify the input argument(s))
```
y_km = kmeans.fit_predict(...)
```

Visualisation

The visualisation of the result can be achieved with the following code:

plt.figure()
plt.scatter(points[:,0], points[:,1], c=y_km)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='k')

Task: Compare the results of the two methods, are they similar?

Lab 9: Clustering

Objective

Note

k-means clustering using basic Python

Dataset

Initialisation

Cluster the points

Update the centroids

Termination condition

Visualisation

k-means clustering using scikit-learn Python library

Dataset

k-means model

Training

Cluster the points

Visualisation

k-means clustering using `scikit-learn` Python library