Lab 7: k Nearest Neighbours (KNN)
Objective
- To perform k nearest neighbours algorithm with
scikit-learn
Python library for classification and regression.
Datasets
-
The iris dataset will be used for classification, and the diabetes dataset for regression.
from sklearn import datasets import pandas as pd iris = datasets.load_iris() iris = { 'attributes': pd.DataFrame(iris.data, columns=iris.feature_names), 'target': pd.DataFrame(iris.target, columns=['species']), 'targetNames': iris.target_names } diabetes = datasets.load_diabetes() diabetes = { 'attributes': pd.DataFrame(diabetes.data, columns=diabetes.feature_names), 'target': pd.DataFrame(diabetes.target, columns=['diseaseProgression']) }
-
Split the datasets into 80-20 for train-test proportion.
from sklearn.model_selection import train_test_split for dt in [iris, diabetes]: x_train, x_test, y_train, y_test = train_test_split(dt['attributes'], dt['target'], test_size=0.2, random_state=1) dt['train'] = { 'attributes': x_train, 'target': y_train } dt['test'] = { 'attributes': x_test, 'target': y_test }
Note: Be reminded that
random_state
is used to reproduce the same "random" split of the data whenever the function is called. To produce randomly splitted data every time the function is called, remove therandom_state
argument.
Task: How do we access the training input data for the iris dataset?
KNN
KNN algorithms are provided by the scikit-learn
Python library as the class sklearn.neighbors.KNeighborsClassifier
for classification, and sklearn.neighbors.KNeighborsRegressor
for regression.
Classification
-
Import the class for KNN classifier.
from sklearn.neighbors import KNeighborsClassifier
-
Instantiate an object of
KNeighborsClassifier
class with k = 5.knc = KNeighborsClassifier(5)
-
Train the classifier with the training data. We will use the
sepal length (cm)
andsepal width (cm)
(the first and second columns) as the attributes for now.input_columns = iris['attributes'].columns[:2].tolist() x_train = iris['train']['attributes'][input_columns] y_train = iris['train']['target'].species knc.fit(x_train, y_train)
-
.predict
function is used to predict the species of the testing data.x_test = iris['test']['attributes'][input_columns] y_test = iris['test']['target'].species y_predict = knc.predict(x_test)
-
Comparing the predicted value and the target value of the test data.
print(pd.DataFrame(list(zip(y_test,y_predict)), columns=['target', 'predicted']))
-
Calculate the accuracy of the predicted value.
print(f'Accuracy: {knc.score(x_test,y_test):.4f}')
-
Visualisation
-
Import the
matplotlib.pyplot
library and the colormaps from thematplotlib
library. ```python
import matplotlib.pyplot as plt from matplotlib import cm from matplotlib.colors import ListedColormap -
Prepare the colormaps.
colormap = cm.get_cmap('tab20') cm_dark = ListedColormap(colormap.colors[::2]) cm_light = ListedColormap(colormap.colors[1::2])
-
Calculate the decision boundaries.
import numpy as np x_min = iris['attributes'][input_columns[0]].min() x_max = iris['attributes'][input_columns[0]].max() x_range = x_max - x_min x_min = x_min - 0.1 * x_range x_max = x_max + 0.1 * x_range y_min = iris['attributes'][input_columns[1]].min() y_max = iris['attributes'][input_columns[1]].max() y_range = y_max - y_min y_min = y_min - 0.1 * y_range y_max = y_max + 0.1 * y_range xx, yy = np.meshgrid(np.arange(x_min, x_max, .01*x_range), np.arange(y_min, y_max, .01*y_range)) z = knc.predict(list(zip(xx.ravel(), yy.ravel()))) z = z.reshape(xx.shape)
-
Plot the decision boundary.
plt.figure(figsize=[12,8]) plt.pcolormesh(xx, yy, z, cmap=cm_light)
-
Plot the training and testing data.
plt.scatter(x_train[input_columns[0]], x_train[input_columns[1]], c=y_train, label='Training data', cmap=cm_dark, edgecolor='black', linewidth=1, s=150) plt.scatter(x_test[input_columns[0]], x_test[input_columns[1]], c=y_test, marker='*', label='Testing data', cmap=cm_dark, edgecolor='black', linewidth=1, s=150) plt.xlabel(input_columns[0]) plt.ylabel(input_columns[1]) plt.legend()
-
Task: Create a loop to compare the accuracy of the prediction at different value of k. The comparison should be shown in a graph with k as the horizontal axis and accuracy as the vertical axis.
Regression
-
Import the class for KNN regressor.
from sklearn.neighbors import KNeighborsRegressor
-
Instantiate an object of
KNeighborsRegressor
class with k = 5.knr = KNeighborsRegressor(5)
-
Train the regressor with the training data. We will use the
age
andbmi
as the attributes for now.input_columns = ['age', 'bmi'] x_train = diabetes['train']['attributes'][input_columns] y_train = diabetes['train']['target'].diseaseProgression knr.fit(x_train, y_train)
-
.predict
function is used to predict the disease progression of the testing data.x_test = diabetes['test']['attributes'][input_columns] y_test = diabetes['test']['target'].diseaseProgression y_predict = knr.predict(x_test)
-
Comparing the predicted value and the target value of the test data.
print(pd.DataFrame(list(zip(y_test,y_predict)), columns=['target', 'predicted']))
-
Calculate the accuracy of the predicted value.
print(f'Accuracy: {knr.score(x_test,y_test):.4f}')
-
Visualisation
-
Import the
matplotlib.pyplot
library and the colormaps from thematplotlib
library.import matplotlib.pyplot as plt from matplotlib import cm
-
Prepare the colormaps.
dia_cm = cm.get_cmap('Reds')
-
Calculate the decision boundaries.
import numpy as np x_min = diabetes['attributes'][input_columns[0]].min() x_max = diabetes['attributes'][input_columns[0]].max() x_range = x_max - x_min x_min = x_min - 0.1 * x_range x_max = x_max + 0.1 * x_range y_min = diabetes['attributes'][input_columns[1]].min() y_max = diabetes['attributes'][input_columns[1]].max() y_range = y_max - y_min y_min = y_min - 0.1 * y_range y_max = y_max + 0.1 * y_range xx, yy = np.meshgrid(np.arange(x_min, x_max, .01*x_range), np.arange(y_min, y_max, .01*y_range)) z = knr.predict(list(zip(xx.ravel(), yy.ravel()))) z = z.reshape(xx.shape)
-
Plot the decision boundary.
plt.figure() plt.pcolormesh(xx, yy, z, cmap=dia_cm)
-
Plot the training and testing data.
plt.scatter(x_train[input_columns[0]], x_train[input_columns[1]], c=y_train, label='Training data', cmap=dia_cm, edgecolor='black', linewidth=1, s=150) plt.scatter(x_test[input_columns[0]], x_test[input_columns[1]], c=y_test, marker='*', label='Testing data', cmap=dia_cm, edgecolor='black', linewidth=1, s=150) plt.xlabel(input_columns[0]) plt.ylabel(input_columns[1]) plt.legend() plt.colorbar()
-
Task: Create a loop to compare the accuracy of the prediction at different value of k. The comparison should be shown in a graph with k as the horizontal axis and accuracy as the vertical axis.