Lab 8: Decision Tree
Objective
- To perform CART decision tree learning algorithm using the
scikit-learn
Python library.
Datasets
The same iris and diabetes datasets used in Lab 7 are used here. Load the datasets and split them into 80-20 train-test proportion.
Decision tree
Decision tree models and algorithms are provided by the scikit-learn
as the class sklearn.tree.DecisionTreeClassifier
for classification, and sklearn.tree.DecisionTreeRegressor
for regression.
Classification
-
Import the class for decision tree classifier.
from sklearn.tree import DecisionTreeClassifier
-
Instantiate an object of
DecisionTreeClassifier
class with gini impurity as the split criterion.dtc = DecisionTreeClassifier(criterion='gini')
-
Train the classifier with the training data. We will use all the input attributes.
dtc.fit(iris['train']['attributes'], iris['train']['target'])
-
.predict
function is used to predict the species of the testing data.predicts = dtc.predict(iris['test']['attributes'])
-
Comparing the predicted value and the target value of the test data.
print(pd.DataFrame(list(zip(iris['test']['target'].species,predicts)), columns=['target', 'predicted']))
-
Calculate the accuracy of the predicted value.
accuracy = dtc.score(iris['test']['attributes'],iris['test']['target'].species) print(f'Accuracy: {accuracy:.4f}')
-
Decision tree visualisation
-
Import the
matplotlib.pyplot
library and the function to visualise the tree.import matplotlib.pyplot as plt from sklearn.tree import plot_tree
-
Visualise the decision tree model.
plt.figure(figsize=[10,10]) tree = plot_tree(dtc, feature_names=iris['attributes'].columns.tolist(), class_names=iris['targetNames'], filled=True, rounded=True)
-
The maximum depth of a decision tree can be defined by adding the max_depth=...
argument to the DecisionTreeClassifier(...)
object instantiation. To allow unlimited maximum depth, pass max_depth=None
.
Task: Create a loop to compare the accuracy of the prediction with different maximum depths. Use 1, 2, 3, 5, 7, and 20 as the maximum depths. In every iteration, you should calculate both the accuracy on the training data and the accuracy on the testing data. The comparison should be shown in a graph with max_depth
as the horizontal axis and accuracy as the vertical axis. Two lines should be displayed on the graph with one line for training accuracy and the other testing accuracy.
Visualisation of decision surface
This section explains the method to visualise a decision tree on a graph. To do so we will focus on using two input attributes, sepal length and sepal width, i.e. the first two columns.
-
Instantiate the classifier without defining the maximum depth and train the model.
dtc = DecisionTreeClassifier() input_cols = iris['train']['attributes'].columns[:2].tolist() dtc.fit(iris['train']['attributes'][input_cols], iris['train']['target'].species)
-
Plot the decision tree.
plt.figure(figsize=[50,50]) plot_tree(dtc, feature_names=input_cols, class_names=iris['targetNames'], filled=True, rounded=True) plt.savefig('classificationDecisionTreeWithNoMaxDepth.png')
-
Prepare the colormaps.
from matplotlib import cm from matplotlib.colors import ListedColormap colormap = cm.get_cmap('tab20') cm_dark = ListedColormap(colormap.colors[::2]) cm_light = ListedColormap(colormap.colors[1::2])
-
Calculating the decision surface.
import numpy as np x_min = iris['attributes'][input_cols[0]].min() x_max = iris['attributes'][input_cols[0]].max() x_range = x_max - x_min x_min = x_min - 0.1 * x_range x_max = x_max + 0.1 * x_range y_min = iris['attributes'][input_cols[1]].min() y_max = iris['attributes'][input_cols[1]].max() y_range = y_max - y_min y_min = y_min - 0.1 * y_range y_max = y_max + 0.1 * y_range xx, yy = np.meshgrid(np.arange(x_min, x_max, .01*x_range), np.arange(y_min, y_max, .01*y_range)) z = dtc.predict(list(zip(xx.ravel(), yy.ravel()))) z = z.reshape(xx.shape)
-
Plot the decision surface.
plt.figure() plt.pcolormesh(xx, yy, z, cmap=cm_light)
-
Plot the training and testing data.
plt.scatter(iris['train']['attributes'][input_cols[0]], iris['train']['attributes'][input_cols[1]], c=iris['train']['target'].species, cmap=cm_dark, s=200, label='Training data', edgecolor='black', linewidth=1) plt.scatter(iris['test']['attributes'][input_cols[0]], iris['test']['attributes'][input_cols[1]], c=iris['test']['target'].species, cmap=cm_dark, s=200, label='Testing data', edgecolor='black', linewidth=1, marker='*') train_acc = dtc.score(iris['train']['attributes'][input_cols], iris['train']['target'].species) test_acc = dtc.score(iris['test']['attributes'][input_cols], iris['test']['target'].species) plt.title(f'training: {train_acc:.3f}, testing: {test_acc:.3f}') plt.xlabel(input_cols[0]) plt.ylabel(input_cols[1]) plt.legend()
You may not be able to see anything on one of the graph of the decision tree because the figure size is set to be larger than the screen size. However, the tree is saved to a png file in the same folder as your code.
Overfitting
Now, instantiate a decision tree classifier of max_depth=3
and train it with the two input attributes used in the previous section, sepal length and sepal width.
Plot the decision surface for this classifier after the training. Compare the decision surface, training accuracy, and testing accuracy between this model and the model in the previous section.
The previous model shows a situation of overfitting. Though the training accuracy of this model is lower than that of the previous one, the testing accuracy is higher than the previous one. That shows a higher level of generalisation.
Regression
-
Import the class for decision tree regressor.
from sklearn.tree import DecisionTreeRegressor
-
Instantiate an object of
DecisionTreeRegressor
class.dtr = DecisionTreeRegressor()
-
Train the classifier with the training data. We will use all the input attributes.
dtr.fit(diabetes['train']['attributes'], diabetes['train']['target'])
-
.predict
function is used to predict the disease progression of the testing data.predicts = dtr.predict(diabetes['test']['attributes'])
-
Comparing the predicted value and the target value of the test data.
print(pd.DataFrame(list(zip(diabetes['test']['target'].diseaseProgression,predicts)), columns=['target', 'predicted']))
-
Calculate the accuracy of the predicted value.
accuracy = dtr.score(diabetes['test']['attributes'], diabetes['test']['target'].diseaseProgression) print(f'Accuracy: {accuracy:.4f}')
-
Decision tree visualisation
-
Import the
matplotlib.pyplot
library and the function to visualise the tree.import matplotlib.pyplot as plt from sklearn.tree import plot_tree
-
Visualise the decision tree model.
plt.figure(figsize=[10,10]) tree = plot_tree(dtr, feature_names=diabetes['attributes'].columns.tolist(), filled=True, rounded=True)
-
The maximum depth of a decision tree can be defined by adding the max_depth=...
argument to the DecisionTreeRegressor(...)
object instantiation. To allow unlimited maximum depth, pass max_depth=None
.
Task: Create a loop to compare the accuracy of the prediction with different maximum depths. Use 1, 2, 3, 5, 7, and 20 as the maximum depths. In every iteration, you should calculate both the accuracy on the training data and the accuracy on the testing data. The comparison should be shown in a graph with max_depth
as the horizontal axis and accuracy as the vertical axis. Two lines should be displayed on the graph with one line for training accuracy and the other testing accuracy.
Visualisation of decision surface
This section explains the method to visualise a decision tree on a graph. To do so we will focus on using two input attributes, age
and bmi
.
-
Instantiate the classifier without defining the maximum depth and train the model.
dtr = DecisionTreeRegressor() input_cols = ['age', 'bmi'] dtr.fit(diabetes['train']['attributes'][input_cols], diabetes['train']['target'].diseaseProgression)
-
Plot the decision tree.
plt.figure(figsize=[50,50]) plot_tree(dtr, feature_names=input_cols, filled=True, rounded=True) plt.savefig('regressionDecisionTreeWithNoMaxDepth.png')
-
Prepare the colormaps.
from matplotlib import cm dia_cm = cm.get_cmap('Reds')
-
Create the decision surface.
import numpy as np x_min = diabetes['attributes'][input_cols[0]].min() x_max = diabetes['attributes'][input_cols[0]].max() x_range = x_max - x_min x_min = x_min - 0.1 * x_range x_max = x_max + 0.1 * x_range y_min = diabetes['attributes'][input_cols[1]].min() y_max = diabetes['attributes'][input_cols[1]].max() y_range = y_max - y_min y_min = y_min - 0.1 * y_range y_max = y_max + 0.1 * y_range xx, yy = np.meshgrid(np.arange(x_min, x_max, .01*x_range), np.arange(y_min, y_max, .01*y_range)) z = dtr.predict(list(zip(xx.ravel(), yy.ravel()))) z = z.reshape(xx.shape)
-
Plot the decision surface
plt.figure() plt.pcolormesh(xx, yy, z, cmap=dia_cm)
-
Plot the training and testing data.
plt.scatter(diabetes['train']['attributes'][input_cols[0]], diabetes['train']['attributes'][input_cols[1]], c=diabetes['train']['target'].diseaseProgression, label='Training data', cmap=dia_cm, edgecolor='black', linewidth=1, s=150) plt.scatter(diabetes['test']['attributes'][input_cols[0]], diabetes['test']['attributes'][input_cols[1]], c=diabetes['test']['target'].diseaseProgression, marker='*', label='Testing data', cmap=dia_cm, edgecolor='black', linewidth=1, s=150) plt.xlabel(input_cols[0]) plt.ylabel(input_cols[1]) plt.legend() plt.colorbar()
Overfitting
Now, instantiate a decision tree regressor of max_depth=3
and train it with the two input attributes used in the previous section, age
and bmi
.
Plot the decision surface for this regressor after the training. Compare the decision surface, training accuracy, and testing accuracy between this model and the model in the previous section.
The previous model shows a situation of overfitting. Though the training accuracy of this model is lower than that of the previous one, the testing accuracy is higher than the previous one. That shows a higher level of generalisation.