Lab 4: Decision Tree

Lab learning outcomes

After completing this lab, the students are able to

construct decision tree using CART learning algorithm with the scikit-learn Python library.

Datasets

The iris dataset will be used for classification, and the diabetes dataset for regression.

from sklearn import datasets
import pandas as pd
iris = datasets.load_iris()
iris = {
  'attributes': pd.DataFrame(iris.data, columns=iris.feature_names),
  'target': pd.DataFrame(iris.target, columns=['species']),
  'targetNames': iris.target_names
}
diabetes = datasets.load_diabetes()
diabetes = {
  'attributes': pd.DataFrame(diabetes.data, columns=diabetes.feature_names),
  'target': pd.DataFrame(diabetes.target, columns=['diseaseProgression'])
}

Split the datasets into 80-20 for train-test proportion.

from sklearn.model_selection import train_test_split
for dt in [iris, diabetes]:
  x_train, x_test, y_train, y_test = train_test_split(dt['attributes'], dt['target'], test_size=0.2, random_state=1)
  dt['train'] = {
    'attributes': x_train,
    'target': y_train
  }
  dt['test'] = {
  'attributes': x_test,
  'target': y_test
  }

Note: Be reminded that random_state is used to reproduce the same "random" split of the data whenever the function is called. To produce randomly splitted data every time the function is called, remove the random_state argument.

How do we access the training input data for the iris dataset?

Decision tree

Decision tree models and algorithms are provided by the scikit-learn as the class sklearn.tree.DecisionTreeClassifier for classification, and sklearn.tree.DecisionTreeRegressor for regression.

Classification

Import the class for decision tree classifier.

from sklearn.tree import DecisionTreeClassifier

Instantiate an object of DecisionTreeClassifier class with gini impurity as the split criterion.
```
dtc = DecisionTreeClassifier(criterion='gini')
```
Train the classifier with the training data. We will use all the input attributes.
```
dtc.fit(iris['train']['attributes'], iris['train']['target'])
```
.predict function is used to predict the species of the testing data.
```
predicts = dtc.predict(iris['test']['attributes'])
```

Comparing the predicted value and the target value of the test data.

print(pd.DataFrame(list(zip(iris['test']['target'].species,predicts)), columns=['target', 'predicted']))

Calculate the accuracy of the predicted value.

accuracy = dtc.score(iris['test']['attributes'],iris['test']['target'].species)
print(f'Accuracy: {accuracy:.4f}')

Decision tree visualisation

Import the matplotlib.pyplot library and the function to visualise the tree.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

Visualise the decision tree model.

plt.figure(figsize=[10,10])
tree = plot_tree(dtc, feature_names=iris['attributes'].columns.tolist(), 
                 class_names=list(iris['targetNames']), filled=True, rounded=True)

The maximum depth of a decision tree can be defined by adding the max_depth=... argument to the DecisionTreeClassifier(...) object instantiation. To allow unlimited maximum depth, pass max_depth=None.

Create a loop to compare the accuracy of the prediction with different maximum depths. In every iteration, you should calculate both the accuracy on the training data and the accuracy on the testing data. The comparison should be shown in a graph with max_depth as the horizontal axis and accuracy as the vertical axis. Two lines should be displayed on the graph with one line for training accuracy and the other testing accuracy.

Visualisation of decision surface

This section explains the method to visualise a decision tree on a graph. To do so we will focus on using two input attributes, sepal length and sepal width, i.e. the first two columns.

Instantiate the classifier without defining the maximum depth and train the model.

dtc = DecisionTreeClassifier()
input_cols = iris['train']['attributes'].columns[:2].tolist()
dtc.fit(iris['train']['attributes'][input_cols], iris['train']['target'].species)

Plot the decision tree.

plt.figure(figsize=[50,50])
plot_tree(dtc, feature_names=input_cols, 
          class_names=list(iris['targetNames']), filled=True, rounded=True)
plt.savefig('classificationDecisionTreeWithNoMaxDepth.png')

Prepare the colormaps.

from matplotlib import cm
from matplotlib.colors import ListedColormap
colormap = cm.get_cmap('tab20')
cm_dark = ListedColormap(colormap.colors[::2])
cm_light = ListedColormap(colormap.colors[1::2])

Calculating the decision surface.

import numpy as np
x_min = iris['attributes'][input_cols[0]].min()
x_max = iris['attributes'][input_cols[0]].max()
x_range = x_max - x_min
x_min = x_min - 0.1 * x_range
x_max = x_max + 0.1 * x_range
y_min = iris['attributes'][input_cols[1]].min()
y_max = iris['attributes'][input_cols[1]].max()
y_range = y_max - y_min
y_min = y_min - 0.1 * y_range
y_max = y_max + 0.1 * y_range
xx, yy = np.meshgrid(np.arange(x_min, x_max, .01*x_range), 
                    np.arange(y_min, y_max, .01*y_range))
z = dtc.predict(list(zip(xx.ravel(), yy.ravel())))
z = z.reshape(xx.shape)

Plot the decision surface.

plt.figure()
plt.pcolormesh(xx, yy, z, cmap=cm_light)

Plot the training and testing data.

plt.scatter(iris['train']['attributes'][input_cols[0]],   
            iris['train']['attributes'][input_cols[1]], 
            c=iris['train']['target'].species, cmap=cm_dark, s=200,
            label='Training data', edgecolor='black', linewidth=1)
plt.scatter(iris['test']['attributes'][input_cols[0]], 
            iris['test']['attributes'][input_cols[1]], 
            c=iris['test']['target'].species, cmap=cm_dark, s=200,
            label='Testing data', edgecolor='black', linewidth=1,
            marker='*')
train_acc = dtc.score(iris['train']['attributes'][input_cols], 
                      iris['train']['target'].species)
test_acc = dtc.score(iris['test']['attributes'][input_cols], 
                    iris['test']['target'].species)
plt.title(f'training: {train_acc:.3f}, testing: {test_acc:.3f}')
plt.xlabel(input_cols[0])
plt.ylabel(input_cols[1])
plt.legend()

You may not be able to see anything on one of the graph of the decision tree because the figure size is set to be larger than the screen size. However, the tree is saved to a png file in the same folder as your code.

Overfitting

Now, train a decision tree classifier of max_depth=3with the two input attributes used in the previous section, sepal length and sepal width.

Plot the decision surface for this classifier after the training. Compare the decision surface, training accuracy, and testing accuracy between this model and the model in the previous section.

Discuss the comparison between the decision tree from previous section and that of max_depth=3 from the aspect of overfitting/generalisation.

Regression

Regression using decision tree can be achieved by using the DecisionTreeRegressor class in sklearn.tree. Instantiate a regressor class and train the regressor with the training data using all the input attributes.

Predict the disease progression of the testing data, and determine the accuracy of the prediction.

Create a plot of prediction accuracies against maximum depths of the decision tree for both training data and testing data.

Visualisation of decision surface

This section explains the method to visualise a decision tree on a graph. To do so we will focus on using two input attributes, age and bmi.

Instantiate the classifier without defining the maximum depth and train the model.

dtr = DecisiontTreeRegressor()
input_cols = ['age', 'bmi']
dtr.fit(diabetes['train']['attributes'][input_cols], 
        diabetes['train']['target'].diseaseProgression)

Plot the decision tree.

plt.figure(figsize=[50,50])
plot_tree(dtr, feature_names=input_cols, filled=True, rounded=True)
plt.savefig('regressionDecisionTreeWithNoMaxDepth.png')

Prepare the colormaps.

from matplotlib import cm
dia_cm = cm.get_cmap('Reds')

Create the decision surface.

import numpy as np
x_min = diabetes['attributes'][input_cols[0]].min()
x_max = diabetes['attributes'][input_cols[0]].max()
x_range = x_max - x_min
x_min = x_min - 0.1 * x_range
x_max = x_max + 0.1 * x_range
y_min = diabetes['attributes'][input_cols[1]].min()
y_max = diabetes['attributes'][input_cols[1]].max()
y_range = y_max - y_min
y_min = y_min - 0.1 * y_range
y_max = y_max + 0.1 * y_range
xx, yy = np.meshgrid(np.arange(x_min, x_max, .01*x_range), 
                    np.arange(y_min, y_max, .01*y_range))
z = dtr.predict(list(zip(xx.ravel(), yy.ravel())))
z = z.reshape(xx.shape)

Plot the decision surface

plt.figure()
plt.pcolormesh(xx, yy, z, cmap=dia_cm)

Plot the training and testing data.

plt.scatter(diabetes['train']['attributes'][input_cols[0]],          
            diabetes['train']['attributes'][input_cols[1]], 
            c=diabetes['train']['target'].diseaseProgression, 
            label='Training data', cmap=dia_cm, 
            edgecolor='black', linewidth=1, s=150)
plt.scatter(diabetes['test']['attributes'][input_cols[0]],   
            diabetes['test']['attributes'][input_cols[1]], 
            c=diabetes['test']['target'].diseaseProgression, marker='*', 
            label='Testing data', cmap=dia_cm, 
            edgecolor='black', linewidth=1, s=150)
plt.xlabel(input_cols[0])
plt.ylabel(input_cols[1])
plt.legend()
plt.colorbar()

Overfitting

Compare the decision tree regressor in the previous model with a decision tree regressor of a small maximum depth and discuss overfitting using the decision surface, training accuracy, and testing accuracy.

Report

Submit a written report that describes the considerations while writing the scripts, the answers to the above questions, and the problems you encountered in the process. Include your script in the written report as an appendix.