Lab 6: Linear Regression
Objectives
-
To develop the script to perform gradient descent on linear regression model with least squares method using Python.
-
To train a linear regression model with least squares method using the
scikit-learnPython library.
Dataset
Throughout this lab we will be using the data for 442 diabetes patients. The data can be import from the sklearn library.
from sklearn import datasets
diabetes = datasets.load_diabetes()
diabetes contains the keys of data, target, DESCR, feature_names, data_filename, and target_filename. The description of the diabetes dataset is given by
print(diabetes.DESCR)
Task: What are the keys for the input data and the output/target data?
Gradient descent on linear regression model with least squares method
-
Load the dataset as described in the Dataset Section.
from sklearn import datasets diabetes = datasets.load_diabetes() -
Convert the dataset into a pandas DataFrame.
import pandas as pd dt = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) y = pd.DataFrame(diabetes.target, columns=['target']) -
In machine learning, the common practice is to split the dataset into training data and testing data (some also include the validation data). The testing data is not used during the training phase, but only after training to evaluate the model.
The split is normally done such that the training data has a larger portion than the testing data. In this case, let's use a 80-20 split for the training data and testing data.
sklearnprovides a function to split the train-test data given a percentage.from sklearn.model_selection import train_test_splitTask: How do you use
train_test_splitto split the data and target into 80-20 proportion? (Define the training features asX_train, training target asy_train, testing features asX_test, testing target asy_test.) -
Before we train the data, a few functions need to be defined. We will start with defining the model of a simple regression line, i.e. \(y = mx + c\). Define the function name as
modelwith 3 inputsx,m, andc. The function should return the value ofy. -
The next function to define is the cost function, i.e.
\[J = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]Name the function as
costwith 2 inputsyas the array of target values andyhas the array of predicted target values. The function returns a scalar value of the cost. Assume the arrays are pandas Series. -
We also need to define the function to calculate the derivatives of the cost with respect to each parameter. Name the function as
derivativeswith 3 inputsxas the array of input values,yas the array of of target values, andyhas the array of predicted target values. The function returns a dict object with keysmto hold the value of \(\frac{\partial J}{\partial m}\) andcthe value of \(\frac{\partial J}{\partial c}\). Assume the arrays are pandas Series.\[\frac{\partial J}{\partial m} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i)\]\[\frac{\partial J}{\partial c} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)\] -
We will use
bmias our input feature. -
Define the initial values for the parameters.
learningrate = 0.1 m = [] c = [] J = [] m.append(0) c.append(0) J.append(cost(y_train['target'], X_train['bmi'].apply(lambda x: model(x, m[-1], c[-1])))) -
Define the termination conditions.
J_min = 0.01 del_J_min = 0.0001 max_iter = 10000Variable Termination condition J_minThe algorithm terminates when the cost is less than this value. del_J_minThe algorithm terminates when the proportion of change in cost to the current cost is less than this value. max_iterThe algorithm terminates when the number of iteration exceeds this value. -
Now develop the code to perform gradient descent.
- Check termination conditions, if any condition fulfilled, the algorithm is terminated.
- Calculate derivatives.
- Update \(m\) and \(c\); append the new values to the arrays
mandc. - Calculate \(J\); append the value to the array
J. Repeat from the beginning.
-
To provide feedback during each iteration,
import sys, and add the following lines as the last lines in the loop.print('.', end='') sys.stdout.flush() -
To provide visual display for each iteration,
import matplotlib.pyplot as plt. Add these lines before the loop.plt.scatter(X_train['bmi'], y_train['target'], color='red') plt.title('Training data') plt.xlabel('BMI') line = NoneAdd the following lines as the last lines in the loop.
if line: line[0].remove() line = plt.plot(X_train['bmi'], X_train['bmi'].apply(lambda x: model(x, m[-1], c[-1])), '-', color='green') plt.pause(0.001) -
Outside of the loop, add the following lines to display the results.
y_train_pred = X_train['bmi'].apply(lambda x: model(x, m[-1], c[-1])) y_test_pred = X_test['bmi'].apply(lambda x: model(x, m[-1], c[-1])) print('\nAlgorithm terminated with') print(f' {len(J)} iterations,') print(f' m {m[-1]}') print(f' c {c[-1]}') print(f' training cost {J[-1]}') testcost = cost(y_test['target'], y_test_pred) print(f' testing cost {testcost}') -
Test the result on the testing data.
plt.figure() plt.scatter(X_test['bmi'], y_test['target'], color='red') plt.plot(X_test['bmi'], \ X_test['bmi'].apply(lambda x: model(x, m[-1], c[-1])), \ '-', color='green') plt.title('Testing data')
Learning on linear regression model using scikit-learn
-
scikit-learnprovides model and functions to perform linear regression easily.from sklearn import linear_model -
Create a linear regression model.
regrmodel = linear_model.LinearRegression() -
Train the model with the training data.
regrmodel.fit(X_train[['bmi']], y_train['target']) -
regrmodelis now a trained model. To get the predicted \(y\) given a set of \(x\) values,y_train_pred = regrmodel.predict(X_train[['bmi']]) y_test_pred = regrmodel.predict(X_test[['bmi']]) -
As we are using pandas data structures, we need to convert the predicted value to Series and align the indices with the target Series.
y_train_pred = pd.Series(y_train_pred) y_train_pred.index = y_train.index y_test_pred = pd.Series(y_test_pred) y_test_pred.index = y_test.index -
Display the information of the fitted model.
print('sklearn') print(f' m {regrmodel.coef_}') print(f' c {regrmodel.intercept_}') traincost = cost(y_train['target'], y_train_pred) testcost = cost(y_test['target'], y_test_pred) print(f' training cost: {traincost}') print(f' testing cost: {testcost}') -
Display the result of prediction together with the target values.
plt.figure() plt.scatter(X_train['bmi'], y_train['target'], color='red') plt.plot(X_train['bmi'], y_train_pred, '-', color='green') plt.title('Training data (sklearn)') plt.xlabel('BMI') plt.figure() plt.scatter(X_test['bmi'], y_test['target'], color='red') plt.plot(X_test['bmi'], y_test_pred, '-', color='green') plt.title('Testing data (sklearn)') plt.xlabel('BMI')
Multivariate linear regression
-
Linear regression can also be applied with multiple inputs. With \(k\) inputs, the linear regression equation is given as
\[\hat{y} = \beta + m_1x_1 + m_2x_2 + \cdots + m_kx_k\] -
With one input, the resultant function is a line; with two inputs, the resultant function is a plane; with more inputs, it's a hyperplane.
-
Gradient descent for multivariate linear regression is performed with the same approach.
- Check termination conditions, if any condition fulfilled, the algorithm is terminated.
- Calculate derivatives.
- Update \(m_1\), \(m_2\), ..., \(m_k\) and \(\beta\).
- Calculate \(J\). Repeat from the beginning.
-
For multivariate linear regression, the least squares cost is given by \(\(J = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)\)
The derivative with respect to \(\beta\) \(\(\frac{\partial J}{\partial \beta} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)\)\)
The derivatives with respect to \(m_j\) \(\(\frac{\partial J}{\partial m_j} = -\frac{2}{n} \sum_{i=1}^{n} x_{ji}(y_i - \hat{y}_i)\)\)
-
Using
scikit-learnfunctions, multivariate linear regression can be achieved by providing the input features to the function. -
We will now use the
bmiand blood pressurebpas the input features to predict the target. Copy the code block from the previous section and changeX_...[['bmi']]toX_...[['bmi','bp']]except for the plotting part. -
For the graphical visualisation (plotting) part, use the following code to plot 3d graphs:
from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(X_train['bmi'], X_train['bp'], y_train['target'], color='red') ax.scatter(X_train['bmi'], X_train['bp'], y_train_pred, color='green') ax.set_xlabel('BMI') ax.set_ylabel('Blood pressure') ax.set_title('Training data (sklearn multivariate)') fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(X_test['bmi'], X_test['bp'], y_test['target'], color='red') ax.scatter(X_test['bmi'], X_test['bp'], y_test_pred, color='green') ax.set_xlabel('BMI') ax.set_ylabel('Blood pressure') ax.set_title('Testing data (sklearn multivariate)')