Lab 6: Linear Regression
Objectives
-
To develop the script to perform gradient descent on linear regression model with least squares method using Python.
-
To train a linear regression model with least squares method using the
scikit-learn
Python library.
Dataset
Throughout this lab we will be using the data for 442 diabetes patients. The data can be import from the sklearn
library.
from sklearn import datasets
diabetes = datasets.load_diabetes()
diabetes
contains the keys of data
, target
, DESCR
, feature_names
, data_filename
, and target_filename
. The description of the diabetes dataset is given by
print(diabetes.DESCR)
Task: What are the keys for the input data and the output/target data?
Gradient descent on linear regression model with least squares method
-
Load the dataset as described in the Dataset Section.
from sklearn import datasets diabetes = datasets.load_diabetes()
-
Convert the dataset into a pandas DataFrame.
import pandas as pd dt = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) y = pd.DataFrame(diabetes.target, columns=['target'])
-
In machine learning, the common practice is to split the dataset into training data and testing data (some also include the validation data). The testing data is not used during the training phase, but only after training to evaluate the model.
The split is normally done such that the training data has a larger portion than the testing data. In this case, let's use a 80-20 split for the training data and testing data.
sklearn
provides a function to split the train-test data given a percentage.from sklearn.model_selection import train_test_split
Task: How do you use
train_test_split
to split the data and target into 80-20 proportion? (Define the training features asX_train
, training target asy_train
, testing features asX_test
, testing target asy_test
.) -
Before we train the data, a few functions need to be defined. We will start with defining the model of a simple regression line, i.e. \(y = mx + c\). Define the function name as
model
with 3 inputsx
,m
, andc
. The function should return the value ofy
. -
The next function to define is the cost function, i.e.
\[J = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]Name the function as
cost
with 2 inputsy
as the array of target values andyh
as the array of predicted target values. The function returns a scalar value of the cost. Assume the arrays are pandas Series. -
We also need to define the function to calculate the derivatives of the cost with respect to each parameter. Name the function as
derivatives
with 3 inputsx
as the array of input values,y
as the array of of target values, andyh
as the array of predicted target values. The function returns a dict object with keysm
to hold the value of \(\frac{\partial J}{\partial m}\) andc
the value of \(\frac{\partial J}{\partial c}\). Assume the arrays are pandas Series.\[\frac{\partial J}{\partial m} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i)\]\[\frac{\partial J}{\partial c} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)\] -
We will use
bmi
as our input feature. -
Define the initial values for the parameters.
learningrate = 0.1 m = [] c = [] J = [] m.append(0) c.append(0) J.append(cost(y_train['target'], X_train['bmi'].apply(lambda x: model(x, m[-1], c[-1]))))
-
Define the termination conditions.
J_min = 0.01 del_J_min = 0.0001 max_iter = 10000
Variable Termination condition J_min
The algorithm terminates when the cost is less than this value. del_J_min
The algorithm terminates when the proportion of change in cost to the current cost is less than this value. max_iter
The algorithm terminates when the number of iteration exceeds this value. -
Now develop the code to perform gradient descent.
- Check termination conditions, if any condition fulfilled, the algorithm is terminated.
- Calculate derivatives.
- Update \(m\) and \(c\); append the new values to the arrays
m
andc
. - Calculate \(J\); append the value to the array
J
. Repeat from the beginning.
-
To provide feedback during each iteration,
import sys
, and add the following lines as the last lines in the loop.print('.', end='') sys.stdout.flush()
-
To provide visual display for each iteration,
import matplotlib.pyplot as plt
. Add these lines before the loop.plt.scatter(X_train['bmi'], y_train['target'], color='red') plt.title('Training data') plt.xlabel('BMI') line = None
Add the following lines as the last lines in the loop.
if line: line[0].remove() line = plt.plot(X_train['bmi'], X_train['bmi'].apply(lambda x: model(x, m[-1], c[-1])), '-', color='green') plt.pause(0.001)
-
Outside of the loop, add the following lines to display the results.
y_train_pred = X_train['bmi'].apply(lambda x: model(x, m[-1], c[-1])) y_test_pred = X_test['bmi'].apply(lambda x: model(x, m[-1], c[-1])) print('\nAlgorithm terminated with') print(f' {len(J)} iterations,') print(f' m {m[-1]}') print(f' c {c[-1]}') print(f' training cost {J[-1]}') testcost = cost(y_test['target'], y_test_pred) print(f' testing cost {testcost}')
-
Test the result on the testing data.
plt.figure() plt.scatter(X_test['bmi'], y_test['target'], color='red') plt.plot(X_test['bmi'], \ X_test['bmi'].apply(lambda x: model(x, m[-1], c[-1])), \ '-', color='green') plt.title('Testing data')
Learning on linear regression model using scikit-learn
-
scikit-learn
provides model and functions to perform linear regression easily.from sklearn import linear_model
-
Create a linear regression model.
regrmodel = linear_model.LinearRegression()
-
Train the model with the training data.
regrmodel.fit(X_train[['bmi']], y_train['target'])
-
regrmodel
is now a trained model. To get the predicted \(y\) given a set of \(x\) values,y_train_pred = regrmodel.predict(X_train[['bmi']]) y_test_pred = regrmodel.predict(X_test[['bmi']])
-
As we are using pandas data structures, we need to convert the predicted value to Series and align the indices with the target Series.
y_train_pred = pd.Series(y_train_pred) y_train_pred.index = y_train.index y_test_pred = pd.Series(y_test_pred) y_test_pred.index = y_test.index
-
Display the information of the fitted model.
print('sklearn') print(f' m {regrmodel.coef_}') print(f' c {regrmodel.intercept_}') traincost = cost(y_train['target'], y_train_pred) testcost = cost(y_test['target'], y_test_pred) print(f' training cost: {traincost}') print(f' testing cost: {testcost}')
-
Display the result of prediction together with the target values.
plt.figure() plt.scatter(X_train['bmi'], y_train['target'], color='red') plt.plot(X_train['bmi'], y_train_pred, '-', color='green') plt.title('Training data (sklearn)') plt.xlabel('BMI') plt.figure() plt.scatter(X_test['bmi'], y_test['target'], color='red') plt.plot(X_test['bmi'], y_test_pred, '-', color='green') plt.title('Testing data (sklearn)') plt.xlabel('BMI')
Multivariate linear regression
-
Linear regression can also be applied with multiple inputs. With \(k\) inputs, the linear regression equation is given as
\[\hat{y} = \beta + m_1x_1 + m_2x_2 + \cdots + m_kx_k\] -
With one input, the resultant function is a line; with two inputs, the resultant function is a plane; with more inputs, it's a hyperplane.
-
Gradient descent for multivariate linear regression is performed with the same approach.
- Check termination conditions, if any condition fulfilled, the algorithm is terminated.
- Calculate derivatives.
- Update \(m_1\), \(m_2\), ..., \(m_k\) and \(\beta\).
- Calculate \(J\). Repeat from the beginning.
-
For multivariate linear regression, the least squares cost is given by \(\(J = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)\)
The derivative with respect to \(\beta\) \(\(\frac{\partial J}{\partial \beta} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)\)\)
The derivatives with respect to \(m_j\) \(\(\frac{\partial J}{\partial m_j} = -\frac{2}{n} \sum_{i=1}^{n} x_{ji}(y_i - \hat{y}_i)\)\)
-
Using
scikit-learn
functions, multivariate linear regression can be achieved by providing the input features to the function. -
We will now use the
bmi
and blood pressurebp
as the input features to predict the target. Copy the code block from the previous section and changeX_...[['bmi']]
toX_...[['bmi','bp']]
except for the plotting part. -
For the graphical visualisation (plotting) part, use the following code to plot 3d graphs:
from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(X_train['bmi'], X_train['bp'], y_train['target'], color='red') ax.scatter(X_train['bmi'], X_train['bp'], y_train_pred, color='green') ax.set_xlabel('BMI') ax.set_ylabel('Blood pressure') ax.set_title('Training data (sklearn multivariate)') fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(X_test['bmi'], X_test['bp'], y_test['target'], color='red') ax.scatter(X_test['bmi'], X_test['bp'], y_test_pred, color='green') ax.set_xlabel('BMI') ax.set_ylabel('Blood pressure') ax.set_title('Testing data (sklearn multivariate)')