How to check for assumptions in a Linear Regression

Sanchita Paul
Analytics Vidhya
Published in
4 min readApr 7, 2021

--

Before we move to checking the assumptions let us first understand why do we need to need to check for assumptions before fitting a model.

Why do we do this? You don’t need the assumptions for having a best fit line. But your parameters maybe biased or have high variance. Violation of assumptions will make interpretation of regression results much more difficult. Also the predictions made by the model will be extremely inefficient. Now that we understand the need, let us see the how.

I will be using the 50 start-ups dataset to check for the assumptions. You can conduct this experiment with as many variables.

1. Linearity between independent and dependent variables

The expected value or the predicted value is a straight line function for each independent variable.

Linearity can be easily checked with scatter plot

# Importing the libraries
import numpy as np
import pandas as pd
from numpy import math
import matplotlib.pyplot as plt
#Marketing spend and Profit
plt.scatter(dataset['Marketing Spend'], dataset['Profit'], alpha=0.5)
plt.title('Scatter plot of Profit with Marketing Spend')
plt.xlabel('Marketing Spend')
plt.ylabel('Profit')
plt.show()
From the graph it is visible that there is mostly a direct relationship between Profit and Marketing Spend with the exception of a few outliers.
#R&D Spend with Profit
plt.scatter(dataset['R&D Spend'], dataset['Profit'], alpha=0.5)
plt.title('Scatter plot of Profit with R&D Spend')
plt.xlabel('R&D Spend')
plt.ylabel('Profit')
plt.show()
The relationship between R&D Spend and Profit is completely linear. They fit together almost perfectly.
#Administration with Profit
plt.scatter(dataset['Administration'], dataset['Profit'], alpha=0.5)
plt.title('Scatter plot of Profit with Administration')
plt.xlabel('Administration')
plt.ylabel('Profit')
plt.show()
There is almost little to no relationship between profit and administration

So what can we infer from the following graphs on a business point of view?

R&D spend will directly almost perfectly effect the Profit. Increasing R&D spend will result in a higher profit. There is a good relationship between Marketing Spend and Profit. Administration changes does not have direct effect on the Profit margin.

2. Regression model is homoscedastic

Homoscedasticity means that the error doesn’t doesn’t change across all the values of the independent variable.

We can easily check this with the help of The White Test.

We can find it in Python’s Statsmodels libaray. The White test gives us a direct answer without having to plot graphs.

#Importing Libraries
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip
#The variables
x=dataset[['R&D Spend','Marketing Spend','Administration']].values
y=dataset['Profit'].values
#Fit the model
model = sm.OLS(y,x).fit()

The White Test is done by passing the residuals and all the independent variables.

#The test
white_test = het_white(model.resid, model.model.exog)
#Zipping the array with labels
names = ['Lagrange multiplier statistic', 'p-value','f-value', 'f p-value']
lzip(names,white_test)

The White Test has the null hypothesis that the errors are have same variance or homoscedastic. Having a p-value ≤ 0.05 would indicate that the null hypothesis is rejected, hence Heteroscedasticity.

The White Test comes with a few limitations: it needs a lot of variables and hence can be quite time consuming, so using the White Test for a large dataset isn’t too practical and it would be easier to go for Breusch-Pagan Test.

What to do if there is Heteroscedasticity?

  1. Outlier removal
  2. Log transformation of x variables
  3. Polynomial regresion

3. No Multicollinearity

Multicollinearity occurs when the independent variables are correlated to each other. If the degree of multicollinearity is high it can cause problems while interpreting the results.

This is can easily be done using a heat map

#Import Seaborn
import seaborn as sns
plt.figure(figsize=(10,5))
sns.heatmap(dataset[['R&D Spend','Marketing Spend','Administration']].corr(),vmin=-1,annot= True)
Heat map of the independent variables

We can already see from the heatmap that there is a significant correlation between R&D Spend and Marketing Spend.

We can find the degree of correlation with the help of Variation Inflation Factor(VIF)

It can be interpreted as :

1= Not correlated

1–5 = Moderately correlated

>5 = Highly correlated

Let us look at how to do it

from statsmodels.stats.outliers_influence import variance_inflation_factor# For each X, calculate VIF and save in dataframevif = pd.DataFrame()vif["VIF Factor"] = [variance_inflation_factor(dataset[['R&D Spend','Marketing Spend','Administration']].values, i) for i in range(x.shape[1])]vif["features"] = dataset[['R&D Spend','Marketing Spend','Administration']].columnsvif.round(1)

As already suspected, there is correlation between the variables.

So how do we take care of multicollinearity?

According to the dataset and its requirements we can do it by the following ways:

  1. Remove some highly correlated variables
  2. Perform Principal Component Analysis for highly correlated variables
  3. Linearly add them together

4. Errors are normally distributed

Why check for this? Because it can cause problems while calculating confidence intervals. Skewness can be due to the presence of outliers and this can make bias while parameter estimation.

The most powerful way of doing this by a Q-Q probability plot. The Quantile-Quantile is made by plotting the residuals vs the order of statistic. This can be done by:

#Import libraryfrom scipy import statsstats.probplot(model.resid, dist="norm", plot= plt)plt.title("MODEL Residuals Q-Q Plot")plt.legend(['Actual,'Theoretical'])

--

--

Sanchita Paul
Analytics Vidhya

Hi I am Sanchita, an engineer, a math enthusiast, an AlmaBetter Datascience trainee and writer at Analytics Vidhya