Introduction to Linear Regression in Python| Data Interview Questions

Greg Brown
6 min readMay 13, 2021
Photo by Isaac Smith on Unsplash

Introduction to Linear Regression using Python

This post will walk through a practice problem which analyzes NBA player’s season stats. We’ve also provided the practice problem’s Colab Notebook so you can follow along, just copy the notebook and the two Google sheets ( season stats and player data) to your Google Drive.

This practice problem has a data set that contains NBA players and their invidiual season stats. This practice will take you through fitting a linear model using player stats to estimate win shares per 48 minutes.

Topics covered in this tutorial:

  • Importing data from Google Sheets into Colab
  • Basic data cleaning / analysis
  • Splitting data into test and training sets
  • Training a single variable regression model
  • Training a multivariate regression model
  • Manually calculating Adjusted R-squared and R-squared
  • Calculating Root Mean Square Error
  • Comparing results of 2 regression models

Importing and cleaning data

# Code to connect Colab to Google Drive (save the spreadsheet provided into your drive)# https://colab.research.google.com/notebooks/io.ipynb#scrollTo=JiJVCmu3dhFa!pip install --upgrade -q gspreadfrom google.colab import authauth.authenticate_user()import gspreadfrom oauth2client.client import GoogleCredentialsgc = gspread.authorize(GoogleCredentials.get_application_default())# Merging dataframes and dropping null values so we avoid errors# joiningplayerAllStats = pd.merge(seasonPlayerStats, playerData, how='left', on='Player')# dropping null value columns to avoid errorsplayerAllStats = playerAllStats[pd.notnull(playerAllStats['height'])]playerAllStats = playerAllStats[pd.notnull(playerAllStats['Points'])]playerAllStats = playerAllStats[pd.notnull(playerAllStats['Games'])]playerAllStats.head()

Output:

#Viewing all columns that we have in the dataframe. I'll be referencing this figure out which fields I need to cleanplayerAllStats.columns

Output:

Cleaning data even more

  1. Converting height to inches (so we can use it if we decide to)
  2. Converting some of the stats to a numeric value so we can do analysis
  3. Calculating points per game, total rebounds per game, and assists per game

Question: Can we create a linear regression model using points per game to estimate win shares per 48 minutes?
1 . Create a plot with the 2 statistics
Before we create the model, let’s plot the points per game vs win shares per 48 minutes. This will let us understand the data set and see if we need to remove outliers to improve model accuracy.

sns.regplot(x="WinsSharesPer48Minutes", y = "PointsPerGame", data=playerAllStats)

Output:

There’s obviously some outliers and data that doesn’t below. Let’s see what kind of data we’re working with.

playerAllStats.describe()

Output:

There’s a few things I see immediately that would make our data hard to plot.

  1. We have seasons from 11400…to minimize noise, we’re only going to look at recent player’s seasons. So I’m going to limit my analysis from 2014 to present.
  2. We have players who’ve played 1 game all season. We’re also going to limit the number of games a player has played in a given season to be >70% of all games. A given season has 82 games.
  3. This isn’t in the descriptive statistics above, but there are different positions in basketball. For the purpose of this analysis let’s limit our data to include players who play Point Guard (PG) (or other positions along with PG). These are typically the players who score the most points, and it makes sense for us to group players who have even scoring potential to reduce skewness.
#filtering on players who have played since 2014 and have played at least 70% of the games in a given season (assuming 82 games per season)playerAllStatsSince2014 = playerAllStats.loc[(playerAllStats['Year'] >= 2014) &(playerAllStats['Games'] >= 58)]#filteirng on PG playersplayerAllStatsSince2014PG = playerAllStatsSince2014[playerAllStatsSince2014['Position'].str.contains('PG')].reset_index(drop=True)sns.regplot(x="WinsSharesPer48Minutes", y = "PointsPerGame", data=playerAllStatsSince2014PG)

Output:

The plot looks much better between points per game vs win shares per 48 minutes. Our data is in a format we can use. Let’s move on to creating a training and test data set.

2. Splitting data into training and test set
We’re going to split our data into dependent variables and an independent varaibles for both training and test sets using SKLearn’s train_test_split function.

playerAllStatsSince2014PG_WinShare = playerAllStatsSince2014PG['WinsSharesPer48Minutes']playerAllStatsSince2014PG_PPG = playerAllStatsSince2014PG['PointsPerGame']#importing packages and splitting out data into training + testing setsfrom sklearn.model_selection import train_test_splittrain_x, test_x, train_y, test_y = train_test_split(playerAllStatsSince2014PG['PointsPerGame'], \playerAllStatsSince2014PG['WinsSharesPer48Minutes'], test_size=0.20, random_state=42)

3. Fitting the data to a linear regression and getting R-squared
Here we’re creating a linear regression model called model_1. We’re going to fit the training data to this model. After that we’re going to get the R-squared value. R-squared is a value between 0 and 1 which is a statistical measure of how close the data are to the fitted regression line. 1 = perfect fit, 0 = no fit

model_1 = LinearRegression()model_1.fit(train_x.reshape(-1, 1), train_y.reshape(-1, 1))model_1.score(train_x.reshape(-1, 1), train_y.reshape(-1, 1))

Output:

0.49632072484943907

A r-squared of 0.496 tells us we have a moderately effective fit. Since we’re also going to be creating a multi-variate model, let’s also calculate Adjusted R-squared and Root Mean Square Error (RMSE) so we have more statistics to compare. The below code calculates R-squared and Adjusted R-squared.

X = train_x.values.reshape(-1, 1)y = train_y.values.reshape(-1, 1)# compute with formulas from the theoryyhat = model_1.predict(X)SS_Residual = sum((y-yhat)**2)SS_Total = sum((y-np.mean(y))**2)r_squared = 1 - np.divide(SS_Residual,SS_Total)#r_squared = 1 - np.exp((float(SS_Residual))/SS_Total)adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)print(r_squared, adjusted_r_squared)

Output:

[0.49632072] [0.49326812]

4. Estimate values for the test data
Last thing to do is to estimate Win Shares with our test data. We’ll compare our estimate to the true values and calculate the Root Mean Square Error.

model_1_predict_values = model_1.predict(test_x.values.reshape(-1, 1))model_1_predict = pd.DataFrame({'Points Per Game': test_x.values, 'Actual Win Share': test_y.values,'Predicted Win Share': model_1_predict_values.reshape(1, -1)[0]})

The code below calculate Root Mean Square Error. The lower the Root Mean Square Error the closer our estimations are to the real values.

#Root Mean square error on test datasetnp.sqrt(np.mean(np.square(model_1_predict['Actual Win Share'] - \model_1_predict['Predicted Win Share'])))

Output:

0.0523034857831081

Question: Can we create a linear regression model using points per game, total rebounds per game, and assists per game to estimate win shares per 48 minutes?
We’ll go through the same process as above (less the data exploration, which we do not need to do 2 times). We’ll create the model and generate the statistics. Then we’ll compare the single variable to the multivariable model.

playerAllStatsSince2014PG_Ys = playerAllStatsSince2014PG['WinsSharesPer48Minutes']playerAllStatsSince2014PG_Xs = playerAllStatsSince2014PG[['PointsPerGame', 'TotalReboundsPerGame', 'AssistsPerGame']]#splitting test and training datatrain_x, test_x, train_y, test_y = train_test_split(playerAllStatsSince2014PG_Xs, playerAllStatsSince2014PG_Ys, test_size=0.20, random_state=42)#fitting modelmodel_1 = LinearRegression()model_1.fit(train_x, train_y)model_1.score(train_x, train_y)

Output:

0.5060498809063979

It looks like this model’s R-square value is already higher that our single variable. Next we’ll calculate the model’s estimate for Win Shares.

model_1_predict_values = model_1.predict(test_x)model_1_predict = pd.DataFrame({'Actual Win Share': test_y.values,'Predicted Win Share': model_1_predict_values.reshape(1, -1)[0]})#Root Mean square error on test datasetnp.sqrt(np.mean(np.square(model_1_predict['Actual Win Share'] - model_1_predict['Predicted Win Share'])))

Output:

0.05174714479057624X = train_x.valuesy = train_y.values# compute with formulas from the theoryyhat = model_1.predict(X)SS_Residual = sum((y-yhat)**2)SS_Total = sum((y-np.mean(y))**2)r_squared = 1 - np.divide(SS_Residual,SS_Total)#r_squared = 1 - np.exp((float(SS_Residual))/SS_Total)adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)print(r_squared, adjusted_r_squared)

Output:

0.5060498809063977 0.49695877442001235

Comparing single variable to multivariate linear regression model
Below is a summary of the test statistics of the two models we created. Across all 3 statistics, the multivariate model performs slightly better than the single variable.

Originally published at https://www.interviewqs.com.

--

--