Introduction to Linear Regression in Python| Data Interview Questions

Photo by Isaac Smith on Unsplash

Introduction to Linear Regression using Python

# Code to connect Colab to Google Drive (save the spreadsheet provided into your drive)# https://colab.research.google.com/notebooks/io.ipynb#scrollTo=JiJVCmu3dhFa!pip install --upgrade -q gspreadfrom google.colab import authauth.authenticate_user()import gspreadfrom oauth2client.client import GoogleCredentialsgc = gspread.authorize(GoogleCredentials.get_application_default())# Merging dataframes and dropping null values so we avoid errors# joiningplayerAllStats = pd.merge(seasonPlayerStats, playerData, how='left', on='Player')# dropping null value columns to avoid errorsplayerAllStats = playerAllStats[pd.notnull(playerAllStats['height'])]playerAllStats = playerAllStats[pd.notnull(playerAllStats['Points'])]playerAllStats = playerAllStats[pd.notnull(playerAllStats['Games'])]playerAllStats.head()

Output:

#Viewing all columns that we have in the dataframe. I'll be referencing this figure out which fields I need to cleanplayerAllStats.columns

Output:

sns.regplot(x="WinsSharesPer48Minutes", y = "PointsPerGame", data=playerAllStats)

Output:

playerAllStats.describe()

Output:

#filtering on players who have played since 2014 and have played at least 70% of the games in a given season (assuming 82 games per season)playerAllStatsSince2014 = playerAllStats.loc[(playerAllStats['Year'] >= 2014) &(playerAllStats['Games'] >= 58)]#filteirng on PG playersplayerAllStatsSince2014PG = playerAllStatsSince2014[playerAllStatsSince2014['Position'].str.contains('PG')].reset_index(drop=True)sns.regplot(x="WinsSharesPer48Minutes", y = "PointsPerGame", data=playerAllStatsSince2014PG)

Output:

playerAllStatsSince2014PG_WinShare = playerAllStatsSince2014PG['WinsSharesPer48Minutes']playerAllStatsSince2014PG_PPG = playerAllStatsSince2014PG['PointsPerGame']#importing packages and splitting out data into training + testing setsfrom sklearn.model_selection import train_test_splittrain_x, test_x, train_y, test_y = train_test_split(playerAllStatsSince2014PG['PointsPerGame'], \playerAllStatsSince2014PG['WinsSharesPer48Minutes'], test_size=0.20, random_state=42)
model_1 = LinearRegression()model_1.fit(train_x.reshape(-1, 1), train_y.reshape(-1, 1))model_1.score(train_x.reshape(-1, 1), train_y.reshape(-1, 1))

Output:

0.49632072484943907
X = train_x.values.reshape(-1, 1)y = train_y.values.reshape(-1, 1)# compute with formulas from the theoryyhat = model_1.predict(X)SS_Residual = sum((y-yhat)**2)SS_Total = sum((y-np.mean(y))**2)r_squared = 1 - np.divide(SS_Residual,SS_Total)#r_squared = 1 - np.exp((float(SS_Residual))/SS_Total)adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)print(r_squared, adjusted_r_squared)

Output:

[0.49632072] [0.49326812]
model_1_predict_values = model_1.predict(test_x.values.reshape(-1, 1))model_1_predict = pd.DataFrame({'Points Per Game': test_x.values, 'Actual Win Share': test_y.values,'Predicted Win Share': model_1_predict_values.reshape(1, -1)[0]})
#Root Mean square error on test datasetnp.sqrt(np.mean(np.square(model_1_predict['Actual Win Share'] - \model_1_predict['Predicted Win Share'])))

Output:

0.0523034857831081
playerAllStatsSince2014PG_Ys = playerAllStatsSince2014PG['WinsSharesPer48Minutes']playerAllStatsSince2014PG_Xs = playerAllStatsSince2014PG[['PointsPerGame', 'TotalReboundsPerGame', 'AssistsPerGame']]#splitting test and training datatrain_x, test_x, train_y, test_y = train_test_split(playerAllStatsSince2014PG_Xs, playerAllStatsSince2014PG_Ys, test_size=0.20, random_state=42)#fitting modelmodel_1 = LinearRegression()model_1.fit(train_x, train_y)model_1.score(train_x, train_y)

Output:

0.5060498809063979
model_1_predict_values = model_1.predict(test_x)model_1_predict = pd.DataFrame({'Actual Win Share': test_y.values,'Predicted Win Share': model_1_predict_values.reshape(1, -1)[0]})#Root Mean square error on test datasetnp.sqrt(np.mean(np.square(model_1_predict['Actual Win Share'] - model_1_predict['Predicted Win Share'])))

Output:

0.05174714479057624X = train_x.valuesy = train_y.values# compute with formulas from the theoryyhat = model_1.predict(X)SS_Residual = sum((y-yhat)**2)SS_Total = sum((y-np.mean(y))**2)r_squared = 1 - np.divide(SS_Residual,SS_Total)#r_squared = 1 - np.exp((float(SS_Residual))/SS_Total)adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)print(r_squared, adjusted_r_squared)

Output:

0.5060498809063977 0.49695877442001235

I’m a product junkie with a passion for data.