Using Python to improve my Fantasy Basketball team

Front Page

Nikolaos Michas

PyCon Balkan 2018

What is Fantasy Basketball ?

Fantasy sport

From Wikipedia, the free encyclopedia

A fantasy sport (also known less commonly as rotisserie or roto) is a type of online game where participants assemble imaginary or virtual teams of real players of a professional sport. These teams compete based on the statistical performance of those players' players in actual games. This performance is converted into points that are compiled and totaled according to a roster selected by each fantasy team's manager. These point systems can be simple enough to be manually calculated by a "league commissioner" who coordinates and manages the overall league, or points can be compiled and calculated using computers tracking actual results of the professional sport. In fantasy sports, team owners draft, trade and cut (drop) players, analogously to real sports.

Basic Rules

  • Team owner drafts a team of 13 players

  • Each week owners select 10 active players

  • Owners collect points based on their picks' performance

Front Page

Each Player collects points from the following statistics

  • Field Goals Made (FGM) 1.5​
  • Field Goals Attempted (FGA) ​ -0.5​
  • Free Throws Made (FTM)​ 1​
  • Free Throws Attempted (FTA)​ -0.75​
  • Three Pointers Made (3PM)​ 1​
  • Three Pointers Attempted (3PA)​ -0.25​
  • Offensive Rebounds (OREB)​ 0.5​
  • Rebounds (REB)​ 1​
  • Assists (AST)​ 2​
  • Steals (STL)​ 2.5​
  • Blocks (BLK)​ 2.5​
  • Turnovers (TO)​ -1.75​
  • Points (PTS)​ 1​

How to improve my team?

  • Improve Draft Process
  • Make smarter moves during the season (based on schedule and form)

Predict player's performance

  • Use previous season's statistics to predict the next one
  • Machine Learning
  • Regression
  • Neural Networks

Use Python

  • Pandas

  • Beautiful Soup

  • Jupyter

  • Seaborn / Plotly

  • Scikit Learn

  • Keras

Step 0.

Pandas

https://pandas.pydata.org/pandas-docs/stable/

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

(venv) [nmichas@my-pc]$ pip install pandas

pandas.DataFrame

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
In [2]:
df
Out[2]:
A B C D
0 -0.023940 -1.116884 -1.420836 0.026762
1 0.472838 0.537210 -0.174598 -1.972429
2 0.030127 -0.493965 -1.710277 -1.127274
3 -0.838290 -0.340422 0.982786 -0.291325
4 0.942333 0.914386 -1.218660 -2.353766
5 0.326871 -0.797093 -0.446801 -0.366841

Step 1.

Collect Statistics from Previous Years

Beautiful Soup

https://www.crummy.com/software/BeautifulSoup/

Install by:

(venv) [nmichas@my-pc]$ pip install beautifulsoup4
(venv) [nmichas@my-pc]$ pip install lxml

Parse and read all Players Statistics from Basketball-Reference website

https://www.basketball-reference.com/players/a/antetgi01.html

In [3]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.basketball-reference.com/players/a/antetgi01.html'

r = requests.get(url)
s = BeautifulSoup(r.text, 'lxml')
In [4]:
player_df = pd.read_html(r.text)[0]
player_df.head()
Out[4]:
Season Age Tm Lg Pos G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 2013-14 19.0 MIL NBA SF 77 23 24.6 2.2 5.4 ... 0.683 1.0 3.4 4.4 1.9 0.8 0.8 1.6 2.2 6.8
1 2014-15 20.0 MIL NBA SG 81 71 31.4 4.7 9.6 ... 0.741 1.2 5.5 6.7 2.6 0.9 1.0 2.1 3.1 12.7
2 2015-16 21.0 MIL NBA PG 80 79 35.3 6.4 12.7 ... 0.724 1.4 6.2 7.7 4.3 1.2 1.4 2.6 3.2 16.9
3 2016-17 22.0 MIL NBA SF 80 80 35.6 8.2 15.7 ... 0.770 1.8 7.0 8.8 5.4 1.6 1.9 2.9 3.1 22.9
4 2017-18 23.0 MIL NBA PF 75 75 36.7 9.9 18.7 ... 0.760 2.1 8.0 10.0 4.8 1.5 1.4 3.0 3.1 26.9

5 rows × 30 columns

In [5]:
COLUMNS = ['Season', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 
           'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', 
           '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 
           'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 
           'PF', 'PTS']
player_df = player_df[COLUMNS]
player_df.head()
Out[5]:
Season Age Tm Lg Pos G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 2013-14 19.0 MIL NBA SF 77 23 24.6 2.2 5.4 ... 0.683 1.0 3.4 4.4 1.9 0.8 0.8 1.6 2.2 6.8
1 2014-15 20.0 MIL NBA SG 81 71 31.4 4.7 9.6 ... 0.741 1.2 5.5 6.7 2.6 0.9 1.0 2.1 3.1 12.7
2 2015-16 21.0 MIL NBA PG 80 79 35.3 6.4 12.7 ... 0.724 1.4 6.2 7.7 4.3 1.2 1.4 2.6 3.2 16.9
3 2016-17 22.0 MIL NBA SF 80 80 35.6 8.2 15.7 ... 0.770 1.8 7.0 8.8 5.4 1.6 1.9 2.9 3.1 22.9
4 2017-18 23.0 MIL NBA PF 75 75 36.7 9.9 18.7 ... 0.760 2.1 8.0 10.0 4.8 1.5 1.4 3.0 3.1 26.9

5 rows × 30 columns

In [6]:
import re

player_df['Height'] = s.find(itemprop='height').get_text()
player_df['Weight'] = s.find(itemprop='weight').get_text()

regex = re.compile(
    '(Guard|Forward|Point Guard|Center|Power Forward|Shooting Guard|Small Forward)')
player_df['Position'] = s.findAll(text=regex)[0].strip().split('\n')[0]
player_df.columns
Out[6]:
Index(['Season', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Height',
       'Weight', 'Position'],
      dtype='object')

Read All Player Names

(venv) [nmichas@my-pc]$ python get_all_players.py

players.csv should be a csv file of this form

name,shortname,href
Alaa Abdelnaby,abdelal01,/players/a/abdelal01.html
Zaid Abdul-Aziz,abdulza01,/players/a/abdulza01.html
Kareem Abdul-Jabbar,abdulka01,/players/a/abdulka01.html
Mahmoud Abdul-Rauf,abdulma02,/players/a/abdulma02.html
Tariq Abdul-Wahad,abdulta01,/players/a/abdulta01.html
Shareef Abdur-Rahim,abdursh01,/players/a/abdursh01.html
Tom Abernethy,abernto01,/players/a/abernto01.html
Forest Able,ablefo01,/players/a/ablefo01.html

Read all Statistics for every player

(venv) [nmichas@my-pc]$ python get_all_seasons.py

seasons.csv should be a csv file of this form

csv
Player,ShortName,Height,Weight,Position,BirthPlace,SeasonURL,Season,Age,Tm,Lg,Pos,G,...
Alaa Abdelnaby,abdelal01,6-10,240lb,Power Forward,Egypt,/players/a/abdelal01/gamelog/1991/,1990-91,22.0,POR,NBA,PF,43,0,6.7,1.3,2.7,0.474,0.0,0.0,,1.3,...
Alaa Abdelnaby,abdelal01,6-10,240lb,Power Forward,Egypt,/players/a/abdelal01/gamelog/1992/,1991-92,23.0,POR,NBA,PF,71,1,13.2,2.5,5.1,0.493,0.0,0.0,,2.5,...
Alaa Abdelnaby,abdelal01,6-10,240lb,Power Forward,Egypt,/players/a/abdelal01/gamelog/1993/,1992-93,24.0,TOT,NBA,PF,75,52,17.5,3.3,6.3,0.518,0.0,0.0,0.0,...

Step 2.

Clear the Data

In order to perform analysis and make predictions we need to use entirely numerical values

In [7]:
df = pd.read_csv('seasons.csv')
df.sample(1)
Out[7]:
Player ShortName Height Weight Position BirthPlace SeasonURL Season Age Tm ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
2165 Benoit Benjamin benjabe01 7-0 250lb Center Louisiana /players/b/benjabe01/gamelog/1996/ 1995-96 31.0 MIL ... 0.732 1.6 4.7 6.2 0.7 0.5 1.0 1.6 2.6 7.8

1 rows × 37 columns

Remove stats from players that changed team mid-season

In [8]:
df[['Player', 'ShortName', 'Season', 'Lg', 'Tm']].head(5)
Out[8]:
Player ShortName Season Lg Tm
0 Alaa Abdelnaby abdelal01 1990-91 NBA POR
1 Alaa Abdelnaby abdelal01 1991-92 NBA POR
2 Alaa Abdelnaby abdelal01 1992-93 NBA TOT
3 Alaa Abdelnaby abdelal01 1992-93 NBA MIL
4 Alaa Abdelnaby abdelal01 1992-93 NBA BOS
In [9]:
df.drop(
    df[df.duplicated(['ShortName', 'Season'], keep='first')].index, 
    inplace=True)

Drop Rows for No-NBA leagues and total career statistics

In [10]:
df[['Player', 'ShortName', 'Season', 'Lg', 'Tm']][df.Lg != 'NBA'].head(5)
Out[10]:
Player ShortName Season Lg Tm
92 John Abramovic abramjo01 1946-47 BAA PIT
93 John Abramovic abramjo01 1947-48 BAA TOT
96 John Abramovic abramjo01 Career BAA NaN
151 Don Adams adamsdo01 1974-75 TOT TOT
154 Don Adams adamsdo01 1975-76 TOT TOT
In [11]:
df.drop(df[df.Lg == 'ABA'].index, inplace=True)
df.drop(df[df.Lg == 'BAA'].index, inplace=True)
df.drop(df[df.Lg == 'TOT'].index, inplace=True)
df.drop(df[df.Season == 'Career'].index, inplace=True)

Remove data from players with missing information

In [12]:
df[['Player', 'ShortName', 'Season', 'Lg', 'Tm', '3P', 'GS']][df['3P'].isnull()].head(5)
Out[12]:
Player ShortName Season Lg Tm 3P GS
10 Zaid Abdul-Aziz abdulza01 1968-69 NBA TOT NaN NaN
13 Zaid Abdul-Aziz abdulza01 1969-70 NBA MIL NaN NaN
14 Zaid Abdul-Aziz abdulza01 1970-71 NBA SEA NaN NaN
15 Zaid Abdul-Aziz abdulza01 1971-72 NBA SEA NaN NaN
16 Zaid Abdul-Aziz abdulza01 1972-73 NBA HOU NaN NaN
In [13]:
# drop players before 3P use
df.dropna(subset=['3P', '3PA'], inplace=True)
# drop players without info for Games Started
df.dropna(subset=['GS'], inplace=True)
# drop players with no height-weight info
df.dropna(subset=['Height', 'Weight'], inplace=True)

Convert Heigh and Weight to numeric

In [14]:
df[['Height', 'Weight']].sample(5)
Out[14]:
Height Weight
30612 6-8 199lb
3868 6-9 245lb
27836 6-2 186lb
24749 6-8 210lb
13275 6-5 184lb
In [15]:
def height_to_cm(h):
    ft, inch = h.split('-')
    inch = int(inch) + int(ft) * 12
    return round(inch * 2.54, 1)


def remove_lb(w):
    return int(w.replace('lb', ''))

df['Height'] = df['Height'].map(height_to_cm)
df['Weight'] = df['Weight'].map(remove_lb)

Convert season to numeric

In [16]:
min_season = int(df['Season'].min().split('-')[0])

def get_season(row):
    return int(row['Season'].split('-')[0]) - min_season

df['Season_Numeric'] = df.apply(get_season, axis=1)
In [17]:
df[['Season', 'Season_Numeric']].sample(5)
Out[17]:
Season Season_Numeric
12132 2010-11 31
30148 1982-83 3
19979 1982-83 3
3439 1981-82 2
7663 1990-91 11

Convert position to an array of Boolean values

In [18]:
def get_position_matrix(position):
    positions = [0, 0, 0, 0, 0]
    if 'Point Guard' in position:
        positions[0] = 1
    if 'Shooting Guard' in position:
        positions[1] = 1
    if 'Small Forward' in position:
        positions[2] = 1
    if 'Power Forward' in position:
        positions[3] = 1
    if 'Center' in position:
        positions[4] = 1
    if 'Guard' in position and 'Point Guard' not in position and 'Shooting Guard' not in position:
        positions[0] = 1
        positions[1] = 1
    if 'Forward' in position and 'Power Forward' not in position and 'Small Forward' not in position:
        positions[2] = 1
        positions[3] = 1
    return positions

position_matrix = []
for i, season_row in df.iterrows():
    position_matrix.append(get_position_matrix(season_row['Position']))
position_matrix = pd.np.array(position_matrix)
for i, position in enumerate(['PG', 'SG', 'SF', 'PF', 'C']):
    df['plays_' + position] = position_matrix[:, i]
In [19]:
giannis_df = df[df['Player'] == 'Giannis Antetokounmpo']
giannis_df[['Player', 'Position', 'plays_PG', 'plays_SG', 'plays_SF', 'plays_PF', 'plays_C']].head(1)
Out[19]:
Player Position plays_PG plays_SG plays_SF plays_PF plays_C
770 Giannis Antetokounmpo Small Forward and Point Guard and Shooting Gua... 1 1 1 1 0

Calculate score

In [20]:
def get_score(row):
    return row['FG'] * 1.5 + row['FGA'] * (-0.5) + row['FT'] + \
        row['FTA'] * (-0.75) + row['3P'] + row['3PA'] * (-0.25) + \
        row['ORB'] * 0.5 + row['TRB'] + row['AST'] * 2 + \
        row['STL'] * 2.5 + row['BLK'] * 2.5 + \
        row['TOV'] * (-1.75) + row['PTS']

df['Score'] = df.apply(get_score, axis=1)
In [21]:
giannis_df = df[df['Player'] == 'Giannis Antetokounmpo']
giannis_df[['Player', 'Season', 'Score']].head(5)
Out[21]:
Player Season Score
770 Giannis Antetokounmpo 2013-14 17.275
771 Giannis Antetokounmpo 2014-15 28.475
772 Giannis Antetokounmpo 2015-16 39.025
773 Giannis Antetokounmpo 2016-17 51.675
774 Giannis Antetokounmpo 2017-18 55.300

Find next season score - the target column

In [22]:
df.sort_values(['ShortName', 'Season_Numeric'], inplace=True)
g = df.groupby(['ShortName'])
next_season_score = list()
for i, gr in g:
    next_season_score += list(gr['Score'].shift(-1))
df['Next_Season_Score'] = next_season_score
df.dropna(subset=['Next_Season_Score'], inplace=True)
In [23]:
giannis_df = df[df['Player'] == 'Giannis Antetokounmpo']
giannis_df[['Player', 'Season', 'Season_Numeric', 'Score', 'Next_Season_Score']].head(5)
Out[23]:
Player Season Season_Numeric Score Next_Season_Score
770 Giannis Antetokounmpo 2013-14 34 17.275 28.475
771 Giannis Antetokounmpo 2014-15 35 28.475 39.025
772 Giannis Antetokounmpo 2015-16 36 39.025 51.675
773 Giannis Antetokounmpo 2016-17 37 51.675 55.300
In [24]:
# drop unnecessary columns
df.drop(
    ['BirthPlace', 'Season', 'Position', 'Player', 
     'SeasonURL', 'Lg', 'ShortName', 'Pos', 'Tm', 'Score'], 
    axis=1, inplace=True)
In [25]:
df.fillna(0, inplace=True)
df.reset_index(inplace=True)
df.drop(['index'], axis=1, inplace=True)
In [26]:
df.sample(5)
Out[26]:
Height Weight Age G GS MP FG FGA FG% 3P ... TOV PF PTS Season_Numeric plays_PG plays_SG plays_SF plays_PF plays_C Next_Season_Score
9259 182.9 170 28.0 75.0 74.0 31.7 6.4 13.1 0.484 1.6 ... 2.6 1.4 18.2 13 1 0 0 0 0 39.475
3652 200.7 209 32.0 50.0 5.0 9.2 0.8 2.2 0.369 0.0 ... 0.4 1.0 1.9 18 0 0 1 0 0 1.300
4787 195.6 195 24.0 58.0 26.0 18.7 3.8 8.4 0.458 0.0 ... 1.0 1.5 9.2 8 0 1 0 0 0 2.725
2915 185.4 189 28.0 63.0 17.0 23.0 3.6 9.3 0.384 1.1 ... 0.9 1.6 9.5 22 1 1 0 0 0 21.000
880 205.7 235 28.0 56.0 6.0 16.7 3.9 7.3 0.532 0.3 ... 1.2 1.6 9.4 37 0 0 1 1 0 25.400

5 rows × 35 columns

We can now use the following function to get an entirely numeric dataframe.

from clear_seasons_data import get_clear_final_data
df = get_clear_final_data()

Step 3.

Inspect and Visualize our data.

We will use Pandas to get some insights for our data and Seaborn and Plotly in order to easily create plots and understand the relation that may exist between our dataframe's columns

In [27]:
from clear_seasons_data import get_clear_final_data
df = get_clear_final_data()
df.describe()
Out[27]:
Height Weight Age G GS MP FG FGA FG% 3P ... TOV PF PTS plays_PG plays_SG plays_SF plays_PF plays_C Season_Numeric Next_Season_Score
count 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 ... 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000 12648.000000
mean 200.705218 216.453036 26.551312 60.043722 31.035421 22.369892 3.533491 7.694790 0.450584 0.438536 ... 1.391034 2.100253 9.305021 0.253716 0.332701 0.327087 0.355946 0.303843 50.780519 19.671316
std 9.413949 27.616132 3.907472 22.078819 30.437214 9.816474 2.275182 4.706923 0.074528 0.609560 ... 0.820865 0.815102 6.079629 0.435154 0.471199 0.469168 0.478818 0.459934 10.129191 12.078043
min 160.000000 133.000000 18.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 30.000000 -4.000000
25% 193.000000 195.000000 24.000000 47.000000 2.000000 14.400000 1.700000 3.900000 0.415000 0.000000 ... 0.800000 1.500000 4.500000 0.000000 0.000000 0.000000 0.000000 0.000000 42.000000 10.050000
50% 200.700000 215.000000 26.000000 68.000000 19.000000 22.300000 3.100000 6.800000 0.452000 0.100000 ... 1.200000 2.100000 8.000000 0.000000 0.000000 0.000000 0.000000 0.000000 51.000000 17.500000
75% 208.300000 235.000000 29.000000 79.000000 62.000000 30.800000 4.900000 10.800000 0.490000 0.700000 ... 1.900000 2.700000 13.000000 1.000000 1.000000 1.000000 1.000000 1.000000 60.000000 27.725000
max 231.100000 330.000000 42.000000 85.000000 83.000000 43.700000 13.400000 27.800000 1.000000 5.100000 ... 5.700000 6.000000 37.100000 1.000000 1.000000 1.000000 1.000000 1.000000 67.000000 68.050000

8 rows × 35 columns

In [28]:
df.groupby("Age").mean()[['FG', '3P%', 'FT%', '2P%', 'TRB', 'AST', 'BLK', 'TOV', 'PF', 'PTS']].head(10)
Out[28]:
FG 3P% FT% 2P% TRB AST BLK TOV PF PTS
Age
18.0 1.416667 0.159917 0.608917 0.432083 1.941667 0.566667 0.383333 0.658333 1.358333 3.708333
19.0 2.652041 0.201745 0.669204 0.459949 3.220408 1.321429 0.524490 1.162245 1.716327 6.943878
20.0 2.969231 0.215640 0.680802 0.457838 3.682996 1.560324 0.522267 1.277733 1.882996 7.827126
21.0 3.371655 0.219900 0.687147 0.463687 3.852608 1.719955 0.511565 1.360544 2.011111 8.843764
22.0 3.039694 0.202374 0.689547 0.461044 3.462755 1.682857 0.432857 1.276735 1.944898 7.945714
23.0 3.113436 0.202794 0.701479 0.462601 3.459618 1.757416 0.435977 1.283700 1.951762 8.130690
24.0 3.395662 0.211654 0.704888 0.467106 3.722500 1.966103 0.452279 1.372794 2.059706 8.895441
25.0 3.665811 0.214601 0.721808 0.472418 3.985634 2.136116 0.482745 1.435714 2.126164 9.680177
26.0 3.877049 0.214056 0.733745 0.475793 4.201639 2.297066 0.503969 1.501639 2.220362 10.239776
27.0 3.954891 0.220972 0.739258 0.476619 4.234188 2.362393 0.506553 1.523172 2.234283 10.449953
In [29]:
df.groupby("Season_Numeric").mean()[['3P%', 'FT%', '2P%', 'TRB', 'AST', 'BLK', 'TOV', 'PF', 'PTS']].head(10)
Out[29]:
3P% FT% 2P% TRB AST BLK TOV PF PTS
Season_Numeric
30 0.217100 0.755000 0.500100 4.660000 2.750000 0.390000 1.820000 2.420000 11.500000
31 0.112182 0.735818 0.502909 4.200000 2.581818 0.681818 1.818182 2.345455 10.709091
32 0.156051 0.722311 0.488097 4.201556 2.401556 0.522957 1.687549 2.549416 10.540078
33 0.124720 0.717420 0.482973 4.295331 2.469261 0.550195 1.826459 2.475875 10.487938
34 0.142391 0.734000 0.484050 4.031034 2.482759 0.494636 1.663985 2.432950 10.306513
35 0.143127 0.738647 0.485985 4.053455 2.476364 0.500000 1.685091 2.366545 10.404000
36 0.141316 0.718680 0.484004 4.189098 2.472932 0.509774 1.694737 2.421805 10.563910
37 0.141714 0.736007 0.476693 4.116429 2.407143 0.511071 1.596429 2.351429 10.327500
38 0.169051 0.749293 0.474420 4.027536 2.407609 0.502174 1.563043 2.288043 10.147464
39 0.182789 0.733786 0.472057 4.065886 2.378595 0.490970 1.571572 2.206020 10.080268

A picture says more than a thousand words

Seaborn

https://seaborn.pydata.org/

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

(venv) [nmichas@my-pc]$ pip install seaborn
In [30]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
In [31]:
from clear_seasons_data import get_clear_seasons_data
sns_df = get_clear_seasons_data()
sns_df.sample(5)
Out[31]:
ShortName Height Weight Age Tm Pos G GS MP FG ... PF PTS plays_PG plays_SG plays_SF plays_PF plays_C Score Season_Numeric Next_Season_Score
8429 obannch01 195.6 209 22.0 DET SG 30.0 0.0 7.8 0.9 ... 0.5 2.1 0 1 0 0 0 5.075 48 7.425
413 arenagi01 190.5 191 22.0 WAS PG 55.0 52.0 37.6 6.5 ... 3.2 19.6 1 0 0 0 0 34.950 54 44.800
820 battito01 210.8 230 33.0 NJN C 15.0 0.0 8.9 0.9 ... 1.3 2.4 0 0 0 1 1 5.100 60 7.250
10449 smithke01 190.5 170 24.0 TOT PG 79.0 51.0 30.6 4.8 ... 1.8 11.9 1 0 0 0 0 26.475 40 36.925
7160 marjabo01 221.0 290 27.0 SAS C 54.0 4.0 9.4 1.9 ... 1.0 5.5 0 0 0 0 1 12.500 66 12.450

5 rows × 39 columns

In [32]:
sns.distplot(sns_df['STL'])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce611a55c0>
In [33]:
sns.jointplot(x='BLK', y='TRB', data=sns_df, kind='reg')
Out[33]:
<seaborn.axisgrid.JointGrid at 0x7fce610651d0>
In [34]:
test_df = sns_df[['Pos', 'TRB','AST','STL']]
sns.pairplot(test_df, hue='Pos', diag_kind='hist', hue_order=['PG', 'SG', 'SF', 'PF', 'C'])
Out[34]:
<seaborn.axisgrid.PairGrid at 0x7fce61dbdeb8>
In [35]:
sns.barplot(x='plays_PG', y='AST', data=sns_df)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce6259a6a0>
In [36]:
sns.boxplot(x='Pos', y='TRB', data=sns_df, order=['PG', 'SG', 'SF', 'PF', 'C'])
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce6242a240>
In [37]:
sns.heatmap(sns_df[['TRB','AST','STL','BLK','TOV', '2P%', '3P%', 'eFG%', 'PF', 'Score']].corr(), annot=True)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce623953c8>

https://plot.ly/python/

Plotly's Python graphing library makes interactive, publication-quality graphs online.

(venv) [nmichas@my-pc]$ pip install plotly
(venv) [nmichas@my-pc]$ pip install cufflinks
In [38]:
import pandas as pd
import numpy as np
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
%matplotlib inline
In [39]:
plotly_df = get_clear_seasons_data()
In [40]:
plotly_df.iplot(
    kind='scatter', x='Age', y='Score', text='ShortName', mode='markers', 
    layout={'autosize':False, 'width':800, 'height':600, 'hovermode': 'closest'})

# plotly and cufflinks does not work well with jupyter's slides

Step 4.

Predict Next Season.

We will use some Python Machine Learning Libraries in order to find out how every player will perform the next season

Scikit-learn

http://scikit-learn.org/

Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

(venv) [nmichas@my-pc]$ pip install scikit-learn

Regression analysis

From Wikipedia, the free encyclopedia

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Predict with Linear Regression

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [42]:
from clear_seasons_data import get_clear_final_data
In [43]:
def get_train_test_datasets(df):
    max_season = df['Season_Numeric'].max()
    df_train = df[df['Season_Numeric'] != max_season]
    df_test = df[df['Season_Numeric'] == max_season]
    X_train = df_train.drop(columns=['Next_Season_Score'])
    y_train = df_train['Next_Season_Score']
    X_test = df_test.drop(columns=['Next_Season_Score'])
    y_test = df_test['Next_Season_Score']
    return X_train, X_test, y_train, y_test
In [44]:
regr_df = get_clear_final_data()
X_train, X_test, y_train, y_test = get_train_test_datasets(regr_df)
In [45]:
X_train.head(5)
Out[45]:
Height Weight Age G GS MP FG FGA FG% 3P ... BLK TOV PF PTS plays_PG plays_SG plays_SF plays_PF plays_C Season_Numeric
0 208.3 240 22.0 43.0 0.0 6.7 1.3 2.7 0.474 0.0 ... 0.3 0.5 0.9 3.1 0 0 0 1 0 41
1 208.3 240 23.0 71.0 1.0 13.2 2.5 5.1 0.493 0.0 ... 0.2 0.9 1.9 6.1 0 0 0 1 0 42
2 208.3 240 24.0 75.0 52.0 17.5 3.3 6.3 0.518 0.0 ... 0.3 1.3 2.5 7.7 0 0 0 1 0 43
3 208.3 240 25.0 13.0 0.0 12.2 1.8 4.2 0.436 0.0 ... 0.2 1.3 1.5 4.9 0 0 0 1 0 44
4 218.4 225 34.0 76.0 76.0 35.2 9.9 17.1 0.579 0.0 ... 2.7 3.0 2.9 23.9 0 0 0 0 1 32

5 rows × 34 columns

In [46]:
y_train.head(5)
Out[46]:
0    12.325
1    14.950
2     8.350
3     8.500
4    44.350
Name: Next_Season_Score, dtype: float64
In [47]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
Out[47]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [48]:
lm.intercept_
Out[48]:
19.390105326328374
In [49]:
lm.coef_
Out[49]:
array([-1.62823036e-02,  4.15601307e-03, -4.21215753e-01,  2.53078338e-02,
       -9.78700150e-03, -2.22532612e-01,  6.52200483e-01, -2.20696560e-01,
       -5.15913294e+00,  3.70380493e+00, -9.80280912e-01, -1.36486384e-01,
        2.23047050e+00, -7.86904051e-01,  5.18996156e+00, -7.18298303e+00,
        1.43034951e+00, -6.28300398e-01, -5.12766833e-01,  2.47388821e+00,
        2.50854260e+00, -1.25279451e+00,  1.89030934e+00,  3.21056009e+00,
        2.68854892e+00, -1.17614684e+00, -7.65141144e-01,  7.58430663e-01,
        8.48358157e-01,  8.18898330e-01,  7.68414798e-01,  4.24741807e-01,
        9.26033901e-01, -1.82307125e-03])
In [50]:
coeff_df = pd.DataFrame(lm.coef_,X_train.columns,columns=['Coefficient'])
coeff_df.transpose()
Out[50]:
Height Weight Age G GS MP FG FGA FG% 3P ... BLK TOV PF PTS plays_PG plays_SG plays_SF plays_PF plays_C Season_Numeric
Coefficient -0.016282 0.004156 -0.421216 0.025308 -0.009787 -0.222533 0.6522 -0.220697 -5.159133 3.703805 ... 2.688549 -1.176147 -0.765141 0.758431 0.848358 0.818898 0.768415 0.424742 0.926034 -0.001823

1 rows × 34 columns

In [51]:
predictions = lm.predict(X_test)
In [52]:
plt.scatter(y_test,predictions)
Out[52]:
<matplotlib.collections.PathCollection at 0x7fce563cecc0>
In [53]:
sns.distplot((y_test-predictions),bins=50)
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce57422d68>
In [54]:
from sklearn import metrics
print('Mean Absolute Error     :', metrics.mean_absolute_error(y_test, predictions))
print('Mean Squared Error      :', metrics.mean_squared_error(y_test, predictions))
print('Root Mean Squared Error :', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
Mean Absolute Error     : 4.90991732806849
Mean Squared Error      : 39.127676770519244
Root Mean Squared Error : 6.255211968472311

Compare Other Regression Methods

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from clear_seasons_data import get_clear_final_data
compare_df = get_clear_final_data()
X = df.drop(columns=['Next_Season_Score'])
y = df['Next_Season_Score']
In [56]:
X.head()
Out[56]:
Height Weight Age G GS MP FG FGA FG% 3P ... BLK TOV PF PTS plays_PG plays_SG plays_SF plays_PF plays_C Season_Numeric
0 208.3 240 22.0 43.0 0.0 6.7 1.3 2.7 0.474 0.0 ... 0.3 0.5 0.9 3.1 0 0 0 1 0 41
1 208.3 240 23.0 71.0 1.0 13.2 2.5 5.1 0.493 0.0 ... 0.2 0.9 1.9 6.1 0 0 0 1 0 42
2 208.3 240 24.0 75.0 52.0 17.5 3.3 6.3 0.518 0.0 ... 0.3 1.3 2.5 7.7 0 0 0 1 0 43
3 208.3 240 25.0 13.0 0.0 12.2 1.8 4.2 0.436 0.0 ... 0.2 1.3 1.5 4.9 0 0 0 1 0 44
4 218.4 225 34.0 76.0 76.0 35.2 9.9 17.1 0.579 0.0 ... 2.7 3.0 2.9 23.9 0 0 0 0 1 32

5 rows × 34 columns

In [57]:
y.head()
Out[57]:
0    12.325
1    14.950
2     8.350
3     8.500
4    44.350
Name: Next_Season_Score, dtype: float64
In [58]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
In [59]:
a = 0.9

methods = [
    ('linear regression', LinearRegression()),
    ('lasso', Lasso(fit_intercept=True, alpha=a)),
    ('ridge', Ridge(fit_intercept=True, alpha=a)),
    ('elastic-net', ElasticNet(fit_intercept=True, alpha=a))
]
In [60]:
for name,met in methods:
    met.fit(X,y)
    p = met.predict(X)
    e = p-y
    total_error = np.dot(e,e)
    rmse_train = np.sqrt(total_error/len(p))
    
    print('Method: %s' %name)
    print('RMSE on training: %.4f' %rmse_train)
    print("\n")
Method: linear regression
RMSE on training: 6.0929


Method: lasso
RMSE on training: 6.4137


Method: ridge
RMSE on training: 6.0930


Method: elastic-net
RMSE on training: 6.4085


Neural Networks with Keras and Tensorflow

Keras

https://keras.io/

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

Tensorflow

https://www.tensorflow.org/

TensorFlow™ is an open source software library for high performance numerical computation. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organization.

In [62]:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
Using TensorFlow backend.
In [63]:
from clear_seasons_data import get_clear_final_data, get_train_test_datasets
keras_df = get_clear_final_data()
X_train, X_test, y_train, y_test = get_train_test_datasets(keras_df)
In [64]:
# define base model
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(30, input_dim=34, kernel_initializer='normal', activation='relu'))
    model.add(Dense(10, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model
In [65]:
estimator = KerasRegressor(build_fn=baseline_model, epochs=200, batch_size=256, verbose=2)
In [66]:
estimator.fit(X_train, y_train)
Epoch 1/200
 - 0s - loss: 459.5953
Epoch 2/200
 - 0s - loss: 132.8051
Epoch 3/200
 - 0s - loss: 99.5188
Epoch 4/200
 - 0s - loss: 82.6606
Epoch 5/200
 - 0s - loss: 66.9503
Epoch 6/200
 - 0s - loss: 57.6900
Epoch 7/200
 - 0s - loss: 54.7131
Epoch 8/200
 - 0s - loss: 52.4048
Epoch 9/200
 - 0s - loss: 50.1573
Epoch 10/200
 - 0s - loss: 48.3826
Epoch 11/200
 - 0s - loss: 46.7145
Epoch 12/200
 - 0s - loss: 45.5158
Epoch 13/200
 - 0s - loss: 44.4487
Epoch 14/200
 - 0s - loss: 43.8216
Epoch 15/200
 - 0s - loss: 43.1828
Epoch 16/200
 - 0s - loss: 42.6453
Epoch 17/200
 - 0s - loss: 42.3142
Epoch 18/200
 - 0s - loss: 41.7962
Epoch 19/200
 - 0s - loss: 41.3372
Epoch 20/200
 - 0s - loss: 40.9942
Epoch 21/200
 - 0s - loss: 40.7663
Epoch 22/200
 - 0s - loss: 40.3667
Epoch 23/200
 - 0s - loss: 40.5586
Epoch 24/200
 - 0s - loss: 39.9599
Epoch 25/200
 - 0s - loss: 39.8174
Epoch 26/200
 - 0s - loss: 39.6749
Epoch 27/200
 - 0s - loss: 39.4016
Epoch 28/200
 - 0s - loss: 39.2226
Epoch 29/200
 - 0s - loss: 39.3144
Epoch 30/200
 - 0s - loss: 39.1266
Epoch 31/200
 - 0s - loss: 38.9243
Epoch 32/200
 - 0s - loss: 38.7947
Epoch 33/200
 - 0s - loss: 38.7634
Epoch 34/200
 - 0s - loss: 38.9175
Epoch 35/200
 - 0s - loss: 38.7511
Epoch 36/200
 - 0s - loss: 38.5941
Epoch 37/200
 - 0s - loss: 38.5744
Epoch 38/200
 - 0s - loss: 38.4630
Epoch 39/200
 - 0s - loss: 38.5761
Epoch 40/200
 - 0s - loss: 38.4175
Epoch 41/200
 - 0s - loss: 38.3472
Epoch 42/200
 - 0s - loss: 38.1735
Epoch 43/200
 - 0s - loss: 38.1858
Epoch 44/200
 - 0s - loss: 38.0176
Epoch 45/200
 - 0s - loss: 38.1972
Epoch 46/200
 - 0s - loss: 38.1176
Epoch 47/200
 - 0s - loss: 38.3324
Epoch 48/200
 - 0s - loss: 37.9476
Epoch 49/200
 - 0s - loss: 37.8856
Epoch 50/200
 - 0s - loss: 37.9409
Epoch 51/200
 - 0s - loss: 37.8400
Epoch 52/200
 - 0s - loss: 37.7431
Epoch 53/200
 - 0s - loss: 37.7743
Epoch 54/200
 - 0s - loss: 37.8994
Epoch 55/200
 - 0s - loss: 37.6842
Epoch 56/200
 - 0s - loss: 37.7232
Epoch 57/200
 - 0s - loss: 38.0171
Epoch 58/200
 - 0s - loss: 37.7234
Epoch 59/200
 - 0s - loss: 37.9251
Epoch 60/200
 - 0s - loss: 37.7024
Epoch 61/200
 - 0s - loss: 37.6559
Epoch 62/200
 - 0s - loss: 37.5320
Epoch 63/200
 - 0s - loss: 37.5163
Epoch 64/200
 - 0s - loss: 37.7129
Epoch 65/200
 - 0s - loss: 37.6621
Epoch 66/200
 - 0s - loss: 37.6534
Epoch 67/200
 - 0s - loss: 37.4139
Epoch 68/200
 - 0s - loss: 37.5934
Epoch 69/200
 - 0s - loss: 37.4091
Epoch 70/200
 - 0s - loss: 37.5789
Epoch 71/200
 - 0s - loss: 37.4852
Epoch 72/200
 - 0s - loss: 37.5645
Epoch 73/200
 - 0s - loss: 37.3837
Epoch 74/200
 - 0s - loss: 37.4086
Epoch 75/200
 - 0s - loss: 37.3530
Epoch 76/200
 - 0s - loss: 37.4107
Epoch 77/200
 - 0s - loss: 37.3248
Epoch 78/200
 - 0s - loss: 37.3486
Epoch 79/200
 - 0s - loss: 37.2370
Epoch 80/200
 - 0s - loss: 37.4278
Epoch 81/200
 - 0s - loss: 37.3563
Epoch 82/200
 - 0s - loss: 37.5052
Epoch 83/200
 - 0s - loss: 37.3296
Epoch 84/200
 - 0s - loss: 37.1203
Epoch 85/200
 - 0s - loss: 37.1509
Epoch 86/200
 - 0s - loss: 37.2249
Epoch 87/200
 - 0s - loss: 37.1139
Epoch 88/200
 - 0s - loss: 37.3130
Epoch 89/200
 - 0s - loss: 37.2170
Epoch 90/200
 - 0s - loss: 37.0851
Epoch 91/200
 - 0s - loss: 37.1718
Epoch 92/200
 - 0s - loss: 37.6542
Epoch 93/200
 - 0s - loss: 37.0279
Epoch 94/200
 - 0s - loss: 37.0119
Epoch 95/200
 - 0s - loss: 37.5879
Epoch 96/200
 - 0s - loss: 37.3230
Epoch 97/200
 - 0s - loss: 37.1627
Epoch 98/200
 - 0s - loss: 36.9392
Epoch 99/200
 - 0s - loss: 37.1442
Epoch 100/200
 - 0s - loss: 37.0292
Epoch 101/200
 - 0s - loss: 37.2406
Epoch 102/200
 - 0s - loss: 37.1666
Epoch 103/200
 - 0s - loss: 37.1026
Epoch 104/200
 - 0s - loss: 37.0691
Epoch 105/200
 - 0s - loss: 36.9854
Epoch 106/200
 - 0s - loss: 37.0781
Epoch 107/200
 - 0s - loss: 36.9856
Epoch 108/200
 - 0s - loss: 37.2034
Epoch 109/200
 - 0s - loss: 36.9433
Epoch 110/200
 - 0s - loss: 37.1137
Epoch 111/200
 - 0s - loss: 36.9119
Epoch 112/200
 - 0s - loss: 36.8384
Epoch 113/200
 - 0s - loss: 36.8825
Epoch 114/200
 - 0s - loss: 37.0279
Epoch 115/200
 - 0s - loss: 36.9738
Epoch 116/200
 - 0s - loss: 36.8878
Epoch 117/200
 - 0s - loss: 37.2486
Epoch 118/200
 - 0s - loss: 36.8853
Epoch 119/200
 - 0s - loss: 36.9077
Epoch 120/200
 - 0s - loss: 36.8526
Epoch 121/200
 - 0s - loss: 36.7300
Epoch 122/200
 - 0s - loss: 37.0419
Epoch 123/200
 - 0s - loss: 36.7662
Epoch 124/200
 - 0s - loss: 36.7832
Epoch 125/200
 - 0s - loss: 36.7161
Epoch 126/200
 - 0s - loss: 36.9202
Epoch 127/200
 - 0s - loss: 36.8115
Epoch 128/200
 - 0s - loss: 36.8146
Epoch 129/200
 - 0s - loss: 37.0024
Epoch 130/200
 - 0s - loss: 36.6736
Epoch 131/200
 - 0s - loss: 36.9156
Epoch 132/200
 - 0s - loss: 36.6609
Epoch 133/200
 - 0s - loss: 36.8168
Epoch 134/200
 - 0s - loss: 37.2711
Epoch 135/200
 - 0s - loss: 36.8177
Epoch 136/200
 - 0s - loss: 36.6591
Epoch 137/200
 - 0s - loss: 36.8011
Epoch 138/200
 - 0s - loss: 37.0466
Epoch 139/200
 - 0s - loss: 36.7148
Epoch 140/200
 - 0s - loss: 36.6854
Epoch 141/200
 - 0s - loss: 36.8929
Epoch 142/200
 - 0s - loss: 36.8034
Epoch 143/200
 - 0s - loss: 37.0319
Epoch 144/200
 - 0s - loss: 36.8104
Epoch 145/200
 - 0s - loss: 36.6326
Epoch 146/200
 - 0s - loss: 36.9849
Epoch 147/200
 - 0s - loss: 37.2135
Epoch 148/200
 - 0s - loss: 36.7456
Epoch 149/200
 - 0s - loss: 36.5984
Epoch 150/200
 - 0s - loss: 36.6494
Epoch 151/200
 - 0s - loss: 36.5613
Epoch 152/200
 - 0s - loss: 37.0005
Epoch 153/200
 - 0s - loss: 36.6890
Epoch 154/200
 - 0s - loss: 36.6270
Epoch 155/200
 - 0s - loss: 36.8675
Epoch 156/200
 - 0s - loss: 36.6778
Epoch 157/200
 - 0s - loss: 36.6405
Epoch 158/200
 - 0s - loss: 36.5097
Epoch 159/200
 - 0s - loss: 36.8047
Epoch 160/200
 - 0s - loss: 36.5839
Epoch 161/200
 - 0s - loss: 36.6382
Epoch 162/200
 - 0s - loss: 36.6244
Epoch 163/200
 - 0s - loss: 36.6021
Epoch 164/200
 - 0s - loss: 36.5452
Epoch 165/200
 - 0s - loss: 36.5041
Epoch 166/200
 - 0s - loss: 36.6022
Epoch 167/200
 - 0s - loss: 36.6299
Epoch 168/200
 - 0s - loss: 36.6023
Epoch 169/200
 - 0s - loss: 36.7352
Epoch 170/200
 - 0s - loss: 36.4985
Epoch 171/200
 - 0s - loss: 36.7347
Epoch 172/200
 - 0s - loss: 36.5795
Epoch 173/200
 - 0s - loss: 36.6202
Epoch 174/200
 - 0s - loss: 36.8924
Epoch 175/200
 - 0s - loss: 36.5441
Epoch 176/200
 - 0s - loss: 36.6195
Epoch 177/200
 - 0s - loss: 36.6718
Epoch 178/200
 - 0s - loss: 36.7881
Epoch 179/200
 - 0s - loss: 36.4715
Epoch 180/200
 - 0s - loss: 36.4860
Epoch 181/200
 - 0s - loss: 36.5994
Epoch 182/200
 - 0s - loss: 36.4601
Epoch 183/200
 - 0s - loss: 36.5082
Epoch 184/200
 - 0s - loss: 36.4787
Epoch 185/200
 - 0s - loss: 36.6551
Epoch 186/200
 - 0s - loss: 36.5175
Epoch 187/200
 - 0s - loss: 36.7392
Epoch 188/200
 - 0s - loss: 36.5299
Epoch 189/200
 - 0s - loss: 36.4365
Epoch 190/200
 - 0s - loss: 36.5772
Epoch 191/200
 - 0s - loss: 36.5714
Epoch 192/200
 - 0s - loss: 36.4685
Epoch 193/200
 - 0s - loss: 36.5741
Epoch 194/200
 - 0s - loss: 36.5005
Epoch 195/200
 - 0s - loss: 36.4790
Epoch 196/200
 - 0s - loss: 36.4878
Epoch 197/200
 - 0s - loss: 36.6152
Epoch 198/200
 - 0s - loss: 36.8434
Epoch 199/200
 - 0s - loss: 36.5095
Epoch 200/200
 - 0s - loss: 36.5246
Out[66]:
<keras.callbacks.History at 0x7fce461f1978>
In [67]:
predictions = estimator.predict(X_test)
In [68]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(y_test,predictions)
Out[68]:
<matplotlib.collections.PathCollection at 0x7fce45e511d0>
In [69]:
from sklearn import metrics
import numpy as np
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
MAE: 4.911167843974367
MSE: 39.100205451025126
RMSE: 6.253015708522179

Step 5.

Use the predictions

In [70]:
from clear_seasons_data import get_clear_final_data, get_train_test_datasets

final_df = get_clear_final_data(with_labels=True)
X_train, X_test, y_train, y_test = get_train_test_datasets(final_df)
labels_train = X_train[['Player']]
labels_test = X_test[['Player']]
X_train.drop(columns=['Player'], inplace=True)
X_test.drop(columns=['Player'], inplace=True)
In [71]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
predictions = lm.predict(X_test)
In [72]:
labels_test['Prediction'] = predictions
In [73]:
prediction_df_sorted = labels_test.sort_values(by='Prediction', ascending=False)
In [74]:
prediction_df_sorted[:10]
Out[74]:
Player Prediction
11892 Russell Westbrook 64.568737
4664 James Harden 61.799162
2719 Anthony Davis 56.979021
5627 LeBron James 54.898267
369 Giannis Antetokounmpo 54.394127
3285 Kevin Durant 53.026744
2441 DeMarcus Cousins 52.271180
11256 Karl-Anthony Towns 51.269578
11706 John Wall 50.722309
1737 Jimmy Butler 47.564992
In [75]:
prediction_df_sorted[10:20]
Out[75]:
Player Prediction
6670 Kawhi Leonard 47.317304
2612 Stephen Curry 46.954240
11023 Isaiah Thomas 46.403544
6755 Damian Lillard 46.100491
5458 Kyrie Irving 43.599142
5971 Nikola Jokic 43.551978
8811 Chris Paul 43.180978
2959 DeMar DeRozan 42.571234
6902 Kyle Lowry 41.559765
4081 Paul George 41.363313
In [76]:
last_season_df = pd.read_csv('current.csv')
from clear_seasons_data import get_score
last_season_df['Score'] = last_season_df.apply(get_score, axis=1)
last_season_df = last_season_df.sort_values(by='Score', ascending=False)[['Player', 'Score']]
last_season_df.head(10)
Out[76]:
Player Score
84 Anthony Davis 58.750
179 LeBron James 55.575
103 Joel Embiid 54.550
9 Giannis Antetokounmpo 53.900
82 Stephen Curry 53.700
100 Kevin Durant 52.900
90 DeMar DeRozan 52.775
351 Russell Westbrook 52.350
141 James Harden 51.175
139 Blake Griffin 50.075

Next Steps

Potential Improvements

  • Use college stats for rookies or international stats for foreigners
  • Use more than one season as an input (fewer data)
  • Use more advanced stats (Stats per posession, Team averages, ...)

Q & A

Front Page

Nikolaos Michas

PyCon Balkan 2018