A fantasy sport (also known less commonly as rotisserie or roto) is a type of online game where participants assemble imaginary or virtual teams of real players of a professional sport. These teams compete based on the statistical performance of those players' players in actual games. This performance is converted into points that are compiled and totaled according to a roster selected by each fantasy team's manager. These point systems can be simple enough to be manually calculated by a "league commissioner" who coordinates and manages the overall league, or points can be compiled and calculated using computers tracking actual results of the professional sport. In fantasy sports, team owners draft, trade and cut (drop) players, analogously to real sports.
Team owner drafts a team of 13 players
Each week owners select 10 active players
Owners collect points based on their picks' performance
Pandas
Beautiful Soup
Jupyter
Seaborn / Plotly
Scikit Learn
Keras
https://pandas.pydata.org/pandas-docs/stable/
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
(venv) [nmichas@my-pc]$ pip install pandas
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
df
A | B | C | D | |
---|---|---|---|---|
0 | -0.023940 | -1.116884 | -1.420836 | 0.026762 |
1 | 0.472838 | 0.537210 | -0.174598 | -1.972429 |
2 | 0.030127 | -0.493965 | -1.710277 | -1.127274 |
3 | -0.838290 | -0.340422 | 0.982786 | -0.291325 |
4 | 0.942333 | 0.914386 | -1.218660 | -2.353766 |
5 | 0.326871 | -0.797093 | -0.446801 | -0.366841 |
https://www.crummy.com/software/BeautifulSoup/
Install by:
(venv) [nmichas@my-pc]$ pip install beautifulsoup4
(venv) [nmichas@my-pc]$ pip install lxml
Parse and read all Players Statistics from Basketball-Reference website
https://www.basketball-reference.com/players/a/antetgi01.html
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/players/a/antetgi01.html'
r = requests.get(url)
s = BeautifulSoup(r.text, 'lxml')
player_df = pd.read_html(r.text)[0]
player_df.head()
Season | Age | Tm | Lg | Pos | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-14 | 19.0 | MIL | NBA | SF | 77 | 23 | 24.6 | 2.2 | 5.4 | ... | 0.683 | 1.0 | 3.4 | 4.4 | 1.9 | 0.8 | 0.8 | 1.6 | 2.2 | 6.8 |
1 | 2014-15 | 20.0 | MIL | NBA | SG | 81 | 71 | 31.4 | 4.7 | 9.6 | ... | 0.741 | 1.2 | 5.5 | 6.7 | 2.6 | 0.9 | 1.0 | 2.1 | 3.1 | 12.7 |
2 | 2015-16 | 21.0 | MIL | NBA | PG | 80 | 79 | 35.3 | 6.4 | 12.7 | ... | 0.724 | 1.4 | 6.2 | 7.7 | 4.3 | 1.2 | 1.4 | 2.6 | 3.2 | 16.9 |
3 | 2016-17 | 22.0 | MIL | NBA | SF | 80 | 80 | 35.6 | 8.2 | 15.7 | ... | 0.770 | 1.8 | 7.0 | 8.8 | 5.4 | 1.6 | 1.9 | 2.9 | 3.1 | 22.9 |
4 | 2017-18 | 23.0 | MIL | NBA | PF | 75 | 75 | 36.7 | 9.9 | 18.7 | ... | 0.760 | 2.1 | 8.0 | 10.0 | 4.8 | 1.5 | 1.4 | 3.0 | 3.1 | 26.9 |
5 rows × 30 columns
COLUMNS = ['Season', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS',
'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%',
'2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
'PF', 'PTS']
player_df = player_df[COLUMNS]
player_df.head()
Season | Age | Tm | Lg | Pos | G | GS | MP | FG | FGA | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-14 | 19.0 | MIL | NBA | SF | 77 | 23 | 24.6 | 2.2 | 5.4 | ... | 0.683 | 1.0 | 3.4 | 4.4 | 1.9 | 0.8 | 0.8 | 1.6 | 2.2 | 6.8 |
1 | 2014-15 | 20.0 | MIL | NBA | SG | 81 | 71 | 31.4 | 4.7 | 9.6 | ... | 0.741 | 1.2 | 5.5 | 6.7 | 2.6 | 0.9 | 1.0 | 2.1 | 3.1 | 12.7 |
2 | 2015-16 | 21.0 | MIL | NBA | PG | 80 | 79 | 35.3 | 6.4 | 12.7 | ... | 0.724 | 1.4 | 6.2 | 7.7 | 4.3 | 1.2 | 1.4 | 2.6 | 3.2 | 16.9 |
3 | 2016-17 | 22.0 | MIL | NBA | SF | 80 | 80 | 35.6 | 8.2 | 15.7 | ... | 0.770 | 1.8 | 7.0 | 8.8 | 5.4 | 1.6 | 1.9 | 2.9 | 3.1 | 22.9 |
4 | 2017-18 | 23.0 | MIL | NBA | PF | 75 | 75 | 36.7 | 9.9 | 18.7 | ... | 0.760 | 2.1 | 8.0 | 10.0 | 4.8 | 1.5 | 1.4 | 3.0 | 3.1 | 26.9 |
5 rows × 30 columns
import re
player_df['Height'] = s.find(itemprop='height').get_text()
player_df['Weight'] = s.find(itemprop='weight').get_text()
regex = re.compile(
'(Guard|Forward|Point Guard|Center|Power Forward|Shooting Guard|Small Forward)')
player_df['Position'] = s.findAll(text=regex)[0].strip().split('\n')[0]
player_df.columns
Index(['Season', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Height', 'Weight', 'Position'], dtype='object')
(venv) [nmichas@my-pc]$ python get_all_players.py
players.csv should be a csv file of this form
name,shortname,href
Alaa Abdelnaby,abdelal01,/players/a/abdelal01.html
Zaid Abdul-Aziz,abdulza01,/players/a/abdulza01.html
Kareem Abdul-Jabbar,abdulka01,/players/a/abdulka01.html
Mahmoud Abdul-Rauf,abdulma02,/players/a/abdulma02.html
Tariq Abdul-Wahad,abdulta01,/players/a/abdulta01.html
Shareef Abdur-Rahim,abdursh01,/players/a/abdursh01.html
Tom Abernethy,abernto01,/players/a/abernto01.html
Forest Able,ablefo01,/players/a/ablefo01.html
(venv) [nmichas@my-pc]$ python get_all_seasons.py
seasons.csv should be a csv file of this form
csv
Player,ShortName,Height,Weight,Position,BirthPlace,SeasonURL,Season,Age,Tm,Lg,Pos,G,...
Alaa Abdelnaby,abdelal01,6-10,240lb,Power Forward,Egypt,/players/a/abdelal01/gamelog/1991/,1990-91,22.0,POR,NBA,PF,43,0,6.7,1.3,2.7,0.474,0.0,0.0,,1.3,...
Alaa Abdelnaby,abdelal01,6-10,240lb,Power Forward,Egypt,/players/a/abdelal01/gamelog/1992/,1991-92,23.0,POR,NBA,PF,71,1,13.2,2.5,5.1,0.493,0.0,0.0,,2.5,...
Alaa Abdelnaby,abdelal01,6-10,240lb,Power Forward,Egypt,/players/a/abdelal01/gamelog/1993/,1992-93,24.0,TOT,NBA,PF,75,52,17.5,3.3,6.3,0.518,0.0,0.0,0.0,...
df = pd.read_csv('seasons.csv')
df.sample(1)
Player | ShortName | Height | Weight | Position | BirthPlace | SeasonURL | Season | Age | Tm | ... | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2165 | Benoit Benjamin | benjabe01 | 7-0 | 250lb | Center | Louisiana | /players/b/benjabe01/gamelog/1996/ | 1995-96 | 31.0 | MIL | ... | 0.732 | 1.6 | 4.7 | 6.2 | 0.7 | 0.5 | 1.0 | 1.6 | 2.6 | 7.8 |
1 rows × 37 columns
Remove stats from players that changed team mid-season
df[['Player', 'ShortName', 'Season', 'Lg', 'Tm']].head(5)
Player | ShortName | Season | Lg | Tm | |
---|---|---|---|---|---|
0 | Alaa Abdelnaby | abdelal01 | 1990-91 | NBA | POR |
1 | Alaa Abdelnaby | abdelal01 | 1991-92 | NBA | POR |
2 | Alaa Abdelnaby | abdelal01 | 1992-93 | NBA | TOT |
3 | Alaa Abdelnaby | abdelal01 | 1992-93 | NBA | MIL |
4 | Alaa Abdelnaby | abdelal01 | 1992-93 | NBA | BOS |
df.drop(
df[df.duplicated(['ShortName', 'Season'], keep='first')].index,
inplace=True)
Drop Rows for No-NBA leagues and total career statistics
df[['Player', 'ShortName', 'Season', 'Lg', 'Tm']][df.Lg != 'NBA'].head(5)
Player | ShortName | Season | Lg | Tm | |
---|---|---|---|---|---|
92 | John Abramovic | abramjo01 | 1946-47 | BAA | PIT |
93 | John Abramovic | abramjo01 | 1947-48 | BAA | TOT |
96 | John Abramovic | abramjo01 | Career | BAA | NaN |
151 | Don Adams | adamsdo01 | 1974-75 | TOT | TOT |
154 | Don Adams | adamsdo01 | 1975-76 | TOT | TOT |
df.drop(df[df.Lg == 'ABA'].index, inplace=True)
df.drop(df[df.Lg == 'BAA'].index, inplace=True)
df.drop(df[df.Lg == 'TOT'].index, inplace=True)
df.drop(df[df.Season == 'Career'].index, inplace=True)
Remove data from players with missing information
df[['Player', 'ShortName', 'Season', 'Lg', 'Tm', '3P', 'GS']][df['3P'].isnull()].head(5)
Player | ShortName | Season | Lg | Tm | 3P | GS | |
---|---|---|---|---|---|---|---|
10 | Zaid Abdul-Aziz | abdulza01 | 1968-69 | NBA | TOT | NaN | NaN |
13 | Zaid Abdul-Aziz | abdulza01 | 1969-70 | NBA | MIL | NaN | NaN |
14 | Zaid Abdul-Aziz | abdulza01 | 1970-71 | NBA | SEA | NaN | NaN |
15 | Zaid Abdul-Aziz | abdulza01 | 1971-72 | NBA | SEA | NaN | NaN |
16 | Zaid Abdul-Aziz | abdulza01 | 1972-73 | NBA | HOU | NaN | NaN |
# drop players before 3P use
df.dropna(subset=['3P', '3PA'], inplace=True)
# drop players without info for Games Started
df.dropna(subset=['GS'], inplace=True)
# drop players with no height-weight info
df.dropna(subset=['Height', 'Weight'], inplace=True)
Convert Heigh and Weight to numeric
df[['Height', 'Weight']].sample(5)
Height | Weight | |
---|---|---|
30612 | 6-8 | 199lb |
3868 | 6-9 | 245lb |
27836 | 6-2 | 186lb |
24749 | 6-8 | 210lb |
13275 | 6-5 | 184lb |
def height_to_cm(h):
ft, inch = h.split('-')
inch = int(inch) + int(ft) * 12
return round(inch * 2.54, 1)
def remove_lb(w):
return int(w.replace('lb', ''))
df['Height'] = df['Height'].map(height_to_cm)
df['Weight'] = df['Weight'].map(remove_lb)
Convert season to numeric
min_season = int(df['Season'].min().split('-')[0])
def get_season(row):
return int(row['Season'].split('-')[0]) - min_season
df['Season_Numeric'] = df.apply(get_season, axis=1)
df[['Season', 'Season_Numeric']].sample(5)
Season | Season_Numeric | |
---|---|---|
12132 | 2010-11 | 31 |
30148 | 1982-83 | 3 |
19979 | 1982-83 | 3 |
3439 | 1981-82 | 2 |
7663 | 1990-91 | 11 |
Convert position to an array of Boolean values
def get_position_matrix(position):
positions = [0, 0, 0, 0, 0]
if 'Point Guard' in position:
positions[0] = 1
if 'Shooting Guard' in position:
positions[1] = 1
if 'Small Forward' in position:
positions[2] = 1
if 'Power Forward' in position:
positions[3] = 1
if 'Center' in position:
positions[4] = 1
if 'Guard' in position and 'Point Guard' not in position and 'Shooting Guard' not in position:
positions[0] = 1
positions[1] = 1
if 'Forward' in position and 'Power Forward' not in position and 'Small Forward' not in position:
positions[2] = 1
positions[3] = 1
return positions
position_matrix = []
for i, season_row in df.iterrows():
position_matrix.append(get_position_matrix(season_row['Position']))
position_matrix = pd.np.array(position_matrix)
for i, position in enumerate(['PG', 'SG', 'SF', 'PF', 'C']):
df['plays_' + position] = position_matrix[:, i]
giannis_df = df[df['Player'] == 'Giannis Antetokounmpo']
giannis_df[['Player', 'Position', 'plays_PG', 'plays_SG', 'plays_SF', 'plays_PF', 'plays_C']].head(1)
Player | Position | plays_PG | plays_SG | plays_SF | plays_PF | plays_C | |
---|---|---|---|---|---|---|---|
770 | Giannis Antetokounmpo | Small Forward and Point Guard and Shooting Gua... | 1 | 1 | 1 | 1 | 0 |
Calculate score
def get_score(row):
return row['FG'] * 1.5 + row['FGA'] * (-0.5) + row['FT'] + \
row['FTA'] * (-0.75) + row['3P'] + row['3PA'] * (-0.25) + \
row['ORB'] * 0.5 + row['TRB'] + row['AST'] * 2 + \
row['STL'] * 2.5 + row['BLK'] * 2.5 + \
row['TOV'] * (-1.75) + row['PTS']
df['Score'] = df.apply(get_score, axis=1)
giannis_df = df[df['Player'] == 'Giannis Antetokounmpo']
giannis_df[['Player', 'Season', 'Score']].head(5)
Player | Season | Score | |
---|---|---|---|
770 | Giannis Antetokounmpo | 2013-14 | 17.275 |
771 | Giannis Antetokounmpo | 2014-15 | 28.475 |
772 | Giannis Antetokounmpo | 2015-16 | 39.025 |
773 | Giannis Antetokounmpo | 2016-17 | 51.675 |
774 | Giannis Antetokounmpo | 2017-18 | 55.300 |
Find next season score - the target column
df.sort_values(['ShortName', 'Season_Numeric'], inplace=True)
g = df.groupby(['ShortName'])
next_season_score = list()
for i, gr in g:
next_season_score += list(gr['Score'].shift(-1))
df['Next_Season_Score'] = next_season_score
df.dropna(subset=['Next_Season_Score'], inplace=True)
giannis_df = df[df['Player'] == 'Giannis Antetokounmpo']
giannis_df[['Player', 'Season', 'Season_Numeric', 'Score', 'Next_Season_Score']].head(5)
Player | Season | Season_Numeric | Score | Next_Season_Score | |
---|---|---|---|---|---|
770 | Giannis Antetokounmpo | 2013-14 | 34 | 17.275 | 28.475 |
771 | Giannis Antetokounmpo | 2014-15 | 35 | 28.475 | 39.025 |
772 | Giannis Antetokounmpo | 2015-16 | 36 | 39.025 | 51.675 |
773 | Giannis Antetokounmpo | 2016-17 | 37 | 51.675 | 55.300 |
# drop unnecessary columns
df.drop(
['BirthPlace', 'Season', 'Position', 'Player',
'SeasonURL', 'Lg', 'ShortName', 'Pos', 'Tm', 'Score'],
axis=1, inplace=True)
df.fillna(0, inplace=True)
df.reset_index(inplace=True)
df.drop(['index'], axis=1, inplace=True)
df.sample(5)
Height | Weight | Age | G | GS | MP | FG | FGA | FG% | 3P | ... | TOV | PF | PTS | Season_Numeric | plays_PG | plays_SG | plays_SF | plays_PF | plays_C | Next_Season_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9259 | 182.9 | 170 | 28.0 | 75.0 | 74.0 | 31.7 | 6.4 | 13.1 | 0.484 | 1.6 | ... | 2.6 | 1.4 | 18.2 | 13 | 1 | 0 | 0 | 0 | 0 | 39.475 |
3652 | 200.7 | 209 | 32.0 | 50.0 | 5.0 | 9.2 | 0.8 | 2.2 | 0.369 | 0.0 | ... | 0.4 | 1.0 | 1.9 | 18 | 0 | 0 | 1 | 0 | 0 | 1.300 |
4787 | 195.6 | 195 | 24.0 | 58.0 | 26.0 | 18.7 | 3.8 | 8.4 | 0.458 | 0.0 | ... | 1.0 | 1.5 | 9.2 | 8 | 0 | 1 | 0 | 0 | 0 | 2.725 |
2915 | 185.4 | 189 | 28.0 | 63.0 | 17.0 | 23.0 | 3.6 | 9.3 | 0.384 | 1.1 | ... | 0.9 | 1.6 | 9.5 | 22 | 1 | 1 | 0 | 0 | 0 | 21.000 |
880 | 205.7 | 235 | 28.0 | 56.0 | 6.0 | 16.7 | 3.9 | 7.3 | 0.532 | 0.3 | ... | 1.2 | 1.6 | 9.4 | 37 | 0 | 0 | 1 | 1 | 0 | 25.400 |
5 rows × 35 columns
We can now use the following function to get an entirely numeric dataframe.
from clear_seasons_data import get_clear_final_data
df = get_clear_final_data()
from clear_seasons_data import get_clear_final_data
df = get_clear_final_data()
df.describe()
Height | Weight | Age | G | GS | MP | FG | FGA | FG% | 3P | ... | TOV | PF | PTS | plays_PG | plays_SG | plays_SF | plays_PF | plays_C | Season_Numeric | Next_Season_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | ... | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 | 12648.000000 |
mean | 200.705218 | 216.453036 | 26.551312 | 60.043722 | 31.035421 | 22.369892 | 3.533491 | 7.694790 | 0.450584 | 0.438536 | ... | 1.391034 | 2.100253 | 9.305021 | 0.253716 | 0.332701 | 0.327087 | 0.355946 | 0.303843 | 50.780519 | 19.671316 |
std | 9.413949 | 27.616132 | 3.907472 | 22.078819 | 30.437214 | 9.816474 | 2.275182 | 4.706923 | 0.074528 | 0.609560 | ... | 0.820865 | 0.815102 | 6.079629 | 0.435154 | 0.471199 | 0.469168 | 0.478818 | 0.459934 | 10.129191 | 12.078043 |
min | 160.000000 | 133.000000 | 18.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 30.000000 | -4.000000 |
25% | 193.000000 | 195.000000 | 24.000000 | 47.000000 | 2.000000 | 14.400000 | 1.700000 | 3.900000 | 0.415000 | 0.000000 | ... | 0.800000 | 1.500000 | 4.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 42.000000 | 10.050000 |
50% | 200.700000 | 215.000000 | 26.000000 | 68.000000 | 19.000000 | 22.300000 | 3.100000 | 6.800000 | 0.452000 | 0.100000 | ... | 1.200000 | 2.100000 | 8.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 51.000000 | 17.500000 |
75% | 208.300000 | 235.000000 | 29.000000 | 79.000000 | 62.000000 | 30.800000 | 4.900000 | 10.800000 | 0.490000 | 0.700000 | ... | 1.900000 | 2.700000 | 13.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 60.000000 | 27.725000 |
max | 231.100000 | 330.000000 | 42.000000 | 85.000000 | 83.000000 | 43.700000 | 13.400000 | 27.800000 | 1.000000 | 5.100000 | ... | 5.700000 | 6.000000 | 37.100000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 67.000000 | 68.050000 |
8 rows × 35 columns
df.groupby("Age").mean()[['FG', '3P%', 'FT%', '2P%', 'TRB', 'AST', 'BLK', 'TOV', 'PF', 'PTS']].head(10)
FG | 3P% | FT% | 2P% | TRB | AST | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|---|
Age | ||||||||||
18.0 | 1.416667 | 0.159917 | 0.608917 | 0.432083 | 1.941667 | 0.566667 | 0.383333 | 0.658333 | 1.358333 | 3.708333 |
19.0 | 2.652041 | 0.201745 | 0.669204 | 0.459949 | 3.220408 | 1.321429 | 0.524490 | 1.162245 | 1.716327 | 6.943878 |
20.0 | 2.969231 | 0.215640 | 0.680802 | 0.457838 | 3.682996 | 1.560324 | 0.522267 | 1.277733 | 1.882996 | 7.827126 |
21.0 | 3.371655 | 0.219900 | 0.687147 | 0.463687 | 3.852608 | 1.719955 | 0.511565 | 1.360544 | 2.011111 | 8.843764 |
22.0 | 3.039694 | 0.202374 | 0.689547 | 0.461044 | 3.462755 | 1.682857 | 0.432857 | 1.276735 | 1.944898 | 7.945714 |
23.0 | 3.113436 | 0.202794 | 0.701479 | 0.462601 | 3.459618 | 1.757416 | 0.435977 | 1.283700 | 1.951762 | 8.130690 |
24.0 | 3.395662 | 0.211654 | 0.704888 | 0.467106 | 3.722500 | 1.966103 | 0.452279 | 1.372794 | 2.059706 | 8.895441 |
25.0 | 3.665811 | 0.214601 | 0.721808 | 0.472418 | 3.985634 | 2.136116 | 0.482745 | 1.435714 | 2.126164 | 9.680177 |
26.0 | 3.877049 | 0.214056 | 0.733745 | 0.475793 | 4.201639 | 2.297066 | 0.503969 | 1.501639 | 2.220362 | 10.239776 |
27.0 | 3.954891 | 0.220972 | 0.739258 | 0.476619 | 4.234188 | 2.362393 | 0.506553 | 1.523172 | 2.234283 | 10.449953 |
df.groupby("Season_Numeric").mean()[['3P%', 'FT%', '2P%', 'TRB', 'AST', 'BLK', 'TOV', 'PF', 'PTS']].head(10)
3P% | FT% | 2P% | TRB | AST | BLK | TOV | PF | PTS | |
---|---|---|---|---|---|---|---|---|---|
Season_Numeric | |||||||||
30 | 0.217100 | 0.755000 | 0.500100 | 4.660000 | 2.750000 | 0.390000 | 1.820000 | 2.420000 | 11.500000 |
31 | 0.112182 | 0.735818 | 0.502909 | 4.200000 | 2.581818 | 0.681818 | 1.818182 | 2.345455 | 10.709091 |
32 | 0.156051 | 0.722311 | 0.488097 | 4.201556 | 2.401556 | 0.522957 | 1.687549 | 2.549416 | 10.540078 |
33 | 0.124720 | 0.717420 | 0.482973 | 4.295331 | 2.469261 | 0.550195 | 1.826459 | 2.475875 | 10.487938 |
34 | 0.142391 | 0.734000 | 0.484050 | 4.031034 | 2.482759 | 0.494636 | 1.663985 | 2.432950 | 10.306513 |
35 | 0.143127 | 0.738647 | 0.485985 | 4.053455 | 2.476364 | 0.500000 | 1.685091 | 2.366545 | 10.404000 |
36 | 0.141316 | 0.718680 | 0.484004 | 4.189098 | 2.472932 | 0.509774 | 1.694737 | 2.421805 | 10.563910 |
37 | 0.141714 | 0.736007 | 0.476693 | 4.116429 | 2.407143 | 0.511071 | 1.596429 | 2.351429 | 10.327500 |
38 | 0.169051 | 0.749293 | 0.474420 | 4.027536 | 2.407609 | 0.502174 | 1.563043 | 2.288043 | 10.147464 |
39 | 0.182789 | 0.733786 | 0.472057 | 4.065886 | 2.378595 | 0.490970 | 1.571572 | 2.206020 | 10.080268 |
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
(venv) [nmichas@my-pc]$ pip install seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from clear_seasons_data import get_clear_seasons_data
sns_df = get_clear_seasons_data()
sns_df.sample(5)
ShortName | Height | Weight | Age | Tm | Pos | G | GS | MP | FG | ... | PF | PTS | plays_PG | plays_SG | plays_SF | plays_PF | plays_C | Score | Season_Numeric | Next_Season_Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8429 | obannch01 | 195.6 | 209 | 22.0 | DET | SG | 30.0 | 0.0 | 7.8 | 0.9 | ... | 0.5 | 2.1 | 0 | 1 | 0 | 0 | 0 | 5.075 | 48 | 7.425 |
413 | arenagi01 | 190.5 | 191 | 22.0 | WAS | PG | 55.0 | 52.0 | 37.6 | 6.5 | ... | 3.2 | 19.6 | 1 | 0 | 0 | 0 | 0 | 34.950 | 54 | 44.800 |
820 | battito01 | 210.8 | 230 | 33.0 | NJN | C | 15.0 | 0.0 | 8.9 | 0.9 | ... | 1.3 | 2.4 | 0 | 0 | 0 | 1 | 1 | 5.100 | 60 | 7.250 |
10449 | smithke01 | 190.5 | 170 | 24.0 | TOT | PG | 79.0 | 51.0 | 30.6 | 4.8 | ... | 1.8 | 11.9 | 1 | 0 | 0 | 0 | 0 | 26.475 | 40 | 36.925 |
7160 | marjabo01 | 221.0 | 290 | 27.0 | SAS | C | 54.0 | 4.0 | 9.4 | 1.9 | ... | 1.0 | 5.5 | 0 | 0 | 0 | 0 | 1 | 12.500 | 66 | 12.450 |
5 rows × 39 columns
sns.distplot(sns_df['STL'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fce611a55c0>
sns.jointplot(x='BLK', y='TRB', data=sns_df, kind='reg')
<seaborn.axisgrid.JointGrid at 0x7fce610651d0>
test_df = sns_df[['Pos', 'TRB','AST','STL']]
sns.pairplot(test_df, hue='Pos', diag_kind='hist', hue_order=['PG', 'SG', 'SF', 'PF', 'C'])
<seaborn.axisgrid.PairGrid at 0x7fce61dbdeb8>
sns.barplot(x='plays_PG', y='AST', data=sns_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fce6259a6a0>
sns.boxplot(x='Pos', y='TRB', data=sns_df, order=['PG', 'SG', 'SF', 'PF', 'C'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fce6242a240>
sns.heatmap(sns_df[['TRB','AST','STL','BLK','TOV', '2P%', '3P%', 'eFG%', 'PF', 'Score']].corr(), annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fce623953c8>
Plotly's Python graphing library makes interactive, publication-quality graphs online.
(venv) [nmichas@my-pc]$ pip install plotly
(venv) [nmichas@my-pc]$ pip install cufflinks
import pandas as pd
import numpy as np
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
%matplotlib inline
plotly_df = get_clear_seasons_data()
plotly_df.iplot(
kind='scatter', x='Age', y='Score', text='ShortName', mode='markers',
layout={'autosize':False, 'width':800, 'height':600, 'hovermode': 'closest'})
# plotly and cufflinks does not work well with jupyter's slides
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
(venv) [nmichas@my-pc]$ pip install scikit-learn
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from clear_seasons_data import get_clear_final_data
def get_train_test_datasets(df):
max_season = df['Season_Numeric'].max()
df_train = df[df['Season_Numeric'] != max_season]
df_test = df[df['Season_Numeric'] == max_season]
X_train = df_train.drop(columns=['Next_Season_Score'])
y_train = df_train['Next_Season_Score']
X_test = df_test.drop(columns=['Next_Season_Score'])
y_test = df_test['Next_Season_Score']
return X_train, X_test, y_train, y_test
regr_df = get_clear_final_data()
X_train, X_test, y_train, y_test = get_train_test_datasets(regr_df)
X_train.head(5)
Height | Weight | Age | G | GS | MP | FG | FGA | FG% | 3P | ... | BLK | TOV | PF | PTS | plays_PG | plays_SG | plays_SF | plays_PF | plays_C | Season_Numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 208.3 | 240 | 22.0 | 43.0 | 0.0 | 6.7 | 1.3 | 2.7 | 0.474 | 0.0 | ... | 0.3 | 0.5 | 0.9 | 3.1 | 0 | 0 | 0 | 1 | 0 | 41 |
1 | 208.3 | 240 | 23.0 | 71.0 | 1.0 | 13.2 | 2.5 | 5.1 | 0.493 | 0.0 | ... | 0.2 | 0.9 | 1.9 | 6.1 | 0 | 0 | 0 | 1 | 0 | 42 |
2 | 208.3 | 240 | 24.0 | 75.0 | 52.0 | 17.5 | 3.3 | 6.3 | 0.518 | 0.0 | ... | 0.3 | 1.3 | 2.5 | 7.7 | 0 | 0 | 0 | 1 | 0 | 43 |
3 | 208.3 | 240 | 25.0 | 13.0 | 0.0 | 12.2 | 1.8 | 4.2 | 0.436 | 0.0 | ... | 0.2 | 1.3 | 1.5 | 4.9 | 0 | 0 | 0 | 1 | 0 | 44 |
4 | 218.4 | 225 | 34.0 | 76.0 | 76.0 | 35.2 | 9.9 | 17.1 | 0.579 | 0.0 | ... | 2.7 | 3.0 | 2.9 | 23.9 | 0 | 0 | 0 | 0 | 1 | 32 |
5 rows × 34 columns
y_train.head(5)
0 12.325 1 14.950 2 8.350 3 8.500 4 44.350 Name: Next_Season_Score, dtype: float64
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lm.intercept_
19.390105326328374
lm.coef_
array([-1.62823036e-02, 4.15601307e-03, -4.21215753e-01, 2.53078338e-02, -9.78700150e-03, -2.22532612e-01, 6.52200483e-01, -2.20696560e-01, -5.15913294e+00, 3.70380493e+00, -9.80280912e-01, -1.36486384e-01, 2.23047050e+00, -7.86904051e-01, 5.18996156e+00, -7.18298303e+00, 1.43034951e+00, -6.28300398e-01, -5.12766833e-01, 2.47388821e+00, 2.50854260e+00, -1.25279451e+00, 1.89030934e+00, 3.21056009e+00, 2.68854892e+00, -1.17614684e+00, -7.65141144e-01, 7.58430663e-01, 8.48358157e-01, 8.18898330e-01, 7.68414798e-01, 4.24741807e-01, 9.26033901e-01, -1.82307125e-03])
coeff_df = pd.DataFrame(lm.coef_,X_train.columns,columns=['Coefficient'])
coeff_df.transpose()
Height | Weight | Age | G | GS | MP | FG | FGA | FG% | 3P | ... | BLK | TOV | PF | PTS | plays_PG | plays_SG | plays_SF | plays_PF | plays_C | Season_Numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Coefficient | -0.016282 | 0.004156 | -0.421216 | 0.025308 | -0.009787 | -0.222533 | 0.6522 | -0.220697 | -5.159133 | 3.703805 | ... | 2.688549 | -1.176147 | -0.765141 | 0.758431 | 0.848358 | 0.818898 | 0.768415 | 0.424742 | 0.926034 | -0.001823 |
1 rows × 34 columns
predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
<matplotlib.collections.PathCollection at 0x7fce563cecc0>
sns.distplot((y_test-predictions),bins=50)
<matplotlib.axes._subplots.AxesSubplot at 0x7fce57422d68>
from sklearn import metrics
print('Mean Absolute Error :', metrics.mean_absolute_error(y_test, predictions))
print('Mean Squared Error :', metrics.mean_squared_error(y_test, predictions))
print('Root Mean Squared Error :', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
Mean Absolute Error : 4.90991732806849 Mean Squared Error : 39.127676770519244 Root Mean Squared Error : 6.255211968472311
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from clear_seasons_data import get_clear_final_data
compare_df = get_clear_final_data()
X = df.drop(columns=['Next_Season_Score'])
y = df['Next_Season_Score']
X.head()
Height | Weight | Age | G | GS | MP | FG | FGA | FG% | 3P | ... | BLK | TOV | PF | PTS | plays_PG | plays_SG | plays_SF | plays_PF | plays_C | Season_Numeric | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 208.3 | 240 | 22.0 | 43.0 | 0.0 | 6.7 | 1.3 | 2.7 | 0.474 | 0.0 | ... | 0.3 | 0.5 | 0.9 | 3.1 | 0 | 0 | 0 | 1 | 0 | 41 |
1 | 208.3 | 240 | 23.0 | 71.0 | 1.0 | 13.2 | 2.5 | 5.1 | 0.493 | 0.0 | ... | 0.2 | 0.9 | 1.9 | 6.1 | 0 | 0 | 0 | 1 | 0 | 42 |
2 | 208.3 | 240 | 24.0 | 75.0 | 52.0 | 17.5 | 3.3 | 6.3 | 0.518 | 0.0 | ... | 0.3 | 1.3 | 2.5 | 7.7 | 0 | 0 | 0 | 1 | 0 | 43 |
3 | 208.3 | 240 | 25.0 | 13.0 | 0.0 | 12.2 | 1.8 | 4.2 | 0.436 | 0.0 | ... | 0.2 | 1.3 | 1.5 | 4.9 | 0 | 0 | 0 | 1 | 0 | 44 |
4 | 218.4 | 225 | 34.0 | 76.0 | 76.0 | 35.2 | 9.9 | 17.1 | 0.579 | 0.0 | ... | 2.7 | 3.0 | 2.9 | 23.9 | 0 | 0 | 0 | 0 | 1 | 32 |
5 rows × 34 columns
y.head()
0 12.325 1 14.950 2 8.350 3 8.500 4 44.350 Name: Next_Season_Score, dtype: float64
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
a = 0.9
methods = [
('linear regression', LinearRegression()),
('lasso', Lasso(fit_intercept=True, alpha=a)),
('ridge', Ridge(fit_intercept=True, alpha=a)),
('elastic-net', ElasticNet(fit_intercept=True, alpha=a))
]
for name,met in methods:
met.fit(X,y)
p = met.predict(X)
e = p-y
total_error = np.dot(e,e)
rmse_train = np.sqrt(total_error/len(p))
print('Method: %s' %name)
print('RMSE on training: %.4f' %rmse_train)
print("\n")
Method: linear regression RMSE on training: 6.0929 Method: lasso RMSE on training: 6.4137 Method: ridge RMSE on training: 6.0930 Method: elastic-net RMSE on training: 6.4085
Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
TensorFlow™ is an open source software library for high performance numerical computation. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organization.
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
Using TensorFlow backend.
from clear_seasons_data import get_clear_final_data, get_train_test_datasets
keras_df = get_clear_final_data()
X_train, X_test, y_train, y_test = get_train_test_datasets(keras_df)
# define base model
def baseline_model():
# create model
model = Sequential()
model.add(Dense(30, input_dim=34, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
estimator = KerasRegressor(build_fn=baseline_model, epochs=200, batch_size=256, verbose=2)
estimator.fit(X_train, y_train)
Epoch 1/200 - 0s - loss: 459.5953 Epoch 2/200 - 0s - loss: 132.8051 Epoch 3/200 - 0s - loss: 99.5188 Epoch 4/200 - 0s - loss: 82.6606 Epoch 5/200 - 0s - loss: 66.9503 Epoch 6/200 - 0s - loss: 57.6900 Epoch 7/200 - 0s - loss: 54.7131 Epoch 8/200 - 0s - loss: 52.4048 Epoch 9/200 - 0s - loss: 50.1573 Epoch 10/200 - 0s - loss: 48.3826 Epoch 11/200 - 0s - loss: 46.7145 Epoch 12/200 - 0s - loss: 45.5158 Epoch 13/200 - 0s - loss: 44.4487 Epoch 14/200 - 0s - loss: 43.8216 Epoch 15/200 - 0s - loss: 43.1828 Epoch 16/200 - 0s - loss: 42.6453 Epoch 17/200 - 0s - loss: 42.3142 Epoch 18/200 - 0s - loss: 41.7962 Epoch 19/200 - 0s - loss: 41.3372 Epoch 20/200 - 0s - loss: 40.9942 Epoch 21/200 - 0s - loss: 40.7663 Epoch 22/200 - 0s - loss: 40.3667 Epoch 23/200 - 0s - loss: 40.5586 Epoch 24/200 - 0s - loss: 39.9599 Epoch 25/200 - 0s - loss: 39.8174 Epoch 26/200 - 0s - loss: 39.6749 Epoch 27/200 - 0s - loss: 39.4016 Epoch 28/200 - 0s - loss: 39.2226 Epoch 29/200 - 0s - loss: 39.3144 Epoch 30/200 - 0s - loss: 39.1266 Epoch 31/200 - 0s - loss: 38.9243 Epoch 32/200 - 0s - loss: 38.7947 Epoch 33/200 - 0s - loss: 38.7634 Epoch 34/200 - 0s - loss: 38.9175 Epoch 35/200 - 0s - loss: 38.7511 Epoch 36/200 - 0s - loss: 38.5941 Epoch 37/200 - 0s - loss: 38.5744 Epoch 38/200 - 0s - loss: 38.4630 Epoch 39/200 - 0s - loss: 38.5761 Epoch 40/200 - 0s - loss: 38.4175 Epoch 41/200 - 0s - loss: 38.3472 Epoch 42/200 - 0s - loss: 38.1735 Epoch 43/200 - 0s - loss: 38.1858 Epoch 44/200 - 0s - loss: 38.0176 Epoch 45/200 - 0s - loss: 38.1972 Epoch 46/200 - 0s - loss: 38.1176 Epoch 47/200 - 0s - loss: 38.3324 Epoch 48/200 - 0s - loss: 37.9476 Epoch 49/200 - 0s - loss: 37.8856 Epoch 50/200 - 0s - loss: 37.9409 Epoch 51/200 - 0s - loss: 37.8400 Epoch 52/200 - 0s - loss: 37.7431 Epoch 53/200 - 0s - loss: 37.7743 Epoch 54/200 - 0s - loss: 37.8994 Epoch 55/200 - 0s - loss: 37.6842 Epoch 56/200 - 0s - loss: 37.7232 Epoch 57/200 - 0s - loss: 38.0171 Epoch 58/200 - 0s - loss: 37.7234 Epoch 59/200 - 0s - loss: 37.9251 Epoch 60/200 - 0s - loss: 37.7024 Epoch 61/200 - 0s - loss: 37.6559 Epoch 62/200 - 0s - loss: 37.5320 Epoch 63/200 - 0s - loss: 37.5163 Epoch 64/200 - 0s - loss: 37.7129 Epoch 65/200 - 0s - loss: 37.6621 Epoch 66/200 - 0s - loss: 37.6534 Epoch 67/200 - 0s - loss: 37.4139 Epoch 68/200 - 0s - loss: 37.5934 Epoch 69/200 - 0s - loss: 37.4091 Epoch 70/200 - 0s - loss: 37.5789 Epoch 71/200 - 0s - loss: 37.4852 Epoch 72/200 - 0s - loss: 37.5645 Epoch 73/200 - 0s - loss: 37.3837 Epoch 74/200 - 0s - loss: 37.4086 Epoch 75/200 - 0s - loss: 37.3530 Epoch 76/200 - 0s - loss: 37.4107 Epoch 77/200 - 0s - loss: 37.3248 Epoch 78/200 - 0s - loss: 37.3486 Epoch 79/200 - 0s - loss: 37.2370 Epoch 80/200 - 0s - loss: 37.4278 Epoch 81/200 - 0s - loss: 37.3563 Epoch 82/200 - 0s - loss: 37.5052 Epoch 83/200 - 0s - loss: 37.3296 Epoch 84/200 - 0s - loss: 37.1203 Epoch 85/200 - 0s - loss: 37.1509 Epoch 86/200 - 0s - loss: 37.2249 Epoch 87/200 - 0s - loss: 37.1139 Epoch 88/200 - 0s - loss: 37.3130 Epoch 89/200 - 0s - loss: 37.2170 Epoch 90/200 - 0s - loss: 37.0851 Epoch 91/200 - 0s - loss: 37.1718 Epoch 92/200 - 0s - loss: 37.6542 Epoch 93/200 - 0s - loss: 37.0279 Epoch 94/200 - 0s - loss: 37.0119 Epoch 95/200 - 0s - loss: 37.5879 Epoch 96/200 - 0s - loss: 37.3230 Epoch 97/200 - 0s - loss: 37.1627 Epoch 98/200 - 0s - loss: 36.9392 Epoch 99/200 - 0s - loss: 37.1442 Epoch 100/200 - 0s - loss: 37.0292 Epoch 101/200 - 0s - loss: 37.2406 Epoch 102/200 - 0s - loss: 37.1666 Epoch 103/200 - 0s - loss: 37.1026 Epoch 104/200 - 0s - loss: 37.0691 Epoch 105/200 - 0s - loss: 36.9854 Epoch 106/200 - 0s - loss: 37.0781 Epoch 107/200 - 0s - loss: 36.9856 Epoch 108/200 - 0s - loss: 37.2034 Epoch 109/200 - 0s - loss: 36.9433 Epoch 110/200 - 0s - loss: 37.1137 Epoch 111/200 - 0s - loss: 36.9119 Epoch 112/200 - 0s - loss: 36.8384 Epoch 113/200 - 0s - loss: 36.8825 Epoch 114/200 - 0s - loss: 37.0279 Epoch 115/200 - 0s - loss: 36.9738 Epoch 116/200 - 0s - loss: 36.8878 Epoch 117/200 - 0s - loss: 37.2486 Epoch 118/200 - 0s - loss: 36.8853 Epoch 119/200 - 0s - loss: 36.9077 Epoch 120/200 - 0s - loss: 36.8526 Epoch 121/200 - 0s - loss: 36.7300 Epoch 122/200 - 0s - loss: 37.0419 Epoch 123/200 - 0s - loss: 36.7662 Epoch 124/200 - 0s - loss: 36.7832 Epoch 125/200 - 0s - loss: 36.7161 Epoch 126/200 - 0s - loss: 36.9202 Epoch 127/200 - 0s - loss: 36.8115 Epoch 128/200 - 0s - loss: 36.8146 Epoch 129/200 - 0s - loss: 37.0024 Epoch 130/200 - 0s - loss: 36.6736 Epoch 131/200 - 0s - loss: 36.9156 Epoch 132/200 - 0s - loss: 36.6609 Epoch 133/200 - 0s - loss: 36.8168 Epoch 134/200 - 0s - loss: 37.2711 Epoch 135/200 - 0s - loss: 36.8177 Epoch 136/200 - 0s - loss: 36.6591 Epoch 137/200 - 0s - loss: 36.8011 Epoch 138/200 - 0s - loss: 37.0466 Epoch 139/200 - 0s - loss: 36.7148 Epoch 140/200 - 0s - loss: 36.6854 Epoch 141/200 - 0s - loss: 36.8929 Epoch 142/200 - 0s - loss: 36.8034 Epoch 143/200 - 0s - loss: 37.0319 Epoch 144/200 - 0s - loss: 36.8104 Epoch 145/200 - 0s - loss: 36.6326 Epoch 146/200 - 0s - loss: 36.9849 Epoch 147/200 - 0s - loss: 37.2135 Epoch 148/200 - 0s - loss: 36.7456 Epoch 149/200 - 0s - loss: 36.5984 Epoch 150/200 - 0s - loss: 36.6494 Epoch 151/200 - 0s - loss: 36.5613 Epoch 152/200 - 0s - loss: 37.0005 Epoch 153/200 - 0s - loss: 36.6890 Epoch 154/200 - 0s - loss: 36.6270 Epoch 155/200 - 0s - loss: 36.8675 Epoch 156/200 - 0s - loss: 36.6778 Epoch 157/200 - 0s - loss: 36.6405 Epoch 158/200 - 0s - loss: 36.5097 Epoch 159/200 - 0s - loss: 36.8047 Epoch 160/200 - 0s - loss: 36.5839 Epoch 161/200 - 0s - loss: 36.6382 Epoch 162/200 - 0s - loss: 36.6244 Epoch 163/200 - 0s - loss: 36.6021 Epoch 164/200 - 0s - loss: 36.5452 Epoch 165/200 - 0s - loss: 36.5041 Epoch 166/200 - 0s - loss: 36.6022 Epoch 167/200 - 0s - loss: 36.6299 Epoch 168/200 - 0s - loss: 36.6023 Epoch 169/200 - 0s - loss: 36.7352 Epoch 170/200 - 0s - loss: 36.4985 Epoch 171/200 - 0s - loss: 36.7347 Epoch 172/200 - 0s - loss: 36.5795 Epoch 173/200 - 0s - loss: 36.6202 Epoch 174/200 - 0s - loss: 36.8924 Epoch 175/200 - 0s - loss: 36.5441 Epoch 176/200 - 0s - loss: 36.6195 Epoch 177/200 - 0s - loss: 36.6718 Epoch 178/200 - 0s - loss: 36.7881 Epoch 179/200 - 0s - loss: 36.4715 Epoch 180/200 - 0s - loss: 36.4860 Epoch 181/200 - 0s - loss: 36.5994 Epoch 182/200 - 0s - loss: 36.4601 Epoch 183/200 - 0s - loss: 36.5082 Epoch 184/200 - 0s - loss: 36.4787 Epoch 185/200 - 0s - loss: 36.6551 Epoch 186/200 - 0s - loss: 36.5175 Epoch 187/200 - 0s - loss: 36.7392 Epoch 188/200 - 0s - loss: 36.5299 Epoch 189/200 - 0s - loss: 36.4365 Epoch 190/200 - 0s - loss: 36.5772 Epoch 191/200 - 0s - loss: 36.5714 Epoch 192/200 - 0s - loss: 36.4685 Epoch 193/200 - 0s - loss: 36.5741 Epoch 194/200 - 0s - loss: 36.5005 Epoch 195/200 - 0s - loss: 36.4790 Epoch 196/200 - 0s - loss: 36.4878 Epoch 197/200 - 0s - loss: 36.6152 Epoch 198/200 - 0s - loss: 36.8434 Epoch 199/200 - 0s - loss: 36.5095 Epoch 200/200 - 0s - loss: 36.5246
<keras.callbacks.History at 0x7fce461f1978>
predictions = estimator.predict(X_test)
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(y_test,predictions)
<matplotlib.collections.PathCollection at 0x7fce45e511d0>
from sklearn import metrics
import numpy as np
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
MAE: 4.911167843974367 MSE: 39.100205451025126 RMSE: 6.253015708522179
Use the predictions
from clear_seasons_data import get_clear_final_data, get_train_test_datasets
final_df = get_clear_final_data(with_labels=True)
X_train, X_test, y_train, y_test = get_train_test_datasets(final_df)
labels_train = X_train[['Player']]
labels_test = X_test[['Player']]
X_train.drop(columns=['Player'], inplace=True)
X_test.drop(columns=['Player'], inplace=True)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
predictions = lm.predict(X_test)
labels_test['Prediction'] = predictions
prediction_df_sorted = labels_test.sort_values(by='Prediction', ascending=False)
prediction_df_sorted[:10]
Player | Prediction | |
---|---|---|
11892 | Russell Westbrook | 64.568737 |
4664 | James Harden | 61.799162 |
2719 | Anthony Davis | 56.979021 |
5627 | LeBron James | 54.898267 |
369 | Giannis Antetokounmpo | 54.394127 |
3285 | Kevin Durant | 53.026744 |
2441 | DeMarcus Cousins | 52.271180 |
11256 | Karl-Anthony Towns | 51.269578 |
11706 | John Wall | 50.722309 |
1737 | Jimmy Butler | 47.564992 |
prediction_df_sorted[10:20]
Player | Prediction | |
---|---|---|
6670 | Kawhi Leonard | 47.317304 |
2612 | Stephen Curry | 46.954240 |
11023 | Isaiah Thomas | 46.403544 |
6755 | Damian Lillard | 46.100491 |
5458 | Kyrie Irving | 43.599142 |
5971 | Nikola Jokic | 43.551978 |
8811 | Chris Paul | 43.180978 |
2959 | DeMar DeRozan | 42.571234 |
6902 | Kyle Lowry | 41.559765 |
4081 | Paul George | 41.363313 |
last_season_df = pd.read_csv('current.csv')
from clear_seasons_data import get_score
last_season_df['Score'] = last_season_df.apply(get_score, axis=1)
last_season_df = last_season_df.sort_values(by='Score', ascending=False)[['Player', 'Score']]
last_season_df.head(10)
Player | Score | |
---|---|---|
84 | Anthony Davis | 58.750 |
179 | LeBron James | 55.575 |
103 | Joel Embiid | 54.550 |
9 | Giannis Antetokounmpo | 53.900 |
82 | Stephen Curry | 53.700 |
100 | Kevin Durant | 52.900 |
90 | DeMar DeRozan | 52.775 |
351 | Russell Westbrook | 52.350 |
141 | James Harden | 51.175 |
139 | Blake Griffin | 50.075 |