Much of fan discourse around refereeing is based on emotion. Many of the claims made by fans actually made an implicit relationship between variables. For instance, the oft-repeated claim that "star players get more calls" is making the claim that there is a positive correlation between the stardom of a player and the number of calls they receive. The goal of this project to analyze some of the commonly repeated claims made by NBA fans and determine if there is some truth to it.
In this project, I will attempt to see if commonly held beliefs about referees amongst fans of the NBA are true. In particular, these beliefs include the opinions:
Using this information, I will then attempt to use k-Nearest Neighbors to see if we can predict the number of total foul calls made in a game.
Before we can even determine if there is a relationship between a player's stardom and the number of foul calls they receive, we first need to define what a "star player" is. To do so, I will be using common single-value player metrics. These metrics attempt to provide a single number that quantifies how talented a player is. The single-value player metrics I will be using are Win Shares per 48 (WS/48), and Value Over Replacement Plyer (VORP). Notably, I am not using Player Efficiency Rating (PER) here, as it seems to be no longer used in basketball analytics.
The higher these values are, the more of a star the corresponding player is. Therefore, we will attempt to see if there is a relationship between these values and the average number of foul calls per game. I imported one data set from Basketball Reference that included the average number of personal foul calls a player received, along with another data set from Basketball Reference that was primarily focused on single-value player metrics.
# Import data for BBall Ref
import pandas as pd
import numpy as np
url_br_2021_per36 = 'https://raw.githubusercontent.com/NVaduthala3/cmps-6160-project/main/bball_ref_2021_per36.csv' # '2021' here means 2021-22 season
df_br_2021_per36 = pd.read_csv(url_br_2021_per36) # DataFrame containing players' average stats
url_br_2021_adv = 'https://raw.githubusercontent.com/NVaduthala3/cmps-6160-project/main/bball_ref_2021_advanced.csv'
df_br_2021_adv = pd.read_csv(url_br_2021_adv) # DateFrame containing single-value metrics
df_br_2021_per36.head()
After importing the data for the 2021-22 NBA season, it is necessary now to remove of any data points that cannot provide us any value. In order to take into account that different players play a different number of minutes, and therefore will most likely have a higher number of foul calls, I imported in the data set that proportionately normalized the players' average statistics to if they had played 36 minutes every game. However, in the case of players that had played either less than 15 minutes on average or had played less than 35 games, we cannot make any normalization, as they have not played enough to make a realistic extrapolation. Therefore, I want to remove these players from the dataset
But before I can do that, I have to first create a new column that described how many minutes per game a player played on average. This is because the dataset containing the players' average statistics did not contain this information. But, it did contain the total number of minutes played ('MP') and the total number of games played ('G'), and so with this, I was able to find the average number of minutes played ('MPG').
But, because the data set I am importing performs some extrapolation, there are some illogical values in the dataset. I will force every value that was greater than 6 in the "Personal Fouls" column of the dataset6 to be equal to 6, as a player cannot receive more than 6 fouls in an NBA game.
# clean up data for BBall Ref
df_br_2021_per36['MPG'] = df_br_2021_per36.apply(lambda row: row['MP'] / row['G'], axis = 1) # calculate minutes per game
df_br_2021_per36 = df_br_2021_per36.drop(df_br_2021_per36[(df_br_2021_per36['MPG'] < 15) | (df_br_2021_per36['G'] < 35)].index) # remove players with insufficient games/min playes
df_br_2021_per36.loc[df_br_2021_per36['PF'] > 6, 'PF'] = 6 # impossible to have more than 6 fouls
df_br_2021_adv_pf = df_br_2021_adv.merge(df_br_2021_per36, how = 'inner') # add personal fouls column onto advanced stats dataframe
df_br_2021_adv_pf.head()
From here, we can see some results.
# Plot WS/48 vs PF
import matplotlib.pyplot as plt
from scipy import stats
df_br_2021_adv_pf.plot.scatter(x = 'WS/48', y = 'PF')
result = stats.linregress(x = df_br_2021_adv_pf['WS/48'], y = df_br_2021_adv_pf['PF'])
plt.plot(df_br_2021_adv_pf['WS/48'], result.intercept + result.slope*(df_br_2021_adv_pf['WS/48']), 'r')
plt.text(-0.05, 5.5, 'r^2 = %0.2f' % (result.rvalue) ** 2)
plt.title('WS/48 vs (Adjusted) Personal Fouls/Game')
plt.xlabel('Win Shares/48')
plt.ylabel('Adjusted Personal Fouls/Game')
We can see from the graph that there does not appear to be any relationship between an increased Win Shares per 48--and therefore increased stardom--and the average number of personal fouls called on the player. Although there is a positive correlation, the r-squared value is small enough that we cannot make a claim of there being any relation.
df_br_2021_adv_pf.plot.scatter(x = 'VORP', y = 'PF')
result2 = stats.linregress(x = df_br_2021_adv_pf['VORP'], y = df_br_2021_adv_pf['PF'])
plt.plot(df_br_2021_adv_pf['VORP'], result2.intercept + result2.slope*(df_br_2021_adv_pf['VORP']), 'r')
plt.text(2, 5.5, 'r^2 = %0.2f' % (result2.rvalue) ** 2)
plt.title('VORP vs (Adjusted) Personal Fouls/Game')
plt.xlabel('VORP')
plt.ylabel('Adjusted Personal Fouls/Game')
In a similar manner as in the first graph, we cannot conclude any definitive relationship between the stardom of a player and the number of foul call received in a game. In this case, the correlation is negative, but the r-squared value is once again too small for us to state a relationship.
One issue in our analysis is that we are extrapolating information off of the average. Another issue is that we are just comparing the number of foul calls to the single value metrics. It may be the case that a player received a lot of calls in their favor, and that all these calls were correct.
In order to try and remedy these issues, I will using the NBA's Last 2 Minutes dataset to look at the actual favorable calls made, and see if there is a relationship between the single-value player metrics and the number of favorable calls made. The dataset I will be using was organized by Github user athawksfanatic. This dataset states the fouls called made in the last two minutes of every game and states if they were the correct call to make or the incorrect one.
In particular, what I will be counting is the net number of favorable calls. In the NBA's Last 2 Minutes dataset, there are columns that outlines which player was disadvantaged and which player committed the foul. As a result, I can count how many times every player was either at an advantageous position in a foul call or at a a disadvantageous position. By assigning the value 1 to an advantageous position, and a -1 to a disadvantageous position, I will count up all the situations that a player was in, and this will be their net number of favorable calls. From here, I will then see if there is a relationship between this variable and the single-value player metrics.
Before I can do this, I must first remove all the rows that contain data not in the 2021-22 season.
# Import data from NBA L2M, provided by atlhawksfanatic
url_l2m = 'https://raw.githubusercontent.com/atlhawksfanatic/L2M/master/1-tidy/L2M/L2M_stats_nba.csv'
df_l2m = pd.read_csv(url_l2m)
df_l2m = df_l2m.drop((df_l2m[(df_l2m['season'] != 2022) | (df_l2m['playoff'] == True)].index)) #'2022' here means 2021-22 season, and only want Regular Season
df_l2m.head()
As one can see, this code currently produces a DtypeWarning. This should not be an issue, as this only affects columns that I am not interested in. However, we see that this dataset contains violations that aren't fouls. We are not interested in this, and so we will remove all rows that contain information about a referee call made that is not a foul. Along with that, some rows contain team names where there should be player names, and so those also have to be removed.
df_l2m = df_l2m.dropna(subset=['call_type', 'committing', 'disadvantaged', # Remove NaN values in columns we care about
'decision'])
df_l2m = df_l2m[df_l2m['committing'] != df_l2m['disadvantaged']] # Avoid possible repeats
# remove calls that are not fouls, like delay of game or a turnover
df_l2m = df_l2m[df_l2m['call_type'].str.contains('Foul', case=False)]
# removes all possible team names, except the Trail Blazers
mask_commit = df_l2m['committing'].str.count(' ').ge(1)
df_l2m = df_l2m[mask_commit]
mask_disadv = df_l2m['disadvantaged'].str.count(' ').ge(1)
df_l2m = df_l2m[mask_disadv]
df_l2m = df_l2m[(df_l2m['committing'] != 'Trail Blazers') & (df_l2m['disadvantaged'] != 'Trail Blazers')]
Now, we say that a player received a favorable call if they are called committing a foul and the decision is seen as an incorrect non-call, or if they are the disadvantaged player and received an incorrect call. We say that a player received an unfavorable call if they are called committing a foul and the decision is seen as an incorrect call, or if they are the disadvantaged player and the decision is seen as an incorrect non-call. All other cases can be considered to be a neutral (correct) call and so they will automatically have a value of 0.
However, we must also take into account the fact that different players will play different amounts of time in the last two minutes and so the number of foul situations they were involved in will also differ, and therefore it may not be fair to directly compare the net number of favorable calls between players. In order to mitigate this issue, we will instead look at the net number of favorable calls per 100 possessions. In order to find this value, we will first find the net number of favorable calls, and then we will find the number of possessions that a player was involved in a foul call. We will then divide the net number of favorable calls by the number of possessions in order to find the net number of favorable calls per possion, and then multiply this value by 100.
df_fav = pd.DataFrame()
df_fav = pd.DataFrame(pd.concat([df_l2m['committing'], df_l2m['disadvantaged']], ignore_index = True).unique())
df_fav['favorable count'] = ' '
df_fav['unfavorable count'] = ' '
df_fav['possession count'] = ' '
df_fav.columns = ['Player', 'favorable count', 'unfavorable count', 'possession count']
df_fav = df_fav.set_index('Player')
for player in df_fav.index:
fav_count = df_l2m.loc[
(df_l2m['committing'].eq(player) & df_l2m['decision'].eq('INC')) |
(df_l2m['disadvantaged'].eq(player) & df_l2m['decision'].eq('IC'))
]
unfav_count = df_l2m.loc[
(df_l2m['committing'].eq(player) & df_l2m['decision'].eq('IC')) |
(df_l2m['disadvantaged'].eq(player) & df_l2m['decision'].eq('INC'))
]
poss_count = df_l2m['committing'].tolist().count(player) + df_l2m['disadvantaged'].tolist().count(player)
df_fav.at[player, 'possession count'] = poss_count
df_fav.at[player, 'favorable count'] = len(fav_count)
df_fav.at[player, 'unfavorable count'] = len(unfav_count)
df_fav['net favorable count'] = df_fav['favorable count'] - df_fav['unfavorable count']
df_fav['net fav count per 100 poss'] = df_fav['net favorable count'] / df_fav['possession count'] * 100
df_fav = df_fav.reset_index()
df_fav.shape
We also need to consider the fact that some players have not been involved in enough possesions to where we can say that their net number of favorable calls per 100 possessions is a fair extrapolation. To fix this issue, we will drop all the players that have beenn involved in less than 5 possesions. The reason we use 5 is because if we go any lower, then we simply no longer will be fixing the problem, but if we go any higher, then there is a chance that we start cutting off meaningful data. The choice of having 5 as the baseline is also justified by the fact that the 25% percentile is 7 and the 50% percentile is 22, and so the vast majority of the data here is above the 5 benchmark.
pd.to_numeric(df_fav['possession count'], errors = 'coerce').describe()
df_fav = df_fav[df_fav['possession count'] > 4]
From here, we now need to join this DataFrame with the df_br_2021_adv_pf DataFrame in order to see if there is a relationship between Win Shares per 48 and VORP.
df_br_2021_adv_pf_fav = df_br_2021_adv.merge(df_fav, how = 'inner')
df_br_2021_adv_pf_fav.plot.scatter(x = 'WS/48', y = 'net fav count per 100 poss')
result = stats.linregress(x = df_br_2021_adv_pf_fav['WS/48'], y = pd.to_numeric(df_br_2021_adv_pf_fav['net fav count per 100 poss']))
plt.plot(df_br_2021_adv_pf_fav['WS/48'], result.intercept + result.slope*(df_br_2021_adv_pf_fav['WS/48']), 'r')
plt.text(-0.25, 5.5, 'r^2 = %0.2f' % (result.rvalue) ** 2)
plt.title('WS/48 vs Favorable Calls Per 100 Poss')
plt.xlabel('Win Shares/48')
plt.ylabel('Favorable Calls Per 100 Poss')
df_br_2021_adv_pf_fav.plot.scatter(x = 'VORP', y = 'net fav count per 100 poss')
result = stats.linregress(x = df_br_2021_adv_pf_fav['VORP'], y = pd.to_numeric(df_br_2021_adv_pf_fav['net fav count per 100 poss']))
plt.plot(df_br_2021_adv_pf_fav['VORP'], result.intercept + result.slope*(df_br_2021_adv_pf_fav['WS/48']), 'r')
plt.text(5, 6.5, 'r^2 = %0.2f' % (result.rvalue) ** 2)
plt.title('VORP vs Favorable Calls Per 100 Poss')
plt.xlabel('VORP')
plt.ylabel('Favorable Calls Per 100 Poss')
We can see that there seems to be no relationship between the net number of favorable calls per 100 possesions and VORP, and the net number of favorable calls per 100 possesions and Win Shares per 48.
Based upon our analysis, we can conclude that there seems to be no evidence indicating that there exists a relationship between if a player is a star player and if they receive more beneficial foul calls.
Our next goal is to determine if the calls made by some referees show evidence of statistical bias, which would then make them be considered bad. We will first use the Last 2 Minutes Dataset to determine the average number of the net correct calls made by the referees, which is defined to be the diffeence between the number of correct calls and the number of incorrect calls, and from there we will determine if there are any referees that can be considered statistical outliers. We will then determine the average number of correct calls made and also see if there are any statistical outliers.
To find the average number of net correct calls made, we first need to find all the individual games that have their data in the DataSet, and then from there find the number of correct and incorrect foul calls that happened within those games. Finally, we will sum up all the incorrect foul calls and then divide this value by the number of games.
# Import data from NBA L2M, provided by atlhawksfanatic
url_l2m = 'https://raw.githubusercontent.com/atlhawksfanatic/L2M/master/1-tidy/L2M/L2M_stats_nba.csv'
df_l2m_ref = pd.read_csv(url_l2m)
df_l2m_ref = df_l2m_ref.drop((df_l2m_ref[(df_l2m_ref['season'] != 2022) | (df_l2m_ref['playoff'] == True)].index)) #'2022' here means 2021-22 season, and only want Regular Season
df_l2m_ref['game_id'].head()
# separate DataFrame into individual games
df_games_list = []
for id in df_l2m_ref['game_id'].unique():
df_games_list.append(df_l2m_ref[df_l2m_ref['game_id'] ==id])
df_game_info = pd.DataFrame()
df_game_info['game id'] = df_l2m_ref['game_id'].unique()
df_game_info['correct calls'] = ' '
df_game_info['incorrect calls'] = ' '
df_game_info['net correct calls'] = ' '
df_game_info['ref 1'] = ' '
df_game_info['ref 2'] = ' '
df_game_info['ref 3'] = ' '
# calculate calls made in each game
call_list = [pd.DataFrame(game['decision'].value_counts()) for game in df_games_list]
call_list = [call.reset_index() for call in call_list]
for call in call_list:
call.columns = ['type', 'count']
for call in call_list:
call_type = ['IC', 'INC', 'CC', 'CNC']
for call_t in call_type:
if (call_t not in call.values):
call.loc[len(call.index)] = [call_t, 0]
for call in call_list:
call.set_index('type', inplace=True)
# add ref names to dataframe
correct_call_list = [call.loc['CNC', 'count'] + call.loc['CC', 'count'] for call in call_list]
incorrect_call_list = [call.loc['IC', 'count'] + call.loc['INC', 'count'] for call in call_list]
ref_1_list = [game[['OFFICIAL_1']].loc[game.first_valid_index(), 'OFFICIAL_1'] for game in df_games_list]
ref_2_list = [game[['OFFICIAL_2']].loc[game.first_valid_index(), 'OFFICIAL_2'] for game in df_games_list]
ref_3_list = [game[['OFFICIAL_3']].loc[game.first_valid_index(), 'OFFICIAL_3'] for game in df_games_list]
df_game_info['correct calls'] = correct_call_list
df_game_info['incorrect calls'] = incorrect_call_list
df_game_info['ref 1'] = ref_1_list
df_game_info['ref 2'] = ref_2_list
df_game_info['ref 3'] = ref_3_list
df_game_info['net correct calls'] = df_game_info['correct calls'] - df_game_info['incorrect calls']
df_game_info.head()
What we will now do is associate each referee with the number of net correct calls made in the games they oversaw, along with their average net correct calls made per 2 minutes. We find this latter value in an attempt to take into account the fact that some referees oversee more games than others.
However, it should be mentioned that this analysis is flawed. With this method, I will be "sharing the blame." It could be the case that a referee did not actually make many, or any for that matter, incorrect calls, but they will still be labelled as doing so if they were working with a referee that made incorrect calls. To try and mitigate this issue, I will be dividing the number of correct and incorrect calls associated with each referee by 3. Thus, I am making the assumption that each ref is equally likely to make an incorrect call. This is a faulty assumption.
df_ref_info = pd.DataFrame()
df_ref_info['name'] = ' '
df_ref_info['correct calls'] = ' '
df_ref_info['incorrect calls'] = ' '
df_ref_info['net correct calls'] = ' '
df_ref_info['name'] = pd.DataFrame(pd.concat([df_game_info['ref 1'], df_game_info['ref 2'], df_game_info['ref 3']], ignore_index = True).unique())
ref_info_list = [df_game_info.loc[(df_game_info['ref 1'] == name) | (df_game_info['ref 2'] == name) | (df_game_info['ref 3'] == name)] for name in list(df_ref_info['name'])]
cor_call_list = [ref['correct calls'].sum() for ref in ref_info_list]
df_ref_info['correct calls'] = np.divide(cor_call_list, 3)
incor_call_list = [ref['incorrect calls'].sum() for ref in ref_info_list]
df_ref_info['incorrect calls'] = np.divide(incor_call_list, 3)
net_cor_call_list = [ref['net correct calls'].sum() for ref in ref_info_list]
df_ref_info['net correct calls'] = np.divide(net_cor_call_list, 3)
df_ref_info['average net correct calls per 2 min'] = np.divide([ref['net correct calls'].sum() / len(ref) for ref in ref_info_list], 3)
df_ref_info['difference of individual avg NCC/2 min and overall'] = ( df_ref_info['average net correct calls per 2 min']
- (df_ref_info['average net correct calls per 2 min'].sum() / len(df_ref_info)))
df_ref_info.head()
We now remove all referees that have less than 10 minutes reffed in this dataset. We remove them because there is not sufficient data to make any claims about possible bias.
df_ref_info['minutes reffed'] = [2*len(ref) for ref in ref_info_list]
df_ref_info = df_ref_info[df_ref_info['minutes reffed'] > 9]
df_ref_info.head()
Let us now visualize how this data looks through the use of a scatter plot.
df_ref_info.plot.scatter(x = 'name', y = 'difference of individual avg NCC/2 min and overall')
plt.title('Difference Between Individual Average Net Correct Calls/2 min and Overall')
plt.tick_params(
axis='x',
which='both',
bottom=True,
top=False,
labelbottom=False)
plt.xlabel('Referee')
plt.ylabel('Difference')
As seen in the scatter plot, the vast majority of referees fall within committing either 1 incorrect or 1 correct call more than the average. We cannot attribute this to any bias, and so we can say that for the vast majority of referees, there is not sufficient evidence to claim that they are biased, at least within the last two minutes.
With that being said, it is interesting to note that we can see clearly that there is one data point that is significantly higher than the rest, and therefore consistently makes better calls than most referees, and there is another at the bottom that consistently makes worse calls than most referees.
print(df_ref_info.iloc[df_ref_info['difference of individual avg NCC/2 min and overall'].idxmin()])
print(df_ref_info.iloc[df_ref_info['difference of individual avg NCC/2 min and overall'].idxmax()])
It appears that Bennie Adams, on average, consistently makes worse calls in the last 2 minutes than most referees, while Scott Twardoski makes better calls in the last two minutes than most referees. But once again, since this analysis does not specifically consider which referee made the incorrect call, and instead views the three referees in the game as one unit, it is not fair for me to say that Bennie Adams can be considered a bad referee.
A possible way to determine if a referee such as Bennie Adams is being negatively impacted in this analysis due to the other referees they are working with could be to see how many net correct calls their most frequent co-referees make, and determine if there is any relationship between the net number of correct calls a referee such as Bennie Adams make and the number that their co-referees make.
In order to develop a better understanding on if some referees exhibit some form of bias or not, we will now consider the overall average number of calls made by each individual referee. Let us first import the data.
url_br_2021_ref = 'https://raw.githubusercontent.com/NVaduthala3/cmps-6160-project/main/BBall-Ref_Ref.csv'
df_br_2021_ref = pd.read_csv(url_br_2021_ref)
df_br_2021_ref.drop(labels=0, axis=0, inplace=True)
df_br_2021_ref.reset_index(drop=True, inplace=True)
df_br_2021_ref['Per Game Relative.2'] = pd.to_numeric(df_br_2021_ref['Per Game Relative.2'])
df_br_2021_ref['Unnamed: 2'] = pd.to_numeric(df_br_2021_ref['Unnamed: 2'])
df_br_2021_ref = df_br_2021_ref[df_br_2021_ref['Unnamed: 2'] >= 40] # need sufficient amount of games
df_br_2021_ref.head()
The only columns we care about are "Unnamed: 0", "Per Game Relative.3", "Home Minus Visitor.3", and "Per Game.2". The first mentioned column states the name of the referee. The second mentioned one corresponds to the net difference between the number of foul calls the individual referee made and the number of foul calls on average made. We will use this to see if there are any referees that are outliers when it comes to the number of fouls called on average.
The third mentioned column corresponds to the average net difference in fouls called against the home team and the fouls called against the away team. So a negative number indicates that away teams are called for more fouls on average. However, consider a situation where Referee 1 calls on average 4 fouls on the home team, but 8 fouls on the away team. Now, suppose Referee 2 calls on average 20 fouls on the home team, but 25 fouls on the away team. Strictly comparing the difference would imply that Referee 2 has more of a bias to home teams than Referee 1, but this does not contain the entire information. Therefore, we will normalize this difference using the fourth mentioned column, which says the number of fouls called in total on average. To increase readability, we will have this number represented as a percentage.
df_br_2021_ref = df_br_2021_ref[['Unnamed: 0', 'Per Game.2', 'Home Minus Visitor.3', 'Per Game Relative.3']]
df_br_2021_ref.columns =['Referee', 'Total Fouls Called', 'Home Minus Visitor Foul Calls', 'Difference between Ref and Avg']
df_br_2021_ref = df_br_2021_ref.astype({'Home Minus Visitor Foul Calls' : "float", 'Total Fouls Called' : "float", "Difference between Ref and Avg" : "float"})
df_br_2021_ref['Percent Normalized Home Minus Visitor'] = 100 * df_br_2021_ref['Home Minus Visitor Foul Calls'] / df_br_2021_ref['Total Fouls Called']
df_br_2021_ref.head()
Now, with this data, we will first look to see if there are any referees that are outliers in reagrds to average number of foul calls per game and the amount of fouls called on home teams versus away teams.
import matplotlib.pyplot as plt
df_br_2021_ref.plot.scatter(x='Referee', y = 'Difference between Ref and Avg')
plt.title('Per Game Foul Call Difference With League Average')
plt.tick_params(
axis='x',
which='both',
bottom=True,
top=False,
labelbottom=False)
plt.xlabel('Referee')
plt.ylabel('Difference')
df_br_2021_ref.plot.scatter(x='Referee', y = 'Percent Normalized Home Minus Visitor')
plt.title('Percent Difference of Fals Called Against Home and Visitor')
plt.tick_params(
axis='x',
which='both',
bottom=True,
top=False,
labelbottom=False)
plt.xlabel('Referee')
plt.ylabel('Difference')
As we can see, it appears that on average, most referees do not have a significant difference in regards to the number of fouls they call in total, and the fouls they call between home and away teams. In the case for the fouls called between home and away team, the max percent difference is approximately $\pm \% 4$, which is not enough to claim any evidence of bias in this area for any referee.
However, we do see that there are some referees that tend to call significantly more or less total amount of foul calls per game. According to Basketball-Reference, the average number of fouls called per game in the 2021-22 NBA season was 40.7. So we will now look to find out which referees called 4 more foul calls on average, or 4 less foul calls on average.
print(df_br_2021_ref.loc[(df_br_2021_ref['Difference between Ref and Avg'] >= 4) | (df_br_2021_ref['Difference between Ref and Avg'] <= -4), 'Referee'])
These referees call an average number of total foul calls that is at least $\% 10$ away from the league average. This is a significant amount. This means that these referees either tend to be very quick to call a foul, or they tend to be more conservative in their foul calling. But, as mentioned above, it does not seem to be the case that any of these referees exhibit any bias in favor or against the home team or away team.
Based upon our analysis, we cannot conclude that there exists bad referees. When looking at the number of net correct calls made, all the referees fall within a reasonable window. The same occurs when comparing the difference in foul calls made to the home team versus the away team.
We can say that there are some referees who either call significantly more fouls or significantly less fouls than the league average, but they do not appear to call or not call these fouls in a biased manner.
We will now look to determine if there is any relationship between the number of foul calls and if the team is an away team or a home team. We will be looking at data from the 2016-17 NBA season to the 2021-22 season. The reason why we are not looking at data before the 2016 season is because the NBA went through the "3 Point Revolution" through that time, resulting in enormous changes to NBA offenses. As a result, through my own judgement, I believe that it is not helpful to use data from before the 2016 season to make any conclusions about the modern NBA.
We will first look to see what are the average number of fouls called on the home team and compare it to the amount called on the away team to determine if there is any difference in how they are referred. The data we will be using from this point on comes from TeamRank, and luckily the data is already in an extremely clean and neat format and is immediately ready to be used.
url_home_away_2016_22_pf = 'https://raw.githubusercontent.com/NVaduthala3/cmps-6160-project/main/teamrank_2016-22_PF_league_avg.csv'
df_br_2016_22_home_away_pf = pd.read_csv(url_home_away_2016_22_pf)
df_br_2016_22_home_away_pf.columns = ['Season', 'Home PF', 'Away PF']
df_br_2016_22_home_away_pf.set_index('Season', inplace=True)
df_br_2016_22_home_away_pf
Let us see how this looks visually.
ax = df_br_2016_22_home_away_pf.plot.bar(rot=0, title='Average Number of Fouls Called on Home vs Away')
ax.set_ylabel('Foul Calls')
As we can see, there is a negligible difference between the number of fouls called ont the home team versus the amount called on the away team. But, it should be noted that for all seasons listed here, home teams did receive more foul calls on average than away teams.
Now, we will look at the difference in foul calls that each team received as a home team versus as an away team. If the net difference is positive, then they received more calls on average as the home team.
url_home_away_pf_diff_2021 = 'https://raw.githubusercontent.com/NVaduthala3/cmps-6160-project/main/teamrank_21_team_PF_avg.csv'
df_br_2021_home_away_pf_diff = pd.read_csv(url_home_away_pf_diff_2021)
df_br_2021_home_away_pf_diff.columns = ['Team', 'PF as Home', 'PF as Away']
df_br_2021_home_away_pf_diff['Net Difference'] = df_br_2021_home_away_pf_diff['PF as Home'] - df_br_2021_home_away_pf_diff['PF as Away']
df_br_2021_home_away_pf_diff.head()
Let us see how this looks graphically.
df_br_2021_home_away_pf_diff.plot.scatter(x='Team', y='Net Difference')
plt.title('Difference Between Avg Number of Fouls Committed as Home vs Away Team')
plt.tick_params(
axis='x',
which='both',
bottom=True,
top=False,
labelbottom=False)
plt.xlabel('Team')
plt.ylabel('Difference')
We can see that on average, the net difference in how teams are called is within the $\pm 1$ range. Some teams do receive less calls as the away team, but none of them have a difference of at least $\pm 4$. We will use this threshold of "$4$" later as well.
Based upon our analysis, we cannot conlude that there is statistical evidence for home teams receiving beneficial treatment, at least based upon foul call count. it is true that throughout all the listed seasons above, home teams did receive more foul calls, but the difference in foul calls made is too small to argue that there is bias.
Now, we will look to see if there is any significant difference between the number of fouls called on average in the regular season and the playoffs from the 2016-17 season to the 2021-22 sseason. We will again be using TeamRank as our source of data, and so luckily again we will not have to perform any cleaning up of our data.
url_2016_21_reg_pf = 'https://raw.githubusercontent.com/NVaduthala3/cmps-6160-project/main/teamrank_2016_21_reg_seas_league_PF_avg.csv'
url_2016_21_playoff_pf = 'https://raw.githubusercontent.com/NVaduthala3/cmps-6160-project/main/teamrank_2016_21_playoffs_league_PF_avg.csv'
df_2016_21_reg_pf = pd.read_csv(url_2016_21_reg_pf)
df_2016_21_playoff_pf = pd.read_csv(url_2016_21_playoff_pf)
df_2016_21_reg_pf.columns = ['Season', 'PF in Reg Season']
df_2016_21_playoff_pf.columns = ['Season', 'PF in Playoffs']
df_2016_21_reg_pf.set_index('Season', inplace=True)
df_2016_21_playoff_pf.set_index('Season', inplace=True)
df_2016_21_reg_playoff_pf = df_2016_21_reg_pf.join(df_2016_21_playoff_pf, how='outer')
df_2016_21_reg_playoff_pf
Now let us see how this looks visually.
ax2 = df_2016_21_reg_playoff_pf.plot.bar(rot=0)
plt.title('Average Number of Fouls Called in Reg. Season vs Playoffs')
ax2.set_ylabel('Number of Fouls')
As we can see, there seems to be a negligible difference in the average number of fouls calls made in the Playoffs compared to the regular season. It should be noted that for every season, the number of foul calls actually increased in the Playoffs compared to the Regular Season, and so in fact more fouls are called in the Playoffs than the Regular Season
We will now attempt to create a predictive model that will predict the number of net correct foul calls that a team will receive in a game based upon if they are a home team or away team and if the game is in the Regular Season or the Playoffs. We will be using k-Nearest Neighbor in this model. The reason why we will use k-Nearest Neighbors is because it is fair to assume that if two teams play games that have similar values for the input variables mentioned above, then they will receive a similar amount of foul calls.
We will not be including any information about the number of star players or who will be a referee into this model, because as we have just seen above, there seems to be no relationship betwen these variables and the number of net correct foul calls.
We will be using information from the NBA's Last Two Minutes Dataset.
From what we can see throughout this entire analysis, there is close to no evidence for any sense of correlation between if a team is the home or away one and the number of fouls called, and if it a game is during the regular season or playoffs and the number of calls made. So our prediction model most likely will not be very precise or accurate.
url_l2m2 = 'https://raw.githubusercontent.com/atlhawksfanatic/L2M/master/1-tidy/L2M/L2M_stats_nba.csv'
df_l2m2 = pd.read_csv(url_l2m)
df_ncc = pd.DataFrame()
df_ncc['Game ID'] = ' '
df_ncc['Home Team'] = ' '
df_ncc['Away Team'] = ' '
df_ncc['Playoff or No'] = ' '
df_ncc['Net Correct Calls'] = ' '
df_ncc['Game ID'] = df_l2m2['GameId'].unique()
df_ncc['Home Team'] = (df_l2m2.drop_duplicates(subset='GameId'))['home_team'].reset_index(drop=True)
df_ncc['Away Team'] = (df_l2m2.drop_duplicates(subset='GameId'))['away_team'].reset_index(drop=True)
df_ncc['Playoff or No'] = (df_l2m2.drop_duplicates(subset='GameId'))['playoff'].reset_index(drop=True)
ncc = []
for call in df_l2m2['decision']:
if (call == 'CC') | (call == 'CNC'):
ncc.append(1)
elif (call == 'IC') | (call == 'INC'):
ncc.append(-1)
else:
ncc.append(0)
df_l2m2['Correct Call or No'] = ncc
df_l2m2.groupby('GameId')['Correct Call or No'].sum()
df_ncc.dropna(subset=['Game ID'], inplace=True)
df_ncc.sort_values(by=['Game ID'], inplace=True)
df_ncc.set_index('Game ID', drop=True, inplace=True)
df_ncc['Net Correct Calls'] = df_l2m2.groupby('GameId')['Correct Call or No'].sum()
df_ncc
Now we implement kNN and test the errors.
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
def est_test_error(df, features, pred_val, k, iter_count): # iter_count must be at least 2
mae_list = []
rmse_list = []
for i in range(1, iter_count):
error_pair = get_cv_val_error2(df, features, pred_val, k)
mae_list.append(error_pair[0])
rmse_list.append(error_pair[1])
return (np.mean(mae_list), np.mean(rmse_list))
def get_cv_val_error2(df, features, predict_val, k): # cross-validation model
# define train and val sets
train = df.sample(frac = .5)
val = df.drop(train.index)
x_train_dict = train[features].to_dict(orient="records")
x_val_dict = val[features].to_dict(orient = "records")
y_train = train[predict_val]
y_val = val[predict_val]
vec = DictVectorizer(sparse=False)
vec.fit(x_train_dict)
x_train = vec.transform(x_train_dict)
x_val = vec.transform(x_val_dict)
# standardize
scaler = StandardScaler()
scaler.fit(x_train)
x_train_sc = scaler.transform(x_train)
x_val_sc = scaler.transform(x_val)
# Fit kNN model
model = KNeighborsRegressor(n_neighbors=k)
model.fit(x_train_sc, y_train)
# Predict on val set
y_val_pred = model.predict(x_val_sc)
rmse1 = np.sqrt(((y_val - y_val_pred) ** 2).mean())
mae1 = ((y_val - y_val_pred).abs()).mean()
# Perform cross-validation
(mae2, rmse2) = get_val_error_mod(x_val_dict, y_val, x_train_dict, y_train)
# Estimate test error
rmse_list = [rmse1, rmse2]
mae_list = [mae1, mae2]
return (np.mean(mae_list), np.mean(rmse_list))
def get_val_error_mod(X_train_dict, y_train, X_val_dict, y_val): # called back when swapping for cross-val
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 10-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
mae = ((y_val - y_val_pred).abs()).mean()
return (mae, rmse)
def get_kFoldCV_mae(df, features, predict_val, k_nn, k_cv):
x_dict = df[features].to_dict(orient="records")
y = df[predict_val]
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=k_nn)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
scores = cross_val_score(pipeline, x_dict, y,
cv=k_cv, scoring="neg_mean_squared_error")
mae = (np.abs(-scores)).mean()
return mae
def get_kNN_mae(df, features, predict_val, k): # features is list of strings and predict_val is string
x_train_dict = df[features].to_dict(orient="records")
y_train = df[predict_val]
vec = DictVectorizer(sparse=False)
vec.fit(x_train_dict)
x_train = vec.transform(x_train_dict)
scaler = StandardScaler()
scaler.fit(x_train)
x_train_sc = scaler.transform(x_train)
# Fit model
model = KNeighborsRegressor(n_neighbors=k)
model.fit(x_train_sc, y_train)
# Find predictions
y_train_pred = model.predict(x_train_sc)
# Find mae
mae = ((y_train - y_train_pred).abs()).mean()
return mae
features = ['Home Team', 'Away Team', 'Playoff or No']
predict_val = 'Net Correct Calls'
est_test_error(df_ncc, features, predict_val, 100, 10)
k_val = np.arange(1, 50)
train = df_ncc.sample(frac=.5)
knn_train_mae_list = [get_kNN_mae(train, features, predict_val, k) for k in k_val]
knn_val_mae_list = [est_test_error(df_ncc, features, predict_val, k, 50)[0] for k in k_val]
#kFold_train_mae_list = [get_kFoldCV_mae(train, features, predict_val, k, 5) for k in k_val]
#kFold_val_mae_list = [get_kFoldCV_mae(df_ncc, features, predict_val, k, 5) for k in k_val]
plt.plot(k_val, knn_val_mae_list, label = 'kNN val error')
plt.plot(k_val, knn_train_mae_list, label = 'kNN train error')
#plt.plot(k_val, kFold_train_mae_list, label = '5 Fold train error')
#plt.plot(k_val, kFold_val_mae_list, label = '5 Fold val error')
plt.xlabel('k Value')
plt.ylabel('MAE')
plt.legend()
As we can see in our graph for varying $k$ values, our prediction was correct in the sense that our constructed model was not very good at prediction. There seems to be a relatively high error. The major flaw is that, as we have seen in our analysis so far, the variables we used in our prediction model do not seem to be correlated to any information about the number of foul calls.
Our analysis has shown that the commonly held beliefs listed above about NBA referees do not have enough statistical evidence to be believed. There seems to be no evidence of any relation between stardom and favorable treatment; there seems to be no evidence of existence of bad referees; there seems to be no evidence of a significant advantage given to home teams; and there seems to be no evidence of referees calling less fouls in the Playoffs compared to the Regular Season.