Machine Learning the NBA MVP Race Winner

The NBA MVP is selected every year to determine who the "Most Valuable Player" during the regular season's 82 games is. The decision is made using a voting system, where a panel of sportswriters and broadcasters cast votes to decide who will win the award.

The recipient of the award is typically chosen based on factors such as their individual statistics, team success and impact on the game. As such, the end of the season usually has a select few candidates for the title that exhibit these characteristics. But could we train a machine learning model to learn what player will be named MVP for a given season?

Data Collection

Since the voter data is made public, we can find out voting information for a particular season such as First Place Votes, Points Won and Share. Ideally, we could use some of this data as output values to train a regression model on, such that our model can predict the votes each player in the MVP race gets for a given season and as a result, who will become that season's league MVP.

This data is available on Basketball Reference for every season since the award's existence (as well as additional player and team data). I scraped the data using hooPY, a simple Python CLI tool I wrote for accessing all types of NBA data.

I retrieved MVP data from the past 30 years as well as advanced stats, which I plan on using as feature data to train the model. I combined the fetched data with an inner join into a single dataframe for each season, resulting in the following table structure:

Head of fetched table from the 2023 season

I have added 'Won' as an additional column for visualization later on, since I will be concatenating all season's dataframes into one and will need to keep track of the winners.

I will be training my model to predict the sixth column, Share, which is calculated by the number of points a player received (which depends on the amount of votes and their respective placements) divided by the total available votes. In short, this metric essentially defines who will win the MVP award in a given season.

Feature Selection

Since I am going to perform a regression, I will first take a look at the pairwise correlations using the Pearson Correlation Coefficient first. I have plotted the correlation table using a heatmap for visualization:

In this case, we are interested in the correlations of Share with other features. Based on these correlations, I decided to keep the following features to train the model:

Note that defensive stats such as blocks and steals don't seem to have much of an impact on the resulting vote share. Even the more general advanced defensive stats like DBPM, although they have more of an impact on the share, correlate less than their offensive counterparts.

We can take a look at the scatter plots for the correlations of each of these features with the Share stat. The MVP winners for each season are coloured in orange.

Model Selection

Based on the correlation plots above, it looks like we might even be able to use linear regression and get a decent model for this data. First, we have to do same basic data transformation:

1from sklearn.preprocessing import scale 
2X = df[stats].to_numpy()
3X = scale(X)
4print(X)
array([[ 0.57207554,  1.09736228,  0.97627999, ...,  0.32582557,
         0.89949827,  1.06205372],
       [ 0.66770434,  1.52082981,  0.80668369, ...,  2.02849453,
         0.78155081,  1.41746367],
       [ 1.91087884,  1.94429734,  1.56986704, ...,  1.25455409,
         2.2362361 ,  2.63601206],
       ...,
       [ 0.45732097, -1.01997536, -1.05887559, ..., -1.45423743,
        -1.18424011, -1.17195166],
       [ 0.26606335, -0.62675551, -0.44408901, ..., -1.14466125,
        -0.63381865, -0.76576887],
       [ 0.6868301 , -1.44344288, -1.01647652, ..., -0.37072082,
         0.07386608, -0.61345032]])

For ease of use, I have decided to go with the model implementation from Scikit-Learn. Training the linear regressor on this data produces the following R² score:

0.4792558660094782

Not great. Keep in mind, the R² score, known as the Coefficient of Determination provides a measure for the goodness of our model's predictions, with values closer to 1.0 being better and values closer to 0.0 being worse model performances.

In this case, we can also use accuracy as an evaluation metric, considering the argmax result of our regression output would theoretically be labelled as the league MVP for that season. In that case, we can compare the predicted MVP to the actual MVP and keep track of how many MVPs our regression predicted correctly, which in this case is the following fraction:

0.6451612903225806

Slightly better than R², but still not great. In this case, I decided to train some other regression models including a Random Forest, Support Vector Machine, Multilayer Perceptron, and two Voting Regressors, which take in multiple regression models and average out their predictions.

The first Voting Regressor consists of a Linear Regressor, Random Forest and K-Neighbours Regressor. The second consists of a Gradient Boosting Regressor, Random Forest and Linear Regressor. The Voting Regressors as well as the hyperparameters for the other models can be found in the notebook of this page's source code.

Model Evaluation

After training the models, I performed the same simple evaluation from earlier on the other models, calculating accuracy and R² scores for each model. I decided to plot the results in the following bar chart:

Accuracy and R² score values for each model

Testing the models

Now that we have some trained models, we can take a look at how they approach some recent seasons and whether they perform as expected on some concrete examples.

First, let's take a look at the model predictions on the previous season:

MVP vote share predictions for the 2023 season

Interestingly, the model seems to be very balanced between the two main MVP candidates, Joel Embiid and Nikola Jokić. In the case of the first Voting Regressor, it even prefers Jokić. This is reflected both in their regular season performances, as well as the conversation going on during MVP voting at the time, with many people saying Jokić had a solid case for winning the award.

Another interesting season to look at might be the 2016 season, with Stephen Curry being the first and only unanimous MVP. Will the model give him a similarly high vote share? Here are the results:

MVP vote share predictions for the 2016 season

It seems the models do a pretty good job of predicting Steph with a good level of certainty in this case.

Note that some of the models tend to predict negative values in some cases where the predicted share would be very low. This may be caused by overfitting -- in these cases, I have decided to set zero as the share's lower bound for visualization purposes.

Conclusion

In conclusion, predicting NBA MVPs using machine learning turns out to be a bit of a challenge. On one hand, there just isn't a very large amount of relevant data to train from, considering the NBA has changed so much since its inception. Using older MVPs might skew the data since the voting criteria is decided by the always-changing panel of voters and may change over the years. In addition, my approach doesn't consider "voter fatigue" and its possible effect on the MVP race.

Some improvements I might add in the future would be potentially using additional (older) data from seasons pre-'93 and seeing how that affects the models. Another stat that might be worth considering for the feature selection would be the player's team's win-loss record as well as the team's overall standings. Further optimizing the hyperparameters of each model might also be worth looking into.

Anyways, I think this could be interesting when looking at the current season and the possible MVP candidates, and I will most likely be revisiting this project after the 2024/25 season.