Machine Learning Analysis of EA Sports’ FIFA

Brent Claypool and David Lee

Code:

https://github.com/1313davidlee/sports_modeling

Summarizing Video:

https://drive.google.com/file/d/1BaSbfIxqcOZ7Ec8BK-p1ooPVDwPaYkfa/view?usp=sharing

Introduction/Related Work

EA Sports’ FIFA is a soccer video game franchise that releases a game every year for fans to play as their favorite football players and clubs. There are 30 different players attributes on which every player in the game is rated. These attributes correspond to football skills and physical ability. Examples of these attributes are Strength, Sprint Speed, Long Pass Accuracy, and Ball Control. Each player gets a score in the range [0,99] for each of these attributes, and one overall score which is a culmination of the individual scores. The Ratings Collective is a talent scouting network with a passion for The World’s Game. From sprint speed to finishing, passing accuracy to stamina, they meticulously watch what happens on the pitch to assess, judge, and evaluate players across more than 30 attributes that define a footballer’s skill level. Their assessments come together to create the FIFA Ratings – the authoritative ranking of over 17,000 players’ footballing ability. The overall rating has become part of the value of the new game. On September 10 2020, EA Sports’ Instagram account posted a video with the caption “New Season New Ratings…Join the conversation #FIFARatings”. Our project begs the question, what really goes into these ratings? How is it that famous players, seemingly undeservingly given lackluster seasons, continue to retain the highest ratings season after season. Our project attempts to first replicate the FIFA algorithm, identify major contributing factors to the overall rating, and finally identify outliers and the factors that cause them. 


There has been some research and theorization around this topic. A short article from UK media outlet Platform Magazine suggested that player’s ratings are inflated by the fame and size of their club. There has also been research done that in fact uses FIFA data as a proxy for real player data, suggesting accuracy and reliability on EA’s behalf. However, overall, there is no consensus or statistical proof for certain bias factors in FIFA player ratings. Finding so little related work on this particular area of study was further inspiration for us to push the boundaries of this project and the questions we attempted to answer.

Overview of Solution

1. Data selection

The criteria for data selection for this project were very specific. We needed to find real player stats across multiple high-caliber soccer leagues that had most if not all attributes and features in common. The sets needed to be from the same year. We also needed to find a corresponding dataset with EA Sports FIFA player data from the following year. We made this decision because we assumed EA Sports FIFA player rankings are chosen based primarily on the player’s previous season. We were able to meet these criteria for our datasets with the 2018-2019 La Liga and Premier League datasets, and the subsequent FIFA 2020 player profiles. 

2. Data cleaning

Significant data cleaning was done to prepare the datasets for modeling. We first needed to identify common columns between the La Liga and Premier League data, many of which were named differently and needed to be manually renamed. We next removed any irrelevant or uncommon columns between the two sets. Next, we prepared the data for joining with the FIFA dataset. Additionally, we needed to impute several values. For example, a majority of the La Liga rows were missing the number of 50/50s won. We imputed this data by averaging the Premier League data by position and scaling it by appearance count and other factors. We then started preparing the FIFA dataset for joining with the now-clean player data. This involved the design of a common key between the FIFA dataset and the player statistics. We determined that player name and club would be a successful composite key since there were no repeat full names on any club in the two datasets. The major hurdle in this phase of the joining process was the difference in naming convention between the EA FIFA data and the real player statistics. The FIFA data had long names whereas the real season statistics had shortened names or nicknames. We were able to resolve this issue by tokenizing both name fields in the datasets, and declaring a match for a name if the short name was a complete subset of the long name. We were able to create a script that automated this task for all of the several hundred players being analyzed. 

3. Feature selection and model training

Our solution is three separate machine learning algorithms. We separated our dataset into Forwards, Midfielders, and Defenders, then trained a Random Forest Regressor on each dataset separately. Statistics used for each model are summarized in Table 1.   

4. Model applicationWe used Mean Absolute Error,  Training  R2 , Out-Of-Bag  R2 ,and Test R2 as diagnostic tests to evaluate our models. The results of these tests are summarized in Table 2. For Defenders, the Out-Of-Bag  R2 and Test R2, which give us an indicator of how well the model can predict new data, are extremely  low. Due to the nature of our data we do not expect high values for these figures, which are on  a scale of 0 to 1, but these values do not give us any indication that the predictors we used can predict a FIFA overall rating, so we will focus on just Forwards and Midfielders going forward.

For our Forwards model, the most important variables are assists per game and goals per game. For our Midfielders model, the most important variables are assists per game and recoveries per game. Figures A. and B. show plots of the relative importance of each predictor on the response for each model. Upon analyzing these graphs, we saw that the results were intuitive. Scoring goals makes for a highly-rated forward, and creating goal-scoring opportunities and tracking back on defense makes for a highly-rated midfielder.

We tested our algorithms on brand new data that was not used for testing. We trained on data from the 2018-2019 season and FIFA 20 ratings, and we tested on data from the 2019-2020 season and FIFA 21 ratings. Our findings for marquee players are summarized in Table 3. 

5. Explanation of Error

We believe there are several impactful sources of error in our model. The first is omitted variable bias. For this project, we were limited to data available online, which mostly encompassed statistics that could be found on a Match Report. EA Sports’ FIFA uses 30 different attributes to determine an overall rating, and some of these attributes are just not captured by our statistics. Examples of such attributes include Attacking Position, Sprint Speed, and Composure. If we had access to more advanced statistics such as “Average Time Spent in X Position on the Field”, “Top Speed”, and “Successful Dribbles under Pressure”, we would have ended up with a more accurate model.

The second source of error is the size of our data. We used only players from Spain’s La Liga, and England’s Premier League, and of those leagues just the player data that was available online. Out of an initial dataset with over 18,000 rows, we ended with  84 forwards, 159 midfielders and 131 defenders with good data for regression. At this size, we can capture plenty of effects from the middle half of the data, but we lose predictive power on players with overall ratings that are closer to the tails of the distribution.

The third source of error is the bias introduced by high performance in previous seasons. We assume a player who performs wells consistently through years earns themselves some credit. If they suddenly reach a year where their statistical output drops due to some injury, or new teammates, their FIFA rating reflects how good they’ve shown they can be more than how well they have performed recently. The solution to accounting for this would have been to consider each players’ previous FIFA ratings as a predictor for their rating in the next game, but that would have defeated the purpose of trying to discover the effects of the real statistics!

Design Process and Preliminary Evaluation

One of our first steps in creating a machine learning model was defining statistics to use as predictors that were fair to use across a dataset of  players from different clubs, leagues, and play styles. These statistics are summarized in Table 1. Due to the nature of the distribution of overall ratings in the database, we would have favored using a linear model to fit our algorithm, because it would give us a mathematical model to easily reproduce predictions for players whose overall ratings do not tend toward the mean. After observing the relationships between features, we found that most relationships between the predictors and response were nonlinear, and that there was substantial variable interaction. Therefore, we went with a black-box model. We lost some interpretability, but we gained more predictive power. We also split our dataset up by position in order to reduce training error, because we can assume that some statistics such as goals per game are more important factors for the overall rating for certain positions. Since time is not a constraint for our project, we can afford to run three separate models for forwards, midfielders and defenders. 
The RandomForestRegressor class in the SKlearn package has a built-in variable importance measure, so we used this to pick out the most important variables to use in each model. Table 2 shows the variable importance for the model for forwards. Next, we did some parameter tuning to select the best model. The Random Forest Regressor has 13 different arguments that can be changed to fit the model differently. We changed only a few. The first was ‘max_depth’. This parameter decides how many terminal nodes a tree can have. We set this parameter to ‘None’ because we wanted to grow the trees as large as possible. This is important to us because it makes our model more sensitive to our outliers. For example, in our model for forwards, the average overall rating is 79, but we still want our model to be able to capture the uniqueness of a Lionel Messi with a rating of 93. The next parameter we chose to change was ‘max_features’. This parameter controls the number of variables to be randomly considered as candidates to be the criterion of a decision on a tree. Andy Liaw and Matthew Wiener in the publication Classification and Regression by Random Forest recommend setting this parameter to the number of predictors divided by 3. The final parameter we changed is ‘random_state’. This parameter simply sets the random number stream. We set this value to 0 for each model so that our model diagnostic measures will be the same each time we run it. 
The measures we used to evaluate our model are Mean Absolute Error,  Training  R2 , Out-Of-Bag  R2 ,and Test R2 . While tuning our parameters, we chose the parameters that resulted in the highest values of these two statistics. For our test set, we used real player data from the 2019-2020 season and FIFA 21 overall ratings. The big assumption here is that EA Sports used the same criteria for evaluating FIFA 21 overall ratings as they did for  FIFA 20. The importance of considering multiple diagnostics for evaluating our models lies in the nature of our data. We want to see a very high Mean Absolute Error and  Training  R2 because we want to fit our model closely to our data. In reality, most players’ overall ratings will be in the range of 77-83, so we need a model to be able to distinguish between samples in such a small range of possibilities. We then need measures like  OOB  R2 ,and Test R2 because they are more unbiased. These measures give us a better sense of how effective the models are at predicting new data. The pitfalls of observing these statistics are the sensitivity of the size of our dataset and the nature of our data. We are working with a small dataset so we can expect our OOB R2 to be small. Also, overall ratings are integer values, while the predicted values have decimal values. We can expect these diagnostic tests to yield different results if all the predicted ratings were rounded up to the nearest integer.

Finally, this is a tricky project because we hypothesized in the beginning that there does exist bias in the evaluation of players’ FIFA ratings. We were striving to build a great model, but at the same time we were expecting there not to exist a clear, consistent relationship between a player’s stats and their FIFA rating. The difficulty in selecting a model was that there does not exist any true rating with which to compare our predictions. FIFA overall rating evaluation is not something EA Sports computes mathematically, so our work in this project is just trying to see if every player is evaluated on the same criteria. 

Conclusions and Future Work

We were pleased with the accuracy of our model. We believe it was highly effective at generating accurate player ratings for the players the model was tested on. That said, we were particularly interested in the outliers that our model predicted. We have several theories about our outliers, and these hypotheses inspire much of our future work with this project. 

Future work on this project will revolve around mathematically quantifying sources of bias in EA FIFA ratings that cause outliers that were identified in part three of the project. We hypothesized that age, fame, and Club clout could all be contributors to the score of extreme outliers. For example, a world-renowned player like Lionel Messi always seems to receive a high score seemingly independently of his actual performance the previous season. His rating is instead derived from his near God-like status achieved over many seasons and accolades that go beyond pure statistics. We are hoping to mathematically identify the correlation between fame and standard deviation from the expected rating value by using the number of Instagram followers of any player as a constant representing that player’s fame. Similar correlation figures may be calculated for such data as player age and club. For instance, certain clubs such as Real Madrid have contracts with EA Sports, and there may be an incentive for EA to inflate the scores of Real Madrid players accordingly. Further research needs to be done into preparing a statistically sound method of finding the correlation between these variables.  


References

  1. Andy Liaw, Matthew Wiener. Classification and Regression by RandomForest.  Retrieved December  07, 2020 from https://www.researchgate.net/publication/228451484_Classification_and_Regression_by_RandomForest.
  2. Are FIFA’s Player Ratings Guilty of Big Team Bias?. Retrieved December 07, 2020, from https://www.platformmagazine.co.uk/sports/are-fifas-player-ratings-guilty-of-big-team-bias/.
  3. Conner Smith, Zach Taylor, Jonathan Tynan. 2018. FIFANet: Deep Learning to Predict Player Value. Retrieved from http://cs230.stanford.edu/projects_spring_2019/reports/18681023.pdf.
  4. Will Koehrsen. 2017. Retrieved December 07, 2020 from https://towardsdatascience.com/random-forest-in-python-24d0893d51c0.

FIFA Ratings Collective. Retrieved October 22, 2020 from https://www.ea.com/games/fifa/fifa-21/ratings.

Leave a comment