Part 2: ADCC Championship: We Can Predict the Winners, But Only 70% of the Time

In the previous article, we explored the characteristics of the 2019 ADCC tournament. Our initial data had match details, which provided us insight into how matches were won, trends among the different weight classes, the most successful fighters, and the top teams. I added player data to the mix to dive deeper into what goes into a successful run at ADCC. From that exploration, we learned:

After the the initial analysis, I ran the data through various models to see how accurately the machine learning tools could predict a win or loss between fighters.

Main Finding:

With the available data, I can predict the win/loss result of a match between two fighters 70% of the time, or 7 out of 10 matches.

If you follow the sport you know upsets occur. In comparison to collegiate wrestling, the seeding for wrestlers produces a predictive accuracy of 67% (as we saw in another article here in the blog). Why not 80 or 90% of the time?

The Primary Culprit:

A lack of extensive match and fighter data in BJJ.

This shouldn’t come as surprise, for anyone that has created predictive models knows that you need hundreds of data points to feed the program. As a data scientist, most of my job is identifying and cleaning data that leads to a better understanding of the topic and actionable recommendations. Another legitimate part is identifying where there is a lack of data and exploring new data sources. This article serves as a baseline for what the current state of data offers to Brazilian Jiu Jitsu. In articles to come, I will seek to uncover new data sources and how they add to our understanding of the sport. That said, I'm writing this to help push our industry forward.

Recommendations:

Increase the record keeping and public sharing of match-level and fighter data in Jiu Jitsu.

To all grappling events, namely ADCC, IBJJF, UAEJJF World Pro, ACB JJ, Copa Podio, EBI, Polaris, and others – the Jiu Jitsu community needs better data to understand itself.

Here’s a short list of what would serve us better in terms of match-level data:

To all teams, fighters, and grappling events, our community needs more information about each fighter and their training regimen. My applause to André Borges for creating and maintaining the BJJHeroes.com site and its wonderful encyclopedia of details on black belts. This is the first step, and collectively our community could take a step further.

Nerd Alert: it gets really deep into the data science details going forward. If you are here for just the takeaways on Jiu Jitsu, feel free to stop reading. If you're a data nerd, continue and enjoy.

PROCESS

I took the data from the ADCC tournament as a baseline for testing. The research question I wanted to answer here was if I could train a machine learning model to predict a win/loss outcome of a given match up between fighters at a tournament. First, I used the ADCC data, which included 98 fighters and 113 matches with actual match outcomes (wins and losses by fighter for each bracket).

Next, I had to remove all remaining data about the actual match itself, as it would skew the prediction. This included things like how many points were scored, if the match resulted in a submission, and how many minutes the match lasted. If you were to put two fighters together in a future competition, you wouldn’t know these things ahead of time, so keeping them in the dataset would lead to over generalizing (or target data leakage).

The details that were kept in the dataset were:

Fighter name,
Opponent,
Gym,
ADCC entrance (past champion, invitee, or trials winner),
Age,
Belt rank,
Experience as a black belt,
Previous experience at an ADCC event,
Total ADCC medals won,
Highest medal achieved at an ADCC,
Most recent ADCC year,
Previous ADCC trials experience,
Total number of trials medals,
Highest medal achieved at a trials event, and
Most recent trials competition year.

WHAT’S NEW IN THE DATA

For the ADCC trials winners, their past experience was well represented. For the invitees, it was unclear what their experience level was that lead them to receiving an invite (in terms of what was represented in the dataset). For this project, I added additional competition results at IBJJF tournaments and other events as qualifying experience.

Details that were added for all fighters include:

Total medals received at all IBJJF World Championships,
Highest place at an IBJJF World Championship,
Year the most recent IBJJF World Championship medal was won,
Total medals received at all IBJJF major tournaments (limited to Brasilieros, Europeans, Pans, and No Gi Worlds),
Highest place at a major IBJJF tournament,
Most recent year a medal was earned at a major IBJJF tournament,
Total medals received at all other tournaments outside of IBJJF (including UAEJJF World Pro, ACB JJ, SPYDER Invitational, EBI, Polaris, Quintet, FIVE, Kasai, Copa Podio, Kinektic, etc.),
Highest place at one of these other tournaments, and
Most recent year for the other tournaments.

Together, the new data added 9 fields for each fighter. Not extensive, but it adds an interesting layer. The original dataset wasn't enough to make a prediction and I already had to find new data sources just for this project. Compiling the ADCC match data, fighter data, and past experience, then cleaning and wrangling the data into something workable, took a large part of my initial exploration into machine learning.

BUILDING A PREDICTIVE MODEL

The next step was to run the data through several machine learning models to compare their accuracy. I used one-hot encoding for my categorical variables and combined the data back with the numerical features.

Since I was trying to predict a win or loss (which is a binary classification), my approach was to start with common machine learning classifiers. Here’s a quick chart on the accuracy scores that four models produced.

The Decision Tree Classifier produced the lowest accuracy at 51%, which in a research problem solving for a win or loss, this is about as good as a guess would get you. I would need to up my game to increase the accuracy score. For the K Nearest Neighbors model, I tweaked the number of neighbors from 1, which produced the 56% accuracy, to 6 neighbors, which improved the score to 61%. I could have stopped here but I wanted to explore all models.

As a fun side project, I ran the data again through the models again, this time looking to classify how the match was won with four options: score, submission, referee’s decision, and forfeit. This was a multiclass classification problem with far fewer data points between all four options. For instance, forfeits only occurred in 6 cases. Here is a list of accuracy scores for the four models.

I wanted to see how the data would play out and the result wasn’t a pretty sight. This indicates that the models could not predict with high accuracy how a match was won. I scrapped the side project since you or I could guess far better how a match might end than the computer program (on such limited training data).

Back to predicting a win or loss! Next, I tried several different encoding mechanisms, since part of my data included categorical features. To feed my data into a machine learning model, I converted any words (known as string categories) into numbers. Simple label encoding would assign the category of gender as male=1 and female=2. But one gender is not twice the value of the other, so the standard label encoding would not work in this case. That’s why I started with one-hot encoding. Nonetheless, I tried several encoding mechanisms, namely Count, Target, and CatBoost encoding.

Count encoding assigns the category the number of how many times it appears in the data. Target encoding replaces a categorical value with the average value of the target and is only used on the training dataset. CatBoost encoding is based on target probability but calculates only on the previous rows. Here are my accuracy results:

You'd think a 100% score would be ideal, but no machine learning model should have an accuracy score of 100%, so something went wrong. Both the Target encoding and Catboost encoding mechanisms led to what is known as data leakage, or the numerical values assigned to the categories tipped off the machine to think it can have a perfect prediction. If we applied this model to new data, it would be highly inaccurate and wrongly predict in a future case. At this point, I ditched these two encoding mechanisms. However, it made me skeptical of the count encoding, so I decided to stick with the original one-hot encoding mechanism.

LightGBM has become one of the most popular classifiers out there so I decided to test it out on the dataset. Below are the three scores. As part of the data preparation for machine learning, I split the data into smaller datasets. In the first instance, I split the data into a training and testing dataset with a 75-25 split, which produced an accuracy score of 54%. Next, I split the original dataset into a train, test, and validation dataset. Since the three datasets already split the original small data I had, the testing datatset gave me only a 40% accuracy score but jumped to 85% when run on the validation dataset. The large spread between these two was likely caused by the fact that the train, test, and validation datasets were all very small and the validation set introduced new data that the machine learning model hadn't seen before, thus skewing the results. That said, I scrapped the LightGBM train, test, and validation sets altogether. Since the LightGBM only produced an accuracy score of 54%, which was not higher than the other models we looked at before, I choose not to use it as my final model.

That leaves us with the original Logistic Regression classifier that rendered the highest predictive accuracy. I ran the model through the training dataset, then on the testing dataset, which produced an accuracy score of 60% with an F1 scoring of 62% (or the mean score between the precision and recall scores). I used a k-fold, grid search, and finally a randomized search to find the best parameters for cross validation (essentially tuning the hyperparameters to enhance the model's performance), and came up with a final accuracy score of 70% after setting the penalty to L1, C to 0.1, and using 3 folds for cross validation. I then compared the Logistic Regression model to a LinearSVC model, which came up with an accuracy score of 63%. However, I felt comfortable selecting the Logistic Regression classifier as my final model.

FEATURE ENGINEERING & SELECTION

The final step, and a task that I could have put a lot of time into if I had a larger dataset with decent results from the get go, was to take the features and generate new data from the existing relationships, called feature generation or engineering. Frankly, I did not dive too deep into this step as I concluded early on that a larger and more varied dataset would be needed. However, I did explore a few options that I will mention here for fun.

I played around with creating new data points by combining two categorical features (or feature generation). For instance, I combined how a fighter entered the tournament with their gym, gender and weight class, and the fighter and their belt rank. One could easily turn 10 features into 100 and test their importance and relationship to predicting a win/loss. I also transformed numerical features to get a better sense of the data. One can take the square root or the natural logarithm to better see the distribution and understand the relationships between different features.

Last, I selected features to help reduce the dimensionality of my model, meaning I dropped a bunch of features that did not have a strong predictive relationship with the match outcome. With categorical features that I one-hot encoded, the dataset for machine learning went from having 28 features to 296. With additional features created under the feature engineering stage to look for a golden relationship, the number of total features kept growing. In combination with an already small set of instances in the data, this might have decreased the overall model accuracy.

Now that I have a baseline for this dataset, I can spend more time on feature engineering and selection in the future.

CONCLUSION

It is also possible that there is little to no relationship between the available data and the research question, meaning the current fighter and match data cannot predict a win or loss. More data might be better suited to answer this question, like how many hours a week a fighter trains, types of training (wrestling vs submission preparation), drilling and sparring hours, control times, weight cuts and weight lifting, plus a number of other factors.

It also might be possible the a machine learning model can make a generalization and predict well on new data, but that fighters and matches are unpredictable. Fighters can be outliers and their successful features are unseen in other fighters. Matches sometimes end in upsets for no apparent reason. This is what makes sports exciting, but difficult for a machine learning model to generalize on new data.

If collegiate wrestling seeds predict match results 67% of the time, our model that predicts a correct match outcome 70% of the time is a good place to start. In coming articles, I will explore new data sources to deepen our understanding of Brazilian Jiu Jitsu.

BJJ Data by Riggs

Search This Blog