Building a large database of MMA fight results III: summarizing the demographics of 140,000 MMA fighters

13 May 2016 By Sean Hackett

The essential units of analysis in MMA are the sport’s fighters and the bouts between them. In my previous post, I discussed how to standardize match data so that the major categories of finishes could be easily visualized.

In this entry, I will discuss how to clean-up fighter-level data. The main goal of this post will be to determine the factors that affect fighter matchups. For some fighters, our data contains useful demographics, while for other fighters, this information is missing. One goal of this post will be to infer missing demographic data based on fighters’ previous bouts. For example, if we don’t know where in the world a fighter lives, we can probably determine their location with good accuracy if all their fights have been against people from the same region.

The data that we will primarily work with was collected seperately for each fighter. We will first combine fighters’ data into a single table that can be more easily used.

library(dplyr)

load("~/Desktop/MMA/software/data/raw_fight_data/fighter_output.Rdata")

kable(fighter_output[[1]]$vital_stats, row.names=F)

Name	Nickname	Birthday	Height	Weight	nationality	weight_class	camp
Andrei Arlovski	The Pit Bull	1979-02-04	193.04	109.32	Belarus	Heavyweight	Jackson-Wink MMA

fighter_vital_statistics <- bind_rows(lapply(fighter_output, function(x){x$vital_stats})) %>%
  mutate(Query = names(fighter_output)) %>% tbl_df()

kable(head(fighter_vital_statistics), row.names=F)

Name	Nickname	Birthday	Height	Weight	nationality	weight_class	camp	Query
Andrei Arlovski	The Pit Bull	1979-02-04	193.04	109.32	Belarus	Heavyweight	Jackson-Wink MMA	/fighter/Andrei-Arlovski-270
Ikuhisa Minowa	Minowaman	1976-01-12	175.26	83.91	Japan	Middleweight	Kuma Gym	/fighter/Ikuhisa-Minowa-250
Kazushi Sakuraba	The Gracie Hunter	1969-07-14	182.88	75.75	Japan	Welterweight	Laughter7	/fighter/Kazushi-Sakuraba-84
Ronda Rousey	Rowdy	1987-02-01	167.64	61.23	United States	Bantamweight	Team Hayastan	/fighter/Ronda-Rousey-73073
B.J. Penn	The Prodigy	1978-12-13	175.26	65.77	United States	Featherweight	BJ Penn’s MMA	/fighter/BJ-Penn-1307
Thales Trindade	Pezao	NA	NA	NA	Brazil	NA	NA	/fighter/Thales-Trindade-166551

Summarizing fighter demographics: gender, weight and locale

For many fighters, our dataset contains important fighter attributes such as nationality, weight class and age. One attribute that is noticeably absent is gender. If we are interested in the factors affecting who-fights-who in MMA, then gender is definitely important.

One method that we can use to try to predict the genders of fighters is to look at the gender that is generally associated with a fighter’s first name. The gender provides great resources for exactly this task. Using the “ssa” method, which looks at Social Security Administration baby names, for each first name, we can determine the fraction of male babies and the fraction of female babies with the name.

fighter_names <- fighter_vital_statistics %>% select(Name, Query) %>%
  rowwise() %>% mutate(first_name = strsplit(Name, split = " ")[[1]][1])

gender_prediction <- gender(unique(fighter_names$first_name), method = "ssa") %>% tbl_df() %>%
  select(first_name = name, Pr_male = proportion_male)

fighter_names <- fighter_names %>% left_join(gender_prediction, by = "first_name")

kable(head(fighter_names), row.names=F)

Name	Query	first_name	Pr_male
Andrei Arlovski	/fighter/Andrei-Arlovski-270	Andrei	1.0000
Ikuhisa Minowa	/fighter/Ikuhisa-Minowa-250	Ikuhisa	NA
Kazushi Sakuraba	/fighter/Kazushi-Sakuraba-84	Kazushi	NA
Ronda Rousey	/fighter/Ronda-Rousey-73073	Ronda	0.0083
B.J. Penn	/fighter/BJ-Penn-1307	B.J.	NA
Thales Trindade	/fighter/Thales-Trindade-166551	Thales	1.0000

Classifying names works for the majority of fighters, but for 8% of fighters predicted gender is clearly ambiguous (0.1 < Pr(Male) < 0.9), while for 11% of fighters no gender prediction is possible at all.

In addition to gender, we would like to summarize each fighter’s weight class and where he/she is from. To summarize where fighters are from, nationality may be too specific for some purposes (there are 193 countries in the UN). To provide a less specific summary of fighters’ locations, we can use the countrycode package. Countrycode provides mappings between countries and regions. We can then also manually add the continents in which these regions fall.

library(countrycode)

# Define regions
nationality_summary <- fighter_vital_statistics %>% count(nationality) %>%
  mutate(nation_region = countrycode(nationality, "country.name", "region"))
nationality_summary$nation_region[nationality_summary$nationality == "Taiwan"] <- "Eastern Asia"
nationality_summary$nation_region[nationality_summary$nation_region %in% c("Melanesia", "Micronesia", "Polynesia", "Australia and New Zealand")] <- "Oceania"

# Define continents
nationality_summary$nation_continent <- nationality_summary$nation_region
nationality_summary$nation_continent[nationality_summary$nation_region %in% c("Caribbean", "Northern America", "Central America")] <- "N. America"
nationality_summary$nation_continent[nationality_summary$nation_region %in% c("South America")] <- "S. America"
nationality_summary$nation_continent[grepl('Africa', nationality_summary$nation_region)] <- "Africa"
nationality_summary$nation_continent[grepl('Europe', nationality_summary$nation_region)] <- "Europe"
nationality_summary$nation_continent[grepl('Asia', nationality_summary$nation_region)] <- "Asia"

# overwrite misspelled nationalities
fighter_vital_statistics$nationality[fighter_vital_statistics$nationality %in% nationality_summary$nationality[is.na(nationality_summary$nation_region)]] <- NA

kable(head(nationality_summary), row.names = F)

nationality	n	nation_region	nation_continent
165	1	NA	NA
185	1	NA	NA
Afganistan	1	NA	NA
Afghanistan	98	Southern Asia	Asia
Albania	13	Southern Europe	Europe
Algeria	10	Northern Africa	Africa

Based on our starting weight class and nationality data along with the genders that were inferred based on names, we have complete data for 49% of fighters. We can be confident in weight class (although this does change across fighters’ careers). We are less confident, however, in the inferred gender because there are many cases of ambigously predicted genders.

Network-based imputation of fighter demographics

To improve our prediction of fighters’ locale, weight class and gender, we can use an additional important source of information: their matchups with other fighters. Fighters are likely to fight people who are a similar weight, the same gender and who reside in a similar location.

Our strategy is to capitalize on bout information to label fighters’ unknown attributes (and gender, as this may be mispredicted by name), according to a consensus of fighters’ neighbors. This approach is implemented as an iterative algorithm. During one iteration, we start with bout data (fighter-opponent) and fill in the opponent demographics. Then a consensus of opponents’ demographics is taken for each fighter and the fighter’s attributes are updated if they were previously unknown. After each iteration we learn more about opponents’ demographics (if they were missing); this information is then passed onto fighters until we are left with only ambiguous attributes.

# load bout data
all_bouts <- readRDS("~/Desktop/MMA/software/data/processed_fight_data/MMA_all_bouts.Rds")

fighter_vital_network_V <- fighter_vital_statistics %>% select(Query, weight_class, nationality) %>%
  left_join(fighter_names, by = "Query") %>%
  left_join(nationality_summary %>% select(-n), by = "nationality") %>%
  mutate(Pr_male_impute = Pr_male,
         weight_class_impute = weight_class,
         nationality_impute = nationality,
         nation_region_impute = nation_region,
         nation_continent_impute = nation_continent)

kable(head(fighter_vital_network_V %>% select(Name, first_name, Pr_male, Pr_male_impute, weight_class, weight_class_impute, nationality, nationality_impute)), row.names = F)

Name	first_name	Pr_male	Pr_male_impute	weight_class	weight_class_impute	nationality	nationality_impute
Andrei Arlovski	Andrei	1.0000	1.0000	Heavyweight	Heavyweight	Belarus	Belarus
Ikuhisa Minowa	Ikuhisa	NA	NA	Middleweight	Middleweight	Japan	Japan
Kazushi Sakuraba	Kazushi	NA	NA	Welterweight	Welterweight	Japan	Japan
Ronda Rousey	Ronda	0.0083	0.0083	Bantamweight	Bantamweight	United States	United States
B.J. Penn	B.J.	NA	NA	Featherweight	Featherweight	United States	United States
Thales Trindade	Thales	1.0000	1.0000	NA	NA	Brazil	Brazil

# Look at all opponents
fighter_vital_network_E <- all_bouts %>% select(Fighter_link, Opponent_link) %>% unique()
fighter_vital_network_E <- rbind(fighter_vital_network_E, fighter_vital_network_E %>% select(Opponent_link = Fighter_link, Fighter_link = Opponent_link)) %>% unique()

# Add known attributes of query fighter (fighter link)
fighter_vital_network_E <- fighter_vital_network_E %>% left_join(fighter_vital_network_V %>% select(Fighter_link = Query, Pr_male, weight_class, nationality, nation_region, nation_continent), by = "Fighter_link")
# Add consensus estimate of opponent attributes (opponent_link)
fighter_vital_network_E <- fighter_vital_network_E %>% left_join(fighter_vital_network_V %>% select(Opponent_link = Query, Pr_male_impute:nation_continent_impute), by = "Opponent_link")

# determine the top match from neighbors
get_max <- function(x){
  if(all(is.na(x))){
    return(c(NA_character_)) 
  }else{
    x %>% table() %>% sort(decreasing = T) %>% names() %>% first() %>% return()
  }
}

# so that we can track missing values as the algorithm proceeds
count_na <- function(x){
 sum(is.na(x)) 
}

# track number of missing values throughout iteration

iter_count <- 0
track_fill_in <- fighter_vital_network_V %>% ungroup() %>% summarize_each(funs(count_na), Pr_male_impute:nation_continent_impute) %>% mutate(iteration = iter_count)

repeat{
  iter_count <- iter_count+1
  
  # infer demographics based on neighborhood
  # for weight class and locations, find the most frequent neighbor class
  # for gender [0,1] average Pr(male)
  
  update_fighter_vitals <- fighter_vital_network_E %>%
    group_by(Fighter_link) %>%
    summarize(Pr_male = Pr_male[1],
              weight_class = weight_class[1],
              nationality = nationality[1],
              nation_region = nation_region[1],
              nation_continent = nation_continent[1],
              Pr_male_impute = ifelse(all(is.na(Pr_male_impute)), NA, mean(Pr_male_impute, na.rm = T)),
              weight_class_impute = get_max(weight_class_impute),
              nationality_impute = get_max(nationality_impute),
              nation_region_impute = get_max(nation_region_impute),
              nation_continent_impute = get_max(nation_continent_impute))
  
  track_fill_in <- rbind(track_fill_in,
                         update_fighter_vitals %>% summarize_each(funs(count_na), Pr_male_impute:nation_continent_impute) %>% mutate(iteration = iter_count)
  )
  
  if(iter_count > 9){
    break
  }else{
    
    # overwrite known demographics during each iteration
    update_fighter_vitals <- update_fighter_vitals %>%
      mutate(Pr_male_impute = ifelse(is.na(Pr_male), Pr_male_impute, Pr_male),
             weight_class_impute = ifelse(is.na(weight_class), weight_class_impute, weight_class),
             nationality_impute = ifelse(is.na(nationality), nationality_impute, nationality),
             nation_region_impute = ifelse(is.na(nation_region), nation_region_impute, nation_region),
             nation_continent_impute = ifelse(is.na(nation_continent), nation_continent_impute, nation_continent))
    
    fighter_vital_network_E <- fighter_vital_network_E %>% select(Fighter_link:nation_continent) %>%
      left_join(update_fighter_vitals %>% select(Opponent_link = Fighter_link, Pr_male_impute:nation_continent_impute),
                by = "Opponent_link")
    
  }
}

fighter_vital_consensus <- update_fighter_vitals %>% rowwise() %>%
  mutate(Gender = if(is.na(Pr_male_impute)){NA_character_}else if(Pr_male_impute < 0.2){"Female"}else if(Pr_male_impute > 0.8){"Male"}else{NA_character_},
         Weight_class = ifelse(is.na(weight_class), weight_class_impute, weight_class),
         Nationality = ifelse(is.na(nationality), nationality_impute, nationality),
         Nation_region = ifelse(is.na(nation_region), nation_region_impute, nation_region),
         Nation_continent = ifelse(is.na(nation_continent), nation_continent_impute, nation_continent)) %>%
  select(Query = Fighter_link, Gender:Nation_continent)
  
fighter_vital_statistics <- fighter_vital_statistics %>% select(Query, Name:Weight, Camp = camp) %>% left_join(fighter_vital_consensus, by = "Query")

As the imputation algorithm proceeds, we are able to fill in a large amount of missing information. This method should accurately predict unknown genders, weight classes and the general locale of fighters (inferred nationalities are more uncertain in many cases).

kable(track_fill_in)

Pr_male_impute	weight_class_impute	nationality_impute	nation_region_impute	nation_continent_impute	iteration
15890	55810	42317	42317	42317	0
8719	26693	20429	20429	20429	1
820	13393	9595	9595	9595	2
688	11144	7775	7775	7775	3
671	10680	7412	7412	7412	4
670	10576	7338	7338	7338	5
670	10551	7319	7319	7319	6
670	10546	7312	7312	7312	7
670	10544	7308	7308	7308	8
670	10542	7307	7307	7307	9
670	10542	7307	7307	7307	10

Summary of fighter demographics

As a quick summary of cleaned-up fighter-specific data, we can look at summaries of two major types of fighter demographics: fighter weight classes and where fighters live.

First we will look at weight classes. Because weight classes are ordered, with Atomweight the lightest and Super Heavyweight the heaviest, weight categories were sorted according to their logical progression and the frequency of fighters in each weight class was determined

library(ggplot2)

weight_summary <- fighter_vital_statistics %>%
  count(Weight_class) %>%
  filter(!is.na(Weight_class)) %>%
  mutate(Weight_class = ordered(Weight_class,
                                levels = c("Atomweight", "Strawweight", "Flyweight", "Bantamweight",
                                           "Featherweight", "Lightweight", "Welterweight", "Middleweight",
                                           "Light Heavyweight", "Heavyweight", "Super Heavyweight"))) %>%
  arrange(Weight_class)

# ggplot2 plotting theme
hist_theme <- theme(axis.title = element_text(color = "black", size = 25),
                    panel.background = element_rect(fill = "gray93"),
                    panel.border = element_rect(fill = NA, color = "black"),
                    panel.grid.minor = element_blank(),
                    panel.grid.major.x = element_blank(),
                    panel.grid.major.y = element_line(color = "gray70", size = 1),
                    axis.text = element_text(size = 18, hjust = 0.5, vjust = 0.5),
                    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
                    axis.ticks = element_line(size = 1),
                    axis.ticks.length = unit(0.15, "cm"))

ggplot(weight_summary, aes(x = Weight_class, y = n)) +
  geom_bar(stat = "identity", fill = "firebrick") +
  hist_theme +
  scale_y_continuous("Counts") + scale_x_discrete("Weight Class")

plot of chunk unnamed-chunk-3

Looking at weight classes: most fighters compete at intermediate weight classes; the most common weight classes are Lightweight and Welterweight.

The geographical summaries that we generated: nationality, region and continent are a nested hierarchy with nations residing within regions and regions within continents. To make use of this hierarchy for visualization we need to make sure that this nested structure is maintained for all fighters such that nations can only belong to one region and regions can only belong to one continent.

locale_summary <- fighter_vital_statistics %>%
  count(Nation_continent, Nation_region, Nationality)

locale_summary <- locale_summary %>%
  filter(!is.na(Nation_continent) & !is.na(Nation_region) & !is.na(Nationality))

# assigning each nation to unique region and region to a continent
continent_summary <- locale_summary %>%
  count(Nation_continent, Nation_region, wt = n) %>%
  group_by(Nation_region) %>%
  summarize(Continent = Nation_continent[which.max(n)][1])

region_summary <- locale_summary %>%
  count(Nation_region, Nationality, wt = n) %>%
  group_by(Nationality) %>%
  summarize(Region = Nation_region[which.max(n)][1])

locale_summary <- locale_summary %>%
  left_join(region_summary, by = "Nationality") %>%
  count(Nation_continent, Region, Nationality, wt = n) %>%
  left_join(continent_summary, by = c("Region" = "Nation_region")) %>%
  count(Continent, Region, Nationality, wt = n) %>%
  ungroup

kable(head(locale_summary %>% arrange(desc(n)), 8), row.names = F)

Continent	Region	Nationality	n
N. America	Northern America	United States	57170
S. America	South America	Brazil	18553
Europe	Northern Europe	United Kingdom	8190
Asia	Eastern Asia	Japan	5928
N. America	Northern America	Canada	4618
Europe	Eastern Europe	Russia	4478
N. America	Northern America	USA	2436
Europe	Eastern Europe	Poland	2361

Having established a hierarchy of fighter locations, summarized based on the number of fighters residing in each country, nested within geographical regions and continents, we can visualize this hierarchy using a treemap. Essentially we will turn each country into a rectangle with area proportional to the number of fighter’s from that country. These rectangles will be nested according to the continent-region-nationality hierarchy.

To allow for clearer discrimination of regions (which will be especially useful for later analyses) we can shade similar countries with similar colors. Conventionally, we would only color one level of our hierarchy by either choosing unique colors for continents or regions or nations. If we only colored continents, we would only have a very coarse summary of geograophy. If we colored nations then similar country might receive very different shades of color.

Using a hierarchy to guide the choice of colors seems natural when such information is present. Here, I chose a general color for each continent, shades of that color for regions, and more specific colors for each country.

library(treemap)

source("~/Desktop/MMA/manuscripts/blogs/Coloring_hierarchical_data/hierarchical_color_lib.R")

available_colors <- extract_color_space(hmax = 355, lmin = 30, cmin = 30)

country_colors <- identify_color_hierarchy(locale_summary, available_colors, weight_column = "n") %>%
  filter(Tier == "Nationality") %>% select(Nationality = Category, Color)

locale_summary <- locale_summary %>% ungroup %>% left_join(country_colors, by = "Nationality")

treemap(locale_summary, 
index = c("Continent", "Region", "Nationality"),
vSize = "n",
vColor = "Color", type = "color",
title = "Locations of MMA fighters",
fontsize.title = 30,
fontsize.labels = c(0, 0, 15))

plot of chunk unnamed-chunk-5

By coloring and organizing fighter according to their nationality we can see that Mixed Martial Arts is truely a world-wide phenomenon, although its popularity is greatest in the USA, Europe, Brazil and Japan.

In my next post, I will demonstrate how the wide geographical distribution of MMA fighters and other factors such as weight class and gender structure the matchups between fighters. This structure emerges directly from the matchups between fighters and can be readily visualized by treating MMA fights as an undirected graph.

Fight Prior MMA Data Science and Statistical Analysis

Building a large database of MMA fight results III: summarizing the demographics of 140,000 MMA fighters

Summarizing fighter demographics: gender, weight and locale

Network-based imputation of fighter demographics

Summary of fighter demographics

Related Posts

Building a Brazillian Jiu-Jitsu family tree 29 Dec 2016

MMA styles II: ranking the top striking, submission and decision specialists 20 Dec 2016

MMA styles I: grouping finishes using large-scale data 29 Sep 2016