16 min read

Team India Cricket Data Analysis

Being a fan of both cricket and cricket statistics, I thought why not explore a cricket related data set this time. So without any dilly-dally, I logged on straight to kaggle and got myself a data set maintained by Jalaz Kumar. If you wish to view or download it for your own use here is the link to do so. In its raw form the data set contained data of 3932 One Day International matches spread over 7 features. Here is a sample view of the raw data:

Let me briefly describe the features available in the data set:

  • Scorecard: Contains an index
  • Team 1: Contains name of the host Team
  • Team 2: Contains name of the visiting Team
  • Winner: Contains name of the winning team
  • Margin: Contains margin by which a team won. It is either in number of wickets or number of runs
  • Ground: Contains name of the ground on which the game was played
  • Match Date: Contains date on which the match was played

I am from India so I was only interested in exploring data related to the Indian Cricket Team. I created a new data frame with only Team India data in it. Below is a one line code that allowed me to do so.

crik_india <- crik_data[crik_data$`Team 1`=="India" | 
                          crik_data$`Team 2` == "India",]

I removed the Scorecard feature from the data frame as it was just an index. With the 6 remaining features this was how the sample data frame looked.

In total, Team India data contained 930 matches dating back to as early as Jul 1974 till as recent as Oct 2017. Based on this data, these were the six main questions I focused on answering:

  • What is India’s total Win/Loss/Tie percentage?
  • What is India’s Win/Loss/Tie percentage in away and home matches?
  • How many matches has India played against different ICC teams?
  • How many matches India has won or lost against different teams?
  • Which are the home and away grounds where India has played most number of matches?
  • What has been the average Indian win or loss by Runs per year?

The data alone however wasn’t enough. I had to do a bit of feature engineering and data wrangling to get the anwsers I was seeking. Below is a step by step process of how I cleaned some of the data and created a couple of features named Ind Win Loss and Home Away .

# replace commas with empty space
crik_india$`Match Date` <- gsub(",", "", crik_india$`Match Date`)
# fix multiple dates in an entry
crik_india$`Match Date`[crik_india$`Match Date` == "Jul 15-16 1974"] <- "Jul 16 1974"
crik_india$`Match Date`[crik_india$`Match Date` == "Jun 16-18 1979"] <- "Jun 18 1979"
crik_india$`Match Date`[crik_india$`Match Date` == "Jun 9-10 1983"] <- "Jun 10 1983"
# replace empty space with hyphen
crik_india$`Match Date` <- gsub(" ", "-", crik_india$`Match Date`)
# converting strings to date and extract year from it
crik_india$`Match Date` <- year(as.Date(crik_india$`Match Date`, 
                                  format = "%B-%d-%Y"))
# creating a new categorical field named `Ind Win Loss`
# Win where India wins or Loss where India losses else Tie if there was no result
crik_india$`Ind Win Loss` <- ifelse(crik_india$Winner == "India", 
                                    "Win", 
                                    "Loss") 
crik_india$`Ind Win Loss`[crik_india$Winner == "no result"] <- "Tie"
# creating a vector of string with names of home grounds
home_grounds <- c("Kolkata", "Bengaluru", "Delhi", "Mumbai", "Nagpur", 
                  "Ahmedabad", "Cuttack", "Kanpur", "Mohali", "Rajkot", 
                  "Chennai", "Indore", "Jaipur", "Pune", "Guwahati", 
                  "Gwalior", "Hyderabad (Deccan)","Jamshedpur", "Kochi", 
                  "Visakhapatnam", "Faridabad", "Chandigarh", "Ranchi", 
                  "Dharamsala", "Amritsar", "Jalandhar", "Jodhpur", 
                  "New Delhi", "Srinagar", "Thiruvananthapuram", "Mumbai (BS)", 
                  "Vijayawada", "Vadodara")

# creating a new features feature named `Home Away` containing either Home
# for home grounds or Away for away grounds
crik_india$`Home Away` <- ifelse(crik_india$Ground %in% home_grounds, 
                                 "Home", 
                                 "Away")

This was how the new data frame looked after cleaning up and adding a couple of new features.

At this point I was ready to look for answer. I started by subplotting three pie charts that could display What is India’s total Win/Loss/Tie percentage? and What is India’s Win/Loss/Tie percentage in away and home matches?

As we can observe, India has played 930 matches in total out of which it has won 51.2% (476), lost 44.5% (414) and tied 4.3% (40) of its matches. We can also see India’s performance at home and away matches. India’s performance at home is quite obviously much better with a winning percentage of 58.8% at home compared to 47.7% wins away.

I used a combination of bar chart, grouped charts and datatables to answer the next couple of questions How many matches has India played against different ICC teams? and How many matches India has won or lost against different teams?

The top five countries with which India has played most of their matches are:
  • Sri Lanka, 155 ODI’s
  • Pakistan, 129 ODI’s
  • Australia, 128 ODI’s
  • West Indies, 121 ODI’s
  • New Zealand, 101 ODI’s

England, South Africa, Zimbabwe all have played more 60 ODI’s against India.

The grouped chart subplot below displays India’s Win and Loss numbers in ODI’s against all teams at home or away. Below the plots are two data tables displaying the same information except countries with less than 10 matches are excluded from the table.

India’s peformance against arch rival Pakistan and Australia hasn’t been great. Otherwise, India has done pretty well against other teams. In general, looking at the plots and the tables we can once again clearly observe that India’s performance in home conditions is much better than performance in away conditions.

Which are the home and away grounds where India has played most number of matches? was the next question I was looking to answer. There were in total 119 grounds on which India had played. These many data points on a bar plot would make it look messy and incohorent, I thought. Therefore, I decided to subplot two bar charts displaying India’s Win/Loss number on top ten home and top ten away grounds separately.

Again, we can observe India’s impressive record on home pitches with one exception of Ahmedabad where India has lost more matches than it has won. On away grounds however, the story is a bit different. Specially, Austrialian pitches (Brisbane, Melbourne and Sydney) seem to trouble team India quite a bit.

It was time to answer the last question on the list What has been the average Indian win or loss by Runs per year? Below is a dygraph that shows India’s average win/loss by runs per year since 1982-2017.