T20 Match Prediction Model

Executive Summary

While Cricmetric has a robust intra-game match win prediction model, the ability to forecast the winner of a T20 match prior to the game starting has for the most part been a case of “play and miss”. With the next version of the BBL just around the corner, we decided to conduct our most comprehensive analysis to date in an attempt to come up with a better approach to predicting the winner of a T20 match before the game has started.

The positive news is that our new model, when validated against the IPL 2016 tournament, had an overall win rate of just under 63%.

The following article shares both the methodology used to develop the model and detailed results of the validation. We plan to share our pre-match predictions during the upcoming BBL to hopefully make the tournament not just enjoyable, but profitable as well!

 

Methodology

Step 1: Define the Project

In order to build a model that predicts the winner of a match, we needed to analyse data from past matches where the outcome of the match was already known. Which matches did we include/exclude? How many matches did we analyse?

As the objective of the model was to forecast the winner of matches in the major domestic T20 tournaments (IPL, BBL and CPL), only historical matches from these tournaments were considered for inclusion in the model development. International matches were excluded and as a result, the model we developed should not be expected to predict well for these matches.

In terms of how many matches from these tournaments to include, the simple answer was the more the better. Of course, like everything in life, there were always some constraints:

  1. Availability of data: Not only did we need data from the actual match, but we also needed two years of historical performance data for the players about to participate in each match. This meant we could not select every IPL, CPL and BBL match in our analysis
  2. Matches with no winners: Matches that were abandoned before a winner could be determined or matches that were tied (even if there was a winner determined by the super over) were excluded
  3. Stability of Data: A requirement for a strong model is for the overall data trends from the matches being analysed to be stable both over time and across tournaments. However based on analysis of IPL, BBL and CPL tournaments, it was found that CPL matches differed significantly from BBL and IPL matches in terms of the distribution of actual winning margins and relative team strength ratings, which are two key components of the final model. Due to these differences, CPL matches were excluded altogether from the model development. This means that the model can only be applied with confidence to IPL and BBL matches.

After considering the above, a total of 824 observations were included in the model development sample taken from:

  • IPL matches from 2010 to 2015. IPL matches in 2016 were excluded because they were used to validate the accuracy of the model
  • BBL matches from 2012 to 2016

While 824 observations is less than what would be ideally required to build a robust and stable model, it is sufficient as a starting point and the model will be refined as more data becomes available.

 

Step 2: Construct the Dataset for Model Development

A total of 824 observations meant that our dataset had 412 winning teams and 412 losing teams. In order to build a predictive model we needed to identify variables that had a strong ability to separate the winning teams from the losing teams.

In previous attempts to predict the outcome of a match we often made our predictions days or even weeks ahead of the match itself. However our analysis and more importantly, common sense, suggested that the actual players in each team had a significant bearing on the outcome of the match. Therefore our model was built using data connected to the actual players who played in each match. This also means that when applying this model in practice, we cannot predict the match winner until after the toss has been made and the player line ups announced.

Dozens of variables were included as candidate variables in the analysis covering the following aspects:

  • The relative strength of each team, which is dependent on the player line up for each team taking part in the match
  • The recent form of the team
  • Other elements including home ground advantage, the outcome of the toss and the decision to bat or bowl first

Not all of the variables considered were predictive, however enough were identified to build a reasonably predictive model.

 

Step 3: Develop the Model

The model was developed using logistic regression with the output being a scorecard containing 10 scorebands. Each team is “scored” using the predictive variables in the scorecard, with the final score determining which scoreband the team falls in. The higher the score for a team the higher the probability of winning and vice versa. If the model was very strong, we would expect to see the win probability decrease as we move from scoreband 1 (high score) to scoreband 10 (low score) and we would also expect to see be a big difference between the expected winning percentage in scoreband 1 and that in scoreband 10.

The chart below shows the output of the model:

Predict1

Except for scoreband 2, the scorecard effectively rank orders the expected winning percentage with teams scoring in scorebands 1-5 all expected to have a winning percentage greater than 50%. Using the distribution of matches by scoreband, this model is expected to produce an overall match prediction success rate of 62.6%.

The model could also be employed by selectively only tipping those teams to win that score in the very high scorebands. For example if you only chose to tip for those matches where one team scored in scoreband 1  (around 10% of matches) then your success rate would be expected to be 72%.

 

Step 4: Model Validation

Two approaches were taken to validate the model.

In-Time Validation: Firstly, although there were 824 observations available for analysis the model was developed using a randomly selected 75% sample from these observations. The remaining 25% were held aside and used to independently validate the model. The purpose of this was to validate that the model was not overly fit to the development data, and that it can be used with confidence on data not used for development but from the same time period.

The following chart shows a comparison of the win rates by scoreband for both the development and the in-time validation sample, and for both the development and in-time validation combined.

Predict1

As would be expected (due to lower sample size), the win rates by scoreband for the in-time validation is a less stable curve, however there is still a clearly decreasing win rate as the scorebands move from 1-10. The in-time validation suggests we should expect an overall win rate of 61.9%, compared to 62.6% for the development sample.

Out-of-Time Validation: The second approach was to retrospectively test the data in a live environment using matches from the 2016 IPL. This validation is different from the in-time validation in that matches from the 2016 IPL are from a different time period from the data used to develop the model.

When validated against the 2016 IPL, the match prediction model had an overall success rate of 62.7%, which is almost exactly in line with expectation from the model development (62.6%).

When viewed by scoreband (see chart below), the ranking of win rate by scoreband is less pronounced, again due to lower sample size. However there is clearly a higher propensity for the teams scoring in Bands 1-4 to have a higher win rate.

Predict1

 

Conclusion

Although the model development was limited by historical data availability, the final model output – an expected match prediction win rate of 62.6% – was validated using both in-time and out-of-time (IPL 2016) matches. For the IPL 2016, the actual win rate was 62.7%, almost exactly in line with the expectations from the model.

The model will continue to be strengthened as more data becomes available and will next be tested in the upcoming BBL.

Three final things to note.

Firstly, although the model did not include CPL matches as part of the development sample, the model was also validated on CPL 2016 matches with an overall win success rate of 64.0%.

Secondly, the model results were compared to a less complex way of predicting the match winner – predict the winner to be the team with the higher ranking on the points table prior to the game starting. Using that approach on 2016 IPL matches resulted in a win success rate of just 45%, well below the model’s results.

Finally the model includes tournament-to-date performance as one of the predictive variables. Therefore a certain number of matches need to be played before the model can be used. The 62.7% winning percentage for the 2016 IPL was based on 51 of the 60 matches played.

Editor’s Note (19th November, 2016)

Two additional points:

1. The model has important implications for team selection. One of the predictive variables of the model is derived from player level historical performance. This means that given a squad of available players, management can use the model  to select the playing XI with the highest probability of winning the match, given the likely playing XI for the opposing team.

 

2. Is a predicted win rate of 62.6% any good? We will try and answer the question from a betting point of view.

The “worst odds” currently available from the first four games of the upcoming BBL are for Sydney Thunder (odds of 4/6) to beat the Sydney Sixers in the BBL opener. The “worst odds” are those odds that pay the lowest amount for a successful bet. If the favourite team had the same odds  (4/6) for every match of the BBL then we would need a 60% win success rate to break even, assuming our model also predicted the favourite to win. Therefore even in the unrealistic scenario of every match containing a strong favourite with odds of 4/6, the model should on average deliver a small return.

The “best odds” currently available from the first four games of the upcoming BBL are for Melbourne Renegades (odds of 19/20) to beat the Sydney Thunder in game 3. If the favourite team had the same odds  (19/20) for every match of the BBL then we would need a 52% win success rate to break even. Therefore our model would deliver a minimum expected return of 17% on money “invested”, higher depending on how often our model correctly predicted the underdog to win.

The “average odds” for the favourite from the first four games of the upcoming BBL are 27/33. If the favourite team had the same odds  for every match of the BBL then we would need a 56% win success rate to break even. Therefore our model would deliver a minimum expected return of 9% on money “invested”, again higher depending on how often our model correctly predicted the underdog to win.