Editors note: We asked Colin to go in cold: zero to starter model for the T20 World Cup running through June. This a series start from nothing and building towards a basic prediction model that we’ll (hopefully) test against the last games of the tournament later this month.
“We want you to build a cricket model. Don’t know a single thing about cricket? Great, you’re perfect.”
I feel like I’ve been chosen for the Wallfacer program. Am I missing something? Don’t they know I’ve never watched more than 5 minutes of cricket in my life? I’m vaguely aware it kind of looks like baseball if I squint hard enough, and that matches famously last days at a time, and didn’t they make a movie one time about cricket prospects trying to get converted into baseball? Then again, this is how I started off doing tennis modeling a decade ago, never having watched the sport very seriously and trying to predict it anyway. At least back then, I had a decent base modeling approach to apply to the sport I had honed across other places, but this is going to be a new challenge, starting from scratch knowing very little about the rules and no starting clue about how to model or predict the sport.
I do understand why cricket has some things going for it from a betting angle, though. It runs up against a dead spot in the sports betting calendar cycle, where baseball is so picked over at this point, there’s not much left to activate otherwise idle bankrolls. Even if there aren’t huge edges to be found, it’s better than zero edges, as long as the return on time is worth it. I’m always a fan of adding new sports to the toolbox and, inevitably, you pick up on one or two things in each sport you model that carries over to some unknown way to your other sports, so I can get behind this as an intellectual challenge.
My starting point is the same as any other idiot American trying to understand a predominantly foreign sport: look up the rules, and try to shape my understanding according to the sports I already know (baseball is going to be the obvious comparison here). Mercifully, the ask is to focus on T20 cricket, which has been specifically developed as a variant to induce more action and conclude faster to make it more appealing to younger audiences, so at least I don’t have to worry about multiple days of varying conditions.
After some helpful primers on the rules of cricket specifically through a baseball lens, I can at least start to identify what I think are some key similarities and differences between the two. Baseball has a ton of already applicable modelling techniques that rely on a lot of assumptions, so I want to see what techniques and approaches I can take from baseball, and what differences will require a rethink of how I might go about trying to model and predict any kind of cricket outcomes.
Similarities
Cricket can be well reconstructed from box score data.
There’s no clock in T20 cricket (technically there is, as there are penalties for not finishing batting in your allotted 75 minutes, but apparently this rarely happens, so we can basically overlook this for this iteration). Decision: I won’t account for any kind of clock-dependent situations you find in other sports (pulling the goalie when trailing, playing for the field goal win when down by 3 or less, etc.)
Cricket outcomes are approximately the sum of individual actions on each team.
One simplifying assumption about baseball is there aren’t a lot of interactive effects between team members to account for - baseball is more or less individual team members doing their own pitching, batting, fielding, and baserunning, and outcomes can be predicted fairly well by summing individual performances. The dynamics of cricket lend themselves to the same assumptions.
Pitcher vs. batter stats drive most of the dynamic; things like fielding and running are less important. Yes, cricket balls are harder to catch, which makes things like fielding a little more unpredictable, but it’s still probably a decent assumption that defense holds the same weight in cricket that it does in baseball. Absolutely a skill that some players are better at than others, but not as important as their offensive contributions. Also, since there’s less total distance to cover when running to score a run, we can likely assume we don’t have to estimate a player’s run decision making to start.
Relievers sort of exist in cricket. There are bowlers who specialize in late game situations, similar to relievers and/or closers. From the rules alone, it’s not clear if late-game bowlers have the degree of specialization relievers do (less workload, more situational, etc.), but at a minimum it’s good context to be aware of.
Accounting for weather is important. Temperature, humidity, and field conditions all have a huge impact on how the ball travels. Much like baseball, any good model will account for weather conditions. It remains to be seen on which weather conditions matter all that much, and if they have the same relative importance as baseball.
Things that will be notably different:
Scoring differential doesn’t tell the same story. One of the initially confusing things just trying to get a feel from box scores is seeing the winning team’s margin as being described in terms of runs or wickets. This is one of the biggest structural changes from baseball: the equivalent would be if in baseball, the away team batted all of their 9 innings first, the teams switch sides and the home team then starts batting, and if the home team scores more runs then the away team, it’s an automatic walk-off win every time. (This happens to a very small degree now with the home team not batting the bottom of the 9th if they’re winning, but this is magnified to a much higher degree in cricket). So many equations around run differential and Pythagorean expectation rely on an implicit assumption of equal opportunity rate for both offensive lineups, and that’s structurally not the case in cricket, so we’re going to have to dive a little deeper to gauge team strengths than scoring differential alone.
“Plate appearances” have a lot more variance than in baseball. A plate appearance for a baseball hitter is restricted to a single outcome (on base or out), but a batter appearance in cricket has a whole lot more variance. You could score 50 runs before you’re out, or you could get out on the first ball (pitch). A cursory look at cricket statistics suggests this is why batters’ offensive production is described in terms of things like runs per over, but rates have their own problem in telling the whole story: a batter that produces 6 runs but only lasts one over is much more efficient, but also much less productive in the aggregate, than a batter that produces 30 runs over 7 overs. This tradeoff between efficiency and volume is something that will have to be navigated very carefully when assessing players’ performances.
Batters’ offensive contributions have less to do with their teammates’ production. If you’re a great hitter in baseball, your RBIs depend heavily on if your teammates in front of you in the batting order already got on base. That dynamic doesn’t exist in cricket. Your runs are your runs alone, which should produce some more helpful and simplifying modelling assumptions.
Batters face a different bowler (pitcher) every 6 balls (pitches. After every 6 bowls, the batter faces a different bowler and hits to the other side of the field (more on that in a bit). At a minimum, the equivalent of plate appearances are not homogenous; they can have multiple pitchers in the same appearance, which at a minimum needs to be accounted for with summary stats.
The field is not symmetrical. When batters switch sides, the dimensions of the field in front of them change as well. Cricket fields have what’s known as both a long boundary and a short boundary, and while it is still possible to hit effectively to the field behind you, most contact will still occur where the ball gets hit away from the batter’s front facing aim. I don’t know if long/short boundary splits are a thing, but it seems like a good thing to assume this matters until proven otherwise.
Umpires have much less influence. The equivalent of balls and strikes is a much less influential part of the game, which means accounting for the equivalent of where in the count the batter is won’t be nearly as important, to say nothing of not having to worry about more advanced concepts like catcher framing. Umpires still probably have their tendencies in the calls they do make, but at first glance, their influence appears much smaller than in baseball.
T20 cricket has a maximum number of pitches. Each side has a maximum of 20 overs, and with 6 balls in an over, that’s a maximum of 120 balls to each batting team (unless there are errors from the bowler that lead to extra balls - to be explained in later details). How would baseball look if you weren’t allowed to bat any more after facing 120 pitches? Batters wouldn’t get as much of a luxury of feeling out each pitcher’s style the first couple times through the order, or even within the same at-bat. I’m not quite sure how that dynamic would play out, but it’s at a minimum something probably worth accounting for.
The home team has a heavy influence on the playing surface. Home field advantage is much less than home crowd influence on umpire calls. The groundskeepers for the home team can curate the playing field with some degree of influence on how the home team would prefer it plays. It’s not clear how to account for this yet, because that requires understanding how certain players do better or worse under different playing conditions, but if home field advantage ends up being stronger than other sports after running some numbers, at least we have a ready made hypothesis to explain this.
You have to throw the ball back if you catch it in the stands. Originally, this point was just supposed to be a throwaway joke about the inherent offensiveness about a sport that doesn’t let you keep a souvenir. But digging into this a little more shows there’s a reason that this is a rule: cricket balls get deadened over the course of a match, and that deadening effect is something both the bowler and the batter get used to over time, so later batters will be using almost different equipment than the initial batters. This is additional context that’s probably important for interpreting players’ stats: how deep in the game were they? And how much of their change in performance is attributable to a older ball. Just goes to show there’s never any shortage of potential things to account for when building your model.
I’ve found that when starting a sport from scratch, there’s a drinking from the firehose feeling of having to account for every little thing, which can lead to a sort of paralysis by analysis on where to prioritize. Years of practice have taught me the wisdom of being okay with building a model you know will be bad at first, and accepting that it will take a lot of hypothesis testing, data additions, and refinements to get a model that I’ll be comfortable firing on.
Hopefully, we’ll do exactly that in the next couple of articles: build a bad model to start, and get it to a place where additional specific hypotheses can be chased down over time.