Cohort models that measure likelihoods of prospect success by passing certain thresholds of NHL games played have been around for years, and have been demonstrated to be statistically predictive. Yet they face criticism because they appear to determine success only at a binary level – either the player reaches the threshold or they don’t. They don’t tell you how good the player actually was.

However, a deeper understanding of cohort models using thresholds demonstrates that this isn’t actually accurate – higher percentages in threshold models actually correlate with higher rates of NHL production. Still, a single number lacks the nuance of potential NHL roles, and measures using games played are subject to several systemic biases.

The use of production thresholds instead of games played thresholds allows for greater nuance, even though statistically it doesn’t represent a step up in predictivity. It might, however, abate some of the criticisms by providing greater context.

Criticism of the Models Using the Games Played Threshold

Hockey prospect analysis is a tricky business. There’s no shortage of people out there trying to do it, and everyone is coming at it from a slightly different angle. That’s because, to this point, there is no clear cut “best” way to go about it. Compared to analyzing NHL players, analyzing prospects comes with a host of problems, from limited access to video and reliable data, to the fact that it can take three to five years (or longer) to see if the prediction matches the results.

Not that these difficulties have stopped us from continuing to make attempts to improve the way we analyze prospects. Undoubtedly, the statistical variety of prospect analysis has come a very long way in the past dozen years, and the desire to move forward is always there (even if the time, for some of us, isn’t). Being at the forefront of prospect analysis, we at CanucksArmy have a pretty good idea of where improvements can be – or need to be – made, but we also get plenty of outside advice as well, whether it’s solicited or not.

One of our most frequently used ways to predict prospect success is the pGPS metric, and it’s probably also the one that elicits the most criticism. Much of that criticism is directed at the fact that it uses (among other things) NHL games played as an output measure of success.

On a summer edition Sportsnet’s 31 Thoughts Podcast, Elliot Friedman, Jeff Marek and Sam Cosentino had a conversation that revolved around measuring success by games played, and it’s safe to say that they didn’t quite have the specifics of it nailed down before they began their critique.

“You think about what determines or what defines a career,” Cosentino began. “A lot of the time, the analytics guys are saying 200 games, cause that’s when your pension kicks in.”

“I was talking to a GM about it and they said look, should 200 games be our definition of success?” adds Friedman. “If you draft a guy in the first round and he plays 200 games but he turns out to be a third pairing defenceman or a bottom six forward, should that count as a successful draft pick?”

“And that’s the thing,” Cosentino responds. “I think a determinant for a lot of the analytics guys, they say, okay that’s when your pension kicks in, so that would technically determine a career in a broad brush stroke I guess.”

There is a lot wrong with what was said here, not the least of which being the impression that an NHL pension has anything at all to do with the commonly used 200 game threshold. Before we get to where the 200 game mark came from, we need to understand why games played are used in general, and specifically how they are used.

Why We Still Use Games Played

Friedman’s criticism about using games played demonstrates a serious misunderstanding of the purpose of the threshold that is certainly not unique to him. The implication here is that all players that pass the threshold are the same, but that isn’t the case, and is an overly simplified explanation of what’s actually going on.

Garret Hohl of Hockey Data Inc. laid out the logic behind using threshold based cohorts in a series of since-auto-deleted tweets that I managed to copy down (and later verified with him):

“Using probability of 200 GP is not saying that all players playing above 200 games are equivalent. The fact that not all players that play over 200 games are equal is fine, and accounted for in part.”
Prospects that have a higher percentage of their statistical cohorts making the NHL for 200 games, are also players that have a higher percentage being top-end players and a higher ceiling.
The old thinking was that some players are “safe” picks, while others are “boom or bust”. However, when it comes to statistical cohorts, this relationship does not exist. In general, players that are safer picks (statistically speaking) also have higher ceilings.
See, people used to think: He’s big, and plays defensive. So if he doesn’t hit his ceiling, he can play a depth role. More spots = more likely to play… That statement is true, except for the last sentence. There are more spots, but there is also a greater supply of players to fill that role. So, they have more roles to play, but are actually less likely to make the NHL.
FYI: That doesn’t mean there are not better ways or no room for improvement. I’m just saying the way people traditionally labeled “safe” and “boom or bust” seems incorrect and that % chance at 200 GP does account for potential player value (to a degree).”

What Garret has succinctly laid out is that while the games played threshold is important to the formula, it alone is not the determining factor to success. It has more to do with what allowed that player to succeed in the first place, measured indirectly by using games played as a proxy. This is an important note that we’ll come back to soon: that which makes a player more likely to play more NHL games also makes them more likely to be more productive in those NHL games.

Where The 200 Games Played Threshold Came From

The 200 games played threshold has been around for quite a while, and there are differing potential reasons for why it originally came to be. Here you can read Jonathan Willis’ explanation behind the logic of it, which boils down essentially to this: crossing the 200 game mark doesn’t make a player a success, but it does mark a point at which you can be pretty confident in what a player is or isn’t.

That said, when it comes to using cohorts, the number itself isn’t even vitally important. The main idea is using a consistent threshold for both the cohort and the targeted player. That is, using the percentage of a cohort that played more than 200 games as the likelihood that the target player will play 200 games. You could 100 games, 300 games, or 400 games instead, and as long as you were consistent, the percentages you were left with would actually stay fairly static relative to one another.

In fact, I’ve stopped using the 200 game threshold myself, having moved from a model that relied on career results to one that uses results within certain age ranges. Specifically, I am more interested in measuring how many games a player plays under the age of 25 – the age when most players hit free agency and are no longer under team control. The number of games that I have been targeting in this window is 100. That means that an XLS% is an expected likelihood (weighted by similarity) that the target player will suit up for more than 100 games while under team control. The idea being that a team needs to decide before this point whether or not a player is worth holding on to.

Figure 1 shows how the number of NHL games played by age 25 is related to the number of games played during prime years and the point rates achieved during prime years. Both show very high correlations, which makes logical sense. The more games a player plays while under team control, the more productive the player is likely to be in his prime.

Figure 1

How We Used Games Played

The usage of games played thresholds in cohort systems is quite a bit more complicated than it first appears to be, as it has changed over time since its introduction. Originally, PCS and pGPS assigned a percentage based strictly on thresholds met per cohort (i.e. five of 20 statistical matches pass the games played threshold, the player is given a 25% likelihood of success).

That was years ago though, and in the intervening time, I’ve tried to incorporate similarity more and more into the equation. Similarity scores, derived from multi-factor euclidean distance equations are now used to weight a cohort-player’s contribution to the target player’s likelihood of success. For example, if half of a player’s matches passed the games played threshold, but his statistical profile is more similar to the half that passed than the half that didn’t, the player’s weighted likelihood of success is going to be above 50%.

Above all though, the biggest season that we continued to use games played as the outcome variable is that it’s backed by the numbers. While suggestions have been made that NHL points, time on ice, or point shares would be better measuring sticks, the fact is that a model based on a games played threshold predicts not only games played, but point rates and point shares at the NHL level better than a model based on weighted averages of those numbers. It might seem counter-intuitive, but that’s the way it is.

For the following table, I ran the pGPS formulas on CHL players from 2000 to 2010, then measured the correlations between a series of input variables (those available to junior prospects now, listed and defined below) against a series of output variables (those available after a player has played out all or most of his career in the NHL, also listed and defined below).

Table 1

We begin with unadjusted points per game, with which there is already a modest positive correlation with games played (specifically under the age of 25, before free agency – essentially the period during which the player is under team control) and, to a lesser extent, production and point shares at the NHL level (specifically during their window of statistical prime, age 21 to 28, so as not to be adversely effected by good players playing well into their 30’s and showing diminishing production rates). The fact that higher scoring junior players are more likely to have NHL success is common knowledge now, but that wasn’t always the case.

From there we see a jump up in correlation and r-squared just by era and age adjusting the input numbers. This also is now common knowledge among the statistically literate: we’ve long known that adjusting for the age of junior players is vital to determining future success.

We see an even larger jump up in correlation and r-squared when using XLS%, pGPS’s Expected Likelihood of Success. However, other input measures like XPR (an average of successful matches’ NHL point production weighted by similarity) and xPS (the same as XPR but using NHL point shares instead of point production, and again weighted by similarity) are less effective at predicting games played and production rates than are weighted proportions of successful matches.

Of note is the xVAL measure, which is a combination of XLS% and point shares, and fares slightly better than XLS% on its own.

The Biggest Challenges of Using Games Played

Despite being one of the most reliable measures going right now, models measuring success using games played thresholds are still subject to some pretty big challenges. First and foremost is bias. Relying on NHL games played as your measuring stick means that you are subject to the biases of the decision makers that help get players to the NHL level, and that is no small list. They include everyone from scouts to coaches to general managers.

As a result, certain groups of players may be unfairly disadvantaged by using a cohort system with a games played threshold, including small players (who only recently are being afforded the same opportunities as their taller counterparts) and European players, who not only are less preferred than otherwise equal North American players, but they have highly competitive leagues back on their home continent that they can retreat to if they aren’t getting opportunities here, whereas North American players are more likely to stick it out on this side of the Atlantic and accept the role of bouncing back and forth between the NHL and AHL.

Another issue that these models face is players whose success is determined by exigent circumstances, such as career ending injuries, debilitating personal problems, or even death.

Injuries, mental issues, and coaching biases are all huge factors in prospects' success. And frankly there's no good way to remove them to make these models work. So I think the best course of action is to say "it'll never really work" and just try and cluster players in some way

— Evan Oppenheimer (@OppenheimerEvan) September 10, 2018

In these cases, it would be preferable to remove the players from the sample and let ability alone determined the results of the players that remain. While not an impossible task, it is a highly arduous one for people like me who are working with databases of hundreds of thousands of players in their limited spare time.

If anyone has the knowledge and capacity to create a more efficient and accurate model for predicting prospect success, they should absolutely be running with it. Currently no such model exists publicly.

How We’re Moving Forward: Production Thresholds

I’ve long struggled with a way to involve production in the cohort model in a way that doesn’t sacrifice what predictivity I’ve been able to manufacture. As you can see in the chart about, averages of NHL production (even those weighted by similarity) lag behind the weighted percentages of success.

So I decided to combine the two ideas, applying percentages to the likelihood of a prospect achieving thresholds of production instead of games played. The results so far have been promising. They don’t pose a significant improvement over the standard XLS%, but they do provide a wider variety of information, and they provide something that critics of these types of models say is lacking: they differentiate between players who just stick around in the league and players that actually make an impact.

Figure 2

Figure 2, like Figure 1 above, also uses the ten year sample of CHL players and their eventual NHL results.

Of note is the fact that the R Squared of “% of at least 4th Line” measurement is roughly equivalent to that of the standard XLS% that is based on games played alone, meaning that we could essentially ditch the games played-generated metric, which might appease some of the naysayers, even if the change doesn’t come with any real predictive improvement in terms of future NHL games played or production rate.

It’s also important to note that NHL games played are still involved: a certain sample size has to be played by the players in the historical sample for their production levels to be included in the calculations, so we still haven’t reached the point where NHL games are irrelevant and biases are eradicated.

Because I know there will be questions as to how I determined the line thresholds, I’ll note that I’ll dig into these new numbers sometime later on, but for now I’ll say that they involve tiering historical samples by point production, time on ice, and point shares, while era adjusting and accounting for how many NHL teams existed in any given season.

This new addition also affords us some interesting visualization options going forward, including this pGPS breakdown that appeared in Ryan’s article on Casey Mittelstadt earlier today.

Figure 3

Final Thoughts

Prospect evaluation is a developing art. Like all forms of player evaluation, it is important to mix together data and visual analysis, and dig deep for contextual factors that could sway numbers one way or another.

The numbers side of the equation is very much a work in progress. While I can’t speak to the evaluative results of similar models, the correlations found in pGPS, while certainly positive, are in the moderate range. Their greatest achievement so far is being a significant improvement over working with raw points per game alone, but the goal always remains to push them ever higher and higher.

Of course, I’m no statistician, I’m a guy who does this on the side, mostly out of curiosity. What an NHL team could manage with greater resources and educations in statistics should be well beyond this, but that doesn’t preclude publicly available data based on thresholds from being useful. It’s just important to remember that it needs to be used in conjunction with other methods, both quantitative and qualitative. In any case, I will be continuing to test and try new things in search of better results and more interesting relationships.