Overfitting, trends, and small samples


The Canucks have not won a game where Kevin Bieksa has scored a goal.

For those of you that followed a certain conversation on Twitter yesterday…

I’m a little annoyed at the focus on “The Canucks are X-0-Z when they’ve scored a amount of goals, but are 0-Y-Z when they’ve scored b amount of goals or fewer.” You know the one. The problem is that people attribute this to a flaw in the way the team plays games, rather than a simple sampling error.

There are a lot of reasons why records are where they are, and when you’re a third of the way through the season, some scenarios haven’t presented themselves. It’s rare that a team will win a game 1-0 and 2-1, and given the rarity of those events, should it really concern us that the Canucks have not won a game yet by that margin?

Here’s a comic from XKCD on attempting to forecast elections based on electoral precedent. I’ve taken the liberty of applying this to the Canucks season thus far…

The Vancouver Canucks are undefeated!

Until they lost Game 1 to the San Jose Sharks.

The Vancouver Canucks haven’t won a game!

Until they beat Edmonton 6-2 in Game 2.

The Vancouver Canucks haven’t won consecutive games!

Until they beat Calgary in Game 3.

The Vancouver Canucks have not won without scoring the first goal!

Until New Jersey in Game 4.

The Vancouver Canucks have never lost a game after a win!

Until losing to San Jose in Game 5.

No team that isn’t the Sharks have beaten the Canucks!

But Montreal did in Game 6!

The Canucks have yet to win a road game in regulation!

But in Game 7, they won in Philadelphia.

…and they’ve conceded a goal in every single game!

But not against Buffalo.

The Canucks have yet to pick up a point against a playoff opponent!

But they did in Pittsburgh.

But they have yet to be beaten in regulation by an Eastern Conference opponent on the road!

On but in Columbus they did.

The Canucks have yet win when Roberto Luongo allows more than three goals!

But they beat New York 5-4.

Every team that has played the Canucks at least once this season has won the rematch.

But New Jersey didn’t.

The only Western teams the Canucks can beat are the ones in Alberta!

That was true, until their overtime win against St. Louis

But can they win in regulation if they take more than three penalties? They haven’t been able to do that yet…

Just ask Washington.

The Canucks have won every game they’ve held their opponents to two or fewer goals…

But Detroit beat them 2-1 on October 30th.

The Canucks have yet to beat an Original Six team. San Jose has already beaten three of them!

Yeah, but Toronto got whipped by the Canucks on November 2nd.

The Canucks haven’t won a game without stretching a win streak to at least two games.

Then it may come as a shock that the Phoenix Coyotes beat them in a shootout that night.

Can the Canucks beat the Sharks?

They did, quite convincingly, on November 7.

Here’s a fun stat… the Canucks have picked up a point in every game there have been at least six goals scored.

Hope you didn’t try to use that statistic after the Kings beat the Canucks 5-1 that night…

No worries. The Canucks have won every game on the second half of a back-to-back.

They lost in Anaheim on November 10…

Okay, but they’ve won all their home games after coming back from a road trip thus far.

Until they lost on November 14 to the Sharks (again)

They’ve beaten every team below them in the Western Conference standings (based on where they are today [November 17])

But the 10-7-2 Stars dispatched the 11-7-3 Canucks.

The Canucks can’t lose to American goalies born in states that begin with the letter “M”…

…until Tim Thomas front Flint, Michigan managed.

Vancouver are so dependent on the Sedins. They can’t win when Henrik takes fewer than 20 shifts.

But they beat Columbus 6-2.

The good news is that the Canucks have picked up a point every time I’m in the building this year.

Alas, my perfect streak ended at two games, as I was in attendance to see Chicago defeat Vancouver 2-1.

The Canucks are 5-0 when Jason Garrison registers a point, provided he scored a point in the previous game as well.

Despite Jason Garrison’s efforts, the Canucks lost to Los Angeles.

The Canucks have yet to win a regulation game where they’ve been out-shot…

But they beat Ottawa 5-2 while being out-shot 39-28.

New York should be worried if the Canucks score twice. They’ve picked up a point in every game they’ve scored at least two.

Yeah, before they lost 5-2 to the Rangers.

If the Canucks don’t score more goals at even strength, they lose.

But against Carolina, the Canucks and Hurricanes both scored two even strength goals, and the Canucks won…

Anyway.

…the point is that goal posts can be set at any arbitrary point to fit a narrative. This is a process called ‘overfitting’. To use the Wikipedia definition, “overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship”.

This can also be called the Black Swan Effect. The statement “all swans are white” can be immediately disproven based on one observation of a swan that is not white. As it happens, NHL teams overwhelmingly win games when they score at least three goals, since the possibility of a 2-2 tie no longer exists, all of those games artificially become “3-2” games in the Bettman NHL.

“Home Teams” in the NHL are 154-20-20 when scoring 3 or more, and 34-150-24 when scoring two or fewer. When the difference between your team and the NHL average can be met with just one or two observations, it’s not worth bringing up from any sort of analytical perspective.

Sample size means a lot. “X team has earned 67% of the shots in this game!” means a heck of a lot more when a team is out-shooting the other 40-20 than it does at 2-1. It’s true that over hundreds of observations, a team that picks up every two shots out of three in a game will probably wind up being the teams that out-shoot, but with one or two observations there’s way more signal than noise.

Ultimately, you want any statistic you read off to not be perfect, because as soon as you hit that black swan, the analysis becomes flawed. One of the benefits to attempting to prognosticate team’s future records using things like Corsi is precisely because it’s not perfect all the time, so nobody working closely with the data is ever under the illusion it’s going to work 100% of the time.

A real-world example would be when I was apartment-hunting back in April. The south-facing window only let sunlight about a quarter of the way into the living room in the morning. “That’s good,” I thought. “I like natural lighting but too much sunlight is uncomfortable.” Now, as we approach winter, I’m beginning to realize I hadn’t taken into account that the sun runs a lower trajectory across the sky now that our hemisphere has tilted away from the sun. Now on a sunny day, between 9 and 11 in the morning my couch is drenched in sunlight! The one observation I made in April was not sufficient, since I failed to take into account how that would work out in the future.

Working in absolutes can ruin you because then you’re then committed to ignoring the statistic once it no longer becomes an absolute. There are many, many, many ways to be fearful of the Sharks, but a team’s record in games that end 2-1 is not one of them.