Here at FreeAgent we, like so many other workplaces around the world, have been adjusting to a fully remote setup over the past few weeks. Whilst a significant number of our company’s employees are permanently home-based, only one of our seven Analytics & Data Science team members is usually based away from our Edinburgh office. It has felt strange.
We decided fairly quickly that we needed to create new opportunities for the team to ‘convene’ in lieu of the facetime we’d usually get in the office. We scheduled a short daily catch-up after lunch: its primary purpose was to make sure everybody was doing okay and perhaps to discuss the weather or our snacking levels.
Sometime during the first week, I suggested a quick round of Pointless during one of these catch-ups. I’d been gifted a Pointless quiz book a couple of years before, and I thought it’d tap into the competitive nature of many members of our team. I didn’t know the half of it…
For the uninitiated, the concept of Pointless is pretty simple. Behind the scenes, 100 members of the UK public were quizzed on a range of weird and wonderful questions. For example: in 100 seconds, name as many capital cities of European countries as you can. 98 might have said London, 92 might have said Paris, 6 might have said Zagreb. The contestants on the show are then shown the same question (name a capital city of a European country) but they would have to give the most pointless answer that they could think of – that is, the answer that fewest out of the 100 gave. In this case, London or Paris would have been bad answers, whereas Zagreb would’ve been a good one. The person who gives the best answer is the winner.
There were four players in our first game. As any good data analyst would do, I tracked players’ responses and scores in a spreadsheet. Ipek, one of our BI analysts, won the first game. We applauded her, discussed the possibility of playing again the following day, and went about our afternoons.
The next day, we had our entire team – six players and me – for game two. Dave, our team lead, won the game with a pointless answer (an answer that none of the 100 people surveyed gave). I tracked the scores again. There was some controversy: apparently, when given a short amount of time to name as many words ending in ‘erry’ as they could, 60 out of 100 members of the UK public responded with loganberry – compared to just 28 for cranberry. Who knew the loganberry was so widely appreciated?
After our third game, two things became clear. Firstly, we were all enjoying spending 10 minutes of our afternoons doing something a bit silly and getting competitive. Secondly – and more importantly – if this were to continue, we’d need some kind of long-term scoring metric.
Measuring Pointless Things
So, who should be crowned our Pointless champion? To date, we’ve played 16 games. I’m going to explore the five metrics that we came up with to answer this question.
Metric #1: Games Won
To begin with, we started referring to the person who had won the most games as ‘the person who was doing best’. It’s a nice, simple way to measure things. After 16 games, Ipek leads the pack on Games Won. Our top 3 looks like this:
|Player||Games Won ↑|
Metric #2: Win Rate
Sadly, not every team member is able to play every day! This means that some players have played more games than others. Games Won doesn’t account for this, and so it doesn’t reward players who have won lots of games despite not having played many. I decided to calculate a Win Rate for each player: the proportion of the games they played that they won. After 16 games, Dave knocks Ipek off the top spot, having won 38% of the games he played:
|Player||Games Played||Games Won||Win Rate ↑|
Metric #3: Mean Rank Score
The trouble with both win-based measures is that the person coming second receives no recognition. You could come second out of six in every game and you’d still end up at the bottom of the table. To try to capture the overall performance, we gave each player a Rank Score for each game. The player with the best answer is given a Rank Score of 0, the player with the worst answer is given a Rank Score of 100, and the players in between are given a Rank Score dependent on the total number of players in that game. Over a number of games, we can then calculate the average of each player’s Rank Scores to see where they tend to end up in the rankings, from 0-100.
Let’s take the first 2 games described above as an example. In the first game Dave has the worst answer and is given a Rank Score of 100. However, in the second game he has the best answer, so he’s given a Rank Score of 0. His Mean Rank Score, after 2 games, is 50:
|Player||Game 1: European capitals||Game 2: _erry words||Mean Rank Score ↓|
|Answer||Score||Rank Score||Answer||Score||Rank Score|
Taking this approach for all 16 games played so far starts to build a nice Mean Rank Score picture in the table below:
|Player||Games Played||Mean Rank Score ↓|
Let’s try to extract some meaning from these numbers. Remembering that the median player in each game would be given a Rank Score of 50, Dave’s Mean Rank Score of 48 indicates that, on average, he just about makes the top half each game. Rob, who leads on this metric, tends to finish around the border of the top third of players each game.
Metric #4: Mean Standardised Score
Our next metric was probably the most over-the-top yet crucial metric. Taking a look at the results for game 2 – the ‘erry’ words – we can see that there were two ‘clusters’ of scores. The best answers scored 0, 2 and 5, while the worst answers scored 60, 84 and 92. However, our Rank Score approach considers each of these players to be evenly distributed – that is, it thinks a score of 5 (David’s cloudberry) sits halfway between a score of 2 (lingonberry) and a score of 60 (loganberry). But, if you take a look at the scores, it’s easy to see that cloudberry’s 5 deserves more credit than that.
Why not just take a player’s mean raw score across all their games? Well, different games have different magnitudes of scores. If somebody missed a game in which most of the possible answers were low scorers, or vice versa, they’d be immediately disadvantaged. For example, Owen and David didn’t take part in game 1, which had loads of good answers available, and therefore didn’t have the opportunity to get as low a mean score as the others. We want an approach that doesn’t consider each player to be evenly distributed, but does account for the magnitude of scores in each game. Enter Mean Standardised Score.
In a given game we want to know, in essence, how well or badly each player performed relative to the other players – we want to standardise each player’s score. This is a two-step process. Firstly, how far above or below the mean score were they? Secondly, how did that compare to how far away everyone else was? If a score was 20 above the mean but on the whole scores were close together then we want to penalise that, whereas if a score was 20 above the mean and on the whole scores were spread out then we don’t want to penalise that as heavily.
In game 1, the mean score was 4.3 and the standard deviation of those scores (how spread out they were) was 2.9. Using Dave’s score of 8 as an example, we can firstly see how far from the mean score he was by subtracting the mean from his score (8 – 4.3 = 3.7). We can then factor in how spread out the scores were on the whole by dividing by the standard deviation (3.7 / 2.9). This gives us Dave’s Standardised Score (for game 1) as 1.28, which indicates that his score was 1.28 standard deviations higher than the mean. Ipek’s Standardised Score, -1.14, indicates that her score was 1.14 standard deviations lower than the mean (‘better than average’).
Over a number of games, we can then calculate the mean of each player’s Standardised Scores to see how their answers tend to perform relative to the mean score in each game. A Mean Standardised Score of 0 would indicate that the player tends to come close to the mean score in each game (or does badly as equally as they do well) and a negative Mean Standardised Score indicates that the player tends to outperform the average. Let’s take the first 2 games described above as an example. (Note: there has been some deliberate premature rounding to simplify the example here!)
|Player||Game 1: European capitals (μ = 4.3, σ = 2.9)||Game 2: _erry words (μ = 40.5, σ = 43.1)||Mean
|Answer||Score||Standardised Score||Answer||Score||Standardised Score|
We can see that the Standardised Score given to David’s cloudberry is much better than that given to Owen’s loganberry, solving the issue with the Mean Rank Score.
Applying this to all 16 games played so far gives us yet another top 3 in the table below. David tops the board with a Mean Standardised Score of -0.43: on average, he scores half a standard deviation below the mean each game.
|Player||Games Played||Mean Standardised Score ↓|
One thing to note about the Mean Standardised Score is that it really punishes incorrect answers – which score 100 – as typically 100 will be several standard deviations higher than the mean. On the other hand, it rewards particularly good answers: if you win by a comfortable margin, it will reflect that better than the Mean Rank Score would have done.
Metric #5: Points
But, as the founder of the modern Olympics once said, the most important thing is not winning but taking part. We wanted to reward participation too, and so a points-based approach was suggested. For each game they play, a player receives a number of points. The number of points is determined by their rank in that game: if they got the best score, they receive 5 points; if they got the second best score, they receive 4 points, and so on. We then add up each player’s points from all the games they’ve played, which captures both performance and participation. Our Points approach puts Lana, our web analyst, top of the leaderboard:
Rules of Measurement
So who should be crowned our Pointless champion? Well, (almost) everyone, depending on which way you look at it:
|Mean Rank Score||Rob|
|Mean Standardised Score||David|
There are many other things that we haven’t considered, such as controlling for the advantage (or disadvantage) that the order of responding may have, or calculating each player’s rank amongst all the possible correct answers.
But this isn’t intended to be the definitive guide to long-term scoring in remote games of Pointless. Whilst it has been a lot of silly fun, there are a couple of important takeaways that are immediately apparent:
- Deciding how to measure something – whether that be success, failure, or some scale thereof – is not always a simple task. Some would say that you need to decide how you’ll measure the success of an activity before you undertake it, as you won’t be able to measure objectively if you wait until the activity has started. To some extent, this is true. However, as shown here, the intricacies and complications around measurement may not become apparent until you start. You can try to anticipate these complications but it won’t always be possible. It’s important to have an unbiased eye, somebody with no agenda, involved in defining those measurements.
- With enough data, and enough motives, you can often spin your numbers to tell any story you want to tell (in fact, I decided to start writing this blog when we hit a point at which our five metrics each showed a different ‘champion’ – very meta). With careful selection of the data above, five of the six players could draw the conclusion that they ‘won’ and shout it from the rooftops. If the people receiving this conclusion didn’t do so with a critical eye, it would inevitably (in the majority of cases) lead to suboptimal decision-making somewhere down the line. It is crucial that recipients of data a) understand exactly how conclusions have been drawn and b) challenge those conclusions or methodologies appropriately.
So take care when you’re measuring things, you’re being given a measurement, or you’re being told about a measurement somebody else has received. And if your team is finding remote work lonely, get yourself a copy of a quiz book!