You can crunch it all you like, but the answer is NOT always in the data

Hear that, 'data journalists'? Our analytics prof holds forth

31 Reg comments Got Tips?

Evidence-based decision making is so clearly sensible because the alternative — making random decisions based on no evidence — is so clearly ludicrous.

The “evidence” that we often use is in the form of information that we extract from raw data, often by data mining. Sadly, there has been an upsurge in the number people who move from the perfectly sensible premise of "basing decisions on data" to the erroneous conclusion that "the answer is therefore always in the data". All you have to do is to look hard enough for it.

This strange leap of non-logic seems to apply particularly to big data; clearly the bigger the data set the more information it must contain.

In one sense the problem is not new. Jeff Jonas, in 2006, quoted this definition of data mining from ACM SIGKDD:

Data Mining, noun: "Torturing data until it confesses ... and if you torture it enough, it will confess to anything."

And WS Brown in Introducing Econometrics defines data mining as: "An unethical econometric practice of massaging and manipulating the data to obtain the desired results."

But I think the issue has become much worse recently as data has become more freely available. For example, data journalism has emerged as a field in its own right. Some journalists have taken to applying tortuous analysis to large data sets and the results are then used to “prove” a particular point. Note the use of the word “some” in that last sentence; there are many excellent journalists who use data properly; but not all do so. And it is not just my fellow scribes who are guilty, it is increasingly apparent (although less publicly so) in the commercial world.

I am not (obviously) saying that data analysis is wrong; given my day job that would be an odd stance to take. But I do want to caution against the practices of the data analysis zealots and to make the point that context is vital because (despite what the zealots appear to believe) judgement and context are a major part of any good analysis.

Data analysis has always been there, but what we need right now is a little more sanity, and a few more reality checks when we start exploring the data. So let’s have a look at why the answer is not always in the data.

People often look for information in data in order to be able to predict the future. A good topical example was this summer’s World Cup; many people tried to predict the outcome based on existing data.

Flippin' coins

Hand clutches a coin between fingers

Suppose I build a perfectly unbiased flipping machine and flipped some completely unbiased coins in an absolutely unbiased environment.

Even if I can be sure it all remains unbiased, if I just perform a single flip there is no way I can tell you what the result will be. What I can say is that there is a 50 per cent probability of a head or a tail.

Given, say, 100 flips, I can go slightly further and tell you that the most likely outcome will be 50 heads and 50 tails. And I can also say that (apparently paradoxically) the probability of actually getting 50:50 is very low. This is simply because there are very many other possibilities (such as 49 heads : 51 tails, 51 tails : 49 heads, 48 heads : 52 tails and so on) which are almost as likely as 50:50, and the sum of those probabilities is far greater than the probability of 50:50 alone.

My point here is that with a very simple system, even when I know the original starting data very well, I still cannot predict a single result. However, I can give very definite information about multiple outcomes.

Now for something bigger...

Now, let’s go back to the World Cup. Imagine we are back in May 2014 and trying to predict the outcome. If we have no data beyond the names of the teams, we would have to assign the same probability of winning to each of the 32 teams – 1.0/32 = 0.03125. This is not entirely helpful, so we start collecting some data. What do we think might be relevant? Perhaps:

  • performance is previous World Cups
  • performance of particular players
  • weather in host country and normal weather in home countries
  • injuries

And so on.

We can analyse this and use the results to modify our predictions for each team.

Note, however, that we have already started making judgements here. By the simple act of selecting the input data, we are judging the context such as: “It is obvious that the age of the players will make a difference, so let’s add that to the mix.”

But we could be wrong on two counts. The age may be irrelevant because the team managers select only the best players based on performance. Or there may be an indirect correlation.

We collect the date of birth of the players and it seems to be an indicator of team performance. However, it may be that the age at which the player was first signed is actually the important factor — there just happens to be an additional correlation between the players age and first signing (perhaps most players first sign between 16 and 18 years of age).

Next page: Which to choose


Biting the hand that feeds IT © 1998–2020