If statistics were a human being, it would have been in deep therapy all of its 350-year life. The sessions might go like this:
Statistics: "Everyone hates me."
Therapist: "I'm sure it's not everyone..."
Statistics: "And they misunderstand me."
Therapist: "Sorry, I didn't quite get what you meant there..."
The problem is that statistics are misunderstood by the majority of the population and most people hate what they don't understand. Think of the well-known expressions: "Lies, damn lies and statistics" and "The government uses statistics as a drunkard uses a lamppost; more for support than for illumination."
But statistics deserve so much better, because not only have they allowed us to make informed, evidence-based decisions for centuries, they are now the bedrock upon which the current revolution in machine learning (ML) and artificial intelligence (AI) is founded. Without statistics, there can be no self-driving cars, no Siri and no Google.
The field is a relatively recent development. In 1654 two mathematicians, Blaise Pascal and Pierre de Fermat, both famous in their own right (the former for a triangle, the latter for his last theorem), worked together on a gambling problem posed by nobleman and writer Antoine Gombaud.
They developed ways of enumerating and classifying all of the possible outcomes of an event without actually having to count them. For example, consider the problem: "Is the chance of rolling two sixes with a pair of poker dice the same as the chance of rolling a six and a five?"
We don’t have to list all 36 possible combinations and then count the ones we want – instead we can just say that the chance of two sixes is 1/6 multiplied by 1/6 equals 1/36 or 0.0277 or 2.77 per cent, whereas the chance of a six and a five is twice 1/6 multiplied by 1/6 which equals 2/36 or 0.0555 or 5.55 per cent.
This concept of calculating probabilities was a game-changer of the first water. Around a century later, this work was extended and applied to science by influential scholar Pierre-Simon Laplace and mathematician Carl Friedrich Gauss. From then on it became increasingly popular for people to use statistical evidence rather than violence to support their arguments.
Incidentally, I wrote here that the whole of graph theory and hence graph databases started with an apparently trivial problem (the Seven Bridges of Königsberg) and we have exactly the same pattern here. A whole new branch of mind-bendingly useful mathematics was started by an apparently innocuous question.
The bad news is that statistics is often perceived as a very challenging discipline, which only the nerdiest can understand. Whether this perception is encouraged by those who practice these black arts is anyone's guess (94.3 per cent of people think so), but the truth is that many of the most useful statistical techniques are very simple.
Just as an example, think about a very common problem that occurs in the commercial world. Suppose we know that the gender ratio of our customers is 1:1. We start selling a new product and, by the end of the first day, 2,262 women and 2,128 men have bought it. Clearly more women bought it but does the difference of 134 mean that there is a real tendency for more women to buy? In other words, tomorrow can we confidently expect more women to buy the product or not? What we need to know is the probability that this difference of 134 (3 per cent) is due to chance.
The statistical test we need here is called chi-squared and it is so simple that it is built into Excel as a function called CHISQ.TEST(). This gives us the answer 0.043 (4.3 per cent), which is the probability that the difference we see is due to chance. This is very low, so there is a very good chance that women really are more likely to buy than men.
Now, if you don't use statistics – and most people don't, even for very simple operations like this – you have to look at the numbers and make a wild guess. With statistics you know that this difference will only occur 4.3 per cent of the time by chance, so you make an evidence-based decision. You can still be wrong, but not only will you be wrong less frequently, you will know how likely you are to be wrong.
From here to driving fraternity
So how do we get from men:women ratios to autonomous cars? Well, all you have to do is to take the basic idea of probabilities and give intelligent human beings 350 years to work out more complicated applications. It is true that the equations become more complex but, as a non-mathematician, you can simply accept that the statisticians know what they are doing and use the tool without having to follow what the equations mean.
Autonomous cars are not intelligent, but they have to make decisions. Their sensors are constantly scanning the road ahead. Just at the edge of sensor range, a stationary blob appears on the kerb. It is approximately 2.8m tall. Humans (even with hats) of this height occur with a very low probability. The system will, for now, decide that this is not a human. As the car rolls on, if the estimate of height decreases, or the blob moves, that decision is re-evaluated. But the crucial point is that the car never "knows" either way for certain; all it can do is to estimate probabilities.
The Google Cloud Vision API is another example. It uses ML to provide information about the content of an image. You can test it on one of your own images here:
Google was 99 per cent sure my photo is of a car and 88 per cent sure it is vintage – it isn't, but it's built to look vintage
This whole reliance of ML and AI on statistics goes very deep. Almost all ML is based on data mining and all the classic data mining algorithms (clustering, decision trees and so on) are very heavily based on statistical inference.
So, if statistics is easy (and it is) and simple to use, why does it get abused? I think there are several reasons for this but the main one is that people often abuse statistics on purpose (and it isn't just politicians, but they are an excellent example). In general, they don't do the sums incorrectly. This would be too obvious. What is common, however, is to:
- Ask the question in a particular way so that the available data supports their belief system
- Discount some of the data
- Misuse/misunderstand cause and effect
There is a wonderful example of misuse that eventually passed into the English language as the phrase "eight out of ten cats" – also the name of a long-running Channel 4 quiz show hosted by Jimmy Carr. This goes back to the 1980s when Whiskas cat food was promoted using the advertising slogan: "Eight out of ten owners said their cat prefers it."
You might have thought that was vague enough to escape without challenge but, after complaints to the Advertising Standards Authority, it was rephrased as: "Eight out of ten owners who expressed a preference said their cat prefers it." You can see why the manufacturer preferred the snappy first version but it was simply a misstatement of the statistical evidence.
So, we need to bring one of those well-known sayings up to date. It is more accurate to say: "There is truth, absolute truth and good statistics." Sadly the one about politicians and drunkards is likely to remain current for the foreseeable future. ®