Hello, wrong number Analytical skills are in big demand so it is really important not to make the basic, common, mistakes that show you up as a newbie.
For example, probability calculations are often performed on binary outcomes such as "What is the probability that a given policy holder will claim?" The result is binary because they will either claim or not. There are two very distinct ways to calculate the answer. Paradoxically, the first is the intuitively obvious one but often hard to do properly. The second is counter-intuitive but easy to do in practice. Most people chose the first and often get the wrong answer.
Just to highlight that this is a relatively common error, the following data appeared recently on the website of a company selling equipment that allows companies to screen employees for alcohol. It assumes 260 working days per year and that an employee is coming in with alcohol in their system (drunk) 12 times in a working year.
|Number of days tested per year (per 260)||Annually||Twice yearly||Monthly||Weekly||Daily|
|Probability of catching employee following a year of testing||0.02%||0.04%||0.2%||0.9%||100.00%|
You can test for alcohol and, as the table shows, the more frequently you test, the higher the chance of catching the miscreant (although you'd need to buy a machine that allows you to test every day).
All these percentages are wrong (apart from the last one) and, if you like numbers (and why are you reading if you don't?), stop reading and try the calculation for weekly testing yourself. To summarise the numbers, there are 260 working days a year and a given employee is coming into work drunk 12 times during the year. If we test 52 times a year what are the chances we will catch them?
OK, you're back with an answer that should be about 93.6 per cent (two orders of magnitude greater than the 0.9 per cent in the table).
We cannot know what logic allowed the producers of the table to arrive at the incorrect value of 0.9 per cent but we can guess and see if our guess produces the same incorrect answer. If it does, then that is probably the logic followed.
The person is drunk 12 times out of 260, which means the probability of coming in drunk on any given day is 12 out of 260 = 0.046 (4.6 per cent). We are testing on 52 days out of 260 therefore the probability of a test on any given day is 52/260= 0.2 (20 per cent). So, the chance of the person being drunk on a test day is 0.046 x 0.2 = 0.0092 (about 0.9 per cent).
If we apply this logic to the first four columns of the table, we get answers that match. The last column, testing 260 times, intuitively guarantees catching the person – hence 100 per cent. This logic sounds quite seductive but is flawed. Below is the right way to solve the problem (I have trimmed the number of decimal places simply to reduce the clutter).
We are testing 52 times a year. What are the chances that, on the first day we perform the test, the person will be sober? Well, they are drunk 12 times out of 260, meaning there are 248 days when they are sober. It may help to think of a bag containing 12 red and 248 green marbles. If you select one from the bag (perform a test), what are the chances you will pick a green marble (that they will be sober)? The answer has to be 248 out of 260, which is 248/260 = 0.95 (95 per cent).
The chance of them being caught on the first test is 1 - 0.95 = 0.05 (5 per cent) which is already five times higher than the figure quoted in the table and we have only tested on one day; we have 51 more tests to run. For the next test, one green ball has gone from the bag, changing the probability of drawing another one slightly to 247 out of 259 = 0.95. Note that this answer looks the same because we are rounding, but it has actually decreased slightly. The cumulative probability of getting away with it twice is 0.95 x 0.95 = 0.90 (actually 0.91 if you use greater precision). Which means that the chance of getting caught has risen to about 9 per cent. If we carry this on, by the time we get to 52 tests, the chance of being caught in one or more of the tests has risen to 93.6 per cent.
There is only one way of not being caught and that is not to be caught on all 52 tests. Calculating the chances of not being caught and subtracting that from one neatly sidesteps the need to calculate all the different ways in which you can be caught.
What was wrong with the first approach? In that one we calculated the probability of the employee being drunk on any given day (0.046) and then the probability of testing on any given day (0.2). Both of these calculations are correct. By multiplying them it may sound as if we are asking: "What is the probability of being drunk on any test day throughout the year?" But we are actually calculating the answer to the question: "For one given day, what is the probability of being drunk on that day and also being tested?"
This is a totally different question and therefore has a totally different answer.
What was the fundamental difference between these two approaches? The incorrect approach tried to calculate the chance of the person being caught. The problem with this approach is that there are hundreds of different ways of being caught within the space of one year. You can be caught on the first test or you can be caught on the second. Or the third. Or you can be caught on the first AND third. Or the 27th and the 31st and the 46th.
This means that you have to calculate many different probabilities. This is hard to do and, as in the example above, is often done incorrectly. But there is only one way of not being caught and that is not to be caught on all 52 tests. Calculating the chances of not being caught and subtracting that from one neatly sidesteps the need to calculate all the different ways in which you can be caught.
This is so general that you can use the gloriously named hypergeometric distribution to do the calculation for you. Both R and Python (and certainly other languages) have specific functions for doing this.
To summarise, suppose that you want to know the probability of an event happening but it can happen in many different ways but can only not happen in one way: you should calculate the probability of it not happening and subtract from one.
Of course, you reverse the strategy if it can only happen one way but not happen in multiple ways. It is a little counter-intuitive at first because you start off by calculating the number you don't want but it is computationally much easier and saves both time and resources. More importantly, the calculation is conceptually much easier to do, meaning that you are less likely to make mistakes. ®
Hello, wrong number is The Register's three-part series looking at common errors made in analytics and machine learning.
We’ll be covering machine learning, AI and analytics – and ethics – at MCubed London in October. Full details, including tickets and speakers including Mark Whitehorn, right here.