Hello, wrong number At the heart of machine learning are patterns, and patterns are all about counting, so it's important to make sure we are counting the correct items in the correct way. Combinatorics is the branch of mathematics concerned with counting things; more specifically, all the wonderful ways you can count, arrange and manipulate finite (limited in size) sets of things. Having a grasp of basic combinatorics is vital because it underlies many wide and varied real-world situations.
Two simple concepts in combinatorics are permutations and combinations. You might have heard the words "combination" or "permutation" used in conversation, such as "how many permutations are there of winning the lottery?" or "what's the combination to the safe?" Pedantically, neither of these phrases are correct. While their misuse in spoken language is merely semantics, their confusion in analytics and machine learning is catastrophic.
Tom, Dick and Harry... or is it Sally?
Simply put, combinations and permutations are two subtly different ways of grouping and arranging things – both describe groups of items, but only one cares about ordering. Imagine you are driving your mates to the pub. Unfortunately, your car only has three passenger seats, yet you have five thirsty friends.
Combinations would describe all potential selections of three friends out of those five if you decided to pick names out of a hat. We're not concerned about the order of the names as they are pulled from the hat – if their name is selected, your lucky chum is being driven to the pub. Tom, Dick and Harry is the same combination as Dick, Tom and Harry, but is a different combination to Tom, Dick and Sally. To draw three names out of a hat of five, there are only ten combinations of names – in this small example, it's easy to write each one down to figure this out.
Permutations build on combinations by also considering the order of the items. This could be all the different ways in which three out of your five friends could enter the car via the passenger seat (you are already seated). The first person sits behind the driver, the next sits beside them in the back and the third gets to sit in the front seat. Some people are very particular about where they sit, so for them the order matters. Tom, Dick and Sally is a different permutation to Sally, Dick and Tom; especially if you're Tom and feel sick travelling in the back. There are 60 permutations altogether, and you'll notice this number is six times bigger than the ten combinations. This sixfold increase is very easy to explain; there are six different ways to line up three people. So, for each of the ten combinations there are six different orderings and so there are 60 permutations.
Yes, yes, but tell me about machine learning
So, why does this matter in analytics? Well, in our pub scenario, we could write down every possible option to find the total possibilities, but as your circle of friends expands this rapidly becomes unfeasible. For any non-trivial numbers we need a mathematical way of calculating the combinations and permutations.
Suppose you are performing clickstream analysis for a company and there is a large number of ways in which a customer can navigate through the website. Assuming your data set is large, and there are many visits to the website, you're likely to apply machine learning (ML) in your investigations. A crucial point to consider very early on is whether you are interested in customers taking specific routes through the site (permutations), or just visiting groups of pages together (combinations) because that can significantly impact the choice of ML algorithm you might use.
Let's look at both of these options but this time we'll start with permutations (where the order matters) and then move to combinations (where it doesn't).
Considering a website with 20 pages, the many different routes our customer can take through a subset of them is described by permutations (because order matters). If we want to look at the five successive pages our customer views prior to performing an action (e.g. creating or closing an account) then we can calculate how many routes there are. Spoiler alert: it's massive. To calculate this, we note there are 20 potential pages that could be visited in the first slot, leaving 19 potential pages for the second slot, 18 pages for the third slot, 17 for the fourth and 16 for the fifth.
We reduce the number of potential pages by one each time, as we want a subset of unique pages – once a page has been visited, we're assuming the customer won't backtrack and visit it again within our five-page group. Multiply the potential pages for each slot to give the total answer: 20 x 19 x 18 x 17 x 16 = 1,860,480 possible orderings. This answer is the number of ways we can pick five web pages from 20 when we care about order (i.e. we are distinguishing between 1, 2, 3, 4, 5 and 5, 4, 3, 2, 1 and so on).
Now let's suppose that we don't care about order. Well, the good news is that we already have the total number of permutations (1,860,480). If we can work out how many ways a group of five different numbers can be arranged (1, 2, 3, 4, 5; or 1, 2, 3, 5, 4; or 1, 2, 4, 5, 3 etc.), then all we have to do is divide the 1,860,480 permutations by the number of arrangements. The answer will then be the total number of possible combinations, having removed the importance of order.
Happily, arranging items is simple. You can order five items in 5! (read: "five factorial") ways, which is 5! = 5 x 4 x 3 x 2 x 1 = 120.
Then our total number of combinations becomes: (20 x 19 x 18 x 17 x 16) ÷ (5 x 4 x 3 x 2 x 1) = 1,860,480 ÷ 120 = 15,504 possible combinations.
I see significant people
So, if we care about order there are 1,860,480 options, and if we don't there are 15,504. It is clearly important to be sure which you are interested in before starting the analysis – it can help our choice of ML approach, and our understanding of the significance of the results, to avoid awkward misinterpretations.
As always, mathematics has a convenient formula for the calculations shown above and you can search the internet for it, but if you understand the principles you won't often have to. You'll frequently see combination problems denoted in the style "20C5" (read: "twenty choose five"), and permutations as "20P5" (read: "twenty permute five"). The general case of these is "nCr" and "nPr", where "n" is the total number you select from (e.g. 20 webpages) and "r" is how many you wish to select (e.g. five webpages).
The bottom line is that these two simple concepts are involved in everyday data situations, and although they may seem relatively minor they have huge significance. Get them wrong and you'll make mistakes.
A parting note for the pedants
If you're interested, you win the lottery with a combination (not a permutation) and technically you unlock a safe with a permutation (not a combination). ®
Hello, wrong number is The Register's three-part series looking at common errors made in analytics and machine learning.