Data's democratisation: Because there's no doh in Type 0
Handle me with care
There has been a slow but steady democratisation of business intelligence (BI) and data science over the years with Excel (and PowerPivot), through introduction of self-service BI and growth of R as the language of choice for statistics.
For those from a traditional programming background, Python has become the analytical language of choice – there is a library for just about any analytical problem.
This democratisation is a good thing – it has allowed those lacking the deep pockets needed to implement a full-on data warehouse (or data lake if you’re of an unstructured state of mind) to get on and analyse data, to extract the meaning and stories the data is trying to tell.
There is no doubt that the power of the desktop machine with huge amounts of memory has allowed this to happen, but it’s also because the industry has realised that it has within its ranks the making of data scientists who haven’t been trained in this sort of analytics.
These are programmers, statisticians, physicists, just about anyone who has been trained to handle numbers and can either do a bit of programming or use a spreadsheet at Ninja level.
I’ve no doubt that a lot of these analysts are doing a fantastic job, pulling data together and creating reports and figures just as management needs them.
But there is a problem. Data doesn’t stay still in a nice safe way. Data changes over time, and it changes because most business data is about people.
To put it simply, people are awkward. Just about any “fact” about a person can change – people move house; they change their name; they get a new phone number; their age changes (sadly there is no way to alter that fact!).
Even “facts” about people that you think are fixed can move – a change of gender is not uncommon; birthdates can be recorded incorrectly accidentally (or, if you’re on social media, deliberately); people’s appearance is very fluid – hair can change colour; eyes can change colour with age; body shape can alter in a few weeks (a diet of hamburgers will do that for you). Some of these attributes about your customers may be of no interest to your analytics (we can call them informational fields if you like), but some can trip you up badly if you are unaware of the consequences of changes.
Almost the canonical example of problems that changing attributes can cause is that of address moving. Suppose we are running a small mail order outfit and we keep our customers' addresses in a small database (even an Excel spreadsheet). When a customer moves we just update the address. That’s great, we can just send the goods to the new address.
Who and where were the users when they first bought our gear?
The problem comes when we want to do an historical analysis – we have no way to track where the users were when they bought goods from us. If we want to know how many widgets we sold in Liverpool in 2013 and 50 of our customers have moved away from Liverpool, then the figure we get is just wrong. It doesn’t matter if the data is stored in a spreadsheet, a PDF data lake analysed by Hadoop or Spark, or a NoSQL database – if you don’t track changes correctly, you will get incorrect answers.
And that can cost your business a lot of money. Incorrect data analysis is as bad or worse than no analysis at all. If you're analysing the data it must be for a good reason, and that reason probably has to do with profit (or loss), so if you’ve got inaccurate answers then you’re just wasting your time.
In the BI world this has been known for a long time, and these sorts of changes are called “slowly changing dimensions,” or SCD for short. I should point out that there is no such thing as a quickly changing dimension – they are all slow, no matter at what speed they change. The BI world knows how to deal with these (or at least it should).
There exist six SCDs – here are just two:
- Type 0 that never change and are very rare
- Type 1 that overwrite any data
There is an array of increasingly complex methods to track the history of changes over time, all of which use columns in a table to track the changes, or add a new history table.
Hopefully, you can spot the fatal flaw here when it comes to analysing big data (or small data in Excel): there may be no tables to add columns to.
There are ways round this problem – taking a leaf out of the BI world, you can add the equivalent of Surrogate Keys to the data as it flows through your system to track changes. These keys are generated by the programmer, most likely a hash code that will stick with that entity, even if attributes of the entity change over time.
Depending on the data you’re tracking, you might need to add a field that identifies the date range that the data is valid for. It can then be used or ignored by your analysis system as need be. These might be extra key value pairs in Hadoop or Spark, or in table storage systems (such as Amazon Redshift or Azure Data Warehouse (ADW)) using your own functions.
This is not a trivial task, but it’s not beyond the skills of most programmers. It’s not that it’s hard to do, it’s just that it’s a very easy thing for programmers to overlook. I have, for example, seen this addressed very nicely by firms like Big Consultancy on ADW using just a few lines of C# that show what can be done – if you’re aware of the problem, that is.
Data scientists – should you hire them?
There is a shortage of data scientists due to sudden demand, and it’s tempting to select people who possess a set of skills that make them look suitable for a position in data analysis. More so if they are from within your organisation – much easier and less time-consuming and expensive than venturing out on the open market.
As we’ve seen, there are hidden problems in data that might not be obvious to such an outsider, and that can have some very real consequences for your business should you avoid, ignore or be unaware of them.
For people who haven’t worked with historical data before, not only might these problems not be obvious but the solutions can be even more obscure. There will be a temptation to hand-roll your own solution, being unaware that most of these problems have been solved before.
There is no magic bullet, no technology that will take care of this for you. Only you and your business can identify an SCD and know what the business case is for handling it: do you never change it, overwrite, keep a simple history or a more complicated one?
Then find the correct solution from the existing toolbox of solutions and with a bit of work implement it in your system. It’s got to be better than getting the wrong answer. ®