Data's democratisation: Because there's no doh in Type 0

Handle me with care


There has been a slow but steady democratisation of business intelligence (BI) and data science over the years with Excel (and PowerPivot), through introduction of self-service BI and growth of R as the language of choice for statistics.

For those from a traditional programming background, Python has become the analytical language of choice – there is a library for just about any analytical problem.

This democratisation is a good thing – it has allowed those lacking the deep pockets needed to implement a full-on data warehouse (or data lake if you’re of an unstructured state of mind) to get on and analyse data, to extract the meaning and stories the data is trying to tell.

There is no doubt that the power of the desktop machine with huge amounts of memory has allowed this to happen, but it’s also because the industry has realised that it has within its ranks the making of data scientists who haven’t been trained in this sort of analytics.

These are programmers, statisticians, physicists, just about anyone who has been trained to handle numbers and can either do a bit of programming or use a spreadsheet at Ninja level.

I’ve no doubt that a lot of these analysts are doing a fantastic job, pulling data together and creating reports and figures just as management needs them.

But there is a problem. Data doesn’t stay still in a nice safe way. Data changes over time, and it changes because most business data is about people.

To put it simply, people are awkward. Just about any “fact” about a person can change – people move house; they change their name; they get a new phone number; their age changes (sadly there is no way to alter that fact!).

Even “facts” about people that you think are fixed can move – a change of gender is not uncommon; birthdates can be recorded incorrectly accidentally (or, if you’re on social media, deliberately); people’s appearance is very fluid – hair can change colour; eyes can change colour with age; body shape can alter in a few weeks (a diet of hamburgers will do that for you). Some of these attributes about your customers may be of no interest to your analytics (we can call them informational fields if you like), but some can trip you up badly if you are unaware of the consequences of changes.

Almost the canonical example of problems that changing attributes can cause is that of address moving. Suppose we are running a small mail order outfit and we keep our customers' addresses in a small database (even an Excel spreadsheet). When a customer moves we just update the address. That’s great, we can just send the goods to the new address.

Who and where were the users when they first bought our gear?

The problem comes when we want to do an historical analysis – we have no way to track where the users were when they bought goods from us. If we want to know how many widgets we sold in Liverpool in 2013 and 50 of our customers have moved away from Liverpool, then the figure we get is just wrong. It doesn’t matter if the data is stored in a spreadsheet, a PDF data lake analysed by Hadoop or Spark, or a NoSQL database – if you don’t track changes correctly, you will get incorrect answers.

And that can cost your business a lot of money. Incorrect data analysis is as bad or worse than no analysis at all. If you're analysing the data it must be for a good reason, and that reason probably has to do with profit (or loss), so if you’ve got inaccurate answers then you’re just wasting your time.

In the BI world this has been known for a long time, and these sorts of changes are called “slowly changing dimensions,” or SCD for short. I should point out that there is no such thing as a quickly changing dimension – they are all slow, no matter at what speed they change. The BI world knows how to deal with these (or at least it should).

There exist six SCDs – here are just two:

  • Type 0 that never change and are very rare
  • Type 1 that overwrite any data

There is an array of increasingly complex methods to track the history of changes over time, all of which use columns in a table to track the changes, or add a new history table.

Hopefully, you can spot the fatal flaw here when it comes to analysing big data (or small data in Excel): there may be no tables to add columns to.

There are ways round this problem – taking a leaf out of the BI world, you can add the equivalent of Surrogate Keys to the data as it flows through your system to track changes. These keys are generated by the programmer, most likely a hash code that will stick with that entity, even if attributes of the entity change over time.

Depending on the data you’re tracking, you might need to add a field that identifies the date range that the data is valid for. It can then be used or ignored by your analysis system as need be. These might be extra key value pairs in Hadoop or Spark, or in table storage systems (such as Amazon Redshift or Azure Data Warehouse (ADW)) using your own functions.

This is not a trivial task, but it’s not beyond the skills of most programmers. It’s not that it’s hard to do, it’s just that it’s a very easy thing for programmers to overlook. I have, for example, seen this addressed very nicely by firms like Big Consultancy on ADW using just a few lines of C# that show what can be done – if you’re aware of the problem, that is.

Data scientists – should you hire them?

There is a shortage of data scientists due to sudden demand, and it’s tempting to select people who possess a set of skills that make them look suitable for a position in data analysis. More so if they are from within your organisation – much easier and less time-consuming and expensive than venturing out on the open market.

As we’ve seen, there are hidden problems in data that might not be obvious to such an outsider, and that can have some very real consequences for your business should you avoid, ignore or be unaware of them.

For people who haven’t worked with historical data before, not only might these problems not be obvious but the solutions can be even more obscure. There will be a temptation to hand-roll your own solution, being unaware that most of these problems have been solved before.

There is no magic bullet, no technology that will take care of this for you. Only you and your business can identify an SCD and know what the business case is for handling it: do you never change it, overwrite, keep a simple history or a more complicated one?

Then find the correct solution from the existing toolbox of solutions and with a bit of work implement it in your system. It’s got to be better than getting the wrong answer. ®

Similar topics

Broader topics

Narrower topics


Other stories you might like

  • Cassandra vendor DataStax secures $115m investment for $1.6b valuation
    Tech stock crash fails to deter Goldman Sachs as it leads funding round in the real-time data specialist

    DataStax, the database company based on the open-source Cassandra system, has secured $115 million in funding for a $1.6 billion valuation.

    Led by the Growth Equity business within Goldman Sachs and backed by RCM Private Markets and EDB Investments, the latest round follows a strong first quarter based on the popularity of DataStax's Cassandra DBaaS Astra DB. Existing investors include Crosslink Capital, Meritech Capital Partners, OnePrime Capital, and others.

    Cassandra is a distributed, wide-column store database suited to real-time use cases such as e-commerce and inventory management, personalization and recommendations, Internet of Things-related applications, and fraud detection. It is freely available on the Apache Version 2 license, although DataStax offers managed service Astra on a subscription model.

    Continue reading
  • NoSQL player Aerospike links up with Starburst for SQL-based access to edge data
    'We’re not necessarily replacing Snowflake' is an interesting choice of words

    Aerospike, the value-key NoSQL database, has launched a collaboration with data connection vendor Starburst to offer SQL access to its datastores.

    Dubbed Aerospike SQL Powered by Starburst, the system hopes to offer data analysts and data scientists a single point of access to federated data in Aerospike using existing SQL analytic tools such as Tableau, Qlik, and Power BI. It is the first time Aerospike has offered an off-the-shelf tool to analyze its database using SQL, the ubiquitous database language.

    Aerospike was purpose-built with a highly parallelized architecture to support real-time, data-driven applications that cost-effectively scale up and out. It claims to offer predictable sub-millisecond performance up to petabyte-scale with five-nines uptime with globally distributed, strongly consistent data.

    Continue reading
  • Aerospike takes swipe at document databases JSON support
    MongoDB, Couchbase in the crosshairs as Aerospike tries to broaden use cases

    Distributed NoSQL database Aerospike has added support for JSON documents to a slew of new features included in its Database 6 release.

    The value-key database has established its niche by touting high throughput, low latency and global scalability. It is adding support for the document format in a bid to broaden use cases and take on document database specialists Couchbase and MongoDB.

    Lenley Hensarling, chief product officer, said some customers were supporting as many as 13 billion transactions per day on Aerospike. Support for JSON document models, the Java programming models, and JSONPath query would help users store, search, and better manage complex data sets and workloads.

    Continue reading
  • Cassandra 4.0 release held back after Apple engineer discovers last-minute bug
    Bid to build the most stable iteration of the columnar database has its price

    Cassandra 4.0 – the open-source distributed NoSQL database used by Apple, Netflix, and Spotify – has been delayed at the 11th hour after a developer spotted a bug in the code.

    Project contributors had committed to making the much-anticipated release the most stable yet and wanted to ensure it shipped with no known issues. But the world will have to wait a little longer for the release, previously slated for 8am BST, 19 July.

    "In preparing the 4.0 GA release, the Apache Cassandra community identified a fix to be made late Friday. Because of this, the release is being held until the fix is complete. We will share the new release time as soon as we know," a community spokesperson said.

    Continue reading
  • The first rule of NoSQL DBaaS club is: You must talk about NoSQL DBaaS club. And Couchbase is in
    Follows its customers into the cloud

    Couchbase's database-as-a-service product has hit general availability – although this is just on Amazon Web Services initially, with Microsoft Azure and Google Cloud Platform soon to follow.

    The document-oriented database will be available as a service in the public cloud providers, but also on virtual private cloud deployments on AWS, which the vendor behind the open-source system said would help customers lower operational costs compared with earlier approaches to deployment.

    Despite taking over management of the database, Couchbase offers customers some control of configuration and node-level performance. Through a single-pane-of-glass control environment, users can manage multi-cloud instances and cross-data-centre replication.

    Continue reading
  • Analyse this: Microsoft promises OLAP-OLTP 'Link' with new CosmosDB features
    Azure SQL Data Warehouse

    Ignite Microsoft has pushed out Azure Synapse Link for Cosmos DB to general availability in an effort to bring its transactional NoSQL database closer to the analytics workhorse data warehouse.

    Teased in May last year, Microsoft said the link would comprise of two main components.

    Firstly, a Cosmos DB would house a column-oriented analytical store within containers in addition to the existing row-oriented transactional store. "The analytical store is fully isolated from the transactional store such that queries over the analytical store have no impact on your transactional workloads," developers Ramnandan Krishnamurthy and Sri Chintala said in a blog post last year.

    Continue reading
  • Aerospike adds set indexing and SQL expressions to make the distributed NoSQL database more ML-friendly
    New Spark 3.0 connector will appeal to users too, analyst says

    Distributed NoSQL database Aerospike is introducing set indexes and SQL operations within expressions in the pursuit of greater machine learning efficiency via its Apache Spark 3.0 connector.

    Speaking to The Register, chief product officer Srini Srinivasan claimed the combined tweaks could help reduce the feedback cycle to improve ML models from days to hours.

    A key-value and multi-modal database, Aerospike can run on the edge to support so-called real-time decisions based on pre-existing ML models in applications such as fraud detection. It is also used to feed data back into the ML model management commonly used by data pipeline platform Apache Spark to ensure models reflect changes to data patterns in the real world.

    Continue reading

Biting the hand that feeds IT © 1998–2022