This article is more than 1 year old
No supercomputer cash? Time for a systems squeeze
Some advice when you're trading off information for performance
Many companies have, understandably, a burning desire to learn things from their data. There's a cost and this manifests itself in one – or, frequently both – of two forms: money and time. Big data equals big storage and big processing power, and both of those equate to a financial cost.
(And yes, we could go into the idea of cloud services and cranking up processing on demand, but the point is that still costs money and time to set up). Now, big storage is something you can't really avoid paying for if you want to hold loads of historical data (though we'll touch on this again later)
So let's look at five ways you can go about increasing the amount of information you can squeeze out of your systems without simply writing a six-figure cheque and buying your county's biggest ever supercomputer.
1. Improving the input
I generally try to avoid cliches like the plague, but the old adage is true: Garbage In, Garbage Out. I also hate rolling out naff, overused quotations but as businessman John Wanamaker is said to have uttered: “Half the money I spend on advertising is wasted; the trouble is – I don't know which half.”
Can you improve the quality of the data you're reporting on? Are you using appropriate cookies to identify repeat visitors to your website? Are you able to hook a collection of web sessions up to a real customer when they phone you or walk into your shop?
Some big companies make a crap job of this potential for data gathering. My bank, for instance, asks me to type my debit card number on the phone keypad and then when someone answers the call a minute or two later this information's entirely disregarded and I have to identify myself again.
My favourite example of doing it right is a company I worked with some years ago. While businesses were looking at trying to use the Caller ID of incoming phone calls (not all that reliable if people are calling from work or from their mum's house) these guys instead used the dialled digits.
They rented a thousand phone numbers and put each one on a different advertisement and brochure; the call centre system then kept a record of the number dialled for each call. Stats for advertisement response were bang on.
Interestingly, they did an exercise in which they didn't tell the call centre staff that this number matching was happening, and asked them to question the customer about where they saw the ad then select the answer in the CRM system GUI. On a good day the latter was 28 per cent wrong and the worst was 44 per cent wrong: the cost of renting and administering the phone numbers was well worth it.
2. Employ (or at least rent) a DBA
If you're storing big chunks of data, use the tools your database software gives you to maximise the performance. Remember, you're not the only person whose database has several million rows in loads of its tables – the Oracles and Microsofts of this world have learned over the years how to construct features that make the database more efficient as the volume of data grows like mad.
Partitioning is one of the favourite approaches, because it lets you hold a database with vast amounts of information but avoid crawling over the entire data set when you do a query.
Say you like to do lots of number crunching where you look in different calendar years and then compare the results: partition the database based on the date field you're looking at and it'll only search the partition(s) containing the year(s) you're querying on.
Or if you have super-wide tables but you only want to report on a handful of columns you can partition such that the lesser-used columns are kept separate. Partitioning isn't a miracle cure, because it doesn't give you a magic way of slicing and dicing data in multiple ways at will, but it's useful nonetheless.
The other favourite is to actually analyse the queries you're using. In these days of Mr. Google there's no reason why you can't, for instance, run an automated weekly report of the top five long-running queries – which you can then analyse and hopefully modify to make more efficient.
If your database engine's query analyser isn't second-nature to you, spend some time with it: it's easy to inadvertently start missing indexes in queries by unwittingly doing table joins sub-optimally, but most query analysers will draw you a comprehensible picture of what's efficient and what isn't.
And if you can't do it yourself, get someone else to do it for you. I've been successful over the years by engaging super-qualified database administrators for a week or two per month – they're costly, yes, but you don't always need them to be full-timers.
3. Denormalisation
When they teach you relational databases at database school they drum into you the concept of normalisation. Techopedia describes normalisation as: “the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together)”. And it's a great way to work, because it means that if you want to change a piece of data you know you only have to change it in one place because everything that needs it references it instead of having its own copy.
The problem is that normalisation sometimes makes things less efficient. If you have a fully normalised database you can be sure that everything's efficiently stored, but you can't be sure that it's easy (or, more accurately, efficient) to query in every case.
I worked with a database a few years ago in a CRM/sales system. It had some socking big tables (fifty million or so rows in some cases) and although it coped very well, it wasn't particularly efficient in a few cases – primarily when looking up customer history information. The reason was pretty obvious: we had queries that were joining million-row tables with ten-million-row tables because to get from the customer ID to the booking details you had to go through a couple of intermediate tables.
The solution was simple: we denormalised – ever so slightly. We just dropped a copy of the customer ID into the booking table. A big four-table join became a small two-table one because we didn't have the inefficient intermediate step. It wasn't a problem because we documented it thoroughly and we ensured that integrity constraints and triggers kept the copy of the customer ID in tune with the master. A small change made a big improvement and everyone was happy.
I wouldn't advocate going too bonkers with denormalisation, of course, because you'll soon have HSDS*, but used in moderation the benefit can outshine the effort.
4. Read-only copies for reporting
I could have put this one first rather than fourth, because it's something that many companies have been doing for years, but the first three items are relevant to a variety of situations involving information retrieval whereas this one is reporting-specific. The problem with running reports on your production database is simple: there's a performance hit that you can guarantee is impacting production operation (because when a CPU cycle or byte of RAM is being used for reporting, it's not being used for production).
Running a read-only replica of your database is usually pretty straightforward. Back in the day I used pretty much to be able to set up MySQL replication in my sleep, and it's not hard with most database systems.
The common mistake is to buy a relatively puny (=cheaper than the production one) server for the reporting engine, which is a daft thing to do for two reasons. First, I thought you wanted to be able to run reports on your big scary database at a sensible speed. Second, many database replication engines send not the data but the INSERT/UPDATE/DELETE queries from master to slave – which means that the replica server needs to be able to keep up with the processing of the queries on the large dataset. A server that's slower than the production may not be able to keep up.
With a read-only copy of the data, you can query to your heart's content in the knowledge that you won't be slowing production down by soaking up system power and/or locking tables for minutes on end.
5. Summary data generation
The final thing I'll talk about here is a mechanism that you can potentially use as a slightly lateral-thinking approach to making your information access more speedy. We've talked about four ways to work with large data sets, but an alternative is of course to reduce the size of the data set. Now, simply chopping off old data won't really be a solution if you want to be able to report on changes over the course of many years' history. But ask yourself the question of whether you need to keep the full detail.
Look at the suite of reports you run. Have they changed radically in the last year or so? Do you envisage any significant changes (specifically requirements for more detail or extra fields)? If not, can you:
(a) simple prune the detail that none of the reports actually use, and/or (
b) consolidate detail into summary information that's much smaller but just as easy to use in reports?
If you only report on customer counts and spends split by date, country and product category then you may well be able to summarise this and then archive the records of every single web session they ever carried out and every phone call they ever made to you.
The problem of summarising data and binning the original comes where some bright spark decides 18 months down the line that they want some more reports that use more than you've made available in your trimmed and/or summarised tables. But you can mitigate this by keeping an offline archive of the full-fat version of the old data in so that if you really need to you can make it accessible somewhere.
Oh, I mentioned earlier that I'd touch on reductions in storage requirements – and as you probably already guessed this is that touch, since if you're able to prune 2TB of data having reduced and/or summarised it into 200MB, that's going to slow your need for storage expansion.
Summary
There, then, are your five steps to getting the information you need at an acceptable level of performance:
Don't start with crap data and ruin your chances of sensible reports: do what you can to maximise the value of what goes in. Use what the database vendor gave you, even if it involves employing a professional to milk the performance out of the system. Have the courage to take a step or two outside what the design textbook tells you – it often makes sense to do so. Use a read-only, non-production copy of the data on a separate server for reporting – you know then that you can do data extracts to your heart's content without impacting production. Prune and summarise the data if you can, to keep unnecessary content to a minimum.
* Hideous Spaghetti Database Syndrome