This article is more than 1 year old
What your database needs is a good thermometer
Data blows hot and cold
In the very early days of databases, when they contained relatively small quantities of information, the distinction between frequencies of access was pretty immaterial.
Now that we collect and store data in massive quantities, however, the distinction is becoming increasingly important. And apposite terms have evolved to describe this facet of data: "Hot" data is accessed frequently, and "cold" is accessed less frequently.
In almost any business system this year's data will be considered hot, because it's accessed frequently and at the transactional level. People, for example, will want to know things like "What did we sell to Customer X last week?" and "Have we fulfilled Customer Y's order placed yesterday?"
Much of the data held in a warehouse is cold - the bulk of its content is likely to be the transactional data from, say, the previous four years - with a much smaller quantity of very hot data from the current year.
Broadly speaking, there are three classes of disk that serve us, which you can see in the table below. The price of one of them - Solid State Drives - is, as I wrote here, plummeting. In that earlier article, I also discussed the use of SSDs in databases and, as may already be apparent, the concept of hot and cold data is particularly applicable to data warehouses.
Disk Type | Power consumption | Capacity | Speed | Cost |
---|---|---|---|---|
Slow hard disks (5K rpm) | 6 – 8 watts | High | Slow | Cheap |
Fast hard disks (15K rpm) | 16 watts | Medium | Fast | Expensive |
SSD (Solid State Disks) | 150mW | Low | Blistering | Painfully expensive |
But surely any DBA with any sense will already be partitioning tables, putting frequently accessed data onto fast media and less frequently accessed data onto slow and so on. In fact, we've been handling data in this way for many years, archiving older material onto tape and so on. So, why the sudden interest in data temperature?
The answer is that these two factors - the increasing diversity in the heat of the data, combined with the changing storage media - mean that the management of the data is becoming more complex. Slightly more accurately, it means that we can gain significant performance advantages if we are prepared to manage the data in more complex ways.
For example, the obvious approach is to put the cold data on slow cheap disks and, ignoring cost implications, the hot on SSDs. What may be less obvious is a counter-intuitive argument for putting some hot data on slow disks. A slow disk holding mainly cold data should, by definition, spend much of its time in an underworked state.
If you add a little hot data to the disk the chances are that when a request for it arrives, the disk won't be busy doing anything else and the hot data will be delivered rapidly and, if not piping hot, at least moderately warm. Neat.
Essentially it is a matter of bottlenecking, and the bottleneck on a slow disk for archived data may sometimes be zero. You are simply balancing probabilities to take advantage of this fact: It won't work every time but in practice it is found to improve performance rather than degrade it.
And once you have decided to put hot data on a slower disk, it makes sense to put that data onto the outer sectors of the disk because data there can often be read off the disk faster. And if you are, by now, thinking "Whoa, this may be theoretically correct but it's way too difficult to manage in practice," you're right.
By understanding that data has temperature and disks vary in performance, we can do a great deal manually but in practice the problem is that the temperature of the data changes over time and the fine-tuning is so complex that it is unlikely to be done in practice.
This is exactly the kind of situation that is far better managed under algorithmic control, which can determine the positioning of data depending on its actual usage. In fact, this way of handling hot and cold data sits extremely well with handling data as Massively Parallel Processing (MPP).
And this is already being done. Teradata, for example, is way ahead of the curve in this area: The data is monitored for temperature and automatically moved to the optimal storage location. Teradata has also been interested in SSDs for years.
Teradata's director or platform and solutions marketing for EMEA Martin Wilcox said: "Not all database engines are created equal when it comes to data access. In contrast to most other engines, Teradata has always been optimized for random I/O patterns, rather than assuming sequential access, and is therefore inherently capable of immediately exploiting all of the advantages that SSDs have to offer."
Teradata has pushed the boundaries of SSD use even further. At its conference a year ago the company demonstrated a data warehouse system running entirely on SSDs and was using it to evaluate performance and cost benefits - and the price of SSDs has done nothing but plummet since then.
Thinking about data temperature can help us to tune our databases better, and badgering your database engine supplier to think about adding it as a feature is even better. ®