Databases

This article is more than 1 year old

What your database needs is a good thermometer

Data blows hot and cold

Mon 14 Sep 2009 // 03:16 UTC

In the very early days of databases, when they contained relatively small quantities of information, the distinction between frequencies of access was pretty immaterial.

Now that we collect and store data in massive quantities, however, the distinction is becoming increasingly important. And apposite terms have evolved to describe this facet of data: "Hot" data is accessed frequently, and "cold" is accessed less frequently.

In almost any business system this year's data will be considered hot, because it's accessed frequently and at the transactional level. People, for example, will want to know things like "What did we sell to Customer X last week?" and "Have we fulfilled Customer Y's order placed yesterday?"

Much of the data held in a warehouse is cold - the bulk of its content is likely to be the transactional data from, say, the previous four years - with a much smaller quantity of very hot data from the current year.

Broadly speaking, there are three classes of disk that serve us, which you can see in the table below. The price of one of them - Solid State Drives - is, as I wrote here, plummeting. In that earlier article, I also discussed the use of SSDs in databases and, as may already be apparent, the concept of hot and cold data is particularly applicable to data warehouses.

Disk Type	Power consumption	Capacity	Speed	Cost
Slow hard disks (5K rpm)	6 – 8 watts	High	Slow	Cheap
Fast hard disks (15K rpm)	16 watts	Medium	Fast	Expensive
SSD (Solid State Disks)	150mW	Low	Blistering	Painfully expensive

But surely any DBA with any sense will already be partitioning tables, putting frequently accessed data onto fast media and less frequently accessed data onto slow and so on. In fact, we've been handling data in this way for many years, archiving older material onto tape and so on. So, why the sudden interest in data temperature?

The answer is that these two factors - the increasing diversity in the heat of the data, combined with the changing storage media - mean that the management of the data is becoming more complex. Slightly more accurately, it means that we can gain significant performance advantages if we are prepared to manage the data in more complex ways.

For example, the obvious approach is to put the cold data on slow cheap disks and, ignoring cost implications, the hot on SSDs. What may be less obvious is a counter-intuitive argument for putting some hot data on slow disks. A slow disk holding mainly cold data should, by definition, spend much of its time in an underworked state.

If you add a little hot data to the disk the chances are that when a request for it arrives, the disk won't be busy doing anything else and the hot data will be delivered rapidly and, if not piping hot, at least moderately warm. Neat.

Essentially it is a matter of bottlenecking, and the bottleneck on a slow disk for archived data may sometimes be zero. You are simply balancing probabilities to take advantage of this fact: It won't work every time but in practice it is found to improve performance rather than degrade it.

And once you have decided to put hot data on a slower disk, it makes sense to put that data onto the outer sectors of the disk because data there can often be read off the disk faster. And if you are, by now, thinking "Whoa, this may be theoretically correct but it's way too difficult to manage in practice," you're right.

By understanding that data has temperature and disks vary in performance, we can do a great deal manually but in practice the problem is that the temperature of the data changes over time and the fine-tuning is so complex that it is unlikely to be done in practice.

This is exactly the kind of situation that is far better managed under algorithmic control, which can determine the positioning of data depending on its actual usage. In fact, this way of handling hot and cold data sits extremely well with handling data as Massively Parallel Processing (MPP).

And this is already being done. Teradata, for example, is way ahead of the curve in this area: The data is monitored for temperature and automatically moved to the optimal storage location. Teradata has also been interested in SSDs for years.

Teradata's director or platform and solutions marketing for EMEA Martin Wilcox said: "Not all database engines are created equal when it comes to data access. In contrast to most other engines, Teradata has always been optimized for random I/O patterns, rather than assuming sequential access, and is therefore inherently capable of immediately exploiting all of the advantages that SSDs have to offer."

Teradata has pushed the boundaries of SSD use even further. At its conference a year ago the company demonstrated a data warehouse system running entirely on SSDs and was using it to evaluate performance and cost benefits - and the price of SSDs has done nothing but plummet since then.

Thinking about data temperature can help us to tune our databases better, and badgering your database engine supplier to think about adding it as a feature is even better. ®

Topics

Special Features

Vendor Voice

Resources

Databases

What your database needs is a good thermometer

Data blows hot and cold

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

AWS must pay $525M to cloud storage patent holder, says jury

Linux Foundation marshals support for open source alternative to Redis

Progress outbids private equity in offer for MariaDB plc

Protecting distributed branch office environments from ransomware

AI boom is boosting demand even for HDDs, raising prices by up to 20% since Q3

San Francisco's light rail to upgrade from floppy disks

Samsung enterprise SSD prices skyrocket thanks to AI's appetite for storage

PostgreSQL pioneer's latest brainchild promises time travel to dodge ransomware

RISC-V PCIe 5 SSD controller for the rest of us hits 14GB/s

Beijing issues list of approved CPUs – with no Intel or AMD

Whistleblower raises alarm over UK Nursing and Midwifery Council's DB

Nutanix catapults IP theft sueball at DBaaS startup Tessell

About Us

Our Websites

Your Privacy