This article is more than 1 year old

All-flash storage or will you settle for hybrid? How to decide

Don't let your cache thrash

Remember thrashing? Back in the early days of server virtual memory systems, the amount of RAM was so limited that the operating system spent most of its time paging for data on its disks, leaving little or no time for processing applications.

It happened because there was a gross mismatch between the number of applications, their working sets – their own code plus the data they have to work on – and the available amount of memory.

These days RAM is more abundant and we rarely hear of this problem. But thrashing is still a concern wherever the amount of fast and expensive computer store is limited and its contents are fetched from a much larger, much cheaper and also much slower store – as may be the case with hybrid flash/disk arrays.

You cannot afford to over-provision the fast store as that costs too much, but nor can you afford to under-provision it, as application runtime can be extended to grotesque levels.

Bearing this in mind, let’s examine disk arrays, all-flash arrays and hybrid flash/disk arrays.

Split seconds

Disk arrays hold vast amounts of data and respond to requests to read it or write new data at millisecond speeds – thousandths of a second.

This data has to be read in or written from a server’s memory, a place in which data access happens at nanosecond – billionths of a second – levels. It is a 106 disparity in data access time, which has become an increasing concern as servers bring more CPU resources to bear on applications with multi-core, multi-socket processors.

CPU wait time for disk IO is becoming untenable and flash memory is being used as an intermediate faster-than-disk access store between disk arrays and servers.

Flash data access latency is in the microsecond area – millionths of a second – roughly a thousand times faster than disk.

Small amounts of flash are used in servers as a cache or as direct-attached SSD storage. Larger amounts of flash are being used in storage arrays directly.

There is a dependency here on data working set size and the number of such sets. If a few of them fit inside a PCIe flash card or a few SSDs, then things are good. But if they are multiple terabytes in size and there are many of them, then fitting them all into a single server is not feasible.

Count the cost

All-flash arrays are one answer. Offerings from startup suppliers such as Pure Storage, SolidFire and Tegile can have 50TB or more of raw capacity, 1500TB-plus after deduplication and compression. They feed random access data to servers much, much faster than a disk array, albeit at a higher cost.

They can give a per-GB cost of less than $5 after deduplication and compression. This is not an absolute number as the effective capacity a customer actually gets depends on the amount of repeated data in their tiered information.

Given that disk arrays are just too slow, hybrid flash/disk arrays look like a natural alternative, with near all-flash performance at near all-disk prices.

High-access rate data is stored in flash, with so-called cooler, or low-access rate, data stored on disk. The flash can be treated as a cache with dynamic movement of needed data, with it over-writing data past its requirement to reside in flash.

This can be a read cache and/or a write cache.

If flash is treated as a tier of storage instead of a temporary cache, then both reads and writes are accelerated.

In general this is all very well but you need enough flash to avoid a re-occurrence of thrashing. If a flash miss occurs, with data wanted that is not in cache, then it has to be fetched from disk, and this means an IO wait. If this happens a lot for any working set then that set may need to be refreshed and reloaded.

If there is a restricted amount of flash the flash-miss rate will increase, the working set refresh rate will rise, and the array will spend a lot of time moving working sets in and out of flash as they become needed by applications executing in the attached servers.

Working limitations

This has two effects. An individual application in a virtual machine needs longer to execute because it is waiting for IO from disk. An individual server will be able to run fewer virtual machines because each one takes longer to execute, waiting on disk IO while its working set is refreshed and reloaded.

Sizing of RAM, cache and disk is in theory quite feasible. Suppose we have 10 servers, each with four CPUs, each with eight cores and 160GB of RAM. The apps are virtual machines and each needs two cores. Then 16 virtual machines can be run by each server, each having 10GB of memory.

The average working data set of each virtual machine is, say, 100GB, and therefore the networked flash array or hybrid flash/disk array needs 16TB of capacity to hold the 10 servers’ 160 virtual machines' working set data.

This is overly simple. There is a lot of averaging going on here and an implicit assumption that we know the bounds of an application’s working set.

It is rarely the case. Instead the set can be defined by time since last access. All data that has been accessed at least once since some arbitrary start time can be regarded as being in the working set, meaning the total application data set size is larger than the working set.

It also means that if the application accesses data widely across this set, and randomly, that data is constantly ageing, while as the process executes more and more data is getting accessed and thus qualified to be in the working set, to be in flash.

The data-moving operation uses up resources too. If it moves small blocks of data, then flash contents are better matched to application needs but more processing is needed to achieve it.

If it moves large blocks of data fewer operations are needed but flash misses are more likely. Applications in virtual machines can use a range of block sizes, say from 4KB to 1,924KB.

You can work this out from knowing the app in some cases, with Oracle using 8KB IO or MS SQL using 64KB. A VMware tool like vscsiStats can be used to find out a block size used by a particular virtual machine.

Your overall application data size and access profile changes over time

Another gotcha is that if the virtual machine population does not have a uniform distribution of working set sizes, then an averaging size scheme may be inappropriate.

Yet another is that your overall application data size and access profile changes over time; the appropriate flash cache size may become progressively less ideal.

Ideally your supplier uses a software tool to monitor and analyse application working set size and access rate characteristics. It will produce data that can be used to generate flash cache or flash tier sizing information.

For example, it might report that the total data storage requirement for your applications is 79TB, of which 21TB should be in flash to enable 95 per cent of IO requests to be fulfilled in less than 10 microseconds.

Automated data placement software can monitor access rates to data and promote data from the disk to the flash tier if its access rate rises, or demote it from flash to disk if its access rate falls below some threshold.

Data migration to and from a flash cache is typically much more dynamic than to and from a flash tier.

Tools of the trade

Dell, for example, has a hybrid array load-balancing tool. It monitors page ”heat” for a few seconds, with accesses to data increasing heat. The heat value decreases at a rate, needing fresh accesses to maintain it.

An algorithm calculates the likelihood of pages of data being accessed based on their heat and every couple of minutes they are automatically moved to and from flash as their heat rises or falls.

SanDisk’s ioControl SPX hybrid array has a quality-of-service (QoS) based scheme, with different storage service levels defined by IOPS, minimum latency and bandwidth.

There could be three different ones called mission-critical, business-critical or non-critical, with application data volumes assigned to the appropriate one. The array resources are defined by the overall QoS needs and sized from them.

An application volume’s guaranteed minimum IOPS, throughput and not-to-exceed latency are monitored in real time. Based on this monitoring the ioControl operating system uses a QoS engine to move data between flash and disk.

This is to ensure that apps with a higher QoS target have a higher percentage of blocks in flash. Ones with lower QoS settings have more blocks on disk.

Nimble Storage has its Cache Accelerated Sequential Layout scheme, which caches active data in real time on SSDs as a way of speeding up response to read requests. It claims response speeds are 10 times faster than with alternative flash schemes.

Network array flash cache and tier sizing is a specific problem area. It is an art with its own technical skillset and we have touched on only the surface here.

Your best advice is to use the technical resources of your hybrid-array supplier and give it access to the server application data it needs for the flash-sizing exercise. ®

More about

TIP US OFF

Send us news


Other stories you might like