Sysadmin Blog Non-Volatile Memory Express, or NVMe, is a game-changing storage standard for PCIe-connected drives. It is replacing AHCI and along with the U.2 (SFF-8639) connector it is replacing both SAS and SATA for high speed, low latency storage. It's the smart way to connect up flash and post-flash storage tech to your servers.
The numbers and the theory seem sound, but what is it like in the real world?
"NVMe drives" have become the layman's term for 2.5" SSDs using the U.2 connector. Though not an interconnect in itself, the terminology has been hijacked in common parlance for this specific form factor.
Part of this is due to the U.2 connector not actually getting a name until well after it was in use. For the most part, however, due to NVMe being something that works for some PCIe card-based SSDs, but by no means all while virtually every 2.5" PCIe SSD uses NVMe and the U.2 connector.
Card-based PCIe SSDs require research to ensure they'll work how you think they'll work. 2.5" SSDs with an NVMe sticker on them will work in a server with an NVMe slot. Technopedantry nerds aim your rage cannons now.
My lab has two NVMe servers consisting of three nodes. One node is a Supermicro 1028U-TNR4T+ and the other 2 nodes are a Supermicro 6028TP-DNCR TwinPro. Supermicro has been out in front of most other vendors on NVMe, having a full range of hot-swap capable servers almost before the drives hit the streets.
These servers are equipped with U.2 connectors supporting 2 NVMe SSDs in the 1028U-TNR4T+ and 4 NVMe SSDs per node for each of the two nodes in the 6028TP-DNCR. I have so far only been able to test with 2 NVMe SSDs and 4 SATA SSDs per node. This is nowhere near a full test of the capabilities of these systems – or of the difference NVMe makes over SATA – but it is enough to gather some early impressions.
While this article isn't intended as a review of these units, it's worth noting they've held up well through testing. The servers themselves are typical Supermicro tenth generation servers. I recently reviewed the 2028TP-DC0FR and 2028TP-DC1R Twin servers and what I wrote about those units is a pretty good overview of what you'd expect from the 1028U-TNR4T+ and 6028TP-DNCR. They have 80-Plus Titanium PSUs, good build quality, and exceed the posted thermal compatibility.
Oh, there are differences, of course, but this is Supermicro. If you need a different combination of ports or to add or subtract GPU support, there is going to be a variant of the server in question that does just exact that with a slightly different model number. It's kind of what Supermicro does.
I am sad to say that my lab does not have any NVMe SSDs. I am aware that most enterprise customers use Intel drives almost exclusively, but trying to extract samples from Intel for testing is dark art that borders on impossibility. Micron is a lot friendlier, and I fully expect to be writing a follow-up article with its help soon that will show what NVMe can really do.
In the meantime, I managed to borrow six 400GB Intel DC P3700 drives from a friend for a week, on penalty of horrific, violent death should they go missing. Given the retail cost of these drives, I understand completely.
The 750 series SSDs are rarer than Micron's 980GB M500s were right after launch, and I suspect for much the same reason. When the M500s launched, they were quickly bought up by web scale operators. They were the lowest cost SSDs available at the time and it was almost a year after initial availability before plebians like you and I could find one for sale without feeling like we'd won the lottery.
The P3700 SSDs, fortunately, aren't suffering from the same availability issues. Probably because the eye-watering price is causing the web scale operators to focus their NVMe gorging on the consumer versions.
If a discussion about the purchasing habits of companies seems extraneous, bear with me for a second, I assure you it's relevant.
Performance and initial impressions
My only real day-to-day exposure to PCIe SSDs was a card-based Micron P420m 1.4TB unit that was tortured in my lab for a year before it finally died. I loved that card. It was so fast I honestly didn't have a practical use for it. Given that the P420ms units were released in early 2013 and that the P3700s started making the rounds in mid 2014, I fully expected the Intel P3700s to beat the P420m hands down.
My initial tests of the P3700 SSDs, unfortunately, were very mixed.
Both the read latency and read throughput on the P3700 SSDs is awful, and the gap between it and the P420m only got worse as the queue depth rose. This doesn't make a lot of sense to me, as the whole point of NVMe is to be able to have bigger queues so that we can actually use all that lovely flash.
On the other hand, both the write latency and write throughput were better, surpassing the P420m in some cases. This is unexpected, as the size difference – 1.4TB for the p420m versus 400GB for the P3700s – should have given the Micron card a decided edge. If you have card-based PCIe SSDs it might be worth testing out NVMe drivers versus proprietary drivers for your use cases to see if performance changes.
While I'd love to do a knock-down, drag-out test of various PCIe SSDs against one another, the real point of these tests was to see how NVMe drives faired against my collection of SATA drives. The short version is: the SATA drives got wrecked.
With random I/O at a queue depth of 1 or 2 there was no appreciable difference between drives. Modern SSDs – consumer or datacenter – are so limited by the SATA or SAS interface that they pretty much perform the same.
Move to a queue depth of 4, however, and the NVMe SSDs start pulling away. By a queue depth of 16 the NVMe drives are clearly earning their keep and at a queue depth of 128 the SATA drives might as well have been magnetics.
To give you an idea of what I'm talking about, I was seeing between 4x and 5x higher latency off of the SATA drives at QD128 and at least 4x higher throughput. Where I was getting 1GiB/sec off the P3700 while slamming it with random I/O at QD128, the SATA SSDs were struggling to deliver 250MiB/sec.
Maybe now it is clear why the web scale companies are snapping these things up as fast as they can be manufactured.
Server design matters
As a hyperconverged cluster, 2 NVMe and 4 SATA per node is fast. Deliriously, ruins-you-on-other-storage-forever kind of fast. Fine, fair enough, that's pretty much what I was expecting. What I wasn't expecting was how big a difference server design would make.
To start off with, the 2x 1GbE NICs per node in the 6028TP-DNCR were wholly inadequate to the job. The cluster was fast, sure, but when I popped in a 4-up 10GbE card per node the speed went from "good" to "holy crap".
Wait, since when is storage not the bottleneck?
Similarly, running benchmarks against arrays of various configurations in a single node led me to some interesting conclusions. The most important conclusion is In the same way that a hybrid array of magnetic disks and SATA SSDs only the performance of the SATA SSDs actually matters, the difference between NVMe and SATA is so big that you can functionally write off any performance derived from SATA drives.
I built a series of different hybrid setups and slammed them with every workload I could find. It didn't matter if the SATA drives were 7.2k rpm magnetics, 10k RPM magnetics or flash. If I purposefully overloaded the array so much that ran the NVMe tier out completely then performance dropped off a cliff so dramatically that workloads actually crashed.
Similarly, I tried hybrid setups of 2x NVMe drives with 4x SATA, 6x SATA and 8x SATA. It didn't matter how many SATA drives I had as my second tier, running out the NVMe tier was like hitting a wall. This probably wouldn't be true for big sequential workloads, but for heavily used virtual servers it was very, very noticeable.
What matters is how they hold up under load. I'd rather have 4x NVMe drives than 16x SATA drives for multiple workloads or workloads where latency matters. A hyperconverged virtualization cluster is a great use case. Lots of VMs writing not only to their own node, but replicating to other nodes in the cluster. The I/O blender is in full effect and the SSDs have to deal with lots of different I/Os all of different sizes.
The flip side of this is that NVMe means next to nothing for most single workloads. Exchange isn't going to care. Your web server really isn't going to care. But a properly configured SQL server capable of slamming the storage subsystem with massive amounts of requests absolutely adores a high queue depth.
For these reasons, the difference between 2x NVMe drives versus 4x NVMe drives is noticeable. Traditional logic would have the 2 NVMe SSDs and 8x SATA in the 1028U-TNR4T+ make it a clear storage winner over the 4 NVMe SSDs and 2x SATA per node in the 6028TP-DNCR. If your goal is capacity, this still holds true. If your goal is performance? Not so much.
This makes the lack of 10GbE in the 6028TP-DNCR all the more baffling. It's also worth noting that the 1028U-TNR4T+ supports 1.5TB of RAM wile the 6028TP-DNCR can only fit 1TB per node. This lets me run more workloads on the system with better networking but fewer NVMe slots.
I realize Supermicro has a SKU for everything, but NVMe changes I simply cannot comprehend a server with 2 NVMe bays and no 10GbE, let alone one with 4 per node. Still, I'm looking at this through the lens of someone who runs everything 100% virtualised and is constantly cramming a whole bunch of different kinds of workloads into his clusters. The point of Supermicro's "one SKU for everything" approach is that there is a whole world full of people out there with needs that are different than mine.
At the end of the day, that's really the take-home lesson about NVMe SSDs themselves. Sure, they deliver much better and more consistent performance at high queue depths, but that doesn't mean they'll deliver noticeable change for everyone.
My company's production systems run on a Scale Computing 3-node cluster that uses 10K rpm magnetic hard drives. Not an SSD in sight; not even a SATA one. That cluster is performing absolutely powering the workloads currently assigned. NVMe would be ridiculous overkill.
By the same token, I have a client who runs great big fat workloads that eat lots of RAM and CPU. They hit the storage hard, in bursts. A cluster of 1028U-TNR4T+ nodes would do them wonders, but the 6028TP-DNCR isn't the right balance for them.
NVMe was designed to ready the world for XPoint. XPoint is a post-flash technology that will perform 1000 times as fast as today's goods. Given the difference that NVMe's high queue depth makes with today's flash, I believe entirely that it is necessary.
For the same reason I also believe that we need to look at NVMe SSDs as a completely separate tier from traditional SAS and SATA SSDs, just as XPoint will be another tier again. The queue depth makes a big enough difference that it is worth thinking about where these drives fit in your storage mix.
Do you have workloads that need NVMe? If so, which ones? Let us know and we'll see if we can add them to the test suite. ®