This article is more than 1 year old
Wow, machine learning, what a snoozefest... less so if you strap a bunch of GPUs to your storage
GPU-boosted system market is, like, literally so hot right now
Analysis Machine learning stresses storage because training the models means millions if not billions of files have to be fed to the training system with its GPUs in as quick a time as possible.
Suppliers are devising converged, hyperconverged and composable systems to sidestep chokepoints and make it simpler to get ML customers up and running.
Recently we have had the Pure Storage and Nvidia AIRI converged system, which brings four Nvidia DGX-1 GPU-enhanced servers to bear on FlashBlade-stored data.
Now Nvidia has released an updated DGX, the DGX-2. Chinese server firm Inspur and composable infrastructure supplier Liqid have produced a Matrix Rack Composable Platform for machine learning while X-IO has added GPUs and SQream database software to its Axellio combined server+storage box.
The DGX-2 is two DGX-1s plus more CPU, memory, interconnect bandwidth and storage:
|GPUs||8x V100||16x V100|
|Interconnect||NVlink||NVlink2 with 12 NVSwitches||216 ports|
|CPUs||2x 20-core Xeon E5-2698 v4 2l.2GHz||2x Xeon Platinum||Faster CPUs|
|GPU Memory||256GB HBM||512GB|
|System Memory||512GB DDR4||1.5GB HBM||Triple pooled memory space|
|Storage||4x 1.92TB SSD – 7.68TB||30-60TB NVMe SSD||4-8x more capacity|
|Performance||960 TFLOPS||1,920 TFLOPS||Bigger memory pool means larger jobs|
|Weight||134lbs||350lbs||More than 2x|
|Networking||4x EDR InfiniBand & 2x 10GbitE||8x EDR InfiniBand or 100GbitE|
|Price||$149,000||$399,000||More than 2x|
The much larger system memory means larger jobs can be run in the DGX-2. They should complete more than twice as fast because of this.
With the DGX-2 being announced so close to the Pure-Nvidia AIRI system, it's clear that Pure and Nvidia decided not to have a DGX-2-based AIRI. However, it's possible that a subsequent AIRI system could be DGX-2-based, and have larger flash drives inside to keep the 16 GPUs occupied. This would be, we suppose, a $2m-plus system which would reduce the number of potential customers.
Inspur and Liqid
Inspur and Liqid have co-developed their Matrix Rack Composable Platform which lets users dynamically set up CPU-GPU-storage combinations composed for specific workloads. Inspur provides the i24 servers and GX4 chassis, Nvidia the Tesla V100 and P100 GPUs, and Liqid the Grid PCIe-based fabric hardware and software.
Start with a set of disaggregated pools of compute, GPU, storage and Ethernet networking resources. Elements from these pools can be combined, clustered, orchestrate and shared over the PCIe fabric.
The pool elements are:
- 24x Compute Nodes (Dual Intel Xeon Scalable Processors)
- 144x U.2 Solid-State Drives (SSD), 6.4 TB per SSD (922TB)
- 24x Network Adapters (NIC), Dual 100 Gb/NIC
- 48x NVIDIA GPUs (V100 and P100)
- Liqid Grid (Managed PCIe Gen 3.0 Fabric) and Liqid Command Center (software)
Liqid Grid PCIe fabric switch
A maximally configured system might blow the Pure-Nvidia AIRI system away and has three times more V100 GPUs than Nvidia's own DGX-2. The cost of such a fully configured Matrix Rack would be astronomical.
Dolly Wu, GM and VP at Inspur Systems, said: "AI and deep learning applications will determine the direction of next-generation infrastructure design, and we believe dynamically composing GPUs will be central to these emerging platforms."
X-IO, SQream and Nvidia
Back on the more affordable side of planet Earth we have X-IO's Axellio edge compute+storage product receiving an Nvidia GPU implant and SQream database software to deliver a "converged appliance for extremely rapid data analytics of massive datasets".
What SQream has done with its DBMS software is to take repetitive low-level SQL query operations and run them on a server GPU accelerator. The company says complex queries contain multiple filters, type conversions, complex predicates, exotic join semantics, and subqueries. When these are run on 100TB-level datasets, with billions of rows in several tables, they can take several minutes to hours to complete (query latency.)
SQream says it can provide a 20x speedup of queries on columnar data base sets, and query large and complex data up to 100x faster than other relational databases. Its latency on the complex query of 100TB-level datasets is, it claims, in seconds to minutes territory.
Its ingest speed is up to 2TB/hour.
This enables a large-scale reduction on the servers needed to run SQL queries on large data sets; SQream claims a single 2U server plus GPU is equivalent to a 42U rack full of servers. Basically SQream says use our relational database to get screaming SQL performance.
Then X-IO says run it on our hardware and go faster still.
The server/storage base is X-IO's Axellio Edge Micro-Datacenter appliance product; a 2U box containing two Xeon server modules with two Xeons apiece, 2x Tesla P100 GPUs, a PCIe fabric, and 1 to 6 FlashPacs, which each hold up to 12x dual-port NVMe SSDs (800, 1,600, 3,200 or 6,400GB) with a maximum capacity of 500TB.
SQream and X-IO claim a two-node example of their combined system can push data from storage to the GPU at up to 3.2GB/sec per GPU. Their combined system can reach 11.5TB/hour in an analytics run.
They say users can get real-time answers to queries that took minutes before, or expand their query windows from weeks to years to find trends, query trillions of rows of data and get results faster.
X-IO might also be looking at the machine learning space. In theory it would be easy enough to climb into bed with a machine learning framework software supplier. Just another partnership, right?
Get an Axellio datasheet here.
Machine learning is seen as a hot growth market. Combine that with on-premises NVMe flash storage and big data analytics applications, and the result is hot boxes galore.
We must surely expect Dell EMC and NetApp to enter the GPU-boosted system market, not to mention Huawei and Lenovo. Other all-flash array vendors might look at the Pure-Nvidia deal and think "me too" e.g. Kaminario, Tintri and WDC Tegile.
The performance gains over non-GPU systems are so impressive that profit margins can be set high enough to get on-commission sales reps salivating like crazy. This GPU-accelerated server/storage product development space is going to see frenzied development as suppliers pile in to take advantage of the growth prospects. ®