Zoom in on HDS' Hyper Scale-out Platform kit
Hyper-converged box avoids the NameNode bottleneck
The Hitachi Hyper Scale-Out Platform – announced on Tuesday – is a single system for ingesting, accessing and analyzing big data-type information. Its architecture differs from typical Hadoop systems in that it avoids NameNode resource contention.
The hardware in the 2U rack-enclosure HSP 400 node is this:
- Two Intel Xeon E5-2620 2GHz processors with 15M of cache.
- 192GB of RAM.
- 12 x 3.5” 4TB SAS 7,200rpm HDD (48TB per node).
- 2 x 2.5” 300GB SAS 10,000rpm HDD for app or OS.
- 40GbitE dual-port QSFP NIC for internode backend network.
- 10GbitE single-port SFP + NIC for host connectivity.
- Brocade ICX 7750-26Q switch.
Twenty nodes will fit in a rack, and nodes are scaled out by being clustered together, starting with a minimum of five nodes. Each node supports compute, a virtual infrastructure, and storage – it being hyper-converged in that sense.
Like hyper-converged infrastructure appliances, capacity and performance scale as nodes are added. Both Hadoop and other types of analytics can run in HSP nodes and access their data.
HSP does not ship with a Hadoop distribution. Instead it has an HDS filesystem, which is object-based with a file front-end, that supports the Hadoop distributed file system (HDFS) via the HDFS API, and this has been certified by Hortonworks.
The HDS filesystem also provides a POSIX-compliant file system interface "that allows other third-party data analytics applications to directly access the data" without it having to be exported to their systems.
There is no CIFS support (it isn't a NAS) and it can support Docker. It can also be bought with a full Hadoop distribution, but it still ships with the HDS filesystem inside it.
It supports "many of the same principals of Hadoop (64MB data chunk size, three-copy policy, data location API)" but is able to avoid the NameNode bottleneck by distributing metadata across the nodes.
There is thus, HDS says, no single point of failure or contention. Data is automatically distributed across nodes in the system, and there is distributed locking.
Every node can serve data without centralized metadata management, and HSP accesses the data via standard POSIX semantics through its distributed file system.
Hadoop analytics jobs work with the HDFS API; others use POSIX, and both have in-place data access.
The nodes run Linux and can run the hypervisor KVM; so starting up analytics is equivalent to spinning up a virtual machine and, hey presto, analytics compute has been brought to the data. In fact Pentaho software can run inside such VMs.
HDS senior veep Sean Moser says Docker containers will be able to run on the HSP's bare-metal. He also says 1,000 nodes have been physically connected in tests, but that was not a final product. So that doesn't mean it supports 1,000 x 48TB = 48,000 TB, meaning 48PB.
The nodes can host different VMs for different big data analytics needs. Network traffic is reduced as data transfers across the network to analytics servers, and a separate analytics data silo, is not needed.
There are connectors to archive systems for archiving of data when it's no longer needed on the analytics platform.
HDS says HSP is self-configuring, self-managing, and self-healing. There is a RESTful API, and Hadoop management tools are supported. It is cloud-ready in that it "provides APIs for OpenStack Glance (image services), Nova (cloud computing fabric controller) and Swift (object storage) projects to enable users to deploy infrastructure as a service (IaaS) functionality."
Customers have pulled HDS along with this; they've been experiencing it for two years. Moser claims the pull on it is phenomenal. ®