NetApp’s effort to feed big data beast through NFS makes no sense
Latency on one side, huge capacities on the other
In this correspondent's opinion, it simply makes no sense at all.
Yes, you could find corner cases – you can always find a corner case for something you love – but in this case, that’s about all you can find.
And I’m not talking about data ingestion here.
Storing (BIG) data on primary storage
One of the benefits of HDFS is that it is a distributed filesystem and it has all the embedded availability, replication and protection mechanisms you need for storing huge amounts of data safely and, above all, it’s very inexpensive.
In fact, you can build your HDFS-based storage layer by simply adding disks into cluster nodes, and all the management tools are integrated. At the end of the day, it’s just a file system that you get for free with any Hadoop distribution.
Despite all its defects, HDFS is optimised to do that job, it’s “local” to the cluster, it is designed to move big data chunks and it doesn’t need the special attention usually required for primary storage. The total cost of acquisition and total cost of ownership of HDFS is very low.
Primary storage can easily be positioned on the opposite side:
- It definitely has problems in managing big data analytics and traditional enterprise workloads at the same time (especially if they need to leverage the same resources. Quality of service is still an option for most storage vendors).
- It also introduces huge management costs when it comes to backup and remote replication, costs that become unsustainable if your environment scales beyond a few hundred terabytes
Val Bercovici, in his article, talks about a hypothetical use case with HDFS in the role of a cache (or a primary file system) and NetApp as a secondary repository.
This way HDFS comes up on top compared with what is usually sold as primary storage, and why would you use a primary storage for a secondary storage task?
Don’t get me wrong. I totally agree with the caching layer part, I’ve been talking about it for months, but I think secondary storage has to be the slowest, most automated, scalable and cheapest part of this kind of design. And this is where NetApp doesn’t really fit in, does it?
Analysing (big) data in place
Analysing (big) data is something I really like, but doing that on NFS and NetApp FAS is just too costly.
In my opinion, there are many limits and constraints that mean NFS and NetApp FAS is not the ideal solution, without mentioning the higher cost of NetApp FAS compared with better suited alternatives for this particular use case – at least in my view.
In fact, if you look at what is happening all around, enterprises are piling up data. Like it or not, they are starting to build data lakes. ONTAP File system (WAFL) and data volume limits, in terms of number of object and capacity, are just the first examples. As I recall the limit of a volume in size is still around 100TB).
Yes, you can configure a NetApp system for high capacity (and with large volumes) but then you might not get the performance – and you won’t have any of the advantages usually found in object-based systems.
Various object storage vendors are working on similar capabilities, proposing an HDFS interface on top of their platforms. Working with the same filesystem interface (internally and externally to the cluster) is much better at any level. And, going back to the first use case presented in Val’s blog, it also enables a seamless use of the object storage system for secondary copies of data.