The data resilience inside of – and outside of – Kubernetes

CSI: Global


Sponsored Things go wrong, and it is only a matter of time before they do. Backing up and archiving data is a kind of security – more of a blankie than a shield – and is equally important to the continued operation of any modern business.

In a monolithic system - as was commonly deployed decades ago - securing data was fairly simple; you milked the data and stored it in a cold place where you could retrieve it when needed. For mission-critical applications, companies deployed high availability clusters, using synchronous replication between mirrored systems, in their datacenters to ensure maximum uptime and then asynchronous replication and failover to remote datacenters to add a belt to the suspenders just in case something might try to wipe out the primary datacenter.

Synchronous replication (aka mirroring) can only be done over short distances, due to speed of light/latency issues. It’s most popular in Europe where things are “closer”. Asynchronous replication has broader market appeal because it can potentially span the globe, and does not suffer from the same latency requirements.

These days, the cloud often functions as the remote replicated system and the backup and archive media, but the ideas are the same. In modern Kubernetes container clusters, the same data resilience ideas prevail as during those monolithic days. It’s just that the systems and their applications are chopped up into pieces – distinct servers and their containers and the microservices applets that collectively comprise the applications.

That said, the storage paradigm with Kubernetes is a little different and this allows for data resiliency techniques to also be different while at the same time allowing for the deployment, where appropriate, of established methods of data resiliency in conjunction with Kubernetes.

In this article, which is the third in a four-part series, we examine the nuances of providing data resilience for Kubernetes platforms. (We have already talked generally about data services for Kubernetes in the first article in the series, and then drilled down into data security and data governance in the second article.

In the final installment, we will tackle data discovery.) This might seem like a small problem – until it’s time to sift through mountains of data to unearth the nuggets of information you need to deliver all those good actionable insights.

Something has to manage this

“All of the agility and efficiency benefits that come with Kubernetes is great, but a Kubernetes platform is not like a hypervisor and virtual machines,” says Peter Brey, Director of Data Services Marketing at Red Hat, whose OpenShift distribution of Kubernetes is the most popular commercially supported implementation of a Kubernetes stack.

“Kubernetes is very different from virtual machines. There is a namespace inside of Kubernetes that is key to understanding data services, and this namespace is required because by the very nature of a container platform, containers and pods of containers can be running anywhere and different containers might need to access the same files or objects at the same time. If you want to do data resiliency correctly, something has to manage this, which OpenShift and other extensions to storage collectively do, and you need to understand this – or have a commercial Kubernetes distributor who understands this.”

Like other container platforms, Kubernetes started out with stateless applications and ephemeral storage. This was easy because you knew containers were not going to stick around and therefore requiring resilience was nonsensical. But many applications in the real world are stateful and need persistent data, and so the Kubernetes platform was given persistent storage.

The inherent persistent storage in Kubernetes has a three-pronged model, called ReadWriteOnce, ReadOnlyMany, and ReadWriteMany, which are abbreviated as RWO, ROX, and RWX respectively. (X means a variable in modern marketing parlance, as it did in middle school algebra.) These RWO, ROX, and RWX APIs overlay the actual storage – be it an NFS or Ceph object storage in the case of Red Hat OpenShift – and creates persistent volumes for containers to dump data into and chew on.

In the early days of commercial Kubernetes installations, the storage drivers were embedded inside the Kubernetes code; this caused all kinds of grief aligning the drivers to the Kubernetes release schedule, which itself was a bit hectic. But several years ago Google and Mesosphere (now defunct), and other key Kubernetes players created the Container Storage Interface, which pulled out the storage drivers and turned them into plug-ins, thus allowing all manner of block and file storage suppliers to underpin persistent volumes for Kubernetes platforms.

The Kubernetes puzzle is different from the VMware puzzle – very different

The CSI plug-ins are similar to the plug-ins that were available for server virtualization hypervisors like VMware ESXi, Red Hat KVM, Microsoft Hyper-V, and Citrix Systems XenServer. But there is no analogous piece of technology to the namespace, which is a kind of – and we hesitate to use this word – virtual compute and storage cluster within a physical cluster that is controlled by Kubernetes.

“The analogy that I use a lot is this,” says Brey. “The Kubernetes puzzle is different from the VMware puzzle – very different. And some would argue that the Kubernetes puzzle is more complicated because you have all these tiny microservices doing their independent things. How do you orchestrate that? Well, Kubernetes does that. That is literally its lot in life. It watches your microservices and when a microservice fails, it finds a new place to run it and give it access to its data. It’s automated and you don’t need hands-on intervention.”

In other words, application resilience and managing data dependencies are built into Kubernetes, and the beauty is you don’t have to do anything to get it. That was the purpose of Borg and Omega when Google created it for its internal cluster and container management, and that was the purpose of Kubernetes when Google open sourced it and gave it to the world.

You have to have automation

“To be able to run tens of thousands, if not hundreds of thousands, of these microservices, you have to have automation,” Brey continues. “You can't do it any other way. Google could have never hired enough engineers to be able to deliver the scale it needed, and so it automated the heck out of everything, including storage and data. That's why we have this concept of persistent storage.

“And it is very different from the traditional enterprise datacenter, where an application developer has to submit an IT support ticket, which has to go to a storage admin, who has to go set up that storage and then get back to the application developer. Two weeks could pass. Now with OpenShift and persistent storage, all the application developer has to do is write a line of code in their app, and boom, it automatically goes and requests the storage on the fly, which has been pre-setup by this storage admin. And it all happens on the fly without any manual intervention from anybody.”

Enterprises also need data to be resilient outside of Kubernetes

However, this does not mean this is where the data resilience ends - just because Kubernetes has this level of automation for provisioning storage and keeping microservices running in containers linked to the proper persistent storage and they are created, moved, or destroyed.

Kubernetes, as implemented in OpenShift Data Foundation built with Red Hat Ceph, is resilient within itself with regards to data, but enterprises also need data to be resilient – meaning replicated or backed up or archived – outside of Kubernetes. You can’t trust any storage system to hold everything forever, and in many industries – healthcare and financial services and government come to mind – it is mandated by law that data must be stored in a retrievable form for years to decades.

It is simply not possible to backup thousands or tens of thousands of containers and their underlying data sources. That would be like trying to catch a vast school of minnows with your bare hands. Luckily, the CSI plug-in also allows for traditional backup and archiving tools to plug into Kubernetes and take off the data while preserving the context of that Kubernetes namespace to store snapshots, which are the first line of defense in enterprise data resilience. Preserving that context is key, because that is what makes a backup or archive useful in the event of a disaster that takes down a Kubernetes cluster.

“You have to have extra software outside of CSI to handle the co-ordination of the namespace,” notes Brey who advises users to check that if their backup vendors have developed this code. The Red Hat example here is OADP -OpenShift API for Data Protection - which provides an API interface to coordinate backup and restore of the Kubernetes namespace.

“We don’t care what storage it is as long as it has a CSI plug-in,” Brey says with a laugh. “Some of our competitors are telling their customers that with OpenShift, Red Hat makes them move to new storage, but that’s baloney. We can store snapshots, backups, replicas – we don’t care what the storage is, and you can even put it in a public cloud. We care about the data services that sit on top. So, you can use Veeam, TrilioVault, IBM Spectrum Protect, MicroFocus Data Protector – we don’t care because this is not an area where we want to add value. Backup and recovery, as I have said before, is a holy war, and we are not here to fight that fight. We have better things to do, like make this all transparent to Kubernetes, which we do with the OpenShift API for Data Protection.”

In common with other enterprise applications, microservices running on Kubernetes that create a massively distributed and fully interconnected application need to support different Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs). This would seem to be a massive, complex task for enterprises with hundreds to thousands of applications embodied in thousands to hundreds of thousands of containers, but it turns out to be no harder to deal with than with traditional, monolithic applications. You have to do the same things in terms of synchronous and asynchronous replicas or backups of data.

Brey notes that only a few OpenShift shops, who tend to be the largest companies on Earth at this point in the adoption cycle of the still relatively new Kubernetes platform are doing either asynchronous or synchronous replication between sites to protect their Kubernetes clusters.

“This is a highly experimental space, and many leading edge companies are trying to figure out how to make this all work. There are some very difficult problems to solve here,” he says, “because you’re dealing with the speed of light and the danger of data loss.”

Also synchronous is still a relatively expensive proposition. Obviously, more companies are doing asynchronous replication in some form because they don’t need synchronous replication and can’t afford the high cost of driving sub-4 millisecond replication response time that synchronous replication needs.

Sponsored by Red Hat.


Biting the hand that feeds IT © 1998–2021