Docker and storage – solving the problem of data persistence
Pull on your vendor hoodie - it's a casual affair
In June I was in in Seattle, the home of Starbucks, Boeing and Microsoft, for DockerCon 2016. Compared to the normal events I attend, this one promises to be a more “casual” affair, so I sported polo shirts and vendor hoodies as my standard attire.
One of the more interesting problems yet to be fully solved with container technology is that of persistent storage. When containers first appeared, the immediate assumption was they they would be ethereal in nature and move around the infrastructure at will. If a container/application needs to be moved, simply re-instantiate it elsewhere. Well it turns out that there are two problems with this: first, people like to have a degree of persistence with their containers (more on that in a moment) and second, application data has to reside somewhere.
Pets, cattle and pets again
The big comparison made between virtual machines and containers is that of pets versus cattle. VMs are pets to be nurtured, maintained and looked after; containers are cattle that can be culled at whim, to be replaced by another in the herd. This is an initially good analogy, apart from the obvious fact that as we provision containers they have to be configured and mapped to our application, including configuring security and network settings, storage and other permissions or access application data or other parts of the application hierarchy.
This means now we go back to maintaining pets again, but this time our pet is a set of configuration files that explain how to orchestrate the application rather than an infrastructure-centric representation of that application.
The result is that people like containers to hang around longer than initially expected because maintaining deployment manifests takes effort.
Data in or accessible by the container
So when it comes to data, should we put data (resiliently) into the container or should we have more persistent data repositories (possibly on VMs) that the containers access? In its purest form, we should be aiming for the former, but that’s a lot more work than moving the stateless part of the app (like the web server) into a container while keeping the data in a more traditional format.
Both issues present us with a problem. Data has inertia and latency makes it difficult for applications to access data over distance, unless the access protocols for that data are specifically latency tolerant. We’re already seeing some solutions come to market to answer these problems, including ClusterHQ with Flocker, Portworx, Hedvig and StorageOS (which launched in beta at DockerCon).
In terms of requirements, we need to fix the ability to move data (the container) from one location to another – and ensure permissions are correct so as to not expose data to the wrong application. We need to maintain integrity, if data is moving around and in transit, while being accessed. Of course we also have to back data up and ensure we can restore it, wherever the application resides in the future.
The architect’s view
I’m looking forward to getting into some detail on how persistent data is being managed. Storage is probably one of the last (big) container problems to solve and the issues are the same as they’ve always been. For the enterprise to adopt containers and Docker, operational issues around storage need to be fixed. I’m hoping we see start seeing some answers.