Just as the National Security Agency in 2005 came to the conclusion that it would be easier to store everything, Netflix has decided to store all of its content metadata with its customers rather than serving data from a central repository and caching frequently accessed data at the network edge.
The streaming media service on Monday released an open-source project called Hollow that dispenses with what it characterizes as the conventional wisdom about distributing datasets of a certain size over a network.
In a blog post, Netflix senior software engineer Drew Koszewnik explains that when distributing datasets composed of metadata – about movies and TV shows on Netflix in this instance – engineers generally implement a system that falls somewhere between the two extremes: storing the data in a central datastore or serializing the data in JSON or XML and distributing it to viewers via network edge servers.
Either approach presents challenges. Centralized data stores are constrained by latency and bandwidth and may not be as reliable as a local data store, Koszewnik says. Keeping a serialized local copy of metadata in memory on a client device increases the memory footprint and may affect CPU resources or memory garbage collection behavior.
To balance these tradeoffs, Koszewnik says engineers often opt for a hybrid approach, caching frequently used data locally while storing seldom used data remotely, for transmission when needed.
"Sizing a local cache is often a careful balance between the latency of going remote for many records and the heap requirement of keeping more data local," said Koszewnik. "However, if you can cache everything in a very efficient way, you can often change the game – and get your entire dataset in memory using less heap and CPU than you would otherwise require to keep just a fraction of it."
Koszewnik says that by keeping the entire, read-only dataset in memory at the endpoint, Hollow doesn't have to deal with updating and removing data from a partial cache. "Due to its performance characteristics, Hollow shifts the scale in terms of appropriate dataset sizes for an in-memory solution," he said.
Hollow represents Netflix's alternative to a hybrid storage scheme. The successor of its 2013 project, Zeno, Hollow is a Java library and toolkit for dealing with small- to medium-sized in-memory datasets (~100GB of JSON or XML) provided by a single organization for read-only access by multiple users.
The Hollow documentation suggests datasets measured in terabytes or petabytes would not be suitable.
"The biggest improvement is that with Zeno, the dataset was held in memory as POJOs," a Netflix spokesperson explained in an email to The Register, using the acronym for Plain Old Java Object. "In Hollow, we instead use a compact, fixed-length encoding to represent the data."
Zeno's use of POJOs required additional code to accompany datasets in order for the data to be interpreted, the spokesperson said.
The project's toolset provides a dataset change history, diffs of datasets between arbitrary states, heap usage analysis, and usage tracking. It supports indexing and querying dataset records, splitting and combining datasets, and filtering records to manage memory usage.
Koszewnik claims that Hollow has helped Netflix reduce server startup times and heap footprints even as it deals with more and more metadata. He also said it has helped the company realize productivity gains associated with disseminating its data catalog. ®