Video streamer Netflix has deployed a prototype University of California, Berkley, fault generating platform to find and fix five problems that otherwise could have affected users.
The platform, dubbed MOLLY, is described in a 2015 Berkeley paper Lineage-driven Fault Injection [pdf] as a "novel approach for discovering bugs in fault-tolerant data management systems".
Berkeley academics Peter Alvaro, Joshua Rosenm, and Joesph M. Hellerstein say if fault-tolerance bugs exist - those which could cause application failure - then their prototype will find it rapidly often using far fewer executions than would random fault injection.
"Failure is always an option; in large-scale data management systems, it is practically a certainty," the trio write.
Netflix already knew of key fault injection points thanks to its existing in-house FIT tool which was coupled to MOLLY for deeper analysis. Company engineering director Ben Schmaus (@schmaus) and internet scale engineer Kolton Andrus (@koltonandrus) say they found five faults in App Boot, the request that loads a list of videos for users, including one that had multiple faults.
"This (App Boot) is also a very complex request, touching dozens of internal services and hundreds of potential failure points," the engineers say.
"Brute force exploration of this space would take 2^100 iterations, whereas our approach was able to explore it in about 200 experiments.
"We found five potential failures, one of which was a combination of failure points."
The duo says a small number of experiments were run so to impact as few users as possible.
Repairing the faults was a manual affair with their FIT tool used to verify and help determine the best fix.
Netflix may extend the prototype to scan a wider request space to find more user-impacting failures.
"We’re very excited that we were able to build this proof of concept implementation and find real failures using it."
Excited readers can apply to work with Netflix in its open positions for fault finders. ®