If Machine Learning is the question, open source is the answer. Right?
Why Google's gift of TensorFlow is not what it seems
Machine learning (ML) and artificial intelligence (AI) are extraordinarily hard to pull off in the real world, so of course the solution must be open source. From Google’s TensorFlow to Microsoft’s Cognitive Toolkit, the world is awash in open source ML/AI code... none of which seems to be solving the gaping void between AI hype and production deployment reality. By Gartner’s estimates a mere 15 per cent of organisations actually get into production with ML/AI.
A big reason for this gap is talent. Or, rather, a lack thereof. There’s a chance, however, that an influx of open-source code into the ML universe could improve things. How so? By lowering barriers to entry to experiment on and become proficient with high-quality ML software. Perhaps not surprisingly, the cloud giants that stand to gain from an influx of data-heavy ML applications are the same ones open sourcing the ML code in the first place.
Mind the gap
Despite the incessant hype over ML’s promise to change everything forever, the reality is that ML has hardly managed to get out of neutral, much less first gear. As I’ve detailed before, the biggest barrier to ML success is a distinct lack of qualified engineers. Or as Gartner analyst Merv Adrian put it to me: “[I]t’s mostly about skills. Missing skills."
What skills are missing? I’m glad you asked. There are a number of lists of must-have attributes for ML engineers, including this one: “[B]e aware of the relative advantages and disadvantages of different approaches, and the numerous gotchas that can trip you,” or this: “Be comfortable with failure,” not to mention a slew of algorithms.
Summarizing the brutal difficulty of sourcing the complete ML package, O’Reilly Media chief data scientist Ben Lorica and vice president Mike Loukides tells us all that’s needed is to find unicorns and a pot of gold at the end of the rainbow: “They frequently have doctorates in the sciences, with a lot of practical experience working with data at scale. They are almost always strong programmers, not just specialists in R or some other statistical package.
“They understand data ingestion, data cleaning, prototyping, bringing prototypes to production, product design, setting up and managing data infrastructure, and much more. In practice, they turn out to be the archetypal Silicon ‘Valley unicorns’: rare and very hard to hire.”
With such a deficit of expertise, it’s not clear how open source would help.
In the early days of Linux, for example, the director of IBM’s Linux Technology Center told me that for open source to be successful, you had to have a sufficient body of developers with aptitude and interest in a given area. Every developer needs an operating system, for example, so there tends to be a large body of developers with interest and aptitude in contributing to something like Linux. Ditto databases, app servers (remember them?), and so on.
More recently, Apcera chief executive (and Cloud Foundry architect) Derek Collison told me: “Open source is a natural progression for ecosystems where there’s a lot of innovation and breakthroughs. The market eventually becomes democratized and open source alternatives emerge.”
Where open source doesn’t work, he declares, is when you go open source: “From the start in an ecosystem that doesn’t even know what it means.”
Like, say, ML, where your odds of finding a qualified engineer are about the same as Lionel Messi signing for Millwall. It’s not going to happen.
After all, these open source ML frameworks come from the rarified air of Google, Facebook and other unicorn-esque companies. It’s not clear that anyone else would know enough to be able to contribute to projects like TensorFlow, and not many more know how to use the software.
Which, ironically, may well be the point.
Early on, Google made its intentions clear when it open sourced TensorFlow. “We hope this will let the machine learning community – everyone from academic researchers, to engineers, to hobbyists – exchange ideas much more quickly, through working code rather than just research papers,” Google said.
It’s a nice thought, but Google isn’t a charity, and open sourcing code isn’t done simply To Make The World A Better PlaceTM. No, Apcera’s Collison probably isn’t too far from the mark when he insists, speaking in this case of Kubernetes: “Google is trying to open source the API ecosystem to drive workloads to its cloud.”
With TensorFlow and other open source ML/AI code in mind, Google breeds familiarity with ML and then encourages developers to run their projects on Google Cloud. It’s a smart strategy, and fits into Google’s overriding message of “AI inside everything.”
It does not, however, turn developers into ML geniuses overnight.
That said, open-source projects like TensorFlow are enabling a new generation of ML-savvy engineers. As MuckRock founder Michael Morisy put it to me, open-source software like TensorFlow: “[M]ake it easy to experiment and in some domains [it enables you to get] meaningful results without [a] ton of expertise. Also [it] lowers barriers to bake [ML] into other projects.” In other words, it’s not that developers are geeking out on the source code, though undoubtedly some are, but rather that open source grants broad access and redistribution rights, allowing developers to experiment and learn in the process.
This wouldn’t work so well, however, if the code weren’t exceptional. It matters that data-science heavyweights like Google and Microsoft are the source of much of this code. Scott Clark, co-founder and CEO of SigOpt, picked up on this theme.
Clark told me: “[T]he new swath of new ML and AI open source projects has helped accelerate development and widespread industry acceptance of these techniques. It does so by creating a relatively small number of standard, well-engineered approaches that can be generally applied and trusted. Historically, using a ML technique meant tracking down a book or paper and then writing brittle, ‘academic’ code often from incomplete information and less-than-optimal engineering principles.
“The fact that you can use state-of-the-art research techniques ‘out of the box’ and almost immediately, without the extreme overhead of implementing them from first principles means that many more people from many more backgrounds are able to get value from them.”
In this way, such open-source libraries and access to on-demand infrastructure like AWS or Google Cloud “allows developers to accomplish things in hours or days that would have taken researchers months or longer to do only a decade ago.”
Some assembly required
Of course, understanding how to engineer ML is only the first part of the problem. The second might actually be harder. As Igor Faletski, CEO and co-founder of Mobify, expressed it to me: “With ML getting democratized by services like TensorFlow, having lots of data that can drive action at scale becomes the next big priority.”
Simply put, most enterprises don’t have massive quantities of data available to train their algorithms, either because the data is siloed throughout their organisation or because they simply haven’t been collecting the right kinds of data. For this, open source doesn’t help.
For others, the availability of data isn’t the hurdle. According to Yaron Haviv, founder and CTO at iguaz.io, understanding ML thanks to software like TensorFlow is relatively straightforward, but: “The hard part is figuring out neural networks.” Jared Rosoff, VMware executive, takes a similar tack, arguing that before TensorFlow and its open source ilk: “Matlab and Octave made machine learning ‘easy’ [while] TensorFlow made it ‘easy’ to scale and use [ML] in production systems (instead of desktop).” The harder part, he warns, is that: “All of these tools are useless if you don’t have some knowledge of ML, statistics, and math.”
In short, open-source ML tools aren’t going to magically turn random programmers into ML gods. They can, however, lower the barriers to learning ML, and the fact that sources of much of this excellent ML open source code (Google, Microsoft, and AWS) also happen to offer ready-made elastic infrastructure to run those newfangled ML applications?
Well, let’s just call that a fortuitous coincidence. ®