Interview Data science as a profession has a lot of maturing to do, with workflows up to a decade behind those of software engineers and tools that make collaboration hard, according to Cloudera's general manager for machine learning.
Most people... on the IT, engineering side of things – they're like, 'Oh those data scientists, they're just throwing their horrible shitty code at me over the wall and it doesn't work; I can't deploy it and I don't know what to do...
Hilary Mason, a data scientist by training, joined the firm about 18 months ago when Cloudera snaffled up her analytics and algorithm research biz Fast Forward Labs.
The shift saw her 16-person team, mostly used to working with startups, join a company of 1,500 that soon went on to merge with its chief rival Hortonworks, and is relentlessly eyeing up the enterprise.
"It has, on the surface, felt like an odd combination, to have a bunch of applied machine learning researchers essentially in residence at a data platform vendor," she told The Register at the DataWorks conference in Barcelona last week. "But we've found the niche where our company sits."
The former Fast Forward Labs team makes up one part of the machine learning business unit in Cloudera, focused on research, consulting and building the software, while the other half is focused on the Cloudera Data Science Workbench platform.
But one thing that will have taken some getting used to is working with customers that have significantly different problems than those they traditionally worked with – and there is now a very "wide spectrum of sophistication", Mason said.
"Some are incredibly advanced, doing the really flashy, exciting stuff… but the majority aren't doing that; they're doing things like predicting operational metrics, or business intelligence."
These customers tend to have a data lake investment and want to build a machine learning application on that. They generally start by targeting a large cost saving before trying to argue for cash to tackle the more interesting problems that would create new revenue streams.
They rely on outside expertise to help them address the organisational challenges – often harder than the technology problems – but there is also a cultural element at play. "In many of these companies – the Fortune 500s of the world – they're looking for an outside group to help with R&D because, internally, even though they have the talent and the tech, they can't politically invest in those systems."
Data scientists aren't tooled up to be team players
Cultural challenges face the data science community in other ways, too, Mason argues. At the moment, data scientists' jobs don't look the same from company to company, and the whole profession needs to become "much more of a team sport".
Part of the problem is that the standard tooling "essentially a data scientists working on their own, in a notebook, and collaboration is a challenge", when the aim should be to create a network effect. "The more people in an organisation that use the tools and platform, the better it should get for everybody," she said.
Related to this is, of course, a business plan for Cloudera's machine learning team. "One opportunity we see is to create one end-to-end workflow for data scientists to be able to go from data engineering, through the experimental process, through to deployment."
Existing tools cover data analysis and managing experiments, she said, so the focus will be on taking a trained model and moving it into production. "We haven't made any specific announcements, but we do have a strong point of view that this is where the market is going to move."
Mason said that, in this sense, data science workflow tooling is about five to 10 years behind that of software engineering – but that the "metadata piece is a bit more complex" when it comes to data science.
You need to be able to prove, at any given point in time, that the classification made was done by this exact model trained on this data set...
"Things get a little complicated," she said, because some of the features are engineered by a machine learning system. And, in many contexts, you need to be able to prove, at any given point in time, that the classification made was done by this exact model trained on this data set.
On top of that, the market for collaborative products is "very immature" – giving people the ability to search other people's feature labels isn't much use if they don't use the same vocabulary to label their work, for instance.
Data science has a lot to learn from DevOps
A symptom of the problems data science faces is that people working in the field haven't got their own identity, or a place to belong within companies.
"We've been shoved into software engineering workflows that don't really fit," said Mason. "I can't tell you how many teams I've gone into where data scientists are being forced to adhere to an agile methodology that doesn’t work for research."
On the production side, they are thrown in with DevOps styles of workflows – but she said there is a lot that could be learned here, because this was "as much a process and cultural movement as it was a technology one".
This centred around understanding that, when managing hundreds or thousands of servers, you would need an insight into the whole workflow, and that the people designing the code should know what eventual deployment mechanism will look like. "In data science we don't work that way."
At the same time, though, more cultural challenges raise their head, especially for those in legacy firms, where there are tensions between the old and the new guard.
"Most people... on the IT, engineering side of things – they're like, 'Oh those data scientists, they're just throwing their horrible shitty code at me over the wall and it doesn't work; I can't deploy it and I don't know what to do," she said. "You might say they should be woken up at 2am when their code breaks. But [data scientists] might say, well I just do math."
The community has yet to figure out how all of this will work in a mature practice, but Mason said she was confident it would "inevitably" mature. "But it's fascinating because it's one of human process and workflows as much as it is about building the right technology." ®