Cloudera's machine learning head honcho: Collaboration? Data scientists have heard of it

Hilary Mason: Agile 'doesn't work for research', we need to build 'network' workflow

Interview Data science as a profession has a lot of maturing to do, with workflows up to a decade behind those of software engineers and tools that make collaboration hard, according to Cloudera's general manager for machine learning.

Most people... on the IT, engineering side of things – they're like, 'Oh those data scientists, they're just throwing their horrible shitty code at me over the wall and it doesn't work; I can't deploy it and I don't know what to do...

Hilary Mason, a data scientist by training, joined the firm about 18 months ago when Cloudera snaffled up her analytics and algorithm research biz Fast Forward Labs.

The shift saw her 16-person team, mostly used to working with startups, join a company of 1,500 that soon went on to merge with its chief rival Hortonworks, and is relentlessly eyeing up the enterprise.

"It has, on the surface, felt like an odd combination, to have a bunch of applied machine learning researchers essentially in residence at a data platform vendor," she told The Register at the DataWorks conference in Barcelona last week. "But we've found the niche where our company sits."

The former Fast Forward Labs team makes up one part of the machine learning business unit in Cloudera, focused on research, consulting and building the software, while the other half is focused on the Cloudera Data Science Workbench platform.

But one thing that will have taken some getting used to is working with customers that have significantly different problems than those they traditionally worked with – and there is now a very "wide spectrum of sophistication", Mason said.

"Some are incredibly advanced, doing the really flashy, exciting stuff… but the majority aren't doing that; they're doing things like predicting operational metrics, or business intelligence."

These customers tend to have a data lake investment and want to build a machine learning application on that. They generally start by targeting a large cost saving before trying to argue for cash to tackle the more interesting problems that would create new revenue streams.

They rely on outside expertise to help them address the organisational challenges – often harder than the technology problems – but there is also a cultural element at play. "In many of these companies – the Fortune 500s of the world – they're looking for an outside group to help with R&D because, internally, even though they have the talent and the tech, they can't politically invest in those systems."

Data scientists aren't tooled up to be team players

Cultural challenges face the data science community in other ways, too, Mason argues. At the moment, data scientists' jobs don't look the same from company to company, and the whole profession needs to become "much more of a team sport".

Part of the problem is that the standard tooling "essentially a data scientists working on their own, in a notebook, and collaboration is a challenge", when the aim should be to create a network effect. "The more people in an organisation that use the tools and platform, the better it should get for everybody," she said.

Related to this is, of course, a business plan for Cloudera's machine learning team. "One opportunity we see is to create one end-to-end workflow for data scientists to be able to go from data engineering, through the experimental process, through to deployment."

Existing tools cover data analysis and managing experiments, she said, so the focus will be on taking a trained model and moving it into production. "We haven't made any specific announcements, but we do have a strong point of view that this is where the market is going to move."

Mason said that, in this sense, data science workflow tooling is about five to 10 years behind that of software engineering – but that the "metadata piece is a bit more complex" when it comes to data science.

You need to be able to prove, at any given point in time, that the classification made was done by this exact model trained on this data set...

"Things get a little complicated," she said, because some of the features are engineered by a machine learning system. And, in many contexts, you need to be able to prove, at any given point in time, that the classification made was done by this exact model trained on this data set.

On top of that, the market for collaborative products is "very immature" – giving people the ability to search other people's feature labels isn't much use if they don't use the same vocabulary to label their work, for instance.

Data science has a lot to learn from DevOps

A symptom of the problems data science faces is that people working in the field haven't got their own identity, or a place to belong within companies.

"We've been shoved into software engineering workflows that don't really fit," said Mason. "I can't tell you how many teams I've gone into where data scientists are being forced to adhere to an agile methodology that doesn’t work for research."

On the production side, they are thrown in with DevOps styles of workflows – but she said there is a lot that could be learned here, because this was "as much a process and cultural movement as it was a technology one".

This centred around understanding that, when managing hundreds or thousands of servers, you would need an insight into the whole workflow, and that the people designing the code should know what eventual deployment mechanism will look like. "In data science we don't work that way."

At the same time, though, more cultural challenges raise their head, especially for those in legacy firms, where there are tensions between the old and the new guard.

"Most people... on the IT, engineering side of things – they're like, 'Oh those data scientists, they're just throwing their horrible shitty code at me over the wall and it doesn't work; I can't deploy it and I don't know what to do," she said. "You might say they should be woken up at 2am when their code breaks. But [data scientists] might say, well I just do math."

The community has yet to figure out how all of this will work in a mature practice, but Mason said she was confident it would "inevitably" mature. "But it's fascinating because it's one of human process and workflows as much as it is about building the right technology." ®

Similar topics

Other stories you might like

  • Drone ship carrying yet more drones launches in China
    Zhuhai Cloud will carry 50 flying and diving machines it can control with minimal human assistance

    Chinese academics have christened an ocean research vessel that has a twist: it will sail the seas with a complement of aerial and ocean-going drones and no human crew.

    The Zhu Hai Yun, or Zhuhai Cloud, launched in Guangzhou after a year of construction. The 290-foot-long mothership can hit a top speed of 18 knots (about 20 miles per hour) and will carry 50 flying, surface, and submersible drones that launch and self-recover autonomously. 

    According to this blurb from the shipbuilder behind its construction, the Cloud will also be equipped with a variety of additional observational instruments "which can be deployed in batches in the target sea area, and carry out task-oriented adaptive networking to achieve three-dimensional view of specific targets." Most of the ship is an open deck where flying drones can land and be stored. The ship is also equipped with launch and recovery equipment for its aquatic craft. 

    Continue reading
  • Experts: AI should be recognized as inventors in patent law
    Plus: Police release deepfake of murdered teen in cold case, and more

    In-brief Governments around the world should pass intellectual property laws that grant rights to AI systems, two academics at the University of New South Wales in Australia argued.

    Alexandra George, and Toby Walsh, professors of law and AI, respectively, believe failing to recognize machines as inventors could have long-lasting impacts on economies and societies. 

    "If courts and governments decide that AI-made inventions cannot be patented, the implications could be huge," they wrote in a comment article published in Nature. "Funders and businesses would be less incentivized to pursue useful research using AI inventors when a return on their investment could be limited. Society could miss out on the development of worthwhile and life-saving inventions."

    Continue reading
  • Declassified and released: More secret files on US govt's emergency doomsday powers
    Nuke incoming? Quick break out the plans for rationing, censorship, property seizures, and more

    More papers describing the orders and messages the US President can issue in the event of apocalyptic crises, such as a devastating nuclear attack, have been declassified and released for all to see.

    These government files are part of a larger collection of records that discuss the nature, reach, and use of secret Presidential Emergency Action Documents: these are executive orders, announcements, and statements to Congress that are all ready to sign and send out as soon as a doomsday scenario occurs. PEADs are supposed to give America's commander-in-chief immediate extraordinary powers to overcome extraordinary events.

    PEADs have never been declassified or revealed before. They remain hush-hush, and their exact details are not publicly known.

    Continue reading
  • Stolen university credentials up for sale by Russian crooks, FBI warns
    Forget dark-web souks, thousands of these are already being traded on public bazaars

    Russian crooks are selling network credentials and virtual private network access for a "multitude" of US universities and colleges on criminal marketplaces, according to the FBI.

    According to a warning issued on Thursday, these stolen credentials sell for thousands of dollars on both dark web and public internet forums, and could lead to subsequent cyberattacks against individual employees or the schools themselves.

    "The exposure of usernames and passwords can lead to brute force credential stuffing computer network attacks, whereby attackers attempt logins across various internet sites or exploit them for subsequent cyber attacks as criminal actors take advantage of users recycling the same credentials across multiple accounts, internet sites, and services," the Feds' alert [PDF] said.

    Continue reading
  • Big Tech loves talking up privacy – while trying to kill privacy legislation
    Study claims Amazon, Apple, Google, Meta, Microsoft work to derail data rules

    Amazon, Apple, Google, Meta, and Microsoft often support privacy in public statements, but behind the scenes they've been working through some common organizations to weaken or kill privacy legislation in US states.

    That's according to a report this week from news non-profit The Markup, which said the corporations hire lobbyists from the same few groups and law firms to defang or drown state privacy bills.

    The report examined 31 states when state legislatures were considering privacy legislation and identified 445 lobbyists and lobbying firms working on behalf of Amazon, Apple, Google, Meta, and Microsoft, along with industry groups like TechNet and the State Privacy and Security Coalition.

    Continue reading
  • SEC probes Musk for not properly disclosing Twitter stake
    Meanwhile, social network's board rejects resignation of one its directors

    America's financial watchdog is investigating whether Elon Musk adequately disclosed his purchase of Twitter shares last month, just as his bid to take over the social media company hangs in the balance. 

    A letter [PDF] from the SEC addressed to the tech billionaire said he "[did] not appear" to have filed the proper form detailing his 9.2 percent stake in Twitter "required 10 days from the date of acquisition," and asked him to provide more information. Musk's shares made him one of Twitter's largest shareholders. The letter is dated April 4, and was shared this week by the regulator.

    Musk quickly moved to try and buy the whole company outright in a deal initially worth over $44 billion. Musk sold a chunk of his shares in Tesla worth $8.4 billion and bagged another $7.14 billion from investors to help finance the $21 billion he promised to put forward for the deal. The remaining $25.5 billion bill was secured via debt financing by Morgan Stanley, Bank of America, Barclays, and others. But the takeover is not going smoothly.

    Continue reading

Biting the hand that feeds IT © 1998–2022