Immuta, a data governance startup in Maryland run by former US National Security Agency technicians, has developed a method to govern how data is used by machine learning algorithms.
Dubbed "Projects," the new addition to Immuta's data governance platform embeds what the company considers "key GDPR [EU's General Data Protection Regulation] concepts, such as purpose-based restrictions and audits on data," which will allow data scientists to run complicated algorithms on data without breaching privacy laws.
After announcing the conclusion of its Series A funding round back in February, pulling in $8m, Immuta's CEO Matthew Carroll has stressed that governance now requires data controllers to know "who is working on what and what the outcomes of that work are," as well as needing to "automate complex reporting – which is critical for GDPR compliance – that documents which data sources have been used, for which purposes, and by whom."
Citing work by Nicolas Papernot – a Google PhD Fellow in Security at Pennsylvania State University who has worked on privacy within machine learning, especially regarding preventing bias and achieving higher accuracy in the output of algorithms – Immuta noted that the governance issue with non-interoperability CNNs (convolutional neural networks) is that the CNNs are "arbitrarily making decisions in a hidden layer. We don't know how it weights certain values."
Speaking to The Register back in March, University College London's Dr Hannah Fry warned we needed to be wary of algorithms behind closed doors. The issue, she noted, is that without access to seeing how such algorithms function, "you can't argue against them" when they provide dodgy results.
"If their assumptions and biases aren't made open to scrutiny then you're putting a system in the hands of a few programmers who have no accountability for the decisions that they're making," Fry said.
In Immuta's words, it is the lack of interoperability within these algorithms that increases the risk that the data controllers face, as they are not able to audit what data was used and how.
"We can always go back into an application or business intelligence tool if we've made a mistake," Carroll told The Register. "We can call the database administrator, we can admit we've screwed up, it's fixable – because it's interpretable. The problem is that goes away with machine learning."
Once the data is inside the black box, data controllers would have to shut down their algorithm and retrain the whole model, with significant revenue impacts. Governance "was always the data," said Carroll, "but no longer. Now it's the model and how you're trying to use it that's equally as important as the data."
Projects attempts to deal with these issues by forcing data controllers to think about "purpose-based" deployments of their data analysis and machine learning models. The idea is that, despite the differing data science platforms, users can tie data sources and scripts to a specific project and assign purpose to a project. Carroll says:
For example, say I can see A, B, and C rules on data. I may be using it for very different projects. How does it change? When data scientists are running queries and scripts we will know why, know intent. This is a whole new concept: tying code, data, and users together.
We've made it very simple through the UI to add data sources and scripts. Projects is embedded into our platform, made incredibly easy for any tool to leverage our governance layer.
Projects helps you understand INTENT first. You might choose to train a machine learning model that is 6 per cent less accurate than another but far more interpretable. That way if you do have an issue, you have a much better chance of being able to fix it quickly.
You can't just go in and fix a model and everything updates. You need to make highly strategic decisions from the outset. The more precise you can be up front the higher your success rate.
"We're particularly excited about Projects because it opens the door to purpose-based restrictions on data, which has never been done before," said Andrew Burt, Immuta's chief privacy officer and legal engineer, who formerly served in the FBI as special advisor for policy to the assistant director of the Fed's cyber division.
"Many laws and regulations only allow certain data to be used for certain purposes. When dealing with complex machine learning projects that traverse multiple data sets, it's incredibly inefficient – and borderline untenable – to rely on case-by-case determinations from compliance departments. What companies really need are automated purpose-based controls on each and every data set." ®