Prejudiced humans = prejudiced algorithms, and it's not an easy fix

Building bad practices in ML can turn out awkward


Obvious candidates for ML are questions of credit repayment likelihood, insurance risk, and propensity to buy, as well as, on social media, the stories most likely to be of interest.

The problem? Machines learn whatever there is to learn, and sometimes they get it very wrong. This is because whatever the underlying process, ML typically involves two stages. The first – training – involves presenting cases from which the learning machine seeks to extrapolate a pattern of independent factors predicting the key dependency.

Some two decades ago a story, possibly apocryphal, doing the rounds of the neural net community told of a failed attempt by the US military to distinguish friendly from enemy tanks. Success. Or so they thought, until they moved from learning to test set and found the software "opening fire" equally on friend and enemy. It turned out that all their machine had learnt was the difference between pictures of tanks taken on sunny versus overcast days.

More serious, at least commercially, was a Google facial recognition fail that identified black people as gorillas. The problem arose, according to Anu Tewary, chief data officer for Mint – a web-based personal finance site bought by Intuit – and founder of the Technovation challenge for young women, because of underrepresentation of African American faces in the training set.

Similar issues, she suggests, could arise with voice recognition, where underrepresentation of women in a particular training set could result in software less able to interact with women – a vicious circle, reinforcing historic discrimination.

Insulting – or ignoring – a significant proportion of your customers is bad enough. But difficulties only multiply when it comes to using ML to determine who gets specific offers or services. The law is clear: even where protected characteristic identifies actual difference when it comes to risk or propensity, to use it as such is discrimination. Hence the equalisation of rates for insurance as well as the age at which men and women may collect their pensions.

Clearly, therefore, you should avoid inputting protected characteristics into your ML process. But how do you prevent your ML homing in on some secondary characteristic which happens to be closely correlated with a protected characteristic, thereby triggering a suit for indirect discrimination?

It may be unintended consequence, but as the House of Lords ruled in 1990, discrimination is discrimination, no matter the intention, motive, or purpose behind a discriminatory act. The temptation – the tendency – for IT professionals to separate human and machine responsibility may be strong, but the law is unlikely to approve.

Or is it? As Andrew Joint, lawyer and managing partner at technology law specialists Kemp Little, notes: "Whilst the law is not yet demanding that IT developers are accountable for all levels of their development, it is clear that legislators are looking to find ways that make sure IT developers and coding ethical responsibilities into their developments."

That means there is a growing need for IT to check what their business ML is doing. Especially as the law seems to be demanding higher standards of ML than it asks of human-originated systems. One approach involves monitoring outputs and putting in place robust systems for detecting bias. A somewhat technical treatment of this issue is to be found in a paper (PDF) published last October.

This uses the Receiver Operating Characteristic (ROC) curve – a plot of true positive rate versus false positive rate (FPR) at various threshold settings – to explore whether a particular distribution is biased according to any given (protected) characteristic. Direct marketing has long used this technique in the form of the Gains Chart.

A demonstration and slightly less technical treatment of this method is also available.

Central to this approach is the fact that it is "oblivious". That is, it considers inputs and outcomes, without digging into the underlying algorithm. That may be a pragmatic approach to avoiding issues of discrimination, but may not satisfy the EU's General Data Protection Regulation, coming into force next year. GDPR requires that where algorithms are involved in decision-making "fair and transparent processing" requires the provision of "meaningful information about the logic involved".

That may sound straightforward in theory but much harder in practise, because while some algorithms – for instance, a simple scoring system built using discriminant analysis – can be dissected in this way, this becomes increasingly difficult, verging on impossible, with other methods such as neural nets. And that's even before your ML invents its own language!

Equality, data protection, human rights: ML triggers legal compliance needs in all these areas. And that is even before we get into ethical territory.

Unknown unknowns

How should we respond to the issue of "unknown unknowns", for instance, where ML uncovers something that we didn't know was even out there to be known. In 2012, US retailer Target identified the fact that one of its customers – a teen girl – was pregnant before her father was aware, based on changes to her purchasing habits.

No laws were broken. Yet it is not hard to imagine other circumstances – for instance, changing habits indicate critical illness – where society is likely to be uneasy at the power of ML to discover things.

Final word, therefore, to Catherine Flick, senior lecturer in Computing and Social Responsibility at De Montfort University: "Discrimination is not an easy issue to deal with, but machine learning and AI developers should not use this as an excuse to avoid addressing it.

"Developers must always take responsibility for the algorithms they create and seek to ensure that they serve the public good, through audit trails of decision making, thorough testing and training, and inclusion of diverse stakeholders during the development process to ensure that the goals, aims, and potential consequences of the algorithms are thought through and socially responsible." ®

Similar topics

Other stories you might like

  • Everything you wanted to know about modern network congestion control but were perhaps too afraid to ask

    In which a little unfairness can be quite beneficial

    Systems Approach It’s hard not to be amazed by the amount of active research on congestion control over the past 30-plus years. From theory to practice, and with more than its fair share of flame wars, the question of how to manage congestion in the network is a technical challenge that resists an optimal solution while offering countless options for incremental improvement.

    This seems like a good time to take stock of where we are, and ask ourselves what might happen next.

    Congestion control is fundamentally an issue of resource allocation — trying to meet the competing demands that applications have for resources (in a network, these are primarily link bandwidth and router buffers), which ultimately reduces to deciding when to say no and to whom. The best framing of the problem I know traces back to a paper [PDF] by Frank Kelly in 1997, when he characterized congestion control as “a distributed algorithm to share network resources among competing sources, where the goal is to choose source rate so as to maximize aggregate source utility subject to capacity constraints.”

    Continue reading
  • How business makes streaming faster and cheaper with CDN and HESP support

    Ensure a high video streaming transmission rate

    Paid Post Here is everything about how the HESP integration helps CDN and the streaming platform by G-Core Labs ensure a high video streaming transmission rate for e-sports and gaming, efficient scalability for e-learning and telemedicine and high quality and minimum latencies for online streams, media and TV broadcasters.

    HESP (High Efficiency Stream Protocol) is a brand new adaptive video streaming protocol. It allows delivery of content with latencies of up to 2 seconds without compromising video quality and broadcasting stability. Unlike comparable solutions, this protocol requires less bandwidth for streaming, which allows businesses to save a lot of money on delivery of content to a large audience.

    Since HESP is based on HTTP, it is suitable for video transmission over CDNs. G-Core Labs was among the world’s first companies to have embedded this protocol in its CDN. With 120 points of presence across 5 continents and over 6,000 peer-to-peer partners, this allows a service provider to deliver videos to millions of viewers, to any devices, anywhere in the world without compromising even 8K video quality. And all this comes at a minimum streaming cost.

    Continue reading
  • Cisco deprecates Microsoft management integrations for UCS servers

    Working on Azure integration – but not there yet

    Cisco has deprecated support for some third-party management integrations for its UCS servers, and emerged unable to play nice with Microsoft's most recent offerings.

    Late last week the server contender slipped out an end-of-life notice [PDF] for integrations with Microsoft System Center's Configuration Manager, Operations Manager, and Virtual Machine Manager. Support for plugins to VMware vCenter Orchestrator and vRealize Orchestrator have also been taken out behind an empty rack with a shotgun.

    The Register inquired about the deprecations, and has good news and bad news.

    Continue reading

Biting the hand that feeds IT © 1998–2021