Your AI pet project is only as smart as its garbage training set

No one said it'd be easy

AI isn't immune to one of computing's most basic rules – garbage in, garbage out. Train a neural network on flawed data and you'll have one that makes lots of mistakes.

Most neural networks learn to distinguish between things by sampling different groups. This is supervised learning, and it only works if someone labels the data first so that the network knows what it's looking at.

But how can you find the "right" data to train your AI, and confirm its quality? Well, what you feed your machine might surprise you. Not only are there a variety of off-the-shelf choices but we've now entered an era where real-world data can be replaced with machine-created data for AI and ML.

And if that sounds a tad too artificial, don't worry: there is an expanding ecosystem of humans in the machine-learning feedback loop who keep the machines on track.

Let's start at the beginning, though. If you'd like some off-the-shelf data, you're in luck: there are plenty of labelled data sets out there suited to a range of scenarios.

One of the better-known data sets is ImageNet, both because of its broad applicability to image processing – a common task in deep learning – and its associated annual challenge. It is a collection of millions of image links, each tagged with meaningful concepts called "synsets". You can find satellite pictures courtesy of SpaceNet, which released a labelled set of them on AWS.

If you are training your neural net to recognise handwriting, there's the Modified National Institute of Standards and Training (MNIST) database of handwritten digits. It comprises of a set of 60,000 handwritten digits used for training image processing systems. There are many more data sets covering faces, text, speech and music.

Getting computers to make their own data

If you don't want something off the shelf, you can make your own data – so-called synthetic data. This is where computers fabricate data that is so realistic that it looks like it originated in the real world.

This happens particularly in high-end video games, where images are becoming so realistic that they are becoming good enough to "train" with.

In the world of self-driving vehicles, engineers can use data sets such as CityScapes, which labels classes of pixels at varying levels of granularity in 25,000 images cross 50 different cities. There's also the CamVid database, which provides 10 minutes of labelled video footage from a moving vehicle.

Is this approach safe – or even accurate? Isn't it the AI equivalent of eating Soylent Green? Not according to researchers from Intel Labs and TU Darmstadt, who collected 25,000 images by simply driving around in a photorealistic open-world computer game. They manually labelled objects in one frame and had their software propagate the labels across many video frames (after all, pavements and pedestrians look very similar from one frame to the next). The techies claim to have experienced a greater rate of accuracy using the game data to train their AI than relying entirely on the real-world stuff from CamVid.

Into the feedback loop

Computers train their AI in loops, repeatedly formulating hypotheses based on input data and then testing to find the result.

Synthesis might work in open-world environments, and for faces and modelled objects, but some prefer to gather that data from real people. Among them is the firm Mighty.AI, which gathers and organises data sets for clients.

Mighty.AI told us last year: "Computers are very fast, accurate and stupid, and humans are brilliant, slow, and inaccurate. So how do you get the best of both, given that they're weak where the other is strong?"


Want to learn machine learning in 15 minutes? Start here...


The firm's software sends users tasks such as image segmentation; users to draw round object borders and label them. It is using this model to label data for self-driving car clients. This input feeds human judgements back into the training set to help the computer refine its own model. "It's best to get results out of this learning loop if you put humans in the middle, because then we can insert our judgements," Mighty.AI told us.

Rival Alegion uses a similar approach to build an algorithm detect damage on car body panels. It's a system that might find use where cars are being transported.

Crowdsourced workers outline the damage on hundreds of thousands of images of car body panels, and then classify them with jobs sent to Amazon's Mechanical Turk. "What they need are examples of graded pictures, so they have a clear classification taxonomy for mild, moderate and severe damage," vice president Chip Ray says.

For some applications, digesting and distilling the training data is far more complex than drawing a box around something.

Dennis Mortensen, CEO and founder of AI virtual assistant company, spent years training his neural network how to read emails and automatically respond to book meetings with people. It required building a complex view of that problem space, which he calls his "universe".

This universe has three entities and a pool of intents. The entities are date and time, location, and people. The intents are more complex, covering the need to reschedule, tell someone you're running late, decide who is optional and mandatory for the meeting, and so on.

To get these intents, Mortensen had to analyse the language in thousands of emails, producing training data that understood not only misspellings but also linguistic idiosyncrasies. When someone in London mails to ask for a 4pm call but you're in Paris, which time zone are they talking about? If they mail at 11:30 on Tuesday night and ask for a call "tomorrow", do they mean Wednesday or Thursday?

Misunderstand something early in the conversation, and there will be what Mortensen calls a "cascading set of negative consequences".

To prevent this, he went deep, drawing the data from a three-year beta covering tens of thousands of people. "We looked at people scheduling meetings, thousands of times," he says. "There are thousands of edge cases." Each time the team found new parameters, it had to retrain its AI model.

Imbalanced AI

So, you have your data and feedback loops. Next comes massaging your data sets, and how you do this will depend in part on what you're building.

In many cases, you'll want to filter out noise data, such as duplicates and outliers that may be errors. That's natural data cleansing.


Is this a hotdog? What it takes for an AI to answer that might surprise you


On the other hand, however, you may actually want to introduce noise.

The Silicon Valley Not HotDog app was designed to recognise whether something was a hot dog or not – duh. The app's creators produced copies of their hotdog pictures but then distorted, rotated and flipped them. This had two effects. First, it made hotdogs look more like they might when taken from a tilted phone. Second, it helped to reduce an imbalance in their data set.

An imbalanced set with far more pictures in one group than any other can lead a neural network to constantly label new pictures with that popular classification. The network's program assumes that if most pictures in its training set were in one group, then assuming new pictures are in that same group will make it accurate most of the time.

You can solve this by creating a more balanced set of classifications, and also by changing your neural network model, weighting its results to give more attention to under-represented models.

Imbalances in your data may lead to bias. One of the first tasks you see when signing up for Mighty.AI is to rate puppies for cuteness. The firm told us it wanted to see what the aggregated results would look like.

"As it turns out, there's a strong gender bias in how women and men rate the cuteness of puppies," Mighty.AI told us. "It's cute, but demonstrative." Left unchecked, it can lead to errors that have unexpected real-world impacts.

Avoiding such dangers and producing a training set that will give you an accurate neural model is harder than it looks. Finding, scrubbing, and interpreting training data for deep learning algorithms is an important part of the AI development process that often takes up most of the time before you even get to have fun with neural network code.

"Most people think that TensorFlow is where the party is," concludes Mortensen. "But that's where the party ends." ®

Broader topics

Other stories you might like

  • NASA's InSight doomed as Mars dust coats solar panels
    The little lander that couldn't (any longer)

    The Martian InSight lander will no longer be able to function within months as dust continues to pile up on its solar panels, starving it of energy, NASA reported on Tuesday.

    Launched from Earth in 2018, the six-metre-wide machine's mission was sent to study the Red Planet below its surface. InSight is armed with a range of instruments, including a robotic arm, seismometer, and a soil temperature sensor. Astronomers figured the data would help them understand how the rocky cores of planets in the Solar System formed and evolved over time.

    "InSight has transformed our understanding of the interiors of rocky planets and set the stage for future missions," Lori Glaze, director of NASA's Planetary Science Division, said in a statement. "We can apply what we've learned about Mars' inner structure to Earth, the Moon, Venus, and even rocky planets in other solar systems."

    Continue reading
  • The ‘substantial contributions’ Intel has promised to boost RISC-V adoption
    With the benefit of maybe revitalizing the x86 giant’s foundry business

    Analysis Here's something that would have seemed outlandish only a few years ago: to help fuel Intel's future growth, the x86 giant has vowed to do what it can to make the open-source RISC-V ISA worthy of widespread adoption.

    In a presentation, an Intel representative shared some details of how the chipmaker plans to contribute to RISC-V as part of its bet that the instruction set architecture will fuel growth for its revitalized contract chip manufacturing business.

    While Intel invested in RISC-V chip designer SiFive in 2018, the semiconductor titan's intentions with RISC-V evolved last year when it revealed that the contract manufacturing business key to its comeback, Intel Foundry Services, would be willing to make chips compatible with x86, Arm, and RISC-V ISAs. The chipmaker then announced in February it joined RISC-V International, the ISA's governing body, and launched a $1 billion innovation fund that will support chip designers, including those making RISC-V components.

    Continue reading
  • FBI warns of North Korean cyberspies posing as foreign IT workers
    Looking for tech talent? Kim Jong-un's friendly freelancers, at your service

    Pay close attention to that resume before offering that work contract.

    The FBI, in a joint advisory with the US government Departments of State and Treasury, has warned that North Korea's cyberspies are posing as non-North-Korean IT workers to bag Western jobs to advance Kim Jong-un's nefarious pursuits.

    In guidance [PDF] issued this week, the Feds warned that these techies often use fake IDs and other documents to pose as non-North-Korean nationals to gain freelance employment in North America, Europe, and east Asia. Additionally, North Korean IT workers may accept foreign contracts and then outsource those projects to non-North-Korean folks.

    Continue reading
  • Google opens the pod doors on Bay View campus
    A futuristic design won't make people want to come back – just ask Apple

    After nearly a decade of planning and five years of construction, Google is cutting the ribbon on its Bay View campus, the first that Google itself designed.

    The Bay View campus in Mountain View – slated to open this week – consists of two office buildings (one of which, Charleston East, is still under construction), 20 acres of open space, a 1,000-person event center and 240 short-term accommodations for Google employees. The search giant said the buildings at Bay View total 1.1 million square feet. For reference, that's less than half the size of Apple's spaceship. 

    The roofs on the two main buildings, which look like pavilions roofed in sails, were designed that way for a purpose: They're a network of 90,000 scale-like solar panels nicknamed "dragonscales" for their layout and shimmer. By scaling the tiles, Google said the design minimises damage from wind, rain and snow, and the sloped pavilion-like roof improves solar capture by adding additional curves in the roof. 

    Continue reading
  • Pentester pops open Tesla Model 3 using low-cost Bluetooth module
    Anything that uses proximity-based BLE is vulnerable, claim researchers

    Tesla Model 3 and Y owners, beware: the passive entry feature on your vehicle could potentially be hoodwinked by a relay attack, leading to the theft of the flash motor.

    Discovered and demonstrated by researchers at NCC Group, the technique involves relaying the Bluetooth Low Energy (BLE) signals from a smartphone that has been paired with a Tesla back to the vehicle. Far from simply unlocking the door, this hack lets a miscreant start the car and drive away, too.

    Essentially, what happens is this: the paired smartphone should be physically close by the Tesla to unlock it. NCC's technique involves one gadget near the paired phone, and another gadget near the car. The phone-side gadget relays signals from the phone to the car-side gadget, which forwards them to the vehicle to unlock and start it. This shouldn't normally happen because the phone and car are so far apart. The car has a defense mechanism – based on measuring transmission latency to detect that a paired device is too far away – that ideally prevents relayed signals from working, though this can be defeated by simply cutting the latency of the relay process.

    Continue reading
  • Google assuring open-source code to secure software supply chains
    Java and Python packages are the first on the list

    Google has a plan — and a new product plus a partnership with developer-focused security shop Snyk — that attempts to make it easier for enterprises to secure their open source software dependencies.

    The new service, announced today at the Google Cloud Security Summit, is called Assured Open Source Software. We're told it will initially focus on some Java and Python packages that Google's own developers prioritize in their workflows. 

    These two programming languages have "particularly high-risk profiles," Google Cloud Cloud VP and GM Sunil Potti said in response to The Register's questions. "Remember Log4j?" Yes, quite vividly.

    Continue reading
  • Rocket Lab is taking NASA's CAPSTONE to the Moon
    Mission to lunar orbit is further than any Photon satellite bus has gone before

    Rocket Lab has taken delivery of NASA's CAPSTONE spacecraft at its New Zealand launch pad ahead of a mission to the Moon.

    It's been quite a journey for CAPSTONE [Cislunar Autonomous Positioning System Technology Operations and Navigation Experiment], which was originally supposed to launch from Rocket Lab's US launchpad at Wallops Island in Virginia.

    The pad, Launch Complex 2, has been completed for a while now. However, delays in certifying Rocket Lab's Autonomous Flight Termination System (AFTS) pushed the move to Launch Complex 1 in Mahia, New Zealand.

    Continue reading
  • Alibaba Cloud adds third datacenter in Germany
    More Euro-presence than any other Chinese company, but still nowhere near Google or AWS

    Alibaba has pulled ahead of its Chinese rivals in Europe with the opening of a third datacenter in Germany.

    The company said the Frankfurt datacenter serves cloud computing products to Europe and "adheres to the highest security standards and the strict compliance regulations set out in the Cloud Computing Compliance Controls Catalog (C5) in Germany."

    The addition brings Alibaba Cloud to a network of 84 availability zones in 27 regions worldwide. The company's first European cloud center arrived in Frankfurt in 2016.

    Continue reading

Biting the hand that feeds IT © 1998–2022