Fake it until you make it: Can synthetic data help train your AI model?
Yes and no. It's complicated.
The saying "data is the new oil," was reportedly coined by British mathematician and marketing whiz Clive Humby in 2006. Humby's remark rings true more now than ever with the rise of deep learning.
Data is the fuel powering modern AI models; without enough of it the performance of these systems will sputter and fail. And like oil, the resource is scarce and controlled by big businesses. What do you do if you're a small computer vision company? You can turn to fake data to train your models, and if you're lucky it might just work.
The market for synthetic data generation grew to over $110 million in 2021 and is expected to increase to $1.15 billion by the end of 2027, according to a report published by research firm Cognilytica.
Numerous startups have built tools to spin up synthetic images to help companies train their machine learning algorithms.
There are many benefits to using computer-generated data, Gil Elbaz, co-founder and CTO of Datagen, explained to The Register.
The startup, founded in 2018 and based in Israel, has built a software platform that allows customers to easily create mock images at the click of a button. Synthetic data provides a way to scale up datasets and automatically annotate each picture with the necessary metadata without much human labor.
Issues of privacy and bias can be avoided too. "Privacy for human faces is very, very hard, and it's not ideal to even hold [that kind of data] in your servers," Elbaz says.
"With our data, there's no [personally identifiable information]. This is not a real person. This is completely synthetic, so there's no privacy issues. And bias-wise we can generate whatever distribution of ethnicities, ages, genders you want in your data, so we are not biased in any way," he says as he shows us a three-dimensional fake face.
- Machine learning models leak personal info if training data is compromised
- AI-created faces now look so real, humans can't spot the difference
- Cerebras' wafer-size AI chips play nice with PyTorch, TensorFlow
- US military wants $29.8m for IT to boost AI intel analysis
Datagen works with companies to train computer vision models for different tasks. Simulated data is used by the automotive industry to develop AI software that automatically detects driver behavior, such as when they're distracted or falling asleep at the wheel.
Fake data has also been used by surveillance camera companies to flag whenever packages have been delivered outside people's homes. AI applications in augmented and virtual reality also benefit from ingesting copious amounts of synthetic data.
Rendering fake data is a complicated process. Datagen uses multiple methods to create computer-made images, from physics-based ray tracing algorithms to generative adversarial networks (GANs). Making the data is the easy part. Getting a model trained on false images to work in the real world is the challenge. Ideally, companies should have some real data to hand and can't just rely on fake data.
"What we see working really well is to train the network on a large amount of synthetic data, and then fine tune it on the small amount of real data. This last step is optional.
"It's really not a must, but it does improve the performance to do a small fine tune on the real world. What this means in practice is that you need much less real world data. So you don't need as much, you can use like 1/20th or 1/50th of the amount of real data and use mostly synthetic data for your training," Elbaz says.
Models trained on fake images have to be robust enough to work in real-life settings. Synthetic data has been successful in training self-driving cars to recognise things like cars, road signs, and pedestrians in its environment and simulate driving the same roads in different weather conditions. It has proven useful in robotics too in limited scenarios, like getting mechanical grippers to rotate or pick up objects.
Simulation to reality
Developers relying on synthetic data have to test and tweak their models rigorously to make sure they'll work.
"If you test your models in a good way, the idea is that your test should validate that the performance will be of high quality or have a quality that you expect. If your testing is not as good or if you don't have enough test data, then you can find a gap in the performance," says Elbaz.
"We can do testing to see where the neural network is weak, pretty much by trying to ask it, for example, what do you think about this guy? And if I make him darker, or if I make him further away, or if I change him to look more angry? What do you think about that? And I can ask the network all of these different things and see where it's weaker, and really map out the weaknesses of the network itself," Elbaz says.
- AI pioneer suggests trickle-down approach to machine learning
- 'Virtually no difference' between AI and humans in diagnosing prediabetes
- How Nvidia is overcoming slowdown issues in GPU clusters
- Microsoft to upgrade language translator with new class of AI model
But in some cases the real world is too difficult to model, and synthesizing data samples won't be worthwhile. "There is a very high effort that's required in order to build [for niche things]. Say, if you're trying to understand where a dog's nose is in an image, we don't do synthetic data for dog noses. Trying to pick something like that out on your own is extremely hard."
These gaps open up opportunities for startups that use synthetic data in a different way. Synthetaic, based in Wisconsin and founded in 2019 by Corey Jaskolski, doesn't sell computer-generated images to customers. Instead, it uses generative models like GANs or transformers to help image detection algorithms automatically label objects.
"We're still building AI that is capable of generating synthetic data. However, the novel piece that we're doing is we're not using it to generate synthetic data to then use to train an AI. We're using this generative capability to create, effectively, a way to look at real world data that allows us to do things like this auto-labeling," Jaskolski tells The Register.
"What's going on behind the scenes here is using a transformer technology that is usually used to generate imagery, but because it's so powerful and good at generating imagery, it's actually also so powerful and good at describing real world imagery in a way that lets you click on a single image and [detect others like it]."
Synthetaic showed El Reg a demo, where its Rapid Automatic Image Categorization (RAIC) technology could zero in on specific frames in a video feed. Jaskolski fed the system a photograph of a cheetah, and RAIC was able to find instances where a cheetah popped up in the video.
Real is always better
Real data is still more important for Synthetaic, despite the company's somewhat confusing name. "There are lots of examples in defense and in other industry applications where just adding 3D data or synthetic data doesn't fix the problem. I think because every situation is different, and AI always has trouble transferring from domains. It might not transfer well to the real world."
Generating synthetic data is a great way to create a larger and more diverse dataset, but it's only effective for training machine learning algorithms that perform jobs that aren't too simple and aren't too complex either. Easy computer vision tasks doesn't always require fake data, and AI. Difficult tasks require a high level of detail in simulated images and expert knowledge is needed to assess its quality.
"I think that medical data is a really good example of a use case that we don't want to work on," says Elbaz.
"In order to model medical diseases, you need real doctors to help you.
"There's a lot of specialized knowledge that you would need in order to create this medical synthetic data. Even though medical data is extremely valuable. It's something that I think requires a separate company. It's just too hard. Anything that requires very, very, specialized knowledge is hard," he concluded. ®