AI + ML

This article is more than 1 year old

Infosec brainiacs release public dataset to classify new malware using AI

Data is the secret sauce to advancing AI research

Mon 16 Apr 2018 // 23:21 UTC

Researchers at Endgame, a cyber-security biz based in Virginia, have published what they believe is the first large open-source dataset for machine learning malware detection known as EMBER.

EMBER contains metadata describing 1.1 million Windows portable executable files: 900,000 training samples evenly split into malicious, benign, and unlabeled categories and 200,000 files of test samples labelled as malicious and benign.

“We’re trying to push the dark arts of infosec research into an open light. EMBER will make AI research more transparent and reproducible,” Hyrum Anderson, co-author of the study to be presented at the RSA conference this week in San Francisco, told The Register.

Progress in AI is driven by data. Researchers compete with one another by building models and training them on benchmark datasets to reach ever increasing accuracies.

Computer vision is flooded with numerous datasets containing millions of annotated pictures for image recognition tasks, and natural language processing has various text-based datasets to test machine reading and comprehension skills. this has helped a lot in building AI image processing.

Although there is a strong interest in using AI for information security - look at DARPA’s Cyber Grand Challenge where academics developed software capable of hunting for security bugs autonomously - it’s an area that doesn’t really have any public datasets.

Security and legality

It’s difficult to share security files due to legal restrictions on the transmission of malware and the private nature of security research. For that reason, EMBER doesn't actually contain complete Windows files, instead the files are described by several bits of information such as its format and size.

A machine learning trained on EMBER has to examine all the different features of a file to determine if its malicious or benign.

Now that's sticker shock: Sticky labels make image-recog AI go bananas for toasters

“There is no evil bit. There is no one thing that says it’s malicious. The way that antivirus typically works is that they will write a signature that goes goes after a certain type of malware, by identifying byte sequences, or a single set of properties,” Anderson explained.

"Machine learning is different. It’s a top-down approach. Given this dataset the model learns an intricate combination of features that makes a file malicious so that it can learn new forms of malware instead of ones it’s trained on."

He did warn, however, that EMBER was strictly for research purposes. The dataset isn’t rich enough to train models that are good enough to deploy. It’s just meant to be a starting place for hobbyists and researchers to build upon.

“You won’t get a lot of researchers or hobbyists working on a specific problem like malware detection unless there is data available.”

You can play around with EMBER here. ®

More about

AI
Malware

More about

AI
Malware

Narrower topics

Narrower topics

Broader topics

Self-driving Car

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

AI + ML

Infosec brainiacs release public dataset to classify new malware using AI

Data is the secret sauce to advancing AI research

Security and legality

Now that's sticker shock: Sticky labels make image-recog AI go bananas for toasters

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

AI spam is winning the battle against search engine quality

Google Cloud chief is really psyched about this AI thing

What's up with AI lately? Let's start with soaring costs, public anger, regulations...

Industrial systems integrating digitalisation

AI hallucinates software packages and devs download them – even if potentially poisoned with malware

AI PCs are here but a killer application for biz users? Nope

Psst, hey. It's the NSA. You want some AI security advice?

UK unions publish AI bill to protect workers from 'risks and harms' of tech

Intel CEO suggests AI can help to create a one-person Unicorn

Hailo's latest AI chip shows up integrated NPUs and sips power like fine wine

British watchdog has 'real concerns' about the staggering love-in between cloud giants and AI upstarts

US House mulls forcing AI makers to reveal use of copyrighted training data

About Us

Our Websites

Your Privacy