Researchers at Endgame, a cyber-security biz based in Virginia, have published what they believe is the first large open-source dataset for machine learning malware detection known as EMBER.
EMBER contains metadata describing 1.1 million Windows portable executable files: 900,000 training samples evenly split into malicious, benign, and unlabeled categories and 200,000 files of test samples labelled as malicious and benign.
“We’re trying to push the dark arts of infosec research into an open light. EMBER will make AI research more transparent and reproducible,” Hyrum Anderson, co-author of the study to be presented at the RSA conference this week in San Francisco, told The Register.
Progress in AI is driven by data. Researchers compete with one another by building models and training them on benchmark datasets to reach ever increasing accuracies.
Computer vision is flooded with numerous datasets containing millions of annotated pictures for image recognition tasks, and natural language processing has various text-based datasets to test machine reading and comprehension skills. this has helped a lot in building AI image processing.
Although there is a strong interest in using AI for information security - look at DARPA’s Cyber Grand Challenge where academics developed software capable of hunting for security bugs autonomously - it’s an area that doesn’t really have any public datasets.
Security and legality
It’s difficult to share security files due to legal restrictions on the transmission of malware and the private nature of security research. For that reason, EMBER doesn't actually contain complete Windows files, instead the files are described by several bits of information such as its format and size.
A machine learning trained on EMBER has to examine all the different features of a file to determine if its malicious or benign.
Now that's sticker shock: Sticky labels make image-recog AI go bananas for toastersREAD MORE
“There is no evil bit. There is no one thing that says it’s malicious. The way that antivirus typically works is that they will write a signature that goes goes after a certain type of malware, by identifying byte sequences, or a single set of properties,” Anderson explained.
"Machine learning is different. It’s a top-down approach. Given this dataset the model learns an intricate combination of features that makes a file malicious so that it can learn new forms of malware instead of ones it’s trained on."
He did warn, however, that EMBER was strictly for research purposes. The dataset isn’t rich enough to train models that are good enough to deploy. It’s just meant to be a starting place for hobbyists and researchers to build upon.
“You won’t get a lot of researchers or hobbyists working on a specific problem like malware detection unless there is data available.”
You can play around with EMBER here. ®