Google open sources file-identifying Magika AI for malware hunters and others

Cool, but it's 2024 – needs more hype, hand wringing, and flashy staged demos to be proper ML

Google has open sourced Magika, an in-house machine-learning-powered file identifier, as part of its AI Cyber Defense Initiative, which aims to give IT network defenders and others better automated tools.

Working out the true contents of a user-submitted file is perhaps harder than it looks. It's not safe to assume the file type from, say, its extension, and relying on heuristics and human-crafted rules – such as those in the widely used libmagic – to identify the actual nature of a document from its data is, in Google's view, "time consuming and error prone."

Basically, if someone uploads a .JPG to your online service, you want to be sure it's a JPEG image and not some script masquerading as one, which could later bite you in the ass. Enter Magika, which uses a trained model to rapidly identify file types from file data, and it's an approach the Big G thinks works well enough to use in production. Magika is, we're told, used by Gmail, Google Drive, Chrome's Safe Browsing, and VirusTotal to properly identify and route data for further processing.

Your mileage may vary. Libmagic, for one, might work well enough for you. In any case, Magika is an example of Google internally using artificial intelligence to reinforce its security, and hopes others can benefit from that tech, too. Another example would be RETVec, which is a multi-language text-processing model used to detect spam. This comes at a time when we're all being warned that miscreants are apparently making more use of machine-learning software to automate intrusions and vulnerability research.

Policymakers, security professionals and civil society have the chance to finally tilt the cybersecurity balance from attackers to cyber defenders

"AI is at a definitive crossroads — one where policymakers, security professionals and civil society have the chance to finally tilt the cybersecurity balance from attackers to cyber defenders," Phil Venables, chief information security officer at Google Cloud, and Royal Hansen, veep of engineering for privacy, safety, and security, said on Friday. 

"At a moment when malicious actors are experimenting with AI, we need bold and timely action to shape the direction of this technology."

The pair believe Magika can be used by network defenders to identify, fast and at scale, the true content of files, which is a first step in malware analysis and intrusion detection. To be honest, this deep-learning model could be useful for anyone who needs to scan user-provided documents: Videos that are actually executables, for instance, ought to set off some alarm and require closer inspection. Email attachments that aren't what they say they are ought to be quarantined. You get the idea.

More generally speaking, in the context of cybersecurity, AI models can not only inspect files for suspicious content and source code for vulnerabilities, they can also generate patches to fix bugs, the Googlers asserted. The mega-corp's engineers have been experimenting with Gemini to improve the automated fuzzing of open source projects, too.

Google claims Magika is 50 percent more accurate at identifying file types than the biz's previous system of handcrafted rules, takes milliseconds to identify a file type, and is said to have at least 99 percent accuracy in tests. It isn't perfect, however, and fails to classify file types about three percent of the time. It's licensed under Apache 2.0, the code is here, and its model weighs in at 1MB.

Moving away from Magika, the Chocolate Factory will also, as part of this new AI Cyber Defense Initiative, partner up with 17 startups in the UK, US, and Europe, and train them to use these types of automated tools to improve their security. 

It will also expand its $15 million Cybersecurity Seminars Program to help universities train more European students in security. Closer to home, it pledged $2 million in grants to fund research in cyber-offense as well as large language models to support academics at the University of Chicago, Carnegie Mellon, and Stanford.

"The AI revolution is already underway. While people rightly applaud the promise of new medicines and scientific breakthroughs, we're also excited about AI's potential to solve generational security challenges while bringing us close to the safe, secure and trusted digital world we deserve," Venables and Hansen concluded. ®

More about


Send us news

Other stories you might like