Evan Sultanik, principal computer security researcher with Trail of Bits, has unpacked the Python world’s pickle data format and found it distasteful.
He is not the first to do so, and acknowledges as much, noting in a recent blog post that the computer security community developed a disinclination for pickling – a binary protocol for serializing and deserializing Python object structures – several years ago.
Even Python's own documentation on the pickle module admits that security is not included. It begins, "Warning: The pickle module is not secure. Only unpickle data you trust," and goes on from there.
Yet developers still use it, particularly in the Python machine learning (ML) community. Sultanik says it's easy to understand why, because pickling is built into Python and because it saves memory, simplifies model training, and makes trained ML models portable.
In addition to being part of the Python standard library, pickling is supported in Python libraries NumPy and scikit-learn, both of which are commonly used in AI-oriented data science.
According to Sultanik, ML practitioners prefer to share pre-trained pickled models rather than the data and algorithms used to train them, which can represent valuable intellectual property. Websites like PyTorch Hub have been set up to facilitate model distribution and some ML libraries incorporate APIs to automatically fetch models from GitHub.
- Happy birthday, Python, you're 30 years old this week: Easy to learn, and the right tool at the right time
- Python swallows Java to become second-most popular programming language... according to this index
- Facebook boffins bake robo-code converter to take the pain out of shifting between C++, Java, Python
- Python 2 bows out after epic transition. And there was much applause because you've all moved to version 3, right? Uh, right?
Almost a month ago in the PyTorch repo on GitHub, a developer who goes by the name KOLANICH opened an issue that states the problem bluntly: "Pickle is a security issue that can be used to hide backdoors. Unfortunately lots of projects keep using [the pickling methods]
Other developers participating in the discussion responded that there's already a warning and pondered what's to be done.
Hoping to light a fire under the pickle apologists, Sultanik, with colleagues Sonya Schriner, Sina Pilehchiha, Jim Miller, Suha S. Hussain, Carson Harmon, Josselin Feist, and Trent Brunson, developed a tool called Fickling to assist with reverse engineering, testing, and weaponizing pickle files. He hopes security engineers will use it for examining pickle files and that ML practitioners will use it to understand the risks of pickling.
Sultanik and associates also developed a proof-of-concept exploit based on the official PyTorch tutorial that can inject malicious code into an existing PyTorch model. The PoC, when loaded as a model in PyTorch, will exfiltrate all the files in the current directory to a remote server.
"This is concerning for services like Microsoft’s Azure ML, which supports running user-supplied models in their cloud instances," explains Sultanik. "A malicious, 'Fickled' model could cause a denial of service, and/or achieve remote code execution in an environment that Microsoft likely assumed would be proprietary."
Sultanik said he reported his concerns to the maintainers of PyTorch and PyTorch Hub and apparently was told they'll think about adding additional warnings. And though he was informed models submitted to PyTorch Hub are "vetted for quality and utility," he observed that there's no effort to understand the people publishing models or to audit the code they upload.
Asking users to determine on their own whether code is trustworthy, Sultanik argues, is no longer sufficient given the supply chain attacks that have subverted code packages in PyPI, npm, RubyGems, and other package registries.
"Moving away from pickling as a form of data serialization is relatively straightforward for most frameworks and is an easy win for security," he concludes. ®