Why Python's pip search isn't working: We speak to infrastructure director about ongoing traffic overload

'The decision was made to return an error message that gave people an ability to contact us'


Interview Last December, the Python development team overseeing the Python Package Index (PyPI) temporarily disabled the search endpoint on its XML-RPC API because its infrastructure has been overwhelmed by "abusive clients."

The upshot is that searching for Python packages with pip, eg: pip search ascii or pip3 search png, isn't possible because this backend search API is unavailable.

In March, the API was permanently disabled, depriving developers of one of several ways to programmatically find packages in PyPI. The result has been frustration among those developing Python software because the XML-RPC API is still widely used (other API endpoints are still active).

Current usage figures were not immediately available though in May 2019, the search endpoint of the XML-RPC API was said to have received 85.5m requests over the span of three months. And over the course of a week in October 2020, the search endpoint for the API received 33.2m requests, or an average of ~54 requests per second.

One client application that uses the API's search endpoint – and isn't itself a problem – is the aforementioned package management tool pip. Running pip at the command line to search for a popular library, such as requests – ie: pip search requests – currently returns an error:

xmlrpc.client.Fault: <Fault -32500: "RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.">

Ee W. Durbin III, director of infrastructure at the Python Software Foundation, told The Register in an interview on Monday that it's unclear who or what is responsible for overloading the search endpoint, and that the error message represents an attempt to get the attention of anyone responsible for the network traffic overload.

The issue has surfaced before, several times.

"One of the most notable incidents was with very large clusters of computers at a specific corporate entity that was using the Puppet library," Durbin explained. "The Puppet library used the XML-RPC to determine the latest version of a package. And so, every 15 minutes we would get this massive influx of XML-RPC requests that would overwhelm the backends and cause us an availability concern."

In that instance, Python's minders were able to identify the company from the IP addresses of the requests and then contact the offending firm to ask for a fix to its upstream management software, which in time resolved the issue.

But the latest deluge of network traffic is more diffuse and defies easy attribution.

"Late last year, we identified an availability concern again due to what appears to be a scheduled recurring job from a very large and broad set of IP addresses that we're not as easily able to identify," said Durbin. "Given that we weren't able to easily identify the source of these requests, or the reason for the requests, the decision was made to make it return an error message that gave people an ability to contact us."

So far, no one has responded to claim responsibility for excessive API calls. And shutting down the search endpoint hasn't put an end to the incoming data flood – PyPI continues to see about 100 requests per second still trying to reach the shuttered search endpoint.

Durbin said the API doesn't require authentication or impose other restrictions on clients, which was by design, to make sure people could search PyPI. But as a result, there are fewer backend defense options against abuse. PyPI could block by user agent string or IP address, but both of those can be easily changed.

Of its time

The XML-RPC API dates back to the previous iteration of PyPI; a more modern revision called Warehouse was rolled out several years ago to improve the sustainability of package index.

"At the time, XML RPC was a totally valid solution," said Durbin. "It provided the ease of use and ease of implementation that the maintainer, at the time, was comfortable with."

The problem with the API is it uses an HTTP POST operation. PyPI currently tries to cache its operations as much as possible through its CDN layer. "POSTs are in and of themselves difficult to cache at the CDN layer," explained Durbin, adding that the anonymity of XML-RPC requests further complicates efforts to ensure service quality.

The reason the situation has become untenable for PyPI, Durbin said, is that "we don't have the resources to commit to firefighting the XML RPC endpoint on a day to day basis. I'm the only paid staff available to support PyPI."

The other maintainers of PyPI are all volunteers.

Durbin asked for the Python community to be patient because the infrastructure team has limited resources and suggested that those at large organizations who use the language might look for ways to support the community more effectively.

As an example, they pointed to Bloomberg, which is sponsoring a new packaging project manager role through the Python Foundation.

"Our goal is definitely going to be to utilize these newly available resources to design and hopefully see implemented a next generation API for PyPI that provides us some new things, specifically along the lines of being able to identify who's using the service, and in what manner," said Durbin. ®

Similar topics


Other stories you might like

Biting the hand that feeds IT © 1998–2021