While politicians and the public demand Facebook dam its indiscriminate dispensation of data, academics want to open the social network info-spigot wider still.
In a paper popped onto ArXiv this week, boffins from the Instituto Politécnico Nacional's ESIME Culhuacan in Mexico, and the University of Warwick in the UK, describe a technique for getting around Twitter's API rate-of-access limitations to harvest data from the social network more efficiently.
The paper doesn't mince words about its intent to flout Twitter's rules for the sake of science. It's titled "a web scraping methodology for bypassing Twitter API restrictions."
To test and train of data science algorithms, eggheads must have something to work with, the researchers – A. Hernandez-Suarez, G. Sanchez-Perez, K. Toscano-Medina, V. Martinez-Hernandez, V. Sanchez and H. Perez-Meana – declare.
"Gathering information from Online Social Networks is a primordial step in many data science fields allowing researchers to work with different and more detailed datasets," they said. "Although an important proportion of the scientific community uses the Twitter streaming API for collecting data, a limitation occurs when queries exceed rating intervals and time ranges."
Twitter, they claim, has become the preferred social network for data collection, because of its usability, reach, and varied types of data. Its real-time and historical data have proven useful for research on rumor propagation, tracking people geographically, spam and botnet detection, and disaster response, they stated.
US presidential historians and prosecutors no doubt can find noteworthy tweets, too.
Twitter, however, imposes limits on the rate and range of data available through the API it provides for free to rubes. There's an enterprise API, but it's pricey to make large companies cough up big bucks for premium access.
Academic researchers appear to be less inclined to pony up.
"In this paper, we propose a web scraping methodology for crawling and parsing tweets bypassing Twitter API restrictions taking advantage of public search endpoints, such that, given a query with optional parameters and set of HTTP headers we can request an advanced search going deeper in collecting data," they explained.
Web scraping remains a legally contentious issue. A San Francisco-based startup called hiQ last year sued LinkedIn to be allowed to scrape public LinkedIn profiles after LinkedIn tried to lock the upstart out. The case remains ongoing, however hiQ's data harvesting has been allowed while the case proceeds.
Where data isn't public, the law is clearer: accessing protected data can be prosecuted under hacking statutes, depending on where you live.
Using a public API in a way that obviates the need to rely on the enterprise API isn't quite the same thing, though.
The researchers have developed what they describe as a new approach to scraping Twitter API endpoints by customizing query fields to extend search capabilities.
The technique relies on Scrapy, an open-source web scraping framework for Python,
"By using Scrapy, an open source and collaborative framework for extracting data from websites written in Python, we enhance the power of scraping engines to obtain an unlimited volume of tweets bypassing date ranges limitations," the researchers explained.
The key to the technique is that where the first Twitter API request returns 20 results, the second can be crafted to return a variable number of results because the system is designed for users scrolling through the Twitter feed (where the number of tweets to be loaded isn't fixed). This behavior can be exploited by passing a maximum position parameter, which can instruct Twitter's backend systems to provide more data than they would normally.
The paper described the query structure thus:
https://twitter.com/search/timeline?f=tweets&vertical=default&q=words + array of dates + parameters &src=typd&minposition=maximum-position
The researchers have thoughtfully deployed a proof-of-concept scraper on AWS.
Twitter did not immediately respond to a request for comment. ®