Boffins at MIT's Computer Science & Artificial Intelligence Laboratory (CSAIL) have developed a tool called ScrAPIr to help simplify access to online data available through application programming interfaces, or APIs.
ScrAPIr is not a scraping tool – code designed to fetch webpages and extract from the HTML specific elements, which is not exactly robust. Rather, ScrAPIr is a Swiss army-knife for accessing the official search interfaces, or APIs, provided by websites to download records. It's the difference between downloading a webpage of The Register's most read articles, and extracting the headlines from all the HTML and CSS, or querying an API that returns a simple, easy to process list of headlines (no, don't ask for one, our tech team is busy enough as it is.)
If you are adept at writing code that can talk to these sorts of interfaces and fetch the information you need, great – this toolkit isn't necessarily for you. It's primarily for those who are new to the idea of sending specially crafted queries to servers and parsing the responses.
In a paper [PDF] presented last month at the CHI '20 conference, Tarfah Alrashed, Jumana Almahmoud, Amy Zhang, and David Karger explain that ScrAPIr represents an effort to make APIs accessible to non-programmers, much easier for programmers, and more consistent in the way they present information.
APIs are effortful for programmers and nigh-impossible for non-programmers to use
"APIs are effortful for programmers and nigh-impossible for non-programmers to use," they state in their paper. "In this work, we empower users to access APIs without programming."
Programmers have various tools for scraping information from websites, but the process is inherently brittle: scraping code requires developers specify the webpage elements to be extracted, and subsequent changes to that page's design can break the code.
APIs allow sites such as GitHub, Reddit, and Yelp, among many others, make specific data available in a more stable way. But non-programmers tend to have trouble using APIs, and even those who code may find the process of integrating an API onerous.
ScrAPIr consists of three components that facilitate API access and the distribution of API queries for others to use: HAAPI (Human Accessible API), an OpenAPI extension to declare query parameters, authentication and pagination methods, and how returned data should be structured; SNAPI (Search Normalized APIs), to fetch data using the HAAPI description; and WRAPI (Wrapper for APIs), a tool for describing APIs.
Sometimes websites offer search capabilities that match their APIs, but not always. In a YouTube video, Tarfah Alrashed, an MIT CSAIL graduate student, demonstrates how ScrAPIr can be used to conduct a search for leather jackets on Etsy, and sort the results by the manufacture date, which isn't an option on the website. You can see how below:
In another example of ScrAPIr's potential utility, the eggheads created a repository of more than 50 COVID-19 virus resources, in the hope that the technically less inclined – journalists, ahem, and data scientists – may have an easier time combing the data for trends.
"An epidemiologist might know a lot about the statistics of infection but not know enough programming to write code to fetch the data they want to analyze," MIT professor David Karger said earlier this week. “Their expertise is effectively being blocked from being put to use."
To test their creation, the academics gave a group of programmers information-gathering tasks involving Google Books, Reddit, and Yelp. ScrAPIr proved to be 3.8x faster on average than writing the equivalent data-fetching code.
ScrAPIr still has some rough spots and limitations. Saved queries that require authentication can't be shared unless the API key can be made public, which isn't always an option. Users may be able to sign up for their own API key where authentication is necessary, but that adds friction to the process. Also, ScrAPIr is only designed to handle basic filtering and sorting; more involved database "join" operations still require a query language like GraphQL.
"In a perfect world, the ScrAPIr ecosystem would not be necessary," the paper concludes.
"If one of the API standardization efforts such as the Semantic Web or GraphQL were to succeed, then a tool like SNAPI could be created that would be able to query every web data source with no additional programming. But this kind of perfection requires cooperation from every website developer in the world." ®