Meet ScrAPIr, MIT's Swiss army-knife for non-coders to shake data out of APIs (It's useful for pro devs, too)

A simpler alternative to site slurping


Boffins at MIT's Computer Science & Artificial Intelligence Laboratory (CSAIL) have developed a tool called ScrAPIr to help simplify access to online data available through application programming interfaces, or APIs.

ScrAPIr is not a scraping tool – code designed to fetch webpages and extract from the HTML specific elements, which is not exactly robust. Rather, ScrAPIr is a Swiss army-knife for accessing the official search interfaces, or APIs, provided by websites to download records. It's the difference between downloading a webpage of The Register's most read articles, and extracting the headlines from all the HTML and CSS, or querying an API that returns a simple, easy to process list of headlines (no, don't ask for one, our tech team is busy enough as it is.)

If you are adept at writing code that can talk to these sorts of interfaces and fetch the information you need, great – this toolkit isn't necessarily for you. It's primarily for those who are new to the idea of sending specially crafted queries to servers and parsing the responses.

In a paper [PDF] presented last month at the CHI '20 conference, Tarfah Alrashed, Jumana Almahmoud, Amy Zhang, and David Karger explain that ScrAPIr represents an effort to make APIs accessible to non-programmers, much easier for programmers, and more consistent in the way they present information.

APIs are effortful for programmers and nigh-impossible for non-programmers to use

"APIs are effortful for programmers and nigh-impossible for non-programmers to use," they state in their paper. "In this work, we empower users to access APIs without programming."

Programmers have various tools for scraping information from websites, but the process is inherently brittle: scraping code requires developers specify the webpage elements to be extracted, and subsequent changes to that page's design can break the code.

APIs allow sites such as GitHub, Reddit, and Yelp, among many others, make specific data available in a more stable way. But non-programmers tend to have trouble using APIs, and even those who code may find the process of integrating an API onerous.

ScrAPIr consists of three components that facilitate API access and the distribution of API queries for others to use: HAAPI (Human Accessible API), an OpenAPI extension to declare query parameters, authentication and pagination methods, and how returned data should be structured; SNAPI (Search Normalized APIs), to fetch data using the HAAPI description; and WRAPI (Wrapper for APIs), a tool for describing APIs.

The ScrAPIr website provides a selection of interfaces to web APIs. For example, it lets users construct sophisticated YouTube queries, and export them as JavaScript or Python that perform the queries when run, or download returned results as CSV or JSON files.

Sometimes websites offer search capabilities that match their APIs, but not always. In a YouTube video, Tarfah Alrashed, an MIT CSAIL graduate student, demonstrates how ScrAPIr can be used to conduct a search for leather jackets on Etsy, and sort the results by the manufacture date, which isn't an option on the website. You can see how below:

Youtube Video

In another example of ScrAPIr's potential utility, the eggheads created a repository of more than 50 COVID-19 virus resources, in the hope that the technically less inclined – journalists, ahem, and data scientists – may have an easier time combing the data for trends.

"An epidemiologist might know a lot about the statistics of infection but not know enough programming to write code to fetch the data they want to analyze," MIT professor David Karger said earlier this week. “Their expertise is effectively being blocked from being put to use."

To test their creation, the academics gave a group of programmers information-gathering tasks involving Google Books, Reddit, and Yelp. ScrAPIr proved to be 3.8x faster on average than writing the equivalent data-fetching code.

ScrAPIr still has some rough spots and limitations. Saved queries that require authentication can't be shared unless the API key can be made public, which isn't always an option. Users may be able to sign up for their own API key where authentication is necessary, but that adds friction to the process. Also, ScrAPIr is only designed to handle basic filtering and sorting; more involved database "join" operations still require a query language like GraphQL.

"In a perfect world, the ScrAPIr ecosystem would not be necessary," the paper concludes.

"If one of the API standardization efforts such as the Semantic Web or GraphQL were to succeed, then a tool like SNAPI could be created that would be able to query every web data source with no additional programming. But this kind of perfection requires cooperation from every website developer in the world." ®

Broader topics


Other stories you might like

  • Robotics and 5G to spur growth of SoC industry – report
    Big OEMs hogging production and COVID causing supply issues

    The system-on-chip (SoC) side of the semiconductor industry is poised for growth between now and 2026, when it's predicted to be worth $6.85 billion, according to an analyst's report. 

    Chances are good that there's an SoC-powered device within arm's reach of you: the tiny integrated circuits contain everything needed for a basic computer, leading to their proliferation in mobile, IoT and smart devices. 

    The report predicting the growth comes from advisory biz Technavio, which looked at a long list of companies in the SoC market. Vendors it analyzed include Apple, Broadcom, Intel, Nvidia, TSMC, Toshiba, and more. The company predicts that much of the growth between now and 2026 will stem primarily from robotics and 5G. 

    Continue reading
  • Deepfake attacks can easily trick live facial recognition systems online
    Plus: Next PyTorch release will support Apple GPUs so devs can train neural networks on their own laptops

    In brief Miscreants can easily steal someone else's identity by tricking live facial recognition software using deepfakes, according to a new report.

    Sensity AI, a startup focused on tackling identity fraud, carried out a series of pretend attacks. Engineers scanned the image of someone from an ID card, and mapped their likeness onto another person's face. Sensity then tested whether they could breach live facial recognition systems by tricking them into believing the pretend attacker is a real user.

    So-called "liveness tests" try to authenticate identities in real-time, relying on images or video streams from cameras like face recognition used to unlock mobile phones, for example. Nine out of ten vendors failed Sensity's live deepfake attacks.

    Continue reading
  • Lonestar plans to put datacenters in the Moon's lava tubes
    How? Founder tells The Register 'Robots… lots of robots'

    Imagine a future where racks of computer servers hum quietly in darkness below the surface of the Moon.

    Here is where some of the most important data is stored, to be left untouched for as long as can be. The idea sounds like something from science-fiction, but one startup that recently emerged from stealth is trying to turn it into a reality. Lonestar Data Holdings has a unique mission unlike any other cloud provider: to build datacenters on the Moon backing up the world's data.

    "It's inconceivable to me that we are keeping our most precious assets, our knowledge and our data, on Earth, where we're setting off bombs and burning things," Christopher Stott, founder and CEO of Lonestar, told The Register. "We need to put our assets in place off our planet, where we can keep it safe."

    Continue reading

Biting the hand that feeds IT © 1998–2022