Internet Archive's way cool Wayback Machine gets way more websites in Cloudflare fail-over pact
And Cloudflare customers get way better availability
The Internet Archive, repository for some 468bn webpages, has become a fail-over service for Cloudflare customers, which could improve website availability for everyone.
On Thursday, Mark Graham, director of the Wayback Machine at the non-profit Internet Archive, said the archive's web-focused warehouse, the Wayback Machine, will store snapshots of websites enrolled in Cloudflare's Always Online service to provide access to those sites in the event they go offline.
Graham in a blog post today said the Wayback Machine has long archived URLs from a variety of different sources including its web crawler, its "Save Page Now" URL submission form, and other signals.
Going forward, the Wayback Machine will also include websites enrolled in Cloudflare Always Online, a decade-old site availability service offered at no charge to Cloudflare customers (The Register being one of them).
"What we're trying to do is make sure all of our customers' sites are available and reliable, no matter what happens to them," said Cloudflare CEO Matthew Prince in a phone interview on Thursday.
Large customers, he said, have the resources to run their hosting infrastructure in a reliable way, but smaller ones may have a challenge when their hosting provider goes offline. "If we can't get to that content, then we can't serve it up across the network," said Prince, whose company, among other things, helps web publishers distribute cached web data via endpoints at the network's edge.
Cloudflare has been trying to do this since 2010, shortly after the company was founded.
"One of the things that we wanted to provide, especially for smaller customers, was a service that would allow them to remain online no matter what," said Prince.
Early versions of the service "worked okay," he explained, but faced the challenge of making sure Cloudflare didn't cache internal or private information. And a lot of sites weren't easily cataloged.
Inside Internet Archive: 10PB+ of storage in a church... oh, and a little fight to preserve truthREAD MORE
It was difficult, Prince said, to determine what Cloudflare could cache and what it could show if a website went offline. Initially, the company relied on watching where Google's crawler went and assuming it could cache those pages.
That worked well enough for a time, when Google's traffic all hit Cloudflare's data center in Ashburn, Virginia, but over the past decade, Google's crawling infrastructure became more complicated. Five years ago, Prince said, Cloudflare built its own crawler to help fill in the gaps, but the project never got the attention it deserved.
"We're not in the business of crawling websites, so it wasn't the smartest crawler out there," he said.
About a year ago, a product manager at Cloudflare pointed out that the Internet Archive had an expansive copy of the web, so the network service biz began looking into whether the two organizations could work together.
"Our hope is this will make the Internet Archive more thorough and better by giving it a more complete picture of the web [while also helping our customers]," said Prince.
The updated Always Online service requires customers to provide the Internet Archive with some website information, such as a hostname and popular URLs, for crawling. Thereafter, if the site fails to respond to a network request, Cloudflare will answer with a status code in the 520 to 527 range.
It will then try to provide a stale or expired version of the content cached from an edge data center that it can serve to the requesting website visitor. If that data can't be found, it will ask the Internet Archive for its most recent site capture and serve it with a banner indicating that the original website is inaccessible.
In an email to The Register, Graham said the Internet Archive's arrangement with Cloudflare doesn't entail any financial or infrastructure support.
"But we appreciate the support from the many individuals, organizations and companies that have provided support to date, and those that may support us in the future," he said. "In general terms, we focus on trying to be of service first and foremost."
Graham acknowledged that storing the data from Cloudflare Always Online customers does add to the Internet's Archive's infrastructure costs. "We also benefit from learning about Web-based resources (via URLs) that we might not otherwise have known about, so the partnership helps us do a better job of archiving more of the public Web," he said. ®