Internet Archive's way cool Wayback Machine gets way more websites in Cloudflare fail-over pact

And Cloudflare customers get way better availability

The Internet Archive, repository for some 468bn webpages, has become a fail-over service for Cloudflare customers, which could improve website availability for everyone.

On Thursday, Mark Graham, director of the Wayback Machine at the non-profit Internet Archive, said the archive's web-focused warehouse, the Wayback Machine, will store snapshots of websites enrolled in Cloudflare's Always Online service to provide access to those sites in the event they go offline.

Graham in a blog post today said the Wayback Machine has long archived URLs from a variety of different sources including its web crawler, its "Save Page Now" URL submission form, and other signals.

Going forward, the Wayback Machine will also include websites enrolled in Cloudflare Always Online, a decade-old site availability service offered at no charge to Cloudflare customers (The Register being one of them).

"What we're trying to do is make sure all of our customers' sites are available and reliable, no matter what happens to them," said Cloudflare CEO Matthew Prince in a phone interview on Thursday.

Large customers, he said, have the resources to run their hosting infrastructure in a reliable way, but smaller ones may have a challenge when their hosting provider goes offline. "If we can't get to that content, then we can't serve it up across the network," said Prince, whose company, among other things, helps web publishers distribute cached web data via endpoints at the network's edge.

Cloudflare has been trying to do this since 2010, shortly after the company was founded.

"One of the things that we wanted to provide, especially for smaller customers, was a service that would allow them to remain online no matter what," said Prince.

Early versions of the service "worked okay," he explained, but faced the challenge of making sure Cloudflare didn't cache internal or private information. And a lot of sites weren't easily cataloged.

The Internet Archive in San Francisco

Inside Internet Archive: 10PB+ of storage in a church... oh, and a little fight to preserve truth


It was difficult, Prince said, to determine what Cloudflare could cache and what it could show if a website went offline. Initially, the company relied on watching where Google's crawler went and assuming it could cache those pages.

That worked well enough for a time, when Google's traffic all hit Cloudflare's data center in Ashburn, Virginia, but over the past decade, Google's crawling infrastructure became more complicated. Five years ago, Prince said, Cloudflare built its own crawler to help fill in the gaps, but the project never got the attention it deserved.

"We're not in the business of crawling websites, so it wasn't the smartest crawler out there," he said.

About a year ago, a product manager at Cloudflare pointed out that the Internet Archive had an expansive copy of the web, so the network service biz began looking into whether the two organizations could work together.

"Our hope is this will make the Internet Archive more thorough and better by giving it a more complete picture of the web [while also helping our customers]," said Prince.

The updated Always Online service requires customers to provide the Internet Archive with some website information, such as a hostname and popular URLs, for crawling. Thereafter, if the site fails to respond to a network request, Cloudflare will answer with a status code in the 520 to 527 range.

It will then try to provide a stale or expired version of the content cached from an edge data center that it can serve to the requesting website visitor. If that data can't be found, it will ask the Internet Archive for its most recent site capture and serve it with a banner indicating that the original website is inaccessible.

In an email to The Register, Graham said the Internet Archive's arrangement with Cloudflare doesn't entail any financial or infrastructure support.

"But we appreciate the support from the many individuals, organizations and companies that have provided support to date, and those that may support us in the future," he said. "In general terms, we focus on trying to be of service first and foremost."

Graham acknowledged that storing the data from Cloudflare Always Online customers does add to the Internet's Archive's infrastructure costs. "We also benefit from learning about Web-based resources (via URLs) that we might not otherwise have known about, so the partnership helps us do a better job of archiving more of the public Web," he said. ®

Broader topics

Other stories you might like

  • Meet Wizard Spider, the multimillion-dollar gang behind Conti, Ryuk malware
    Russia-linked crime-as-a-service crew is rich, professional – and investing in R&D

    Analysis Wizard Spider, the Russia-linked crew behind high-profile malware Conti, Ryuk and Trickbot, has grown over the past five years into a multimillion-dollar organization that has built a corporate-like operating model, a year-long study has found.

    In a technical report this week, the folks at Prodaft, which has been tracking the cybercrime gang since 2021, outlined its own findings on Wizard Spider, supplemented by info that leaked about the Conti operation in February after the crooks publicly sided with Russia during the illegal invasion of Ukraine.

    What Prodaft found was a gang sitting on assets worth hundreds of millions of dollars funneled from multiple sophisticated malware variants. Wizard Spider, we're told, runs as a business with a complex network of subgroups and teams that target specific types of software, and has associations with other well-known miscreants, including those behind REvil and Qbot (also known as Qakbot or Pinkslipbot).

    Continue reading
  • Supreme Court urged to halt 'unconstitutional' Texas content-no-moderation law
    Everyone's entitled to a viewpoint but what's your viewpoint on what exactly is and isn't a viewpoint?

    A coalition of advocacy groups on Tuesday asked the US Supreme Court to block Texas' social media law HB 20 after the US Fifth Circuit Court of Appeals last week lifted a preliminary injunction that had kept it from taking effect.

    The Lone Star State law, which forbids large social media platforms from moderating content that's "lawful-but-awful," as advocacy group the Center for Democracy and Technology puts it, was approved last September by Governor Greg Abbott (R). It was immediately challenged in court and the judge hearing the case imposed a preliminary injunction, preventing the legislation from being enforced, on the basis that the trade groups opposing it – NetChoice and CCIA – were likely to prevail.

    But that injunction was lifted on appeal. That case continues to be litigated, but thanks to the Fifth Circuit, HB 20 can be enforced even as its constitutionality remains in dispute, hence the coalition's application [PDF] this month to the Supreme Court.

    Continue reading
  • How these crooks backdoor online shops and siphon victims' credit card info
    FBI and co blow lid off latest PHP tampering scam

    The FBI and its friends have warned businesses of crooks scraping people's credit-card details from tampered payment pages on compromised websites.

    It's an age-old problem: someone breaks into your online store and alters the code so that as your customers enter their info, copies of their data is siphoned to fraudsters to exploit. The Feds this week have detailed one such effort that reared its head lately.

    As early as September 2020, we're told, miscreants compromised at least one American company's vulnerable website from three IP addresses: 80[.]249.207.19, 80[.]82.64.211 and 80[.]249.206.197. The intruders modified the web script TempOrders.php in an attempt to inject malicious code into the checkout.php page.

    Continue reading

Biting the hand that feeds IT © 1998–2022