Pew: Quarter of web pages vanished in past decade
Luckily we have the Wayback Machine
The web is melting away like so many glaciers these days.
A report published by the Pew Research Center finds that digital decay is erasing online news, Twitter/X posts, and web links.
The Pew team - Athena Chapekis, Samuel Bestvater, Emma Remy, and Gonzalo Rivero - found that 25 percent of web pages that existed at some point between 2013 and October 2023 no longer function. As for older web pages that existed before 2013, 38 percent can no longer be accessed. As a point of comparison, about 8 percent of web pages from 2023 have disappeared.
The study, based on an analysis of web crawl data from Common Crawl, some 50,000 Wikipedia articles, and Twitter/X posts from last year, confirms longstanding concerns about the ephemeral nature of digital content.
"The internet is an unimaginably vast repository of modern life, with hundreds of billions of indexed webpages," Pew says. "But even as users across the world rely on the web to access books, images, news articles and other resources, this content sometimes disappears from view."
This has particular implications for news and government websites. A sample of half a million news pages from Common Crawl in March/April 2023 found that 23 percent had at least one broken link. For government websites – the same size sample during the same period – 21 percent had at least one broken link.
In a sample of 50,000 English-language Wikipedia pages, 82 percent included at least one reference link – a cited source – and 11 percent of these are no longer accessible. This makes fact checking much more difficult.
For the social media platform X, which was still referred to as Twitter during the March 8 to April 27, 2023 sample period, some 18 percent of 5 million tweets were no longer accessible by June 15 of that year. Some had been suspended, some had been removed, and some had been made private.
Among the portion of tweets that got removed, half vanished within the first six days of posting, and 90 percent were no longer available within 46 days.
Twitter posts, however, also get resurrected – about 6 percent resurface at some point, a phenomenon the researchers attribute to accounts that went private and then returned to public status, or suspended accounts that got reinstated.
- The internet – not as great as we all thought it was going to be, eh?
- How the Internet Archive faces potential destruction at the hands of Big Four publishers
- Internet Archive's 2046 Wayforward Machine says Google will cease to exist
Mark Graham, director of the Wayback Machine at the non-profit Internet Archive, told The Register: "This is why we do what we do."
The Wayback Machine crawls the web and archives copies of websites for posterity.
"We have identified and fixed more than 21 million broken URLs in Wikipedia articles, replacing them with archived versions of those web resources from The Wayback Machine," he explained.
Graham said the consequence of digital decay is the loss of our collective history. "When a society or people lose the ability to cite and compare and contrast, they lose the ability to do any kind of critical analysis or contextualize current events."
"We operate in the forever-now," he said. "We lose access to our memory. The ability to contextualize the events of our times depends upon having a record of our times." ®