How the UK's national memory lives in a ROBOT in Kew
El Reg visits the National Archives
Digital birth pains
For the the more technically minded, the early 1980s also corresponds with the time when computing and electronic comms and electronic record keeping started to make its way into government. The assumption is that by 2025 pretty much everything that finds its way into the archive in future will be “digital born” - though the Archive expects some departments will still be sending paper down to Kew “for many years after that date”.
This might sound like a recipe for a pretty seamless archiving process. You produce the definitive electronic version of a document, it is circulated, then in time finds its way to Kew and is preserved for posterity.
On the other hand, consider this. What word processor were you using 20 or 30 years ago? Where are the files you created with it? Have you still got the floppy disks it came on? Don’t tell me you have files you created in Microsoft Works?
Scale that up across the sprawl that is central government, with its shifting departmental structures, erratic and silo’d procurement strategies, and at times piecemeal upgrade programmes. Throw in the debate on politicians’ and advisers’ use of private PCs.
Suddenly you can see the prospect of a first world country that can no longer access, much less understand, its own historical documents.
So, to head off this nightmare, the archive has developed its own file format ID tool, Pronom. As Alex Green of the Archive’s Digital Records Infrastructure team explains, they “point it at a collection and it IDs the file formats”. Except when it doesn’t. When the archive took delivery of the records from LOCOG after the Olympic torch was snuffed out, it threw up lots of formats that weren’t recognised. Had the long-feared cyberattack on the London Olympics come to pass, albeit after the closing ceremony? The answer was a lot more prosaic. It turns out that LOCOG had tended to work on Macs, which as Green gently puts it, "is not usual in government".
At the same time, the team has collaborated with Tessella to develop a tool called Safety Deposit Box, which it describes as "a risk-based system to identify formats in danger of becoming obsolescent".
Green says, “We make sure everything we get is on a hard drive. It’s backed up in the dark archive [more on that later]...you can’t put anything in there where we don’t know what it is. It’s very controlled.” Incoming files are integrity-checked.
That takes care of the country’s internal documents - the minutes, the policy papers, the grand plans, and the grubby excuses that come in their wake.
But this is just the internal information from the government. And the Armed Forces. And assorted state agencies.
All this, and the web too
What about the stuff that business might call the customer or citizen-facing content? Yep, the National Archives has to look after that too, and it is now also the home of the UK Government Web Archive.
Bizarre as it may seem now, when UK.gov first dipped its toe into the internet 20 years ago, it didn’t occur to many people that websites should be archived.
Some of the earliest UK government ventures on to the web are sadly lost to history. The earliest finds, dating back to 1997, actually came from the Internet Archive Project. However, the Web archive’s Suzy Espley and her team are particularly taken with this early Treasury Page.
Now the the archive strives to preserve the UK government’s web presence for posterity: U-turns, right-turns and all. It uses a crawler to trawl the UK government’s web estate, aiming to hit sites every six months. With the government looking to shutter many obscure or unloved sites, the pressure is on. The web archive currently stands at around 80TB, with the crawler pulling in 1.6TB a month. At time of writing, there are 3 billion urls in the archive, with 1 billion captured last year alone.
But does anyone really care? Seems like they do. Espley said the archive gets around 15 to 20 million page views a month. This often maps to current events - the assumption being that visitors are often cross checking current government positions/statements against previous positions. When we dropped in, "badgers" was a top search term - this was the same month the badger cull had kicked off.
As the NHS’s care.data program grinds on, old NHS pages detailing the NPFIT will no doubt race up the rankings.
And as government continues to sprawl, no matter who is in power, so does the Web Archive’s purview. Thus it will be extending its archiving activities to cover social media in a couple of months.
Between the web archive and the increasing amount of "born digital" internal government documents, it feels inevitable that the amount of data in the electronic archive will swiftly outstrip that from pre-digital days, if it hasn’t already: no one has calculated exactly how much data a fully digitised version of everything in the Archives would require.
Still, storage companies in particular are regularly telling us more data is now created every couple of days than was created between the dawn of humanity and 2003. If that data fits the mission of the National Archives, then it has to be preserved... somewhere.
That somewhere is well above the waterline, just off one of the repositories, and is known as both the Dark Archive and the Robot Room. The Robot in question is a Sun StorageTek SL3000 tape library. One half runs LTO6, says Thomas, the other half is Sun’s own tape format. The tapes themselves are standard 6.25TB cassettes. It’s clearly an archive. The "Dark" bit of "Dark Archive" refers to the fact that when no one is in the room, the lights are off, rather than any suggestion that this is the “real” archive of the illuminati who really run the UK.
Our history, on tape
As we said earlier, newer collections generally arrive in a digital format and go straight into the Dark Archive. If you’re wondering what key “historical” events were talking about, Thomas quickly reels off the Hillsborough Inquiry, The Leveson Inquiry, the Olympic Games and the records for the latest Census.
The tape library has a theoretical maximum of 13PB, and the Dark Archive is expected to hit 6PB by 2020. By then around 0.63PB of data will be added to the archive every year.
David Thomas walks with a robot
Tape is not perhaps the sexiest of storage media - no helium, little in the way of nanometer scale technology beyond the media itself. But, as Thomas points out, it is a known technology, with us since the 1930s, and cheap - both in cost and environmentally. The tape vendors say it has a 30-year life cycle. While it remains to be seen whether that pans out, the team at Kew tests its existing tapes regularly, we’re assured.
"This is the future of web archiving," says Thomas, picking up a tape, adding, "at the moment".
As an aside, the 1,000-year-old Domesday book and 800-year-old Magna Carta are both written on sheepskin and are still readable to anyone with a passing familiarity with Latin and the urge to pop down to the Museum at Kew. HP and Sun have plenty to prove.
So, what of the those shelves of files, books, parchment and the rest? Shouldn’t the National Archives simply digitise the lot, then leave the originals mouldering in a vast annexe to a small room full of tapes?
Unlikely. Before a collection can be digitised, and therefore served up to the website and committed to tape, it has to prepared for imaging. This is a conservation job in itself. Capturing the image is just part of the process - you then have to produce the appropriate files, transcribe them and prepare them for publication.