Deep-root database: Kew Garden's 8 million specimen collection to find new life through data management

Finally, a digital transformation that really matters


Charles Darwin's legacy lives not just in the idea of evolution by natural selection, but also in the samples he collected.

London's Royal Botanical Gardens in Kew house some 191 samples gathered by Darwin, including an Adiantum henslovianum fern collected from the Galápagos in 1835. It is among 8 million specimens, some of which date back 200 years, in line to be digitised at the internationally important botanical research and education institution owned by the UK government.

Owning one of the world's largest and oldest specimen collections poses a problem for modern science, though.

Kew's Paul Kersey, deputy director of science, told The Register that although the scientists could request access to specimens for DNA sequencing, it was difficult for them to know what was available.

A specimen on Adiantum henslovianum collected from the Galapagos) is one of 191 samples collected by Darwin that are currently digitized: © copyright of the Board of Trustees of the Royal Botanic Gardens, Kew.

A specimen on Adiantum henslovianum collected from the Galapagos) is one of 191 samples collected by Darwin that are currently digitized. © Copyright of the Board of Trustees of the Royal Botanic Gardens, Kew

To try to crack this particular nut, the institution has gone to tender to find a cataloguing system to integrate with its current databases and use specimen metadata to help researchers find the ones useful to their studies.

"In the absence of a comprehensive catalogue they can't see everything we have, nor can they see what sequence data may have already been produced by specimens," Kersey said. "These are two of the key problems we hope to solve with the [content management system]."

As the tender document puts it, Kew holds a "vast and growing collection of plant and fungal data and databases that store information on specimens, names, taxonomy, traits, distributions, phylogenies, phenology and conservation... A key challenge to our collections is the integration of currently disparate systems, to facilitate more efficient curation and management."

For its collections databases, Kew is using the Sybase Adaptive Server Enterprise and MariaDB on in-house VMWare Linux clusters running CentOS Linux release 7. Google Cloud Platform also hosts other services. The plan is to create an overarching cataloguing system to get information about the whole collection into databases. Only about one in eight of the specimens currently have digital records associated with them.

"As an organisation with very long roots in botanical and mycological sciences, we have not historically pushed development of data technologies sufficiently far up our corporate agenda," Kersey said. "Not all our collection is digitised in any form, and not all our collection is even in a database. So, for many of our herbarium specimens, the indexing system is that specimens are put into a cupboard according to which species they belong and you can only find what you've got by looking in the appropriate cupboard."

Kew said it expects a full digitisation of its collection, together with the completion of the cataloguing system and a front-end website, to cost £35m. Not all of the cash is in place as the institution relies on competitive grant funding, private philanthropy, and government. "We were pursuing digitisation as a top priority of our scientific future to maximise scientific impact of our institute," Kersey said.

The price may be high but it could be a worthwhile investment. The long-term goal is to create "virtual specimens", including the image, metadata, DNA sequences, and molecular profiles, which are stored with the specimen, and made searchable and publicly available.

Kersey, who has a PhD in genetics and a background in bio-informatics, said it was a critical time to make such a deep history of scientific work available to researchers worldwide to help understand how plant and fungal biology has changed over the decades.

"It is particularly exciting in combining with DNA sequencing," he said, "because it's going to be possible to identify mutational changes in DNA, and you can identify how populations have shifted in response to climatic changes, in response to the introduction of pesticides, to increased nitrogen in the environment or so on.

"DNA sequencing gives you the ability to look at longitudinal changes in a deep molecular way, and then we can combine that with the metadata today, which allows us to put specimens in a given location on the planet Earth at a given location time. We're very excited by the potential to really understand what's happening to biodiversity on our planet in the last couple of hundred years by applying these sorts of approaches."

The competitive tender for the cataloguing and content management system ends in September, with an evaluation and selection of a winner due to take place shortly afterwards.

With a system in place, and wider digitisation of specimens progressing, database and data management technologies offer the possibility that the gifts given by Darwin and other dedicated researchers can keep on giving. ®


Biting the hand that feeds IT © 1998–2020