This article is more than 1 year old

Geneticists throw hands in the air, change gene naming rules to finally stop Microsoft Excel eating their data

Spreadsheet woes spanning 16+ years force official update

Geneticists have issued new guidelines in naming human genes – after spending years wrestling with Microsoft Excel and similar software that automatically converts the names of genes to dates.

The Gene Nomenclature Committee of the Human Genome Organisation (HUGO), which sets the standard for the titles and shorthand labels of human genes, updated its rules this week to curb any further damage to gene databases stored in spreadsheets.

For example, the gene known as “deleted in esophageal cancer 1” was previously designated DEC1 for short. Type that into Excel, however, and it’ll be quietly auto-corrected to 1-Dec. Same goes for other genes with symbols like SEPT2 and MARCH1: they’re automatically formatted to 2-Sep or 1-Mar. These genes have now been changed to DELEC1, SEPTIN2, and MARCHF1 respectively to avoid being mangled by spreadsheets. This is apparently easier than changing the format of cells in Excel.

Why is this all such a headache? The automatic date corrections can cause software parsing the spreadsheets to skip over or misinterpret genes, ruining analysis work. There are other annoyances: scientists searching for particular genes by their name may not see entries that have been corrupted. Genetic datasets are often shared as text or CSV files, bioinformatics scientists have relied on Excel for years to organize their materials, and when it's all brought together: disaster silently strikes.

"There are lots of better alternatives," Neil Saunders, a data scientist who sounded the alarm about genetic mishaps with Excel back in 2012, told The Register today. "But Excel is on their computers and they feel familiar with it, even if they can't actually use it properly. Biologists in particular are reluctant to invest time in learning programming skills."

SoftMaker PlanMaker on Linux

Microsoft doc formats are the bane of office suites on Linux, SoftMaker's Office 2021 beta may have a solution

READ MORE

The auto-correct issue may only affect a small subset of genes with names that are similar to dates, though the impact has a wider effect on scientific research and clinical trials. The Verge first reported the change in policy earlier today.

Fed up and at their wits end, for years now academics have written scientific papers lamenting the issue, and griped on internet forums. A study published in the BMC Bioinformatics journal 16 years ago found auto-correction affected at least 30 gene names.

“Users of Excel for analyses involving gene names should be aware of this problem, which can cause genes, including medically important ones, to be lost from view and which has contaminated even carefully curated public databases,” the paper’s authors warned.

Another paper from 2016 in BMC Genome Biology discovered that approximately 20 per cent of more than 3,500 papers published in 18 journals contained Excel files riddled with gene name errors. The auto-correct snag is so common that the HUGO Gene Nomenclature Committee even made a YouTube video with step-by-step instructions on how to avoid the problem when opening gene datasets in Excel:

Youtube Video

"It's often pointed out that the problem is entirely avoidable, by setting Excel column type when importing CSV files," Saunders told us. "But no one does this – they just click on a file name, it opens in Excel – boom, the damage is done." He blames Microsoft for the blunders. "Really I think the issue is that non-explicit auto-conversion of data types is a bad default software behavior."

Now, the scientific world has taken firm steps to eradicate the problem. The new guidelines states each shortened gene name, known as a symbol, must not be the same as “commonly used abbreviations.” Symbols must also only contain uppercase Latin letters and Arabic numerals, and cannot be offensive or derogatory.

“The HGNC considers the naming of each and every gene on a case-by-case basis, and deviations from these guidelines may be made given sufficient evidence that the nomenclature will ultimately aid in communication and data retrieval,” the committee said.

"Personally I think that changing the gene symbols is not a great solution," Saunders told us. "But given that Microsoft won't change its default Excel behavior and 16-plus years of attempts to educate biologists on the issue have failed, I suppose it is a practical solution." ®

PS: We warned you all about this in 2004...

More about

TIP US OFF

Send us news


Other stories you might like