Excel ate my DNA
Autoformating black hole
Genetic research is being hampered by a smart formatting function in Excel, according to US researchers.
The problem, which can cause medically important genes to be hidden from view, is widespread, and has affected some public databases, including the gene expression data on the NCBI LocusLink database in the US, the researchers say.
Excel is widely used in genetic research to process microarray data. A microarray chip detects amounts of protein produced from thousands of different genes, enabling researchers to see which particular gene is being expressed in a sample of diseased tissue, for example.
The errors are introduced because some genetic identifiers look very like dates to Excel. If the spreadsheet is not properly set up, it will convert an identifier, such as SEPT2 to a date: 2-Sep. The conversion, the researchers say, is irreversible: once the error has been introduced, the original data is gone.
In a paper published on BioMedCentral, Zeeberg et al explain that they noticed that some identifiers were being converted to non gene names.
"A little detective work traced the problem to default date format conversions and floating-point format conversions in the very useful Excel program package," they write. "The date conversions affect at least 30 gene names; the floating-point conversions affect at least 2,000 if Riken identifiers are included."
The researchers suggest several workarounds for the problem, which you can find here, but caution that despite these "even the most vigilant investigator can inadvertently introduce conversion errors, and it is often necessary to screen data received from other sources". ®