Skip to main content

What's in a (gene) name? That which we call a gene by any other name would confuse a researcher

If you had told me that I would spend my PhD years studying a gene called Falafel, I probably would not have believed you. Yet, that is exactly what happened to me (I was also briefly studying a gene called Bazooka). When working with fruit flies, researchers often come up with entertaining names for newly discovered genes; however, these same genes in mammals can be quite different. For instance, Falafel is called PP4r3 in humans. This discrepancy in gene names (also called gene symbols) can be confusing, and part of the Monarch mission is to ease cross-talk between interspecies genotype data. As a researcher, it can be hard to remember what a gene is called in different species, and this problem becomes more difficult if a gene name is changed. Thankfully, gene names are infrequently changed, and there are groups committed to ensuring that gene names are systematic and regulated. Recently, however, I was prompted to think of alternative names for MARCH7, a gene discovered by Monarch Principle Investigator, Melissa Haendel, in the 1990s.

Why does the name of a gene change? There are several reasons why a gene name might be changed or updated, for instance: if a newly discovered gene has no known function, but later is known to be a part of a family of genes, that newly discovered gene could be renamed to match the family it now belongs to. This is the case of the gene MARCH7, discovered by Haendel during her PhD work. Haendel originally named the gene Axotrophin, but later Axotrophin was discovered to be a member of the MARCH (membrane associated ring-CH-type finger) family of genes, and was renamed. However, MARCH7 is about to be renamed - yet again. The HUGO Gene Nomenclature Committee has recently determined that MARCH7, along with several other genes, will be renamed because, when used within Microsoft Excel (a tool popular among researchers), the gene symbol MARCH7 gets corrupted.

The Excel corruption issue occurs when a gene symbol is recognized as a date, and the original text string is irrevocably overwritten. For example, in the MARCH family of genes, MARCH7 is converted to 42801 which is then visually rendered as 7-March. Because 42801 is not recognized by computers and other software as even being a gene name anymore, it leads to incorrect analyses later. This formatting error befalls other genes families as well: SEP, SEPT, APR, MAR, DEC, NOV, and OCT. While HUGO recognizes that this is not a traditional reason to change the name of a gene, the change has been deemed necessary.

There is another formatting issue in Excel that affects a subset of genes, those named with RIKEN identifiers. These identifiers are in the form “nnnnnnnenn” where n is a digit, for example, 3400000e12. RIKEN identifiers such as these are converted into floating numbers, for instance 3400000e12 would get converted into 3.4e+12. These conversions are irreversible; once changed, the user can no longer get the original gene name back.

Blaming Excel for these errors might be the easy thing to do, but researchers have the responsibility to ensure that their data is accurate. There are several workarounds that researchers can take advantage of to limit these identifier errors. In 2004, Zeeberg and colleagues published steps to stop the automatic reformatting of gene names and also shared a programming script that can detect if a gene name has accidentally been converted into a date or into a floating number format. But it seems that researchers are not taking advantage of these resources. A recent article by Ziemann et al. examined lists of gene names from 18 journals published in the last 10 years and found that almost 20% of papers with gene lists had erroneous gene names in those lists. Ultimately, HUGO has decided that the best solution for this gene symbol debacle is to change the names of these problematic genes.

So now the researchers that are most familiar with the MARCH family of genes have been tasked with renaming these gene symbols. What should be the new symbol for MARCH7? One suggested idea is MAUL; our own Melissa Haendel supports this name because, as she said, “Axotrophin killed everything I put it in!” While the semantic future of MARCH7 is yet to be determined, we do know that these gene symbol name changes will have far-reaching effects. In my blog post next week, I will discuss some of these ramifications and delve deeper into the problems that are caused by divergent gene symbols.

Popular posts from this blog

Finally, a medical terminology that patients, doctors, and machines can all understand.

By Nicole Vasilevsky, Mark Engelstad, Erin Foster, Julie McMurry, Chris Mungall, Peter Robinson, Sebastian Köhler, Melissa Haendel
For many patients with rare and undiagnosed diseases, getting an accurate diagnosis, or even finding the appropriate experts is a long and winding road. To accelerate and facilitate this process, we developed a medical vocabulary (“HPO”) which is comprised of 12,000 terms that doctors can use to codify the precise and distinct observations about patients and their conditions. The HPO is structured in a way that enables machines to intelligently compare a patient’s profile with what scientists worldwide have already uncovered about diseases and their genetic causes.
Until now, most of the HPO labels and synonyms were composed of clinical terms unfamiliar to patients. For example, a patient may know they are ‘color-blind’, but may not be familiar with the clinical term ‘Dyschromatopsia’. This is why we developed a layer of 5,000 corresponding terms that can b…

Why cross-species phenomics informatics is critical to the PMI

Genomics, electronic health records, participant-provided data, sensors, and mobile health technologies can all contribute to personalized medicine. However, we currently cannot achieve statistical correlations amongst these almost unlimited number of parameters that will be collected by the PMI and the depth of mechanistic understanding that will be required for treatment stratification and the development of novel, targeted therapies. The promise of personalized medicine requires deep knowledge of the relationships between genotype, phenotype, and environmental variables - but we simply don’t have enough data. For example, in the ExAC database there are 3,230 genes with near-complete depletion of predicted protein-truncating variants, where 72% of these genes having no currently established human disease phenotype. If we look across organisms, we see that of these 2311 genes with unknown causal phenotypes/diseases, 88% have an associated phenotype in an ortholog, with 56% having or…

Save the Date: Symposium on Linking Disease Model Phenotypes to Human Conditions

Monarch is co-hosting a NIH Symposium titled “Linking Disease Model Phenotypes to Human Conditions” on September 10-11, 2015 at the Fishers Lane Auditorium, NIH, Rockville, MD. 
The purpose of the meeting is to convene a colloquium on the current status of Phenomics and its role in closing the gap that exists between biomedical research and clinical medical practice. The wealth of whole organism, cellular, and molecular data generated in the research laboratory must be translated into clinically relevant knowledge that enables the physician to make the best possible treatment decisions. Phenomics is gaining momentum due to the availability of the complete genomes for many organisms as well as higher throughput methods to genetically modify model organism genomes and observe and record phenotypes. Disease models comprise some of the most important tools of biomedical research. The efficacy of the use of disease models is based upon the principles of evolutionary conservation between sp…