Skip to main content

What's in a (gene) name? That which we call a gene by any other name would confuse a researcher

If you had told me that I would spend my PhD years studying a gene called Falafel, I probably would not have believed you. Yet, that is exactly what happened to me (I was also briefly studying a gene called Bazooka). When working with fruit flies, researchers often come up with entertaining names for newly discovered genes; however, these same genes in mammals can be quite different. For instance, Falafel is called PP4r3 in humans. This discrepancy in gene names (also called gene symbols) can be confusing, and part of the Monarch mission is to ease cross-talk between interspecies genotype data. As a researcher, it can be hard to remember what a gene is called in different species, and this problem becomes more difficult if a gene name is changed. Thankfully, gene names are infrequently changed, and there are groups committed to ensuring that gene names are systematic and regulated. Recently, however, I was prompted to think of alternative names for MARCH7, a gene discovered by Monarch Principle Investigator, Melissa Haendel, in the 1990s.

Why does the name of a gene change? There are several reasons why a gene name might be changed or updated, for instance: if a newly discovered gene has no known function, but later is known to be a part of a family of genes, that newly discovered gene could be renamed to match the family it now belongs to. This is the case of the gene MARCH7, discovered by Haendel during her PhD work. Haendel originally named the gene Axotrophin, but later Axotrophin was discovered to be a member of the MARCH (membrane associated ring-CH-type finger) family of genes, and was renamed. However, MARCH7 is about to be renamed - yet again. The HUGO Gene Nomenclature Committee has recently determined that MARCH7, along with several other genes, will be renamed because, when used within Microsoft Excel (a tool popular among researchers), the gene symbol MARCH7 gets corrupted.

The Excel corruption issue occurs when a gene symbol is recognized as a date, and the original text string is irrevocably overwritten. For example, in the MARCH family of genes, MARCH7 is converted to 42801 which is then visually rendered as 7-March. Because 42801 is not recognized by computers and other software as even being a gene name anymore, it leads to incorrect analyses later. This formatting error befalls other genes families as well: SEP, SEPT, APR, MAR, DEC, NOV, and OCT. While HUGO recognizes that this is not a traditional reason to change the name of a gene, the change has been deemed necessary.

There is another formatting issue in Excel that affects a subset of genes, those named with RIKEN identifiers. These identifiers are in the form “nnnnnnnenn” where n is a digit, for example, 3400000e12. RIKEN identifiers such as these are converted into floating numbers, for instance 3400000e12 would get converted into 3.4e+12. These conversions are irreversible; once changed, the user can no longer get the original gene name back.

Blaming Excel for these errors might be the easy thing to do, but researchers have the responsibility to ensure that their data is accurate. There are several workarounds that researchers can take advantage of to limit these identifier errors. In 2004, Zeeberg and colleagues published steps to stop the automatic reformatting of gene names and also shared a programming script that can detect if a gene name has accidentally been converted into a date or into a floating number format. But it seems that researchers are not taking advantage of these resources. A recent article by Ziemann et al. examined lists of gene names from 18 journals published in the last 10 years and found that almost 20% of papers with gene lists had erroneous gene names in those lists. Ultimately, HUGO has decided that the best solution for this gene symbol debacle is to change the names of these problematic genes.

So now the researchers that are most familiar with the MARCH family of genes have been tasked with renaming these gene symbols. What should be the new symbol for MARCH7? One suggested idea is MAUL; our own Melissa Haendel supports this name because, as she said, “Axotrophin killed everything I put it in!” While the semantic future of MARCH7 is yet to be determined, we do know that these gene symbol name changes will have far-reaching effects. In my blog post next week, I will discuss some of these ramifications and delve deeper into the problems that are caused by divergent gene symbols.

Popular posts from this blog

How to annotate a patient's phenotypic profile

How to annotate a patient's phenotypic profile using PhenoTips and the Human Phenotype Ontology PurposeWe have observed that performance of computational search algorithms within and across species improves if a comprehensive list of phenotypic features is recorded. It is helpful if the person annotating thinks of the set of annotations as a query against all known phenotype profiles. Therefore, the set of phenotypes chosen for the annotation must be as specific as possible, and represent the most salient and important observable phenotypes. Towards this end, Monarch has been asked to provide guidance on how to create a quality patient profile using the Human Phenotype Ontology (HPO). Below we detail our annotation guidelines for use in the PhenoTips application, our partner organization. 

The guidelines can also be considered more generically so as to be applicable to any annotation effort using HPO or even using other phenotype ontologies. The annotations should be limited to th…

Why the Human Phenotype Ontology?

We've often been asked, why should we use the Human Phenotype Ontology to describe patient phenotypes, rather than a more widely-used clinical vocabulary such as ICD or SNOMED? Here are the answers to some of these frequently asked questions:

1. We should use what other big NIH projects, like ClinVar, are using.

ClinVar is using HPO terms to describe phenotypes. This is done in collaboration with MedGen, which has imported HPO terms. Here is an example:

There are now many bioinformatics tools that use the HPO to empower exome diagnostics. The Monarch team has published two of these recently

1) Exomiser (Robinson et al., 2014 Genome Res.) => For discovering new disease genes via model organism data, several successful use cases at UDP and elsewhere

2) PhenIX (Zemojtel et al., 2014 Science Translational Medicine) => For clinical diagnostics of “difficult” cases. This paper was on Russ Altman's year in review at AMIA this year.

Also, a num…

Finally, a medical terminology that patients, doctors, and machines can all understand.

By Nicole Vasilevsky, Mark Engelstad, Erin Foster, Julie McMurry, Chris Mungall, Peter Robinson, Sebastian Köhler, Melissa Haendel
For many patients with rare and undiagnosed diseases, getting an accurate diagnosis, or even finding the appropriate experts is a long and winding road. To accelerate and facilitate this process, we developed a medical vocabulary (“HPO”) which is comprised of 12,000 terms that doctors can use to codify the precise and distinct observations about patients and their conditions. The HPO is structured in a way that enables machines to intelligently compare a patient’s profile with what scientists worldwide have already uncovered about diseases and their genetic causes.
Until now, most of the HPO labels and synonyms were composed of clinical terms unfamiliar to patients. For example, a patient may know they are ‘color-blind’, but may not be familiar with the clinical term ‘Dyschromatopsia’. This is why we developed a layer of 5,000 corresponding terms that can b…