Skip to main content

How Monarch Integrates and Curates Biological Data


As with most biomedical databases, the first step is to identify relevant data from the research community. The Monarch Initiative is focused primarily on phenotype-related resources. We bring in data associated with those phenotypes so that our users can begin to make connections among other biological entities of interest, such as:
  • genes
  • genotypes
  • gene variants (including SNPs, SNVs, QTLs, CNVs, and other rearrangements big and small)
  • models (including cell lines, animal strains, species, breeds, as well as targeted mutants)
  • pathways
  • orthologs
  • phenotypes
  • publications

We import data from a variety of data sources in formats including databases, spreadsheets, delimited text files, XML, JSON, and Web APIs, on a monthly schedule, which is placed into a Postgres database (hosted by the NIF). Our curation team semantically maps each resource into our data model, primarily using ontologies. This involves both typing relevant columns, mappings between columns (such as between identifier and labels, but also more complex associations, such as between a genotype-phenotype association and the publication it was mentioned in), and value-level mapping. Because our focus is on genotype-phenotype data, we focus our efforts on ensuring that each resources’ variants, genes, genotypes, strains, and phenotypes are well-typed using ontologies and standardized identifiers. Internally, we map all genes to NCBI gene identifiers, diseases to the Disease Ontology, and phenotypes into our unified phenotype ontology, Uberpheno.


The Monarch Initiative data workflow.

With many resources integrated into a single database, we can join across the various data sources to produce integrated views. We have started with the big players including ClinVar and OMIM, but are equally interested in boutique databases (which you will see more of in the coming months). You can learn more about the sources of data that populate our system from our sources page.

Once curated, we generate views and semantically index them into a Solr instance, and the data is served to our Monarch application via REST services through NIF. That way when a user is interested in exploring abnormalities of the ear, a single query can retrieve all relevant data from the system. Our web application wraps NIF’s REST services.

Since all of our data is curated using ontologies, we are currently exploring the use of a graph database (based on Neo4j) to serve up all our data and ontologies. This will have the side benefit of providing the community our semantically mapped data in RDF.

Popular posts from this blog

How to annotate a patient's phenotypic profile

How to annotate a patient's phenotypic profile using PhenoTips and the Human Phenotype Ontology Purpose We have observed that performance of computational search algorithms within and across species improves if a comprehensive list of phenotypic features is recorded. It is helpful if the person annotating thinks of the set of annotations as a query against all known phenotype profiles. Therefore, the set of phenotypes chosen for the annotation must be as specific as possible, and represent the most salient and important observable phenotypes. Towards this end, Monarch has been asked to provide guidance on how to create a quality patient profile using the Human Phenotype Ontology (HPO). Below we detail our annotation guidelines for use in the PhenoTips application, our partner organization.  The guidelines can also be considered more generically so as to be applicable to any annotation effort using HPO or even using other phenotype ontologies.  The annotations should b...

Why the Human Phenotype Ontology?

We've often been asked, why should we use the Human Phenotype Ontology to describe patient phenotypes, rather than a more widely-used clinical vocabulary such as ICD or SNOMED? Here are the answers to some of these frequently asked questions: 1. We should use what other big NIH projects, like ClinVar, are using. ClinVar is using HPO terms to describe phenotypes. This is done in collaboration with MedGen, which has imported HPO terms. Here is an example: http://www.ncbi.nlm.nih.gov/medgen/504827 There are now many bioinformatics tools that use the HPO to empower exome diagnostics. The Monarch team has published two of these recently 1) Exomiser ( Robinson et al., 2014 Genome Res. ) => For discovering new disease genes via model organism data, several successful use cases at UDP and elsewhere 2) PhenIX ( Zemojtel et al., 2014 Science Translational Medicine ) => For clinical diagnostics of “difficult” cases. This paper was on Russ Altman's year in review at AMIA this year. ...

Finally, a medical terminology that patients, doctors, and machines can all understand.

By Nicole Vasilevsky, Mark Engelstad, Erin Foster, Julie McMurry, Chris Mungall, Peter Robinson, Sebastian Köhler, Melissa Haendel For many patients with rare and undiagnosed diseases, getting an accurate diagnosis, or even finding the appropriate experts is a long and winding road. To accelerate and facilitate this process, we developed a medical vocabulary (“HPO”) which is comprised of 12,000 terms that doctors can use to codify the precise and distinct observations about patients and their conditions. The HPO is structured in a way that enables machines to intelligently compare a patient’s profile with what scientists worldwide have already uncovered about diseases and their genetic causes. Until now, most of the HPO labels and synonyms were composed of clinical terms unfamiliar to patients. For example, a patient may know they are ‘color-blind’, but may not be familiar with the clinical term ‘Dyschromatopsia’. This is why we developed a layer of 5,000 corresponding terms tha...