Saturday, November 29, 2014

IMPC mouse knockout model phenotypes added

We have added phenotype data from the International Mouse Phenotyping Consortium, who's goal is to discover functional insight for every mouse gene by generating and systematically phenotyping knockout mouse strains. This initially includes 890 mice affecting 763 genes with 222 unique phenotypes. IMPC data will be updated approximately monthly.

IMPC data is presently accessible in the Monarch portal via Mouse gene pages (for example, Stk16, Gpr107, or Gpr22), or via phenotypic similarity comparison on disease pages (such as Sebastian Syndrome or Susceptibility to Malignant Hyperthermia 3).

You can read more about our data sources here.

ClinVar variant-disease associations added

We have added ClinVar variant-disease associations into our database and first released into the Monarch Initiative portal in November, 2014. This new data accompanies previously incorporated ClinVar gene-disease associations (without the specificity of the variations). This initially includes 113,543 SNP, SNV, CNV (and other major rearrangements), linked to 13,591 genes and 11,154 diseases and phenotypes. The associations are also coupled to the original submitters and publications where the variations are reported. The data will be updated approximately monthly.

You can read more about our data sources here.

Wednesday, October 8, 2014

How Monarch Integrates and Curates Biological Data

As with most biomedical databases, the first step is to identify relevant data from the research community. The Monarch Initiative is focused primarily on phenotype-related resources. We bring in data associated with those phenotypes so that our users can begin to make connections among other biological entities of interest, such as:
  • genes
  • genotypes
  • gene variants (including SNPs, SNVs, QTLs, CNVs, and other rearrangements big and small)
  • models (including cell lines, animal strains, species, breeds, as well as targeted mutants)
  • pathways
  • orthologs
  • phenotypes
  • publications

We import data from a variety of data sources in formats including databases, spreadsheets, delimited text files, XML, JSON, and Web APIs, on a monthly schedule, which is placed into a Postgres database (hosted by the NIF). Our curation team semantically maps each resource into our data model, primarily using ontologies. This involves both typing relevant columns, mappings between columns (such as between identifier and labels, but also more complex associations, such as between a genotype-phenotype association and the publication it was mentioned in), and value-level mapping. Because our focus is on genotype-phenotype data, we focus our efforts on ensuring that each resources’ variants, genes, genotypes, strains, and phenotypes are well-typed using ontologies and standardized identifiers. Internally, we map all genes to NCBI gene identifiers, diseases to the Disease Ontology, and phenotypes into our unified phenotype ontology, Uberpheno.


The Monarch Initiative data workflow.

With many resources integrated into a single database, we can join across the various data sources to produce integrated views. We have started with the big players including ClinVar and OMIM, but are equally interested in boutique databases (which you will see more of in the coming months). You can learn more about the sources of data that populate our system from our sources page.

Once curated, we generate views and semantically index them into a Solr instance, and the data is served to our Monarch application via REST services through NIF. That way when a user is interested in exploring abnormalities of the ear, a single query can retrieve all relevant data from the system. Our web application wraps NIF’s REST services.

Since all of our data is curated using ontologies, we are currently exploring the use of a graph database (based on Neo4j) to serve up all our data and ontologies. This will have the side benefit of providing the community our semantically mapped data in RDF.

Monday, September 22, 2014

Monarch teaches at the International Summer School for Rare Disease Registries


Last week, I had the pleasure of teaching at the National Centre for Rare Diseases hosted by the Istituto Superiore di Sanità and Dr. Domenica Taruscio. This rare disease registry course is in its second year, and is focused on exposing the maintainers of rare disease registries various aspects of registry planning and management. I was very impressed with the specific way in which this course was run. The week started with a discussion of the different types of registries (aims, study design, data sources), management sustainability, and clinical outcomes analysis. This was followed by an innovative collaborative learning exercise in the afternoon, where the participants were broken up into three groups. The collaborative learning focused on positive interdependence, individual accountability, face-to-face interaction, group processing and exercise of small-group interpersonal skills - all skills needed to realize a quality registry resource in addition to simply being a quality pedagogical approach. Each group had a different rare disease scenario that they had to develop methods and strategies against using what they had learned in the morning session. On each of the following mornings for the rest of the week, they would learn new content such as reference standards and catalogues, coding of rare disease, omics links with biobanks, epidemiologic analyses and confounders, sample stratification, patient unique identifiers, quality assurance methods, data reporting and dissemination and informed consent. Each afternoon, they would then apply these themes to their ongoing scenarios such that the scenarios developed into robust full-fledged registry plans by the end of the week. The teamwork was amazing, as was the instructor engagement throughout the process.

We capped the week off with a Monarch presentation on "The application of the Human Phenotype Ontology" (HPO), where we discussed why rare disease phenotyping needs something more than standard clinical coding systems can provide. Many rare disease phenotypes are sprinkled throughout the literature and clinical notes in completely non-computable ways. The HPO was designed to address this problem and provide a structure on which to perform bioinformatics analyses. Phenotype comparisons can be between patients and known diseases, as shown in our recent paper where we used the HPO to help diagnose undiagnosed patients. Phenotype comparisons can also be across species as well, to aid candidate prioritization in tools such as Exomiser. We also discussed the Global Alliance for Genomics and Health Matchmaker exchange, and how the HPO was being used to identify cohorts in tools such as PhenomeCentral. Finally, we ended with a summary of tools being developed by Monarch to support quality assurance of phenotype data to aid clinicians during the course of their phenotyping. We believe that the efforts that Monarch is making to define an exchange standard for rare disease phenotyping will be of great value to the rare disease registry communities and are looking forward to working with them further on their data publication.



Friday, September 19, 2014

Monarch presenting at ASHG 2014, Oct 18-22, San Diego

We'll be heading to American Society for Human Genetics 2014 conference in San Diego, October 18-22. Please check out our work in the following sessions:
  • 170. PhenomeCentral: An integrated portal for sharing patient phenotype and genotype data for rare genetic disorders. Mon Oct 20 5:30p. Concurrent Platform Session C: From Bytes To Phenotypes. Hall B1, Ground Level, Convention Center
    Michael Brudno will present the new data sharing portal PhenomeCentral, which facilitates the identification of phenotypically similar patients, utilizing the Human Phenotype Ontology (HPO) for linking patient phenotypes. Monarch contributes the API for the Annotation Sufficiency metric, actively develops on the HPO, and has provided user testing and documentation. Cases from our work with the NIH Intramural Undiagnosed Disease Program (UDP) have been deposited into PhenomeCentral.
  • 1499T. Standardized phenotyping enables rapid and accurate prioritization of disease-associated and previously unreported sequence variants. Tue Oct 21 2-3pm.
    William Bone will present our work with the NIH UDP, particularly about the use of Exomiser 2.0 as a rapid and effective method to screen for variants. The updated algorithm uses a combination of disease-gene and model organism phenotypes, together with protein-protein associations for candidate prioritization.
  • 1643T. Phenotype terminologies in use for genotype-phenotype databases: A common core for standardisation and interoperability. Tue Oct 21 2-3pm.
    Peter Robinson will present the efforts to develop a core terminology of phenotypes that is interoperable with all terminologies in current use including PhenoDB, London Dysmorphology Database, Orphanet, Human Phenotype Ontology, Elements of Morphology, ICD10, UMLS, SNOMED CT, MeSH, and MedDRA.
We will also be spending time at the Global Alliance for Genomics and Health pre-meeting, where we will participate in the Data and Clinical working group breakout sessions on metadata and ontologies.

Wednesday, September 17, 2014

NIEHS workshop on defining language standards for environmental health

This week Monarch team members co-chaired and attended a National Institutes of Environmental Health Science (NIEHS) workshop on Development of a Framework for an Environmental Health Science Language (agenda & report). From Love Canal to Chernobyl, from the Clean Water Act to pending regulation of dietary supplements, what we breathe and what we eat is known to contribute to human health outcomes. Consistent capture, transmission, and analysis of these data for comprehensive use in multiple research and clinical environments depends upon standardization and integration of the data across multiple disciplines.

Because we need to compare phenotypes based upon both genotypes and environmental variables over time, Monarch is very interested in understanding ways to represent and integrate these data. We currently have a great diversity of model and human environmental data: reagents targeting specific gene products, physiological perturbations such as exposure to light, drug treatments, and environmental exposures to complex toxicological mixtures.

The goal of the workshop was to initiate a new working group that will focus on requirements and implementation of environmental vocabulary standards for describing these environments. We had an amazing keynote from Elaine Faustman, where she discussed metagenomic profiling of antibiotic resistance determinants in Puget Sound to assess both human health and oceans impacts. Now that is large-scale (global) data integration! We also had the pleasure of hearing Alexa McCray discuss her groups' work on combining very many autism clinical instruments using an ontological approach to better support analysis and reuse of clinical autism diagnostic data in combination with genomic data to support elusive genetic and environmental correlations in autism patients.

And then there was the amusing example of how hard it is to simply find relevant specimens in NCBI BioSample Database due to lack of standardized language:
Query
# records
Feces
22,592
Faeces
1,750
Ordure
2
Dung
19
Manure
154
Excreta
153
Stool
22,756
Stool NOT faeces
21,798
Stool NOT feces
18,314

The outcome of the workshop was a new team consisting of expertise in many disciplines - from biodiversity, to ontologies, computer science, model organism biology, and the human exposome. The prediction is that the group will have a long and interesting history of solving what may be one of the hardest, yet most interesting, data integration problems facing biological science today.

If you are interested in following this work, you can subscribe to the new working group list.

Friday, July 11, 2014

Monarch Initiative website is Live!

Today is the first official release of the Monarch Initiative website. After years of research to develop the methods to computationally compare phenotypes across species and facilitate the interpretation of disease-gene associations, we are proud to finally see the fruits of our labor brought to fruition. We now have a portal and widget to search, explore, and compare the phenotypic links between diseases, genes, phenotypes, and animal models.

Our modest start includes phenotype data linked to genes and diseases from the following sources: HPO, OMIM, MGI, ZFIN, NCBI Gene, Panther (orthologs), BioGrid (interactions), and KEGG (pathways). Ensuring the integrity of the data, while time-consuming and laborious, is of utmost importance. We will continue to add more sources over time, targeting both large databases, as well as boutiques that cater to very specific data types. Stay tuned for announcements of new data when they are added.

Thursday, April 10, 2014

Monarch at AMIA TBI, Apr 7-9 2014, San Francisco

Ongoing research from the Monarch Initiative was presented by Nicole Washington at AMIA Joint Summits on Translational Bioinformatics 2014.
  • Podium presentation in session TBI19: Phenomic Analysis and Interpretation: Improving the Translation of Model Organism Research into Disease Diagnostics.

    Summary: In order to determine the underlying mechanism of a disease, animal models can often elucidate the biological underpinnings of the phenotype. We present our findings on the distribution, significance, and information characteristics necessary to enable translation of model organism research into disease diagnostic clinical applications using an ontological approach.

  • TBI Poster presentation on Visualizing clinically similar phenotypes

    Summary: Numerous tools for exploring diagnoses rely on the ability to compare clinical phenotypes across patients. These inquiries can be further enhanced with comparative phenotypes from animal models. Here we present novel semantic visualization methods to aid clinical phenotyping through the incorporation of cross-species data.