Thursday, April 23, 2015

What NLM should think about

Monarch replied to the 2015 Request for Information  “Soliciting Input into the Deliberations of the Advisory Committee to the NIH Director (ACD) Working Group on the National Library of Medicine (NLM)”. The RFI sought input regarding the strategic vision for the NLM to ensure that it remains an international leader in biomedical data and health information. 

Below are the Monarch consortium's thoughts. Our comments are primarily informed by our work on the development of information resources in support of translational biomedical informatics.

Dr. Melissa Haendel
Dr. Peter Robinson
Dr. Chris Mungall
Dr. Harry Hochheiser
Dr. David Eichmann
Dr. Michel Dumontier


The Biomedical Informatics Research Training Program is perhaps the single most valuable contribution to the research community, providing considerable value to all of the NLM’s constituencies. At a time when informatics positions are going unfilled and demand is expected to continue to grow, the NLM funded training programs educate students to become practitioners and researchers who will work to develop solutions to challenging medical informatics problems at all levels.  Students from these programs go on to work for health care systems, insurers, industry, and academic institutions, as they develop and evaluate information systems ranging from personal health records to nationwide big data translational data warehouses. Given the importance of a well-educated workforce that understands both data science and health care, the Biomedical Informatics training programs should be supported and expanded, particularly in directions that will encourage promising young students to enter the field.

Massive Open Online Courses (MOOCs) in bioinformatics and medical informatics fields should also be funded, such as is occurring within the BD2K educational award program. These MOOCs will help many clinicians and researchers with on the job training that is relevant to their current and evolving informatics learning needs.

Meeting anticipated needs for informatics professionals will also require training efforts that extend beyond graduate level fellowships. The NLM should actively support and participate in programs at the undergraduate level (and earlier) that expose young students to potential opportunities in the field.  This might include reaching out to undergraduate programs in related areas (biology, information science, computer science, etc.) and supporting programs like the AMIA’s high-school development program (

Ideally, the NLM programs would be integrated with the newly emerging BD2K educational resources and coursework that is being developed in the BD2K program. NLM has an opportunity to coordinate these types of training across all of NIH and beyond, and we would see this as a key role for NLM.

Standards and tools

The pioneering resources developed and maintained by the NLM through the National Center for Biotechnology Information (NCBI) and related efforts are invaluable to the research community. Conducting modern biomedical research without tools like PubMed, NCBI taxonomy, MeSH, UMLS and many others is almost unthinkable.  Charting a course that makes effective use of limited resources to ensure the future utility and viability of these tools should be a top priority for the NLM.  Specifically, NLM leadership should initiate a review of the coverage and compatibility of available resources, with an eye towards both improving existing tools and identifying unmet needs. For example, PubMed and all of the Entrez databases have great value, but the UCSC, IGB, JBrowse genome browsers have become de facto standards, and as such, NLM’s genome browser efforts may need to be evaluated in the light of this development. Another example is vocabulary interoperability, which NLM could facilitate by developing, promoting, and funding better tools to support technical development in this area. The tools currently are very poor, and it is no wonder that there exist a myriad of data integration challenges based on this problem alone.

The time is also ripe for a re-envisioning of PubMed. Some immensely valuable PubMed resources are often difficult to find or to use effectively.  For example, the LinkOuts within PubMed are so well hidden that they are of least use to the community, but have the potential to be of enormous value. Specifically, one should be able to see how the LinkOut is attached to a publication, directly on the abstract. The community should have a more sophisticated mechanism for contributing LinkOuts, and uses should be able to filter/facet on the ones of interest to them. In the end, looking through PubMed could and should involve review of the most salient metadata associated with the paper - else the sheer volume of the literature contained in PubMed is simply too massive to rely on text-based searches alone for searching.  The addition of affiliation to the author construct in MEDLINE 2015 was a significant step forward in disambiguation of researchers, but only partially realized, as there is no further decomposition of the unstructured affiliation string. Further structuring in this area would allow for retrieval by institution and department, opening new avenues of understanding the relationships inherent in the science enterprise.  Binding these entities to authority records leads to clear identification of related work in cognate disciplines.

We are pleased that the NLM has invested significant time and effort into releasing MeSH as Linked Data, thereby demonstrating a forward thinking agenda that aligns with emerging standards for data publication and interoperability. NLM now joins organizations such as the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), the Database Center for Life Sciences (DBCLS), and grassroots efforts such as Bio2RDF to create an ever greater federation of data. However, much more must be done to make the vast array of NLM resources available as Linked Data. The graph of data must not only be fully connected within NLM, but also with these other stakeholders so as to reduce the barrier to discovery and reuse. The NLM could lead by fostering conversations, coordinating efforts, and providing funding towards data interoperability on a massive scale. It must be responsive to social and technical issues relating to knowledge representation, data publication, data interlinking, and data reuse.

NLM should also work to exploit community efforts such as Force11 that are exploring new visions of the scholarly publication process. NLM-support via inclusion in PubMed and support via NLM tools such as E-utilities could be invaluable for these efforts. The NLM should also work with publishers to leverage community-driven annotation standards so authors/publishers can tag parts of the text in scientific publications as clinically relevant. This can then be used by pubmed and/or other interfaces to summarize the text and present the end user with more relevant results for evidence-based medicine. For example, a review paper or a manuscript of a randomized control trial can have less than 10 sentences that have very high impact in terms of clinical practice but many more sentences that are just generic so by identifying (tagging/annotating) these sentences at the publisher level -much like abstracts or key words are being done now will have great benefit to the clinicians who are looking for clinically relevant information in the literature.

NLM should work with industry partners like Google, Amazon, Microsoft to create cloud computing standards (e.g. API standards) that can be used across these platforms and will enable researchers to utilize a combination of these platforms for big data research. Currently, such efforts are being performed by numerous third parties and are not well coordinated. How many external resources take Medline content, transform it in some way, and make it available back to the community in some enhanced form? No one really knows, but if one surveyed this landscape you would find many common requirements being met in slightly different ways. Such work could a) greatly inform requirements for NLM future development, and b) if coordinated with the third-parties, could reduce downstream labor. However, a word of caution - if the NLM does not coordinate well with the third-parties, they will only increase the downstream labor.

NLM should have a transparent mechanism for evaluation. At the recent BD2K standards workshop, one of the key issues that the community unanimously agreed upon, was the need to understand when a standard has outlived its utility or has become outmoded. An example of this is MeSH. Despite the the widespread use of MeSH, every bioinformatician currently has hacks for making MeSH and the content annotated with it more usable. Perhaps it is time to evolve MeSH to a more modern semantic standard? Although MeSH has clearly been of tremendous value to the community, it is now failing to reach its full potential because of its limited interoperability and semantic foundation. NLM could greatly increase its impact by adding enough semantics to allow, say, computers to distinguish between entries that represent human or animal diseases, phenotypic findings or other complications of diseases, and items such as "Cadaver" (which is currently described as a pathologic process). MeSH could be much more valuable if evolved into something more computable and interoperable.

Cross-NIH Collaboration

NLM has a great opportunity to aid coordination within NIH, across the US agencies, and internationally. One example is the coordination of standards development, such as for Common Data Elements, which are currently buried within different ICs. Coordination between the ICs and the BD2K program is also a must if we are to realize the goals of the BD2K program and related efforts such as the National Cancer Informatics Program of the National Cancer Institute  - which are not dissimilar to what one might want for the future of NLM.

Further, the collection, storage, and use of biomedical data by the research community should be supported by a linked and navigable landscape of data, papers, software, and other resources. The proposed NIH commons, the data discovery index, the software discovery index, the Resource Identification Initiative, and the many other related intra- and extramural efforts, both nationally and internationally, must interoperate to support maximal finding and use of content. While this must extend beyond any NLM silos, NLM is in a great position to help support the creation of such a landscape. This goal can be realized via the promotion of open access models for biomedical data and scientific literature creation, annotation, and tools; via collaborations amongst the community on computational methods for content indexing and knowledge derivation; and via manual indexing efforts happening in a distributed and crowd-sourced manner.

Community engagement

Meaningful community engagement must be a key component of the NLM of the future. As the primary consumers of NLM services, biomedical researchers are well-acquainted with the strengths, weaknesses, and opportunities associated with the NLM tools.  Better understanding of user information needs and requirements for specific content, integration between data types, and search and discovery will help inform redesign of tools and data models. Consideration of the need of diverse populations, including clinicians, educators, researchers, and patients, can drive improvements that will benefit all classes of users, while also potentially identifying new opportunities.

Listening to and learning from researchers as consumers of NLM services is an important first step, but it is not sufficient. Informatics researchers -  particularly those funded by the NLM -
have extensive research experience and have developed numerous artifacts that are directly relevant to the information services provided by the NLM and NCBI. Community expertise in requirements analysis, evaluation, ontology development, standards processes, and many other emerging areas of informatics has much to offer NCBI efforts. Specific relevant efforts include visionary attempts to redefine scholarly publishing, resource identification efforts aimed at increasing research reproducibility, dataset archiving and identification tools, and annotation infrastructure for extracting key passages from papers, drug labels, and other text resources that are currently inaccessible to computational approaches. Researchers working in these areas have much to offer NLM efforts. Resources that are currently being allocated to overlapping or potentially redundant efforts, might be repurposed to support the inclusion of community efforts. This could be via workshops or ideally even for shared staff.

Unfortunately, the possibility of close collaboration between extramural researchers and the groups developing and maintaining intramural NLM tools is all too often a missed opportunity. The NLM is often perceived of as less than transparent in terms of priorities, contact points, goals, and needs.  For example, we have been puzzled by the appearance of NLM tools that seem out of step with current practices in the research community, using data in ways that have diverged from standards and leading to poor interoperability.  Limitations on accessibility of tools, in the form of cumbersome licenses for UMLS components and lack of available source code for NCBI tools contribute to the perception that the NLM is not supportive of active collaboration within the biomedical research community. Attempts to collaborate or provide feedback to NLM regarding some resources generally go through a help-desk contact email. Others’ requests, which may be similar, are opaque to the community. Conversely, sometimes new NLM standards appear to the surprise of the community, because the need for their development has not been communicated and they have overlap with existing community standards. This further complicates the data integration landscape and causes increased siloing of NLM.

Why doesn’t NLM use a tracker system like other open source projects? Why doesn’t NLM provide files according to modern version control systems? We have in the past had to scrape HTML pages to get content from NCBI. This does not reflect well upon NLM, a purported leader in information science.

The biomedical research community and the NLM have the potential to significantly increase joint impact on medicine and public health. Realizing this potential will require concrete commitments to increased interaction, collaboration, and transparency, specifically involving:

      Greater transparency through publications of plans and goals for infrastructure development efforts. As the biomedical community has little insight into the development agenda for NLM tools, contributing to that development either through direct participation or through identification of relevant technologies, vocabularies, etc. is very difficult. Early discussion and engagement with the community, through mechanisms ranging from formal NIH requests for information to blog posts and other less formal methods will invite feedback, increase engagement with developers both intramurally and extramurally and facilitate development plans that will best meet researcher needs.

      Enhanced opportunities for feedback and community engagement. Modern web technologies have introduced numerous successful models for online community engagement, including synchronous chat sessions, audio/video meetings, focused community expertise-sharing sites such as stackoverflow, and code-sharing tools such as github. The NLM should embrace these tools as means of helping users, soliciting feedback, engaging software developers, and leveraging extramural efforts.

      Targeted outreach activities: Contests, hackathons, and other “challenge” events have become a popular tool for encouraging focused efforts, particularly from students,  on interesting problems. Taking inspiration from established efforts like the DREAM challenges ( and newer programs like the AMIA student design challenge (, NLM should invite students and others to jump in to biomedical informatics work.  These efforts might be integrated with the NLM’s training mission, perhaps including events at the NLM training program annual meeting.

      More integrated science landscape and attribution. With the new biosketch and ScienCV system, NLM has the opportunity to create a much greater linked research activity landscape. This can provision for better attribution for non-traditional contributions and better research profiling, funding body decision making, and simply a deeper understanding of the science being done and the outcomes of funding and programs. This should necessarily include an improved value system, whereby all contributions can be considered and non-traditional scientists have a more prominent role in review processes and decision making activities. NLM can uniquely support a cross-disciplinary team science approach and improved collaboration.