Monarch replied to the 2015 Request for Information “Soliciting Input into the Deliberations of the Advisory Committee to the NIH Director (ACD) Working Group on the National Library of Medicine (NLM)”. The RFI sought input regarding the strategic vision for the NLM to ensure that it remains an international leader in biomedical data and health information.
Below are the Monarch consortium's thoughts. Our comments are primarily informed by our work on the development of information resources in support of translational biomedical informatics.
Dr. Melissa Haendel
Dr. Peter Robinson
Dr. Chris Mungall
Dr. Harry Hochheiser
Dr. David Eichmann
Dr. Michel Dumontier
Training
The Biomedical Informatics Research Training Program is perhaps
the single most valuable contribution to the research community, providing
considerable value to all of the NLM’s constituencies. At a time when
informatics positions are going unfilled and demand is expected to continue to
grow, the NLM funded training programs educate students to become practitioners
and researchers who will work to develop solutions to challenging medical
informatics problems at all levels.
Students from these programs go on to work for health care systems,
insurers, industry, and academic institutions, as they develop and evaluate
information systems ranging from personal health records to nationwide big data
translational data warehouses. Given the importance of a well-educated
workforce that understands both data science and health care, the Biomedical
Informatics training programs should be supported and expanded, particularly in
directions that will encourage promising young students to enter the field.
Massive Open Online Courses (MOOCs) in bioinformatics and
medical informatics fields should also be funded, such as is occurring within
the BD2K educational award program. These MOOCs will help many clinicians and
researchers with on the job training that is relevant to their current and
evolving informatics learning needs.
Meeting anticipated needs for informatics professionals will
also require training efforts that extend beyond graduate level fellowships.
The NLM should actively support and participate in programs at the
undergraduate level (and earlier) that expose young students to potential
opportunities in the field. This might
include reaching out to undergraduate programs in related areas (biology,
information science, computer science, etc.) and supporting programs like the
AMIA’s high-school development program
(https://www.amia.org/news-and-publications/press-release/high-school-students-present-national-informatics-symposium).
Ideally, the NLM programs would be integrated with the newly
emerging BD2K educational resources and coursework that is being developed in
the BD2K program. NLM has an opportunity to coordinate these types of training
across all of NIH and beyond, and we would see this as a key role for NLM.
Standards and tools
The pioneering resources developed and maintained by the NLM
through the National Center for Biotechnology Information (NCBI) and related
efforts are invaluable to the research community. Conducting modern biomedical
research without tools like PubMed, NCBI taxonomy, MeSH, UMLS and many others
is almost unthinkable. Charting a course
that makes effective use of limited resources to ensure the future utility and
viability of these tools should be a top priority for the NLM. Specifically, NLM leadership should initiate
a review of the coverage and compatibility of available resources, with an eye
towards both improving existing tools and identifying unmet needs. For example,
PubMed and all of the Entrez databases have great value, but the UCSC, IGB,
JBrowse genome browsers have become de
facto standards, and as such, NLM’s genome browser efforts may need to be
evaluated in the light of this development. Another example is vocabulary
interoperability, which NLM could facilitate by developing, promoting, and
funding better tools to support technical development in this area. The tools
currently are very poor, and it is no wonder that there exist a myriad of data
integration challenges based on this problem alone.
The time is also ripe for a re-envisioning of PubMed. Some immensely
valuable PubMed resources are often difficult to find or to use
effectively. For example, the LinkOuts within PubMed are so
well hidden that they are of least use to the community, but have the potential
to be of enormous value. Specifically, one should be able to see how the
LinkOut is attached to a publication, directly on the abstract. The community
should have a more sophisticated mechanism for contributing LinkOuts, and uses
should be able to filter/facet on the ones of interest to them. In the end,
looking through PubMed could and should involve review of the most salient
metadata associated with the paper - else the sheer volume of the literature
contained in PubMed is simply too massive to rely on text-based searches alone
for searching. The addition of
affiliation to the author construct in MEDLINE 2015 was a significant step
forward in disambiguation of researchers, but only partially realized, as there
is no further decomposition of the unstructured affiliation string. Further
structuring in this area would allow for retrieval by institution and
department, opening new avenues of understanding the relationships inherent in
the science enterprise. Binding these
entities to authority records leads to clear identification of related work in cognate
disciplines.
We are
pleased that the NLM has invested significant time and effort into releasing
MeSH as Linked Data, thereby demonstrating a forward thinking agenda that
aligns with emerging standards for data publication and interoperability. NLM
now joins organizations such as the European Bioinformatics Institute (EBI),
the Swiss Institute of Bioinformatics (SIB), the Database Center for Life
Sciences (DBCLS), and grassroots efforts such as Bio2RDF to create an ever
greater federation of data. However, much more must be done to make the vast
array of NLM resources available as Linked Data. The graph of data must not
only be fully connected within NLM, but also with these other stakeholders so
as to reduce the barrier to discovery and reuse. The NLM could lead by
fostering conversations, coordinating efforts, and providing funding towards
data interoperability on a massive scale. It must be responsive to social and
technical issues relating to knowledge representation, data publication, data
interlinking, and data reuse.
NLM should also work to exploit community efforts such as
Force11 that are exploring new visions of the scholarly publication process.
NLM-support via inclusion in PubMed and support via NLM tools such as
E-utilities could be invaluable for these efforts. The NLM should also work
with publishers to leverage community-driven annotation standards so
authors/publishers can tag parts of the text in scientific publications as
clinically relevant. This can then be used by pubmed and/or other interfaces to
summarize the text and present the end user with more relevant results for
evidence-based medicine. For example, a review paper or a manuscript of a
randomized control trial can have less than 10 sentences that have very high
impact in terms of clinical practice but many more sentences that are just
generic so by identifying (tagging/annotating) these sentences at the publisher
level -much like abstracts or key words are being done now will have great
benefit to the clinicians who are looking for clinically relevant information
in the literature.
NLM should work with industry partners like Google, Amazon,
Microsoft to create cloud computing standards (e.g. API standards) that can be
used across these platforms and will enable researchers to utilize a
combination of these platforms for big data research. Currently, such efforts
are being performed by numerous third parties and are not well coordinated. How
many external resources take Medline content, transform it in some way, and
make it available back to the community in some enhanced form? No one really
knows, but if one surveyed this landscape you would find many common
requirements being met in slightly different ways. Such work could a) greatly
inform requirements for NLM future development, and b) if coordinated with the
third-parties, could reduce downstream labor. However, a word of caution - if
the NLM does not coordinate well with the third-parties, they will only
increase the downstream labor.
NLM should have a transparent mechanism for evaluation. At the
recent BD2K standards workshop, one of the key issues that the community
unanimously agreed upon, was the need to understand when a standard has
outlived its utility or has become outmoded. An example of this is MeSH.
Despite the the widespread use of MeSH, every bioinformatician currently has
hacks for making MeSH and the content annotated with it more usable. Perhaps it
is time to evolve MeSH to a more modern semantic standard? Although MeSH has
clearly been of tremendous value to the community, it is now failing to reach
its full potential because of its limited interoperability and semantic
foundation. NLM could greatly increase its impact by adding enough semantics to
allow, say, computers to distinguish between entries that represent human or
animal diseases, phenotypic findings or other complications of diseases, and
items such as "Cadaver" (which is currently described as a pathologic
process). MeSH could be much more valuable if evolved into something more
computable and interoperable.
Cross-NIH Collaboration
NLM has a great opportunity to aid coordination within NIH,
across the US agencies, and internationally. One example is the coordination of
standards development, such as for Common Data Elements, which are currently
buried within different ICs. Coordination between the ICs and the BD2K program
is also a must if we are to realize the goals of the BD2K program and related
efforts such as the National Cancer Informatics Program of the National Cancer
Institute - which are not dissimilar to
what one might want for the future of NLM.
Further, the collection, storage, and use of biomedical data by
the research community should be supported by a linked and navigable landscape
of data, papers, software, and other resources. The proposed NIH commons, the
data discovery index, the software discovery index, the Resource Identification
Initiative, and the many other related intra- and extramural efforts, both
nationally and internationally, must interoperate to support maximal finding
and use of content. While this must extend beyond any NLM silos, NLM is in a
great position to help support the creation of such a landscape. This goal can
be realized via the promotion of open access models for biomedical data and
scientific literature creation, annotation, and tools; via collaborations
amongst the community on computational methods for content indexing and
knowledge derivation; and via manual indexing efforts happening in a
distributed and crowd-sourced manner.
Community engagement
Meaningful community engagement must be a key component of the
NLM of the future. As the primary consumers of NLM services, biomedical
researchers are well-acquainted with the strengths, weaknesses, and
opportunities associated with the NLM tools.
Better understanding of user information needs and requirements for
specific content, integration between data types, and search and discovery will
help inform redesign of tools and data models. Consideration of the need of
diverse populations, including clinicians, educators, researchers, and
patients, can drive improvements that will benefit all classes of users, while
also potentially identifying new opportunities.
Listening to and learning from researchers as consumers of NLM
services is an important first step, but it is not sufficient. Informatics
researchers - particularly those funded
by the NLM -
have extensive research experience and have developed numerous
artifacts that are directly relevant to the information services provided by
the NLM and NCBI. Community expertise in requirements analysis, evaluation,
ontology development, standards processes, and many other emerging areas of
informatics has much to offer NCBI efforts. Specific relevant efforts include
visionary attempts to redefine scholarly publishing, resource identification
efforts aimed at increasing research reproducibility, dataset archiving and
identification tools, and annotation infrastructure for extracting key passages
from papers, drug labels, and other text resources that are currently
inaccessible to computational approaches. Researchers working in these areas
have much to offer NLM efforts. Resources that are currently being allocated to
overlapping or potentially redundant efforts, might be repurposed to support
the inclusion of community efforts. This could be via workshops or ideally even
for shared staff.
Unfortunately, the possibility of close collaboration between
extramural researchers and the groups developing and maintaining intramural NLM
tools is all too often a missed opportunity. The NLM is often perceived of as
less than transparent in terms of priorities, contact points, goals, and
needs. For example, we have been puzzled
by the appearance of NLM tools that seem out of step with current practices in
the research community, using data in ways that have diverged from standards
and leading to poor interoperability.
Limitations on accessibility of tools, in the form of cumbersome
licenses for UMLS components and lack of available source code for NCBI tools
contribute to the perception that the NLM is not supportive of active
collaboration within the biomedical research community. Attempts to collaborate
or provide feedback to NLM regarding some resources generally go through a
help-desk contact email. Others’ requests, which may be similar, are opaque to
the community. Conversely, sometimes new NLM standards appear to the surprise
of the community, because the need for their development has not been
communicated and they have overlap with existing community standards. This
further complicates the data integration landscape and causes increased siloing
of NLM.
Why doesn’t NLM use a tracker system like other open source
projects? Why doesn’t NLM provide files according to modern version control
systems? We have in the past had to scrape HTML pages to get content from NCBI.
This does not reflect well upon NLM, a purported leader in information science.
The biomedical research community and the NLM have the
potential to significantly increase joint impact on medicine and public health.
Realizing this potential will require concrete commitments to increased
interaction, collaboration, and transparency, specifically involving:
●
Greater
transparency through publications of plans and goals for infrastructure
development efforts. As the biomedical community has little insight into
the development agenda for NLM tools, contributing to that development either
through direct participation or through identification of relevant
technologies, vocabularies, etc. is very difficult. Early discussion and
engagement with the community, through mechanisms ranging from formal NIH
requests for information to blog posts and other less formal methods will
invite feedback, increase engagement with developers both intramurally and
extramurally and facilitate development plans that will best meet researcher
needs.
●
Enhanced
opportunities for feedback and community engagement. Modern web
technologies have introduced numerous successful models for online community
engagement, including synchronous chat sessions, audio/video meetings, focused
community expertise-sharing sites such as stackoverflow, and code-sharing tools
such as github. The NLM should embrace these tools as means of helping users,
soliciting feedback, engaging software developers, and leveraging extramural
efforts.
●
Targeted outreach
activities: Contests, hackathons, and other “challenge” events have become
a popular tool for encouraging focused efforts, particularly from
students, on interesting problems.
Taking inspiration from established efforts like the DREAM challenges (http://dreamchallenges.org/)
and newer programs like the AMIA student design challenge (https://www.amia.org/amia2015/student-design-challenge),
NLM should invite students and others to jump in to biomedical informatics
work. These efforts might be integrated
with the NLM’s training mission, perhaps including events at the NLM training
program annual meeting.
●
More integrated
science landscape and attribution. With the new biosketch and ScienCV
system, NLM has the opportunity to create a much greater linked research
activity landscape. This can provision for better attribution for
non-traditional contributions and better research profiling, funding body
decision making, and simply a deeper understanding of the science being done
and the outcomes of funding and programs. This should necessarily include an
improved value system, whereby all contributions can be considered and
non-traditional scientists have a more prominent role in review processes and
decision making activities. NLM can uniquely support a cross-disciplinary team
science approach and improved collaboration.