Upload
judith-axler
View
212
Download
0
Embed Size (px)
Citation preview
C H A P T E R
2
Access to ResourcesA Model Organism Database for Humans
Judith Axler TurnerTCG, Washington, DC, USA
A
h
O U T L I N E
The Problem
37The LAMHDI Solution
38nima
ttp://
LAMHDI Database Search
39l Mod
dx.do
Some Sources are Curatorial
39Some Sources Provide the Model Information
39Performing the Search
4137els for the Study of Human Disease
i.org/10.1016/B978-0-12-415894-8.00002-6
The Advantages and Limitations of LAMHDI
41The Ideal Solution
45Acknowledgments
46References
47THE PROBLEM
Arguably the single most essential element in animal-basedresearch, identifying and selecting the most appropriate animalmodel is also the most challenging. Wood and Hart, 20081 p. 303
While animal models are central to the effectivestudy and discovery of treatments for human diseases,finding appropriate models for every stage of researchis still a challenge. Scientists interested in the “three Rs”of animal model researchdrefining, reducing, andreplacing models where possibledare seeking modelsat the lowest level that can provide insight into knottyproblems. But cross-species information is hard to findand hard to compare: each model databasedwheredatabases exist at alldis different, with different struc-tures, different data, and different search strategies.Learning one database is rarely helpful in searchinganother.
The scientific literature typically does not provideanswers to the question of whether another model mightbe a better choice at various points in disease research.Journal articles focus on the experiment and its results,
and not the model. While some articles fully identifythe model used, rarely are alternatives discussed.
This chapter looks at one initiative to provide infor-mation about and access to appropriate disease modelsacross species, LAMHDI, the project to Link AnimalModels to Human DIsease. LAMHDI was created togather and organize information about extant diseasemodels so that researchers could identify and compareoptions across species and strains. LAMHDI (http://www.lamhdi.org) is available and searchable througha Web interface.
“When I go to a primate center meeting, there isalways a question from the audience, ‘how do I findout what models are available?’ and I do a song anddance and the final answer is ‘give me a call,’” an officialat the National Institutes of Health (NIH) said ina private conversation before LAMHDI was built.“Access to and use of disease model information iscurrently an inefficient process owing at least in partto the large number and complex nature of existingdisease models and associated data,” the NationalCenter for Research Resources (NCRR) wrote in therequest for proposals that was issued in 2008.
Copyright � 2013 Elsevier Inc. All rights reserved.
2. ACCESS TO RESOURCES38
Even those who built the most important diseasemodel databases recognized the problem. Monte West-erfield (Institute of Neuroscience, University of Oregon),a founder of ZFIN, the Zebrafish Information Network,said in a presentation at the LAMHDI organizationalmeeting that heterogeneous information, multiple sites,inconsistent search mechanisms, inconsistent use ofdescriptive terms, lack of support for synonyms, andinconsistent definitions of disease models impedefinding animal models for human disease. Even the defi-nition of a disease model was a problem: is a disease rep-resented by a single gene, is it multigenic, or is it aninfectious disease and therefore non-genetic? “Muta-tions in different genes can produce the same pheno-types,” he said, and “mutations in the same gene indifferent organisms can result in different phenotypes.”
Nevertheless, the NCRR was so confident thata cross-species database of disease models could bedeveloped that it limited respondents to its request forproposals to small businesses, not the research enter-prise. NCRR was looking for a product, not an experi-ment. The statement of work said:
More rigorous methods for describing animal models willallow researchers to identify and examine commonalities anddifferences across multiple animal models. As more animalmodels are generated, it is important to develop better mecha-nisms by which these models are mapped onto human condi-tions aswell as better ways to capture and then access this newlygenerated information. NIH NCRR Statement of Work (2008)
The NIH believed it could be done.Moreover, the NIH expected the database to do more
than just connect researchers with information aboutdisease models in various species. At an NIH meetinga few weeks before the contract was awarded,scientistsdmany of whom would become part of theLAMHDI effortdsought to find a way to maximizeproductivity in biomedical research. They wanted toharness the massive amounts of data that were beingproduced through new technological advances, andanalyze, share, and perform computational studies onthem to translate them into beneficial medical advances.
“The problem is not specifically a lack of resourcesdthe NIH alone funds billions of dollars each year forbiomedical research. Rather, the issue is the lack ofability to effectively utilize data across experimentalgroups, institutions, and domains,” one speaker at theinvitation-only conference said.
Their solution was to create knowledge environ-ments, information infrastructures that would providea framework for effective computation on disease-model data. The first goal was to bring together avail-able data on animal models in a single environment.
Such a system would not be breaking entirelynew ground. The University of California, San Diego
I. ETHICS, RESOURCES
(UCSD), had already built the successful NeuroscienceInformation Framework, or NIF. This resource, estab-lished in 2004, is a dynamic inventory of Web-basedneuroscience resources: data, materials, and tools acces-sible on the Internet. NIF combines ontology, standards,and support for “identifying, locating, relating, access-ing, integrating, and analyzing information from theneuroscience research enterprise” (http://www.neuinfo.org/about/index/shtm).
The LAMHDI contract was awarded to a team thatcombined a small business successful in federalinformation-technology contracting and noted diseaseresearchers and disease-model providers. TCG, a Wash-ington, DC-based information-technology company,brought together a team of scientists from prominentdisease research centers, led by the UCSD Center forResearch in Biological Systems and NIF, the Universityof Washington, and the University of Wisconsin. Theirproposal was for LAMHDI.
THE LAMHDI SOLUTION
The NIH intention was clear. The government’s state-ment of work said:
The initial component of this project will address thedevelopment of the front end of the resource: specifically,a directory of available disease models. The envisioned direc-tory would enhance information access and retrieval by assist-ing researchers to find information about animalmodels, and bysupporting experts seeking to assist researchers to find suchinformation. This resource directory would provide referencematerials, information about the resources, and informationabout the animal models themselves. Initially this componentwould focus on a few model animal species (e.g., mouse,zebrafish), and eventually expand to other species and alongother dimensions (e.g., microbes, tissues). Resource databasedevelopment would focus on the needs of both the animalmodel and human disease research communities and the char-acteristics of existing resources, and address issues such as datadescription standards, user interface, user services, etc. NIH
NCRR Statement of Work (2008)
LAMHDI was designed to be a tool to help res-earchers fulfill the NIH mission of improving healthand saving lives. LAMHDI allows researchers to findthe most appropriate models of human disease bycomparing disease models across species, and to accessthose models.
The scientific community wanted to be able to searchmodel organism databases, animal resources, andOMIM (Online Mendelian Inheritance in Man, http://ncbi.nlm.nih.gov/omim, a database of human diseaseand genetic disorders) by disease, gene, pathway, organ,tissue, cell type, and GO terms (Gene Ontology, http://geneontology.org/, a controlled vocabulary of terms for
AND APPROACHES
THE LAMHDI SOLUTION 39
describing gene product characteristics and geneproduct annotation data).
NIF, which let researchers access literature andinformation through a human-curated database, wasa possible model. But establishing and maintaininga curated catalog is challenging. Maryann Martone, theprincipal investigator at NIF, warned the LAMHDIteam in a conversation that curation is hard: “In onedatabase a mouse is a model, in another it’s a numberin a table. Terminology is a big problem.” NIF attacksthe problem by using teams of scientific curators toverify every entry, and slowly build a standard lexiconof terms. If LAMHDI had followed that model, it wouldhave had to develop its own set of termsdlikelybuilding on NIF, but going further into all humandiseasesdfor its disease-model databases.
Heeding Martone’s warnings, the LAMHDI teamlooked for a quicker and more automated approach, togive researchers the first rough draft of the data theywould need to make informed decisions about diseasemodels. The LAMHDI team chose to build on existingcurateddatabases, andautomate the linkages.Thecurrentversion of LAMHDIprovides a database andwebsite thatallows researchers to search for and access appropriatemodels of human disease in mice, zebra fish, rats, yeast,and flies. LAMHDI also allows users to search the Prima-teLit database of articles about nonhuman primatemodels of human disease, and to do Google-like searchesof select, topic-relevant Web sites. All data included inLAMHDI was chosen and curated by the experts in theareas covered by the existing databases. LAMHDI’s aimwas to save time for scientists who once had to go to thesource databases and learn multiple different searchstrategies and deal with multiple vocabularies. OnLAMHDI, a single search brings results frommany scien-tific databases. Developers devised a smart search enginethat could present multiple species from disparate data-bases, and provided added value by presenting informa-tion about where to access individual models.
LAMHDI does not create the data it offers to users;that is collected and curated by scientists worldwide.LAMHDI’s role is to build the software that translatesfrom one data system or data structure to another, so sci-entists can search across existing databases and find rela-tionships that will help inform research. This is no trivialtask. LAMHDI users can enter a disease name, and findrelated models in five species, yet source databases maynot havedisease names, ormaynot linkparticularmodelsto those diseases, even though a link may exist.
LAMHDI’s strength is its system architecture. Itbrings in a database of disease models, including geneticand annotation data when available, and runs eachmodel against the National Library of Medicine’sMedical Subject Headings (MeSH) controlled vocabu-lary to identify all related terms. After scanning the files
I. ETHICS, RESOURCES
for the information to presentdspecies, name or identi-fier for the model, diseases to which it relates, accessinformation (typically the name of the laboratory wherethe model may be purchased), and “other” informationthat may be useful, such as descriptions of the model,relevant literature, the record of the model, the genesinvolved, etc.dit applies keyword tags to the data toallow it to come up quickly in searches. These tags repre-sent the results of the pre-searching that the systemdoes. See Figs 2.1 and 2.2 for examples of LAMDHIsearch results.
The LAMHDI site’s “about” page (http://lamhdi.org/about) shown below between the two rules offersthe best description of how it works:
LAMHDI Database Search
LAMHDI brings together scientifically validatedinformation from various sources to create a compositemulti-species database of animal models of humandisease. To do this, the LAMHDI database is preparedfrom a variety of sources.
Some Sources are Curatorial
The LAMHDI team takes publicly available datafrom OMIM, NCBI’s Entrez Gene database, Homolo-gene, and WikiPathways, and builds a mathematicalgraph (think of it as a map or a web) that links thesedata together to help discover connections. We useOMIM to link human diseases with specific humangenes. We use Entrez for universal identifiers for eachof those genes. We link human genes to their counter-part genes in other species using Homologene, andwe link those genes to other genes that are tentativelyor authoritatively linked with them in some way usingthe data in WikiPathways.
This preparatory work gives LAMHDI a web ofhuman diseases linked to specific human genes, orthol-ogous human genes, homologous genes in other species,and both human and nonhuman genes involved inspecific metabolic pathways associated with thosediseases.
Some Sources Provide the Model Information
LAMHDI includes model data that partners sharewith us, which is plugged into our data structure. Forinstance, MGI (Mouse Genome Informatics, http://www.informatics.jax.org/, an “international databaseresource for the laboratory mouse, providing integratedgenetic, genomic, and biological data to facilitate thestudy of human health and disease,” according to thewebsite) provides information about mouse models,including a disease for each model, as well as a bit ofgenetic information (the ID of the model, in fact,
AND APPROACHES
FIGURE 2.2 In the LAMDHI search depicted here, the jumps are identified for the first fly model. The record for Drosophila melonagaster,PBac{RB}Nf1e00084, does not contain the terms neurofibromatosis-1 optic glioma. To get to that disease model, LAMHDI first applied the searchterms to OMIM, the “comprehensive, authoritative, and timely compendium of human genes and genetic phenotypes” maintained by theNational Center for Biotechnology Information (NCBI). The OMIM record (http://omim.org/entry/162200) for neurofibromatosis-1 opticglioma included nearly 300 references and a pointer to the human gene NF1. LAMHDI ran “NF1” through Homologene (http://www.ncbi.nlm.nih.gov/sites/entrez?db¼homologene&cmd¼search&term¼226), NCBI’s homologue database, and found the fly gene Nf1. A search for Nf1 inFlyBase (http://flybase.org) brought up this particular model.
FIGURE 2.1 A recent search for neurofibromatosis-1 optic glioma yielded the screen pictured here. A total of 83 records matched thesearch terms. That example search brought back 40 mouse models, 21 fly models, 19 zebra fish models, and 3 yeast models. The first mousemodel, with a score of 99, is the most likely hit; the next closest match had a score of 68. Scores are determined by the number of jumps (logicalinferences) in the search and the strength of the evidence for those jumps.
I. ETHICS, RESOURCES AND APPROACHES
2. ACCESS TO RESOURCES40
THE LAMHDI SOLUTION 41
identifies one or more genes); while ZFIN providesgenetic information for each zebrafish model, but nodiseases. We plug each zebrafish model into our data-base using the genes as the glue. For instance, a hypo-thetical zebrafish that involves the zebrafish pkd2 genewould plug into the larger disease-gene map at thenode representing the zebrafish pkd2 gene, which is con-nected to the node representing the human PKD2 gene,which in turn is connected to the node representing thehuman disease known as polycystic kidney disease.(Some of the partner data we receive can even extendour base map. MGI provides a disease for every model,and in some cases this allows us to create a disease-to-gene relationship in our database that might not alreadybe documented in the OMIM dataset.)
With our curatorial and model information in hand,we run a lengthy automated process that exhaustivelysearches for every possible path between each modeland each disease in the data, up to some arbitrarynumber of hops (producing for each disease-to-modelpair a set of links from the disease to the model). Thealgorithm avoids circular paths and paths that includemore than one disease anywhere in the middle of thepath. At the end of this phase, LAMHDI has a compre-hensive set of paths representing all the disease-to-model relationships in our data, varying in length fromone hop to many hops. Each disease-to-model path isessentially a string of nodes in our data, where eachnode represents a disease, a gene, a linkage betweengenes (an ortholog, a homolog, or a pathway connection,referred to as a gene cluster or association), or a model.Each node has a human-friendly label, a set of termsand keywords, anddin most casesda URL linking thenode to the data source where it originated. We’re nowready for our users.
Performing the Search
When a researcher submits a search on the LAMHDIwebsite, LAMHDI searches for the user’s search termsin its precomputed list of all known disease-to-modelpaths (looking for the terms not only in the diseaseand model nodes, but also every node along eachpath). The complete set of hits may include multiplepaths between any given disease-to-model pair ofendpoints. Each of these disease-to-model pair sets isordered by the number of hops it involves, and theone involving the fewest hops is chosen to representits respective disease-to-model pair in the search resultspresented to the user (which are sorted overall by theirsearch term matching scores).
The number of hops is one barometer of the strengthof the evidence linking the model and the disease; fewerhops indicates the relationship is stronger, more hopsindicates it may beweaker. Note that this indicatorworks
I. ETHICS, RESOURCES
best for comparing models from a single partner dataset:MGI explicitly identifies a disease for each mouse model,so there can be disease-to-model hits for mice thatinvolve just one hop. Because ZFIN does not explicitlyidentify a disease for each model, no zebrafish modelwill involve fewer than four hops to the nearest disease:from the zebrafish model to a zebrafish gene, to a genecluster, to a human gene, to a human disease.
The Advantages and Limitations of LAMHDI
A researcher might do the same searches withoutLAMHDI, looking for relationships and pointers fromarticle to database to database to article to model. But aresearcher could take an hour or more to do each search;LAMHDI returns its results in fractions of a second.When searches are that fast, researchers can do dozens,even scores of them to find exactly the right result. More-over, LAMHDI’s software does not just search, it infersrelationships that could lead researchers to new ways ofthinking about the data. LAMHDI does not claim to findthe right result. It takes a trained scientist to evaluate theresults and decide whether Drosophila melonagaster, PBac{RB}Nf1e00084, is a crummy model or just right for whata researcher needs to do at a particular point in her work.
LAMHDI’s main strength today is its speed: it letsa scientist try out ideas and follow thoughts withoutbeing distracted by the process. The NIH expected that:
an important motivation for developing better access toresources for animal models of human disease [would be] theenhanced ability to search across multiple animal models as wellas to capture specific model information that may be relevant tomultiple diseases. NIH NCRR Statement of Work (2008)
But LAMHDI could go only so far. The five diseasemodel databases it includesdMGI, the Mouse GenomeInformatics website (http://www.informatics.jax.org/);ZFIN, the Zebrafish Model Organism Database (http://www.zfin.org/); RGD, the Rat Genome Database(http://rgd.mcw.edu/); SGD, the SaccharomycesGenome Database (http://www.yeastgenome.org/);and FlyBase (http://www.flybase.org/)dare the onlyextensive, public, well-organized, and well-curateddisease-model databases. (Several well-known data-bases, such as the Knockout Rat Consortium (http://www.knockoutrat.org/) and KOMP, the KnockoutMouse Project (http://www.nih.gov/science/models/mouse/knockout/index.html), are subsets of thedisease-model databases LAMHDI already uses.) Infor-mation about all other disease models is stored inprivate, or small, or out-of-date, or non-standard data-bases. For instance, zoos are eager to make their animalsavailable for research (through non-invasive proce-dures), but do not have comparable data for structured
AND APPROACHES
2. ACCESS TO RESOURCES42
searches. If you need a hippopotamus (for research intoobesity and longevity, for instance), you have to phoneup the local zoo, and if they do not have the capabilityto provide cheek swabs or blood, keep calling zoos.The national primate research centers have reams,even petabytes (1015 bytes, or one million gigabytes) ofinformation on their charges, but it is locked in indi-vidual researchers’ files and not standardized in usefulways. Even commercial providers of disease modelsdo not create the kind of data that a computer can parse;they get the data about their models from the samepublic databases that LAMHDI already uses.
The NIH was aware of the problem; the request forproposals asked for “more rigorous methods fordescribing animal models.” To help meet that goal,LAMHDI scientists met to consider the problem. Oneissue was that searches of the scientific literature didnot easily yield information about disease models. Arelative handful of articles mentioned the exact strainor source of the disease model, and certainly not ina standardized format that could be parsed by acomputer. It seemed to the LAMHDI team that theirefforts could not succeed without help from the journalsthemselves. The LAMHDI team, supported by scientistsfrom a range of disciplines, drafted a letter for journalpublishers urging them to establish standards for infor-mation about animal models (see Box 2.1).
The letter asks that journals enforce three “minimal”practices for authors:
1. Provide gene accession numbers for all genes2. Identify the species for the subject of a study based on
the NCBI taxonomy, and the strains from the modelorganism databases
3. Provide catalog numbers and vendor information forreagents and animals.
Scientific literature is intended not only to reportadvances, but to ensure repeatability. Without the threeminimal pieces of information that the letter outlined,it is impossible to replicate experiments, scientists say.Too many variables are unclear, and any one of themcould take an experiment off in an entirely differentdirection. For instance, a reagent from one vendor couldbe just different enough from a reagent from anotherthat the entire procedure would be affected.
With that minimal information, the scientists add,the value of animal models could be increased dramati-cally, as research reporting would also build the recordon disease modeling. Findings could be correlatedacross models, allowing researchers to home in onmodels that address their particular interests. The NIHhad foreseen that benefit:
*LAMHDI is based on PhP, and uses the CakePHP framework, Lam
tools.
I. ETHICS, RESOURCES
As more animal models are generated, it is important todevelop better mechanisms by which these models are mappedonto human conditions as well as better ways to capture andthen access this newly generated information. Such methodsmust eventually use automated ways of linking various types ofrepresentations to identify equivalent, comparable, or relatedconcepts. NIH NCRR Statement of Work (2008)
Despite that effortdwhich has thus far had onlylimited successdLAMHDI has not been able to havethe effect on disease model research that the NIH hadhoped. The original request for proposals asked for:
better mechanisms by which . models are mapped ontohuman conditions as well as better ways to capture and thenaccess this newly generated information. Such methods musteventually use automated ways of linking various types ofrepresentations to identify equivalent, comparable, or relatedconcepts. NIH NCRR Statement of Work (2008)
Other efforts are under way to refine automated cura-tion to allow initiatives likeLAMHDI to takebetter advan-tage of existing material. LAMHDI’s advances will belinked to their success. Also, other LAMHDI-like effortscan be built on LAMHDI, as the software technology*is all open source (that is, freely available to anyonewho wants to use it). LAMHDI in turn is based in parton another open-source project funded by NIH and builtby TCG, NITRC (http://www.nitrc.org), the Neuroimag-ing Informatics Tools and Resources Clearinghouse.NITRC facilitates finding and comparing neuroimagingresources for functional and structural neuroimaginganalyses. NITRC collects and points to standardizedinformation about the tools it includes. LAMHDI wasintended to be more like NITRC, as evidenced from theslide shown in Fig. 2.3, which the LAMHDI team pre-sented at the kickoff meeting for the project.
To date, neither the “participate” nor the “curate”efforts has been a focus of LAMHDI. NITRC’s strengthis its community (represented by “participate”), andLAMHDI has yet to incorporate either the discussionforums or the automated processes to accept and vali-date individual model listings from researchers.Accepting disease model listings from researchersmeans establishing standards for data that go beyondthe three minimal data elements urged for journals,and engaging the “curate” function in the slide bycreating a formal mechanism for evaluating suggestionsby allowing scientists to rate models and their appro-priateness. Such crowd-sourcing may be in LAMHDI’sfuture.
Another important model for LAMHDI is NIF. Goodpractices and technological advances pioneered by NIFmay be applicable to LAMHDI as well. For instance,NIF is exploring the use of a tool that uses a semantic
p stack, Postgres, MnogoSearch, and Sphinx, among other
AND APPROACHES
BOX 2.1
L E T T E R F R OM S C I E N T I S T S T O J O U RNA L E D I T O R S ANDPU B L I S H E R S
BOX 2.1
L E T T E R F R OM S C I E N T I S T S T O J O U RNA L E D I T O R S ANDPU B L I S H E R S
BOX 2.1
L E T T E R F R OM S C I E N T I S T S T O J O U RNA L E D I T O R S ANDPU B L I S H E R S
BOX 2.1
L E T T E R F R OM S C I E N T I S T S T O J O U RNA L E D I T O R S ANDPU B L I S H E R S
P U B L I S H I N G I N T H E 2 1 S T C E N T U R Y : M I N I M A L ( R E A L L Y ) D A T A S T A N D A R D S
Although researchers write papers for other researchers,
the primary consumers of data and information these days
are not humans but computers. Computers find data,
display and analyze themdprovided that the data are
structured in a way that allows these functions to happen.
The form of the scientific paper has been honed over many
hundreds of years for humans to use. But scientific papers
are difficult for a computer to understand. The beauty and
frustration of human language is that the same word or
phrase can mean many things and the same thing can be
describedmany different ways.While poetry is enriched by
this mystery, scientific literature is hamstrung by it. Some-
times even expert scientific curators have a difficult time
extracting accurate information from an article when the
information is incomplete, ambiguous, or (in more cases
than scientists perhaps would care to admit) missing alto-
gether. It is no surprise that computers have a problem.
We are on the cusp of an evolution in scientific
publishing, and that evolution may involve new ways of
reporting information, e.g., the structured digital abstract,
ontology. However, during this evolution, while tools and
strategies are being developed and tested, those of us who
are charged with providing access to information in the
scientific literatureddatabase curatorsdhave identified
three simple practices that could make extracting relevant
information from the literature more efficient and thor-
ough. We recommend that scientific journals require that
authors do the following in order to meet the publishing
needs of the 21st century:
1. Provide gene accession numbers for all genes
referenced in the methods section of a paper, per
http://www.ncbi.nlm.nih.gov/gene
2. Identify the species for the subject of a study from the
NCBI taxonomy and the strains from the model
organism databases for mice, rats, worms, zebra fish,
and Drosophila, employing any existing unique
identifiers and correct species-specific nomenclature:
a. Mice: http://www.informatics.jax.org/
b. Zebrafish: http://zebrafish.org/zirc/home/guide.
php
c. Rats: http://rgd.mcw.edu/
d. Flies: http://flybase.org
e. Worms: http://wormbase.org
3. Provide catalog numbers and vendor information for
all reagents and animals described in the methods
section of a paper.
These requirements are minimal, really; authors will
not find them onerous, and publishers will not find them
difficult to implement.
Our impetus was a recent invitation-only meeting at
NIH on the initiative to Link Animal Models to Human
DIsease (LAMHDI), a database of animal models that
allows cross-species searching. We explored best practices
that had been covered by others (e.g., http://zfin.org/zf_
info/author_guidelines.html; http://wiki.geneontology.
org/index.php/Letter_to_Editors).
Our goal is to have a permanent unique identifier,
a “social security number,” if youwill, for specific concepts.
Even if a concept hasmany different names, all those names
need to map to the same identifier. Similarly, different
concepts cannot share the same identifier. Thus, this iden-
tifier serves to disambiguate shared names, and link the
same concepts across different papers and databases.
Below we give details on our three recommendations.
Recommendation 1dGene Accession Numbers
NCBI’s Entrez Gene is one of the resources that uses
permanent unique identifiers for gene symbols and
names. Data in Entrez Gene result from a mixture of
curation and automated analyses. Annotation of
sequences is integrated with information from
collaborating model-organism databases, literature
review, andpublic users (with curationbyRefSeq staff
as required). Once a gene is assigned a unique
accession number, it can never be reused to identify
another gene, even if the gene assignment is later
determined tobe in error.Using this accessionnumber
in papers alongside the gene name and abbreviation
lets both computers and humans identify the gene
unambiguously.Because theaccessionnumber isused
universally, it also links that gene to themultiplicity of
databases and knowledge sources that contain the
same identifier. Even if a gene has many different
names or if many different genes or other types of
entities are called by the same name, curators and
automated agents can easily identify them by their
unique accession number.
Recommendation 2dOrganism Identification
NCBI provides a taxonomy ID for each major
species. Within a species, many unique strains and
stocks may be developed, each carrying many
sequence variants, mutations, and genetically
(Continued)
I. ETHICS, RESOURCES AND APPROACHES
THE LAMHDI SOLUTION 43
BOX 2.1 (cont’d)
modified genes. Each of the major model organism
databases (MGI, ZIRC, Wormbase, Flybase, RGD)
provides strain IDs for these genetically unique strains
and stocks. If authors use one of these IDs in their papers,
a search agent (or a human) can confidently relate the
research results to the right organism. These registries
also offer standardized, detailed characterizations of
organisms’ genotypic information, so authors do not
have to provide it. Finally, the registries help researchers
obtain identifiers for new organisms, ensuring accessi-
bility and comparability with all other records.
Recommendation 3dReagent Identification
Experimental results in bioscience are fundamentally
reliant on reagents. For gene expression studies,
antibodies are a critical tool for providing detailed spatial
information on the localization of gene products.
Antibodies are generally made to specific sequences that
may be further modified through phosphorylation or
some other event. As any experienced anatomist or cell
biologist knows, different antibodies to the same protein
can give very different results, even within the same
laboratory. Identifying the reagents used is thus critical,
not only for data mining, but also for experimental
troubleshooting. Yet many papers bury or leave out this
information. Curatorsmay have to track reagents through
several papers when they find sections like “we used the
reagents described in study [X].” Worse, sometimes the
information provided may not be sufficient to identify
a specific reagent. For example, a paper may say that the
researchers used a mouse monoclonal antibody against
GAD from Sigmadand Sigma has multiple mouse
monoclonal antibodies against GAD.Ideally, each reagent should have a unique
identifier; organizations such as NIF are working on
tools to provide that (http://antibodyregistry.org). In
the meantime, authors should identify reagents by both
the vendor and the catalog number. We recognize that
catalog numbers are not foolproof: different vendors
may sell the same reagent under different catalog
numbers and catalog numbers may be reused. (Indeed,
some manufacturers of gene chips use identifiers over
and over again for different probe sets.) However, the
vendor/catalog number identifier will go a long way in
providing more accurate information. Perhaps as
important, providing that information is not a big
burden on authors, and may be easier than the current
practice at some journals to require the location of
a vendor. Some journals, e.g., the Journal of
Comparative Neurology, are already requiring a more
complete accounting of antibodies used.
Technology is changing rapidly, and automated search
and analysis agents are getting better at extracting meaning
from unstructured information. But they are far from
perfect. The simple steps outlined here to better identify the
genes under investigation and the reagents used will
accelerate the development and effectiveness of the algo-
rithms. We can then turn our curators to more important
work, extracting meaning and mining knowledge for new
insights. Our curators can undertake the subtle distillation
of meaning, rather than spend time emailing authors for
basic information. The three key recommendations
described here are neither onerous nor controversial
because they utilize existing technologies and informatics
resources. Thus far, isolated efforts by individual commu-
nities have not been successful. We believe that it is time for
those charged with providing access to the literature, NCBI
and the journals, to take a firm stand on adapting scientific
publishing for the 21st century.
We urge you to adopt our recommendations.
Appendix
Examples*
Organism: “We compared the horizontal optokinetic
reaction (OKR) and response properties of retinal slip
neurons in the nucleus of the optic tract and dorsal
terminal nucleus (NOT-DTN) of albino and wild-type
ferrets (Mustela putorius furo; NCBI Taxonomy ID: 9669).”
(Hoffman et al. (2004) J. Neurosci. 24 (16), 4061e4069.)
Strain: “Wild-type zebrafish strains AB (ZFIN ID:
ZDB-GENO-960809-7), . and Tubingen (ZFIN ID:
ZDB-GENO-010924-10 ) were kept and bred as
described.” (Yang et al. (2007) Genome Biol. 8 (10),
R227.)
Antibodies: “immunolabeling of the GABAAR _1, _2,
_3, and _5 subunit, in each respective mice . The
monoclonal mouse antibody bd-17 (US Biological,
bovine, cat # G1016; 1:400) directed against both _2 and
_3 subunits of GABAARs, recognize the major
GABAAR .” (Sadlauod et al. (2010) J. Neurosci. 30 (9),
3358e3369.)
*Note: These sentences were extracted from the referenced articles. However, the identifiers for organism and strain were
inserted by the authors of this letter for demonstration purposes; they were not supplied by the original author. The antibody
information was, however, included in the paper and is a good example of the recommended best practice.
I. ETHICS, RESOURCES AND APPROACHES
2. ACCESS TO RESOURCES44
FIGURE 2.3 Slide presented at the initial LAMHDI project meeting, illustrating how LAMDHI was intended to function.
THE IDEAL SOLUTION 45
hierarchy tofilter results anddisplay them in context. Thistool could be ideal for pathway searchesdthe next bigchallenge for LAMHDI. In addition, LAMHDIwill followNIF’s lead in creating ways to move between databasesusing the semantics or ontologies developed by others,as it now uses OMIM and Homologene. LAMHDI willcontinue to link to repositories elsewhere rather thanbringing those repositories into LAMHDI. Yet LAMHDIhas far to go to fully meet the NIH’s expectations.
THE IDEAL SOLUTION
Ideally, scientists want to use disease models toimprove the validity of their research and to understandbetter how disease processes work. Even human modelsare not perfect: humans vary too much in their geneticstructure, the effect disease has on them, and their reac-tions to drugs to allow us to be sure that data collected
I. ETHICS, RESOURCES
from one subject is representative of all human subjects.For some processes, scientists can apply what they learnabout animals to what they know of humans. The aim isto understand the system well enough to have the toolsto intelligently design drugs and vaccines.
As the NIH wrote in the statement of work:
AN
The proper modeling of human disease requires an under-standing of how conditions in nonhuman species relate tohuman conditions. In very few cases are the conditionsproduced in animals equivalent to the human condition; it ismore common for animal models to present one or morefeatures that have relevance to a particular aspect of the humandisease. In some cases, this relevance to a human condition isrelatively straightforward; in others, it is quite complex. And itis not surprisingdbut is currently relatively uncommondtofind that model data generated during studies of one disease arealso relevant to understanding another seemingly unrelateddisease. A system that enables the efficient capture of thesecross-over bits of information would be valuable to any thera-peutic development process. NIH NCRR Statement of Work
(2008)
D APPROACHES
2. ACCESS TO RESOURCES46
Pathways are one way of looking at the complex datathat come from working with disease models. Scientistshave linked phenotypes of model organisms togetherusing similarity algorithms to fill in when genetic infor-mation is not available, but the final linking has to bethrough ontology, the science of describing concepts.The LAMHDI team cannot get away from ontology,because searchers use synonyms and databases needoptions for finding data. Moreover, LAMHDI needs tostandardize, or at least link, the various ontologicalstructures related to human disease into a semanticweb that will help build common vocabularies andmake data resources interoperable. LAMHDI currentlybrings together model organism genotypes and genes,but not the data about them. Scientists cannot searchdata about infectious disease, for instance, withoutalso exploring immunology, innate immunity, andinflammation, and the information universe explodeswhen they get to metabolic pathways. Without a set ofstandard concepts, this landscape cannot be navigated.
The LAMHDI team needs new services to allowLAMHDI to link to NIF, for instance, to access keygenomic and phenotypic data. LAMHDI also needs toincorporate spatial data, to allow researchers to visu-alize physical structures and pathways, and access theterminology to get to more data about animal models.For instance, cancer and diabetes tie together throughthe PI3K/AKT/mTOR pathway. While pathway conser-vation is partial in yeast, it elaborates as it moves up theevolutionary scale. By seeing a crosswalk on simplesystems, where part of a pathway might shed light on,say, the functioning of an antagonist, researchers canmore easily figure out how that pathway might workin more complex models. The result can be insights forscientists looking for linkages in humans, and links tothe human condition. The flood of human genomesequence data that will become available over the nextfew years can be mapped onto these networks throughsystems like LAMHDI, which will incorporate thecapacity to move around the evolutionary hierarchyand its pathways. A spatial view, for instance, mightstart with a human disease, jump to human genes,then to their homologues in other species, then toa spatial viewer showing those genes expressed. Or itcould start with a gene expression map in a spatialviewer; if there is substantial expression in a particularzone in a mouse, LAMHDI could allow a researcher tojump to the equivalent human gene (or other ortholog)and see its expression, and then view the relevantdisease models. This kind of functionality can helpscience reduce, reuse, and replace animal models bybuilding on existing research data.
yIn the same article he characterized modelers as “renegade individ
different modeling disciplines such as computer science, statistics, b
I. ETHICS, RESOURCES
The LAMHDI team plans to extend LAMHDI’scoverage in drug discovery in the realm of infectiousdisease and in spatial and temporal searching, startingwith neurodegenerative diseases involving drug inter-actions and pathways. It also plans to work with thescientific community to better identify data throughmetadata standards. Finally, it will reach more broadlyand deeply to engage audiences, from the general publicto school children to regulatory organizations world-wide, to better explain the use of models for humandisease, and to engage scientists to use resources likeLAMHDI to expand their knowledge and spark theirimaginations.
The full import of LAMHDI and similar resources isthat they support scientists in the most importantwork they do, to understand the connections amongfacts, the relationships of different data elements. Thisis little understood in the abstract (although scientistsget it when they see it). Francis S. Collins, the Directorof the National Institutes of Health, wrote in a 2011 issueof Science Translational Medicine, “The use of animalmodels for therapeutic development and target valida-tion is time consuming, costly, and may not accuratelypredict efficacy in humans.”2 But modeling is not justfor direct testing, it is used for gaining insight intodisease processes and therefore shedding light on treat-ment and cures. As James DeLeo, a researcher at theNIH, wrote,3 “modeling is meaningful even when themodels may be imperfect.”y
Scientists are seeking insight, not perfection. They arelooking for data points on which to hang a theory.Linked data about disease models can help designexperiments as well. Looking at pathways in nematodesor flies can tell scientists about pathways in mice,nonhuman primates, and humans. The bigger the infor-mation store researchers create, the more linkages theywill find.
If someone wants just lists of models, PubMedmight suffice. But to learn the value of a model,LAMHDI is showing the way. LAMHDI and similarsystems allow researchers to make value decisionsabout disease models. In some ways, LAMHDI is posi-tioning itself to become the model organism databasefor humans.
Acknowledgments
I would like to acknowledge the NIH, as the originator of theprogram that led to LAMHDI. I’d also like to acknowledge thecontributions of the participants at the NIH Expert Panel Meetingin Seattle, 19–20 August 2008, who were looking for ways to makebest use of the massive amounts of biomedical research data now
ualists who are fuzzy members of the fuzzy subsets of
ioinformatics, analytics and others.”
AND APPROACHES
THE IDEAL SOLUTION 47
being produced, and whose input led to the creation of the LAMDHIproject: Dave Anderson, University of Washington, School of Medi-cine, National Primate Research Center; Kevin Dawson, Universityof California, Davis Center of Excellence in Nutritional Genomics;John F. Elder, Elder Research, Inc., Knowledge Discovery and DataMining; Mark Ellisman, Neurosciences and Bioengineering, Univer-sity of California, San Diego, School of Medicine; Janan T. Eppig,The Jackson Lab/Mouse Genome Informatics Database; Trey Ideker,Bioengineering, University of California, San Diego; Michael Katze,Microbiology, University of Washington, National Primate ResearchCenter; Joseph Kemnitz, University of Wisconsin-Madison MedicalSchool, National Primate Research Center; Bret Peterson, Google,Bioengineering, Neuroscience, Computer science; John Quakenbush,Dana-Farber Cancer Institute/Harvard School of Public Health;Joel Stiles, Computational Physiology, Pittsburgh SupercomputingCenter, Carnegie Mellon University; Eric Von Schweber, PharmaSur-veyor, Neological Corp., Synsyta LLC, Infomaniacs; Linda Von
I. ETHICS, RESOURCES
Schweber, PharmaSurveyor, Neological Corp., Synsyta LLC, Infoma-niacs; Monte Westerfield, Institute of Neuroscience, University ofOregon; and Stuart Zola, Emory School of Medicine, NationalPrimate Research Center.
References
1. Wood MW, Hart LA. Selecting appropriate animal models andstrains: Making the best use of research, information and outreach.In: Proceedings of the 6th World Congress on Alternatives and Animal
Use in the Life Sciences, August 21–25, 2007; 2008. Tokyo, Japan.AATEX 14, Special Issue, 303–306.
2. Collins FS. Reengineering Translational Science: The time is right.Sci Transl Med 2011;3(90): 90cm17.
3. DeLeo J. Guest Editorial. Identifying and Overcoming Skepticismabout biomedical Computing: Modelers should take the lead.Biomed Comp Rev Summer 2012;2012:1–2.
AND APPROACHES