11
CHAPTER 2 Access to Resources A Model Organism Database for Humans Judith Axler Turner TCG, Washington, DC, USA OUTLINE The Problem 37 The LAMHDI Solution 38 LAMHDI Database Search 39 Some Sources are Curatorial 39 Some Sources Provide the Model Information 39 Performing the Search 41 The Advantages and Limitations of LAMHDI 41 The Ideal Solution 45 Acknowledgments 46 References 47 THE PROBLEM Arguably the single most essential element in animal-based research, identifying and selecting the most appropriate animal model is also the most challenging. Wood and Hart, 2008 1 p. 303 While animal models are central to the effective study and discovery of treatments for human diseases, finding appropriate models for every stage of research is still a challenge. Scientists interested in the “three Rs” of animal model researchdrefining, reducing, and replacing models where possibledare seeking models at the lowest level that can provide insight into knotty problems. But cross-species information is hard to find and hard to compare: each model databasedwhere databases exist at alldis different, with different struc- tures, different data, and different search strategies. Learning one database is rarely helpful in searching another. The scientific literature typically does not provide answers to the question of whether another model might be a better choice at various points in disease research. Journal articles focus on the experiment and its results, and not the model. While some articles fully identify the model used, rarely are alternatives discussed. This chapter looks at one initiative to provide infor- mation about and access to appropriate disease models across species, LAMHDI, the project to Link Animal Models to Human DIsease. LAMHDI was created to gather and organize information about extant disease models so that researchers could identify and compare options across species and strains. LAMHDI (http:// www.lamhdi.org) is available and searchable through a Web interface. “When I go to a primate center meeting, there is always a question from the audience, ‘how do I find out what models are available?’ and I do a song and dance and the final answer is ‘give me a call,’” an official at the National Institutes of Health (NIH) said in a private conversation before LAMHDI was built. “Access to and use of disease model information is currently an inefficient process owing at least in part to the large number and complex nature of existing disease models and associated data,” the National Center for Research Resources (NCRR) wrote in the request for proposals that was issued in 2008. 37 Animal Models for the Study of Human Disease http://dx.doi.org/10.1016/B978-0-12-415894-8.00002-6 Copyright Ó 2013 Elsevier Inc. All rights reserved.

Animal Models for the Study of Human Disease || Access to Resources

Embed Size (px)

Citation preview

Page 1: Animal Models for the Study of Human Disease || Access to Resources

C H A P T E R

2

Access to ResourcesA Model Organism Database for Humans

Judith Axler TurnerTCG, Washington, DC, USA

A

h

O U T L I N E

The Problem

37

The LAMHDI Solution

38

nima

ttp://

LAMHDI Database Search

39

l Mod

dx.do

Some Sources are Curatorial

39

Some Sources Provide the Model Information

39

Performing the Search

41

37els for the Study of Human Disease

i.org/10.1016/B978-0-12-415894-8.00002-6

The Advantages and Limitations of LAMHDI

41

The Ideal Solution

45

Acknowledgments

46

References

47

THE PROBLEM

Arguably the single most essential element in animal-basedresearch, identifying and selecting the most appropriate animalmodel is also the most challenging. Wood and Hart, 20081 p. 303

While animal models are central to the effectivestudy and discovery of treatments for human diseases,finding appropriate models for every stage of researchis still a challenge. Scientists interested in the “three Rs”of animal model researchdrefining, reducing, andreplacing models where possibledare seeking modelsat the lowest level that can provide insight into knottyproblems. But cross-species information is hard to findand hard to compare: each model databasedwheredatabases exist at alldis different, with different struc-tures, different data, and different search strategies.Learning one database is rarely helpful in searchinganother.

The scientific literature typically does not provideanswers to the question of whether another model mightbe a better choice at various points in disease research.Journal articles focus on the experiment and its results,

and not the model. While some articles fully identifythe model used, rarely are alternatives discussed.

This chapter looks at one initiative to provide infor-mation about and access to appropriate disease modelsacross species, LAMHDI, the project to Link AnimalModels to Human DIsease. LAMHDI was created togather and organize information about extant diseasemodels so that researchers could identify and compareoptions across species and strains. LAMHDI (http://www.lamhdi.org) is available and searchable througha Web interface.

“When I go to a primate center meeting, there isalways a question from the audience, ‘how do I findout what models are available?’ and I do a song anddance and the final answer is ‘give me a call,’” an officialat the National Institutes of Health (NIH) said ina private conversation before LAMHDI was built.“Access to and use of disease model information iscurrently an inefficient process owing at least in partto the large number and complex nature of existingdisease models and associated data,” the NationalCenter for Research Resources (NCRR) wrote in therequest for proposals that was issued in 2008.

Copyright � 2013 Elsevier Inc. All rights reserved.

Page 2: Animal Models for the Study of Human Disease || Access to Resources

2. ACCESS TO RESOURCES38

Even those who built the most important diseasemodel databases recognized the problem. Monte West-erfield (Institute of Neuroscience, University of Oregon),a founder of ZFIN, the Zebrafish Information Network,said in a presentation at the LAMHDI organizationalmeeting that heterogeneous information, multiple sites,inconsistent search mechanisms, inconsistent use ofdescriptive terms, lack of support for synonyms, andinconsistent definitions of disease models impedefinding animal models for human disease. Even the defi-nition of a disease model was a problem: is a disease rep-resented by a single gene, is it multigenic, or is it aninfectious disease and therefore non-genetic? “Muta-tions in different genes can produce the same pheno-types,” he said, and “mutations in the same gene indifferent organisms can result in different phenotypes.”

Nevertheless, the NCRR was so confident thata cross-species database of disease models could bedeveloped that it limited respondents to its request forproposals to small businesses, not the research enter-prise. NCRR was looking for a product, not an experi-ment. The statement of work said:

More rigorous methods for describing animal models willallow researchers to identify and examine commonalities anddifferences across multiple animal models. As more animalmodels are generated, it is important to develop better mecha-nisms by which these models are mapped onto human condi-tions aswell as better ways to capture and then access this newlygenerated information. NIH NCRR Statement of Work (2008)

The NIH believed it could be done.Moreover, the NIH expected the database to do more

than just connect researchers with information aboutdisease models in various species. At an NIH meetinga few weeks before the contract was awarded,scientistsdmany of whom would become part of theLAMHDI effortdsought to find a way to maximizeproductivity in biomedical research. They wanted toharness the massive amounts of data that were beingproduced through new technological advances, andanalyze, share, and perform computational studies onthem to translate them into beneficial medical advances.

“The problem is not specifically a lack of resourcesdthe NIH alone funds billions of dollars each year forbiomedical research. Rather, the issue is the lack ofability to effectively utilize data across experimentalgroups, institutions, and domains,” one speaker at theinvitation-only conference said.

Their solution was to create knowledge environ-ments, information infrastructures that would providea framework for effective computation on disease-model data. The first goal was to bring together avail-able data on animal models in a single environment.

Such a system would not be breaking entirelynew ground. The University of California, San Diego

I. ETHICS, RESOURCES

(UCSD), had already built the successful NeuroscienceInformation Framework, or NIF. This resource, estab-lished in 2004, is a dynamic inventory of Web-basedneuroscience resources: data, materials, and tools acces-sible on the Internet. NIF combines ontology, standards,and support for “identifying, locating, relating, access-ing, integrating, and analyzing information from theneuroscience research enterprise” (http://www.neuinfo.org/about/index/shtm).

The LAMHDI contract was awarded to a team thatcombined a small business successful in federalinformation-technology contracting and noted diseaseresearchers and disease-model providers. TCG, a Wash-ington, DC-based information-technology company,brought together a team of scientists from prominentdisease research centers, led by the UCSD Center forResearch in Biological Systems and NIF, the Universityof Washington, and the University of Wisconsin. Theirproposal was for LAMHDI.

THE LAMHDI SOLUTION

The NIH intention was clear. The government’s state-ment of work said:

The initial component of this project will address thedevelopment of the front end of the resource: specifically,a directory of available disease models. The envisioned direc-tory would enhance information access and retrieval by assist-ing researchers to find information about animalmodels, and bysupporting experts seeking to assist researchers to find suchinformation. This resource directory would provide referencematerials, information about the resources, and informationabout the animal models themselves. Initially this componentwould focus on a few model animal species (e.g., mouse,zebrafish), and eventually expand to other species and alongother dimensions (e.g., microbes, tissues). Resource databasedevelopment would focus on the needs of both the animalmodel and human disease research communities and the char-acteristics of existing resources, and address issues such as datadescription standards, user interface, user services, etc. NIH

NCRR Statement of Work (2008)

LAMHDI was designed to be a tool to help res-earchers fulfill the NIH mission of improving healthand saving lives. LAMHDI allows researchers to findthe most appropriate models of human disease bycomparing disease models across species, and to accessthose models.

The scientific community wanted to be able to searchmodel organism databases, animal resources, andOMIM (Online Mendelian Inheritance in Man, http://ncbi.nlm.nih.gov/omim, a database of human diseaseand genetic disorders) by disease, gene, pathway, organ,tissue, cell type, and GO terms (Gene Ontology, http://geneontology.org/, a controlled vocabulary of terms for

AND APPROACHES

Page 3: Animal Models for the Study of Human Disease || Access to Resources

THE LAMHDI SOLUTION 39

describing gene product characteristics and geneproduct annotation data).

NIF, which let researchers access literature andinformation through a human-curated database, wasa possible model. But establishing and maintaininga curated catalog is challenging. Maryann Martone, theprincipal investigator at NIF, warned the LAMHDIteam in a conversation that curation is hard: “In onedatabase a mouse is a model, in another it’s a numberin a table. Terminology is a big problem.” NIF attacksthe problem by using teams of scientific curators toverify every entry, and slowly build a standard lexiconof terms. If LAMHDI had followed that model, it wouldhave had to develop its own set of termsdlikelybuilding on NIF, but going further into all humandiseasesdfor its disease-model databases.

Heeding Martone’s warnings, the LAMHDI teamlooked for a quicker and more automated approach, togive researchers the first rough draft of the data theywould need to make informed decisions about diseasemodels. The LAMHDI team chose to build on existingcurateddatabases, andautomate the linkages.Thecurrentversion of LAMHDIprovides a database andwebsite thatallows researchers to search for and access appropriatemodels of human disease in mice, zebra fish, rats, yeast,and flies. LAMHDI also allows users to search the Prima-teLit database of articles about nonhuman primatemodels of human disease, and to do Google-like searchesof select, topic-relevant Web sites. All data included inLAMHDI was chosen and curated by the experts in theareas covered by the existing databases. LAMHDI’s aimwas to save time for scientists who once had to go to thesource databases and learn multiple different searchstrategies and deal with multiple vocabularies. OnLAMHDI, a single search brings results frommany scien-tific databases. Developers devised a smart search enginethat could present multiple species from disparate data-bases, and provided added value by presenting informa-tion about where to access individual models.

LAMHDI does not create the data it offers to users;that is collected and curated by scientists worldwide.LAMHDI’s role is to build the software that translatesfrom one data system or data structure to another, so sci-entists can search across existing databases and find rela-tionships that will help inform research. This is no trivialtask. LAMHDI users can enter a disease name, and findrelated models in five species, yet source databases maynot havedisease names, ormaynot linkparticularmodelsto those diseases, even though a link may exist.

LAMHDI’s strength is its system architecture. Itbrings in a database of disease models, including geneticand annotation data when available, and runs eachmodel against the National Library of Medicine’sMedical Subject Headings (MeSH) controlled vocabu-lary to identify all related terms. After scanning the files

I. ETHICS, RESOURCES

for the information to presentdspecies, name or identi-fier for the model, diseases to which it relates, accessinformation (typically the name of the laboratory wherethe model may be purchased), and “other” informationthat may be useful, such as descriptions of the model,relevant literature, the record of the model, the genesinvolved, etc.dit applies keyword tags to the data toallow it to come up quickly in searches. These tags repre-sent the results of the pre-searching that the systemdoes. See Figs 2.1 and 2.2 for examples of LAMDHIsearch results.

The LAMHDI site’s “about” page (http://lamhdi.org/about) shown below between the two rules offersthe best description of how it works:

LAMHDI Database Search

LAMHDI brings together scientifically validatedinformation from various sources to create a compositemulti-species database of animal models of humandisease. To do this, the LAMHDI database is preparedfrom a variety of sources.

Some Sources are Curatorial

The LAMHDI team takes publicly available datafrom OMIM, NCBI’s Entrez Gene database, Homolo-gene, and WikiPathways, and builds a mathematicalgraph (think of it as a map or a web) that links thesedata together to help discover connections. We useOMIM to link human diseases with specific humangenes. We use Entrez for universal identifiers for eachof those genes. We link human genes to their counter-part genes in other species using Homologene, andwe link those genes to other genes that are tentativelyor authoritatively linked with them in some way usingthe data in WikiPathways.

This preparatory work gives LAMHDI a web ofhuman diseases linked to specific human genes, orthol-ogous human genes, homologous genes in other species,and both human and nonhuman genes involved inspecific metabolic pathways associated with thosediseases.

Some Sources Provide the Model Information

LAMHDI includes model data that partners sharewith us, which is plugged into our data structure. Forinstance, MGI (Mouse Genome Informatics, http://www.informatics.jax.org/, an “international databaseresource for the laboratory mouse, providing integratedgenetic, genomic, and biological data to facilitate thestudy of human health and disease,” according to thewebsite) provides information about mouse models,including a disease for each model, as well as a bit ofgenetic information (the ID of the model, in fact,

AND APPROACHES

Page 4: Animal Models for the Study of Human Disease || Access to Resources

FIGURE 2.2 In the LAMDHI search depicted here, the jumps are identified for the first fly model. The record for Drosophila melonagaster,PBac{RB}Nf1e00084, does not contain the terms neurofibromatosis-1 optic glioma. To get to that disease model, LAMHDI first applied the searchterms to OMIM, the “comprehensive, authoritative, and timely compendium of human genes and genetic phenotypes” maintained by theNational Center for Biotechnology Information (NCBI). The OMIM record (http://omim.org/entry/162200) for neurofibromatosis-1 opticglioma included nearly 300 references and a pointer to the human gene NF1. LAMHDI ran “NF1” through Homologene (http://www.ncbi.nlm.nih.gov/sites/entrez?db¼homologene&cmd¼search&term¼226), NCBI’s homologue database, and found the fly gene Nf1. A search for Nf1 inFlyBase (http://flybase.org) brought up this particular model.

FIGURE 2.1 A recent search for neurofibromatosis-1 optic glioma yielded the screen pictured here. A total of 83 records matched thesearch terms. That example search brought back 40 mouse models, 21 fly models, 19 zebra fish models, and 3 yeast models. The first mousemodel, with a score of 99, is the most likely hit; the next closest match had a score of 68. Scores are determined by the number of jumps (logicalinferences) in the search and the strength of the evidence for those jumps.

I. ETHICS, RESOURCES AND APPROACHES

2. ACCESS TO RESOURCES40

Page 5: Animal Models for the Study of Human Disease || Access to Resources

THE LAMHDI SOLUTION 41

identifies one or more genes); while ZFIN providesgenetic information for each zebrafish model, but nodiseases. We plug each zebrafish model into our data-base using the genes as the glue. For instance, a hypo-thetical zebrafish that involves the zebrafish pkd2 genewould plug into the larger disease-gene map at thenode representing the zebrafish pkd2 gene, which is con-nected to the node representing the human PKD2 gene,which in turn is connected to the node representing thehuman disease known as polycystic kidney disease.(Some of the partner data we receive can even extendour base map. MGI provides a disease for every model,and in some cases this allows us to create a disease-to-gene relationship in our database that might not alreadybe documented in the OMIM dataset.)

With our curatorial and model information in hand,we run a lengthy automated process that exhaustivelysearches for every possible path between each modeland each disease in the data, up to some arbitrarynumber of hops (producing for each disease-to-modelpair a set of links from the disease to the model). Thealgorithm avoids circular paths and paths that includemore than one disease anywhere in the middle of thepath. At the end of this phase, LAMHDI has a compre-hensive set of paths representing all the disease-to-model relationships in our data, varying in length fromone hop to many hops. Each disease-to-model path isessentially a string of nodes in our data, where eachnode represents a disease, a gene, a linkage betweengenes (an ortholog, a homolog, or a pathway connection,referred to as a gene cluster or association), or a model.Each node has a human-friendly label, a set of termsand keywords, anddin most casesda URL linking thenode to the data source where it originated. We’re nowready for our users.

Performing the Search

When a researcher submits a search on the LAMHDIwebsite, LAMHDI searches for the user’s search termsin its precomputed list of all known disease-to-modelpaths (looking for the terms not only in the diseaseand model nodes, but also every node along eachpath). The complete set of hits may include multiplepaths between any given disease-to-model pair ofendpoints. Each of these disease-to-model pair sets isordered by the number of hops it involves, and theone involving the fewest hops is chosen to representits respective disease-to-model pair in the search resultspresented to the user (which are sorted overall by theirsearch term matching scores).

The number of hops is one barometer of the strengthof the evidence linking the model and the disease; fewerhops indicates the relationship is stronger, more hopsindicates it may beweaker. Note that this indicatorworks

I. ETHICS, RESOURCES

best for comparing models from a single partner dataset:MGI explicitly identifies a disease for each mouse model,so there can be disease-to-model hits for mice thatinvolve just one hop. Because ZFIN does not explicitlyidentify a disease for each model, no zebrafish modelwill involve fewer than four hops to the nearest disease:from the zebrafish model to a zebrafish gene, to a genecluster, to a human gene, to a human disease.

The Advantages and Limitations of LAMHDI

A researcher might do the same searches withoutLAMHDI, looking for relationships and pointers fromarticle to database to database to article to model. But aresearcher could take an hour or more to do each search;LAMHDI returns its results in fractions of a second.When searches are that fast, researchers can do dozens,even scores of them to find exactly the right result. More-over, LAMHDI’s software does not just search, it infersrelationships that could lead researchers to new ways ofthinking about the data. LAMHDI does not claim to findthe right result. It takes a trained scientist to evaluate theresults and decide whether Drosophila melonagaster, PBac{RB}Nf1e00084, is a crummy model or just right for whata researcher needs to do at a particular point in her work.

LAMHDI’s main strength today is its speed: it letsa scientist try out ideas and follow thoughts withoutbeing distracted by the process. The NIH expected that:

an important motivation for developing better access toresources for animal models of human disease [would be] theenhanced ability to search across multiple animal models as wellas to capture specific model information that may be relevant tomultiple diseases. NIH NCRR Statement of Work (2008)

But LAMHDI could go only so far. The five diseasemodel databases it includesdMGI, the Mouse GenomeInformatics website (http://www.informatics.jax.org/);ZFIN, the Zebrafish Model Organism Database (http://www.zfin.org/); RGD, the Rat Genome Database(http://rgd.mcw.edu/); SGD, the SaccharomycesGenome Database (http://www.yeastgenome.org/);and FlyBase (http://www.flybase.org/)dare the onlyextensive, public, well-organized, and well-curateddisease-model databases. (Several well-known data-bases, such as the Knockout Rat Consortium (http://www.knockoutrat.org/) and KOMP, the KnockoutMouse Project (http://www.nih.gov/science/models/mouse/knockout/index.html), are subsets of thedisease-model databases LAMHDI already uses.) Infor-mation about all other disease models is stored inprivate, or small, or out-of-date, or non-standard data-bases. For instance, zoos are eager to make their animalsavailable for research (through non-invasive proce-dures), but do not have comparable data for structured

AND APPROACHES

Page 6: Animal Models for the Study of Human Disease || Access to Resources

2. ACCESS TO RESOURCES42

searches. If you need a hippopotamus (for research intoobesity and longevity, for instance), you have to phoneup the local zoo, and if they do not have the capabilityto provide cheek swabs or blood, keep calling zoos.The national primate research centers have reams,even petabytes (1015 bytes, or one million gigabytes) ofinformation on their charges, but it is locked in indi-vidual researchers’ files and not standardized in usefulways. Even commercial providers of disease modelsdo not create the kind of data that a computer can parse;they get the data about their models from the samepublic databases that LAMHDI already uses.

The NIH was aware of the problem; the request forproposals asked for “more rigorous methods fordescribing animal models.” To help meet that goal,LAMHDI scientists met to consider the problem. Oneissue was that searches of the scientific literature didnot easily yield information about disease models. Arelative handful of articles mentioned the exact strainor source of the disease model, and certainly not ina standardized format that could be parsed by acomputer. It seemed to the LAMHDI team that theirefforts could not succeed without help from the journalsthemselves. The LAMHDI team, supported by scientistsfrom a range of disciplines, drafted a letter for journalpublishers urging them to establish standards for infor-mation about animal models (see Box 2.1).

The letter asks that journals enforce three “minimal”practices for authors:

1. Provide gene accession numbers for all genes2. Identify the species for the subject of a study based on

the NCBI taxonomy, and the strains from the modelorganism databases

3. Provide catalog numbers and vendor information forreagents and animals.

Scientific literature is intended not only to reportadvances, but to ensure repeatability. Without the threeminimal pieces of information that the letter outlined,it is impossible to replicate experiments, scientists say.Too many variables are unclear, and any one of themcould take an experiment off in an entirely differentdirection. For instance, a reagent from one vendor couldbe just different enough from a reagent from anotherthat the entire procedure would be affected.

With that minimal information, the scientists add,the value of animal models could be increased dramati-cally, as research reporting would also build the recordon disease modeling. Findings could be correlatedacross models, allowing researchers to home in onmodels that address their particular interests. The NIHhad foreseen that benefit:

*LAMHDI is based on PhP, and uses the CakePHP framework, Lam

tools.

I. ETHICS, RESOURCES

As more animal models are generated, it is important todevelop better mechanisms by which these models are mappedonto human conditions as well as better ways to capture andthen access this newly generated information. Such methodsmust eventually use automated ways of linking various types ofrepresentations to identify equivalent, comparable, or relatedconcepts. NIH NCRR Statement of Work (2008)

Despite that effortdwhich has thus far had onlylimited successdLAMHDI has not been able to havethe effect on disease model research that the NIH hadhoped. The original request for proposals asked for:

better mechanisms by which . models are mapped ontohuman conditions as well as better ways to capture and thenaccess this newly generated information. Such methods musteventually use automated ways of linking various types ofrepresentations to identify equivalent, comparable, or relatedconcepts. NIH NCRR Statement of Work (2008)

Other efforts are under way to refine automated cura-tion to allow initiatives likeLAMHDI to takebetter advan-tage of existing material. LAMHDI’s advances will belinked to their success. Also, other LAMHDI-like effortscan be built on LAMHDI, as the software technology*is all open source (that is, freely available to anyonewho wants to use it). LAMHDI in turn is based in parton another open-source project funded by NIH and builtby TCG, NITRC (http://www.nitrc.org), the Neuroimag-ing Informatics Tools and Resources Clearinghouse.NITRC facilitates finding and comparing neuroimagingresources for functional and structural neuroimaginganalyses. NITRC collects and points to standardizedinformation about the tools it includes. LAMHDI wasintended to be more like NITRC, as evidenced from theslide shown in Fig. 2.3, which the LAMHDI team pre-sented at the kickoff meeting for the project.

To date, neither the “participate” nor the “curate”efforts has been a focus of LAMHDI. NITRC’s strengthis its community (represented by “participate”), andLAMHDI has yet to incorporate either the discussionforums or the automated processes to accept and vali-date individual model listings from researchers.Accepting disease model listings from researchersmeans establishing standards for data that go beyondthe three minimal data elements urged for journals,and engaging the “curate” function in the slide bycreating a formal mechanism for evaluating suggestionsby allowing scientists to rate models and their appro-priateness. Such crowd-sourcing may be in LAMHDI’sfuture.

Another important model for LAMHDI is NIF. Goodpractices and technological advances pioneered by NIFmay be applicable to LAMHDI as well. For instance,NIF is exploring the use of a tool that uses a semantic

p stack, Postgres, MnogoSearch, and Sphinx, among other

AND APPROACHES

Page 7: Animal Models for the Study of Human Disease || Access to Resources

BOX 2.1

L E T T E R F R OM S C I E N T I S T S T O J O U RNA L E D I T O R S ANDPU B L I S H E R S

BOX 2.1

L E T T E R F R OM S C I E N T I S T S T O J O U RNA L E D I T O R S ANDPU B L I S H E R S

BOX 2.1

L E T T E R F R OM S C I E N T I S T S T O J O U RNA L E D I T O R S ANDPU B L I S H E R S

BOX 2.1

L E T T E R F R OM S C I E N T I S T S T O J O U RNA L E D I T O R S ANDPU B L I S H E R S

P U B L I S H I N G I N T H E 2 1 S T C E N T U R Y : M I N I M A L ( R E A L L Y ) D A T A S T A N D A R D S

Although researchers write papers for other researchers,

the primary consumers of data and information these days

are not humans but computers. Computers find data,

display and analyze themdprovided that the data are

structured in a way that allows these functions to happen.

The form of the scientific paper has been honed over many

hundreds of years for humans to use. But scientific papers

are difficult for a computer to understand. The beauty and

frustration of human language is that the same word or

phrase can mean many things and the same thing can be

describedmany different ways.While poetry is enriched by

this mystery, scientific literature is hamstrung by it. Some-

times even expert scientific curators have a difficult time

extracting accurate information from an article when the

information is incomplete, ambiguous, or (in more cases

than scientists perhaps would care to admit) missing alto-

gether. It is no surprise that computers have a problem.

We are on the cusp of an evolution in scientific

publishing, and that evolution may involve new ways of

reporting information, e.g., the structured digital abstract,

ontology. However, during this evolution, while tools and

strategies are being developed and tested, those of us who

are charged with providing access to information in the

scientific literatureddatabase curatorsdhave identified

three simple practices that could make extracting relevant

information from the literature more efficient and thor-

ough. We recommend that scientific journals require that

authors do the following in order to meet the publishing

needs of the 21st century:

1. Provide gene accession numbers for all genes

referenced in the methods section of a paper, per

http://www.ncbi.nlm.nih.gov/gene

2. Identify the species for the subject of a study from the

NCBI taxonomy and the strains from the model

organism databases for mice, rats, worms, zebra fish,

and Drosophila, employing any existing unique

identifiers and correct species-specific nomenclature:

a. Mice: http://www.informatics.jax.org/

b. Zebrafish: http://zebrafish.org/zirc/home/guide.

php

c. Rats: http://rgd.mcw.edu/

d. Flies: http://flybase.org

e. Worms: http://wormbase.org

3. Provide catalog numbers and vendor information for

all reagents and animals described in the methods

section of a paper.

These requirements are minimal, really; authors will

not find them onerous, and publishers will not find them

difficult to implement.

Our impetus was a recent invitation-only meeting at

NIH on the initiative to Link Animal Models to Human

DIsease (LAMHDI), a database of animal models that

allows cross-species searching. We explored best practices

that had been covered by others (e.g., http://zfin.org/zf_

info/author_guidelines.html; http://wiki.geneontology.

org/index.php/Letter_to_Editors).

Our goal is to have a permanent unique identifier,

a “social security number,” if youwill, for specific concepts.

Even if a concept hasmany different names, all those names

need to map to the same identifier. Similarly, different

concepts cannot share the same identifier. Thus, this iden-

tifier serves to disambiguate shared names, and link the

same concepts across different papers and databases.

Below we give details on our three recommendations.

Recommendation 1dGene Accession Numbers

NCBI’s Entrez Gene is one of the resources that uses

permanent unique identifiers for gene symbols and

names. Data in Entrez Gene result from a mixture of

curation and automated analyses. Annotation of

sequences is integrated with information from

collaborating model-organism databases, literature

review, andpublic users (with curationbyRefSeq staff

as required). Once a gene is assigned a unique

accession number, it can never be reused to identify

another gene, even if the gene assignment is later

determined tobe in error.Using this accessionnumber

in papers alongside the gene name and abbreviation

lets both computers and humans identify the gene

unambiguously.Because theaccessionnumber isused

universally, it also links that gene to themultiplicity of

databases and knowledge sources that contain the

same identifier. Even if a gene has many different

names or if many different genes or other types of

entities are called by the same name, curators and

automated agents can easily identify them by their

unique accession number.

Recommendation 2dOrganism Identification

NCBI provides a taxonomy ID for each major

species. Within a species, many unique strains and

stocks may be developed, each carrying many

sequence variants, mutations, and genetically

(Continued)

I. ETHICS, RESOURCES AND APPROACHES

THE LAMHDI SOLUTION 43

Page 8: Animal Models for the Study of Human Disease || Access to Resources

BOX 2.1 (cont’d)

modified genes. Each of the major model organism

databases (MGI, ZIRC, Wormbase, Flybase, RGD)

provides strain IDs for these genetically unique strains

and stocks. If authors use one of these IDs in their papers,

a search agent (or a human) can confidently relate the

research results to the right organism. These registries

also offer standardized, detailed characterizations of

organisms’ genotypic information, so authors do not

have to provide it. Finally, the registries help researchers

obtain identifiers for new organisms, ensuring accessi-

bility and comparability with all other records.

Recommendation 3dReagent Identification

Experimental results in bioscience are fundamentally

reliant on reagents. For gene expression studies,

antibodies are a critical tool for providing detailed spatial

information on the localization of gene products.

Antibodies are generally made to specific sequences that

may be further modified through phosphorylation or

some other event. As any experienced anatomist or cell

biologist knows, different antibodies to the same protein

can give very different results, even within the same

laboratory. Identifying the reagents used is thus critical,

not only for data mining, but also for experimental

troubleshooting. Yet many papers bury or leave out this

information. Curatorsmay have to track reagents through

several papers when they find sections like “we used the

reagents described in study [X].” Worse, sometimes the

information provided may not be sufficient to identify

a specific reagent. For example, a paper may say that the

researchers used a mouse monoclonal antibody against

GAD from Sigmadand Sigma has multiple mouse

monoclonal antibodies against GAD.Ideally, each reagent should have a unique

identifier; organizations such as NIF are working on

tools to provide that (http://antibodyregistry.org). In

the meantime, authors should identify reagents by both

the vendor and the catalog number. We recognize that

catalog numbers are not foolproof: different vendors

may sell the same reagent under different catalog

numbers and catalog numbers may be reused. (Indeed,

some manufacturers of gene chips use identifiers over

and over again for different probe sets.) However, the

vendor/catalog number identifier will go a long way in

providing more accurate information. Perhaps as

important, providing that information is not a big

burden on authors, and may be easier than the current

practice at some journals to require the location of

a vendor. Some journals, e.g., the Journal of

Comparative Neurology, are already requiring a more

complete accounting of antibodies used.

Technology is changing rapidly, and automated search

and analysis agents are getting better at extracting meaning

from unstructured information. But they are far from

perfect. The simple steps outlined here to better identify the

genes under investigation and the reagents used will

accelerate the development and effectiveness of the algo-

rithms. We can then turn our curators to more important

work, extracting meaning and mining knowledge for new

insights. Our curators can undertake the subtle distillation

of meaning, rather than spend time emailing authors for

basic information. The three key recommendations

described here are neither onerous nor controversial

because they utilize existing technologies and informatics

resources. Thus far, isolated efforts by individual commu-

nities have not been successful. We believe that it is time for

those charged with providing access to the literature, NCBI

and the journals, to take a firm stand on adapting scientific

publishing for the 21st century.

We urge you to adopt our recommendations.

Appendix

Examples*

Organism: “We compared the horizontal optokinetic

reaction (OKR) and response properties of retinal slip

neurons in the nucleus of the optic tract and dorsal

terminal nucleus (NOT-DTN) of albino and wild-type

ferrets (Mustela putorius furo; NCBI Taxonomy ID: 9669).”

(Hoffman et al. (2004) J. Neurosci. 24 (16), 4061e4069.)

Strain: “Wild-type zebrafish strains AB (ZFIN ID:

ZDB-GENO-960809-7), . and Tubingen (ZFIN ID:

ZDB-GENO-010924-10 ) were kept and bred as

described.” (Yang et al. (2007) Genome Biol. 8 (10),

R227.)

Antibodies: “immunolabeling of the GABAAR _1, _2,

_3, and _5 subunit, in each respective mice . The

monoclonal mouse antibody bd-17 (US Biological,

bovine, cat # G1016; 1:400) directed against both _2 and

_3 subunits of GABAARs, recognize the major

GABAAR .” (Sadlauod et al. (2010) J. Neurosci. 30 (9),

3358e3369.)

*Note: These sentences were extracted from the referenced articles. However, the identifiers for organism and strain were

inserted by the authors of this letter for demonstration purposes; they were not supplied by the original author. The antibody

information was, however, included in the paper and is a good example of the recommended best practice.

I. ETHICS, RESOURCES AND APPROACHES

2. ACCESS TO RESOURCES44

Page 9: Animal Models for the Study of Human Disease || Access to Resources

FIGURE 2.3 Slide presented at the initial LAMHDI project meeting, illustrating how LAMDHI was intended to function.

THE IDEAL SOLUTION 45

hierarchy tofilter results anddisplay them in context. Thistool could be ideal for pathway searchesdthe next bigchallenge for LAMHDI. In addition, LAMHDIwill followNIF’s lead in creating ways to move between databasesusing the semantics or ontologies developed by others,as it now uses OMIM and Homologene. LAMHDI willcontinue to link to repositories elsewhere rather thanbringing those repositories into LAMHDI. Yet LAMHDIhas far to go to fully meet the NIH’s expectations.

THE IDEAL SOLUTION

Ideally, scientists want to use disease models toimprove the validity of their research and to understandbetter how disease processes work. Even human modelsare not perfect: humans vary too much in their geneticstructure, the effect disease has on them, and their reac-tions to drugs to allow us to be sure that data collected

I. ETHICS, RESOURCES

from one subject is representative of all human subjects.For some processes, scientists can apply what they learnabout animals to what they know of humans. The aim isto understand the system well enough to have the toolsto intelligently design drugs and vaccines.

As the NIH wrote in the statement of work:

AN

The proper modeling of human disease requires an under-standing of how conditions in nonhuman species relate tohuman conditions. In very few cases are the conditionsproduced in animals equivalent to the human condition; it ismore common for animal models to present one or morefeatures that have relevance to a particular aspect of the humandisease. In some cases, this relevance to a human condition isrelatively straightforward; in others, it is quite complex. And itis not surprisingdbut is currently relatively uncommondtofind that model data generated during studies of one disease arealso relevant to understanding another seemingly unrelateddisease. A system that enables the efficient capture of thesecross-over bits of information would be valuable to any thera-peutic development process. NIH NCRR Statement of Work

(2008)

D APPROACHES

Page 10: Animal Models for the Study of Human Disease || Access to Resources

2. ACCESS TO RESOURCES46

Pathways are one way of looking at the complex datathat come from working with disease models. Scientistshave linked phenotypes of model organisms togetherusing similarity algorithms to fill in when genetic infor-mation is not available, but the final linking has to bethrough ontology, the science of describing concepts.The LAMHDI team cannot get away from ontology,because searchers use synonyms and databases needoptions for finding data. Moreover, LAMHDI needs tostandardize, or at least link, the various ontologicalstructures related to human disease into a semanticweb that will help build common vocabularies andmake data resources interoperable. LAMHDI currentlybrings together model organism genotypes and genes,but not the data about them. Scientists cannot searchdata about infectious disease, for instance, withoutalso exploring immunology, innate immunity, andinflammation, and the information universe explodeswhen they get to metabolic pathways. Without a set ofstandard concepts, this landscape cannot be navigated.

The LAMHDI team needs new services to allowLAMHDI to link to NIF, for instance, to access keygenomic and phenotypic data. LAMHDI also needs toincorporate spatial data, to allow researchers to visu-alize physical structures and pathways, and access theterminology to get to more data about animal models.For instance, cancer and diabetes tie together throughthe PI3K/AKT/mTOR pathway. While pathway conser-vation is partial in yeast, it elaborates as it moves up theevolutionary scale. By seeing a crosswalk on simplesystems, where part of a pathway might shed light on,say, the functioning of an antagonist, researchers canmore easily figure out how that pathway might workin more complex models. The result can be insights forscientists looking for linkages in humans, and links tothe human condition. The flood of human genomesequence data that will become available over the nextfew years can be mapped onto these networks throughsystems like LAMHDI, which will incorporate thecapacity to move around the evolutionary hierarchyand its pathways. A spatial view, for instance, mightstart with a human disease, jump to human genes,then to their homologues in other species, then toa spatial viewer showing those genes expressed. Or itcould start with a gene expression map in a spatialviewer; if there is substantial expression in a particularzone in a mouse, LAMHDI could allow a researcher tojump to the equivalent human gene (or other ortholog)and see its expression, and then view the relevantdisease models. This kind of functionality can helpscience reduce, reuse, and replace animal models bybuilding on existing research data.

yIn the same article he characterized modelers as “renegade individ

different modeling disciplines such as computer science, statistics, b

I. ETHICS, RESOURCES

The LAMHDI team plans to extend LAMHDI’scoverage in drug discovery in the realm of infectiousdisease and in spatial and temporal searching, startingwith neurodegenerative diseases involving drug inter-actions and pathways. It also plans to work with thescientific community to better identify data throughmetadata standards. Finally, it will reach more broadlyand deeply to engage audiences, from the general publicto school children to regulatory organizations world-wide, to better explain the use of models for humandisease, and to engage scientists to use resources likeLAMHDI to expand their knowledge and spark theirimaginations.

The full import of LAMHDI and similar resources isthat they support scientists in the most importantwork they do, to understand the connections amongfacts, the relationships of different data elements. Thisis little understood in the abstract (although scientistsget it when they see it). Francis S. Collins, the Directorof the National Institutes of Health, wrote in a 2011 issueof Science Translational Medicine, “The use of animalmodels for therapeutic development and target valida-tion is time consuming, costly, and may not accuratelypredict efficacy in humans.”2 But modeling is not justfor direct testing, it is used for gaining insight intodisease processes and therefore shedding light on treat-ment and cures. As James DeLeo, a researcher at theNIH, wrote,3 “modeling is meaningful even when themodels may be imperfect.”y

Scientists are seeking insight, not perfection. They arelooking for data points on which to hang a theory.Linked data about disease models can help designexperiments as well. Looking at pathways in nematodesor flies can tell scientists about pathways in mice,nonhuman primates, and humans. The bigger the infor-mation store researchers create, the more linkages theywill find.

If someone wants just lists of models, PubMedmight suffice. But to learn the value of a model,LAMHDI is showing the way. LAMHDI and similarsystems allow researchers to make value decisionsabout disease models. In some ways, LAMHDI is posi-tioning itself to become the model organism databasefor humans.

Acknowledgments

I would like to acknowledge the NIH, as the originator of theprogram that led to LAMHDI. I’d also like to acknowledge thecontributions of the participants at the NIH Expert Panel Meetingin Seattle, 19–20 August 2008, who were looking for ways to makebest use of the massive amounts of biomedical research data now

ualists who are fuzzy members of the fuzzy subsets of

ioinformatics, analytics and others.”

AND APPROACHES

Page 11: Animal Models for the Study of Human Disease || Access to Resources

THE IDEAL SOLUTION 47

being produced, and whose input led to the creation of the LAMDHIproject: Dave Anderson, University of Washington, School of Medi-cine, National Primate Research Center; Kevin Dawson, Universityof California, Davis Center of Excellence in Nutritional Genomics;John F. Elder, Elder Research, Inc., Knowledge Discovery and DataMining; Mark Ellisman, Neurosciences and Bioengineering, Univer-sity of California, San Diego, School of Medicine; Janan T. Eppig,The Jackson Lab/Mouse Genome Informatics Database; Trey Ideker,Bioengineering, University of California, San Diego; Michael Katze,Microbiology, University of Washington, National Primate ResearchCenter; Joseph Kemnitz, University of Wisconsin-Madison MedicalSchool, National Primate Research Center; Bret Peterson, Google,Bioengineering, Neuroscience, Computer science; John Quakenbush,Dana-Farber Cancer Institute/Harvard School of Public Health;Joel Stiles, Computational Physiology, Pittsburgh SupercomputingCenter, Carnegie Mellon University; Eric Von Schweber, PharmaSur-veyor, Neological Corp., Synsyta LLC, Infomaniacs; Linda Von

I. ETHICS, RESOURCES

Schweber, PharmaSurveyor, Neological Corp., Synsyta LLC, Infoma-niacs; Monte Westerfield, Institute of Neuroscience, University ofOregon; and Stuart Zola, Emory School of Medicine, NationalPrimate Research Center.

References

1. Wood MW, Hart LA. Selecting appropriate animal models andstrains: Making the best use of research, information and outreach.In: Proceedings of the 6th World Congress on Alternatives and Animal

Use in the Life Sciences, August 21–25, 2007; 2008. Tokyo, Japan.AATEX 14, Special Issue, 303–306.

2. Collins FS. Reengineering Translational Science: The time is right.Sci Transl Med 2011;3(90): 90cm17.

3. DeLeo J. Guest Editorial. Identifying and Overcoming Skepticismabout biomedical Computing: Modelers should take the lead.Biomed Comp Rev Summer 2012;2012:1–2.

AND APPROACHES