8
Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´ errez-Sacrist´ an GRIB IMIM - UPF ` Alex Bravo GRIB IMIM - UPF Marta Portero GReNeC IMIM - UPF Olga Valverde GReNeC IMIM - UPF Antonio Armario Universitat Aut` onoma de Barcelona M.Carmen Blanco-Gand´ ıa Universitat de Valencia Adriana Farr´ e Parc de Salut Mar UAB Lierni Fern´ andez-Ibarrondo Program in Cancer Research IMIM Francina Fonseca Parc de Salut Mar UAB Jes ´ us Giraldo Universitat Aut` onoma de Barcelona Angela Leis GRIB IMIM - UPF Anna Man´ e Parc de Salut Mar UAB Miguel A. Mayer GRIB IMIM - UPF Sandra Montagud-Romero Universitat de Valencia Roser Nadal Institut de Neuroci` encies UAB Jordi Ortiz School of Medicine UAB Francisco Javier Pav´ on Instituto de Investigaci´ on Biom´ edica de M´ alaga Ezequiel Perez Parc de Salut Mar UAB Marta Rodr´ ıguez-Arias Universitat de Valencia Antonia Serrano Instituto de Investigaci´ on Biom´ edica de M´ alaga Marta Torrens Parc de Salut Mar UAB Vincent Warnault GReNeC IMIM - UPF Ferran Sanz GRIB IMIM - UPF Laura I. Furlong GRIB IMIM - UPF Abstract During the last years there has been a growing research in the genetics of psy- chiatric diseases. However, there is still a limited understanding of the cel- lular and molecular mechanisms leading to these diseases, which has hampered the application of this wealth of knowl- edge into the clinical practice to im- prove diagnosis and treatment of psy- chiatric patients. PsyGeNET (http:// www.psygenet.org/) has been devel- oped to improve the understanding of psy- chiatric diseases, by facilitating the ac- cess to the vast amount of their genetic information in a structured manner, pro- viding a set of analysis and visualiza- tion tools. In this communication we de- scribe the protocol we put in place for the sustainable update of this knowledge re- source. It includes the recruitment of a team of experts to perform the curation of the data previously extracted by text min- ing. Annotation guidelines and a web- based annotation tool were developed to support curators tasks. A curation work- flow was designed including a pilot phase, and two rounds of curation and analy- sis phases. We report the results of the application of this workflow to the task of curation of gene-disease associations

Text mining and expert curation to develop a database on ...€¦ · Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´errez-Sacrist

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text mining and expert curation to develop a database on ...€¦ · Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´errez-Sacrist

Text mining and expert curation to develop a database on psychiatricdiseases and their genes

Alba Gutierrez-SacristanGRIB

IMIM - UPF

Alex BravoGRIB

IMIM - UPF

Marta PorteroGReNeC

IMIM - UPF

Olga ValverdeGReNeC

IMIM - UPF

Antonio ArmarioUniversitat Autonoma

de Barcelona

M.Carmen Blanco-GandıaUniversitat de Valencia

Adriana FarreParc de Salut Mar

UAB

Lierni Fernandez-IbarrondoProgram in Cancer Research

IMIM

Francina FonsecaParc de Salut Mar

UAB

Jesus GiraldoUniversitat Autonoma

de Barcelona

Angela LeisGRIB

IMIM - UPF

Anna ManeParc de Salut Mar

UAB

Miguel A. MayerGRIB

IMIM - UPF

Sandra Montagud-RomeroUniversitat de Valencia

Roser NadalInstitut de Neurociencies

UAB

Jordi OrtizSchool of Medicine

UAB

Francisco Javier PavonInstituto de Investigacion

Biomedica de Malaga

Ezequiel PerezParc de Salut Mar

UAB

Marta Rodrıguez-AriasUniversitat de Valencia

Antonia SerranoInstituto de Investigacion

Biomedica de Malaga

Marta TorrensParc de Salut Mar

UAB

Vincent WarnaultGReNeC

IMIM - UPF

Ferran SanzGRIB

IMIM - UPF

Laura I. FurlongGRIB

IMIM - UPF

Abstract

During the last years there has been agrowing research in the genetics of psy-chiatric diseases. However, there isstill a limited understanding of the cel-lular and molecular mechanisms leadingto these diseases, which has hamperedthe application of this wealth of knowl-edge into the clinical practice to im-prove diagnosis and treatment of psy-chiatric patients. PsyGeNET (http://www.psygenet.org/) has been devel-oped to improve the understanding of psy-chiatric diseases, by facilitating the ac-cess to the vast amount of their genetic

information in a structured manner, pro-viding a set of analysis and visualiza-tion tools. In this communication we de-scribe the protocol we put in place for thesustainable update of this knowledge re-source. It includes the recruitment of ateam of experts to perform the curation ofthe data previously extracted by text min-ing. Annotation guidelines and a web-based annotation tool were developed tosupport curators tasks. A curation work-flow was designed including a pilot phase,and two rounds of curation and analy-sis phases. We report the results of theapplication of this workflow to the taskof curation of gene-disease associations

Page 2: Text mining and expert curation to develop a database on ...€¦ · Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´errez-Sacrist

for PsyGeNET, including the analysis ofthe inter-annotator agreement, and suggestthat this model is a suitable approach forthe sustainable development and update ofknowledge resources.

1 Introduction

Psychiatric disorders have a great impact on mor-bidity and mortality (Murray and Lopez, 2013;Whiteford et al., 2013). According to the WorldHealth Organization (WHO), one of every fourpeople will suffer mental or neurological disorders(Kessler et al., 2005; Baldacchino et al., 2009). Ithas been suggested that most psychiatric disordersdisplay a strong genetic component (Sullivan etal., 2012). During the last years there has beena growing research in the genetics of psychiatricdisorders, and its findings have been reported onhundreds of thousands of publications. This liter-ature constitutes a rich and diverse source of infor-mation essential for any psychiatric research line.However, the huge amount and continuous growthof the number of publications refrain scientists toefficiently explore such large volume of data.

PsyGeNET (Psychiatric disorders Gene associ-ation NETwork) (Gutierrez-Sacristan et al., 2015)has been developed to establish a curated resourceon psychiatric diseases and their associated genes.PsyGeNET integrates knowledge extracted fromthe scientific literature by text-mining which hasbeen curated by experts in psychiatry and neuro-sciences.

In this communication we describe the processput in place for the update of the PsyGeNETdatabase. This involved i) the recruitment of ateam of experts to curate the information extractedby text-mining; ii) the extraction of information ofgene-disease associations (GDAs) from the litera-ture using the text mining system BeFree (Bravoet al., 2015), iii) the development of a curationworkflow (Figure 1), iv) the development of aweb-based annotation tool in order to facilitatethe curation task and v) the definition of detailedguidelines to assist the curation task.

In particular, we present the results of the Pilotphase and Curation I phase of the workflow, in-cluding the analysis of the inter-annotator agree-ment, and suggest that this protocol is a suitableapproach for the sustainable development and up-date of knowledge resources.

Figure 1: PsyGeNET curation workflow

2 Methods

2.1 Curation team

A team of 22 experts from different domains (suchas psychiatry, neuroscience, medicine, psychologyand biology) was recruited from the Spanish Net-work of Addiction and other collaborators of thecoordination team (Research Group on IntegrativeBiomedical Informatics (GRIB)) to participate inthe curation process. The incentives for participa-tion were to be part of the PsyGeNET team andto be co-authors in the publication(s) originatedfrom the project. The curators were trained dur-ing an initial session where the PsyGeNET anno-tation guidelines were presented and then duringthe Pilot phase. Communication with the coor-dination team through e-mail was established toresolve questions during the curation process. Inaddition, on-line and f2f meeting were organizedafter key points of the curation process (analysisphases) to share experiences among all curatorsand solve curation issues.

2.2 Defining the Psychiatric Diseases in termsof UMLS concepts

In PsyGeNET, the psychiatric diseases are identi-fied by UMLS Metathesaurus concepts. Three ex-perts reviewed the terminology included in morethan 2,000 UMLS concepts related to the psychi-atric disorders of interest, and assigned them to thefollowing psychiatric disease categories (DCs): 1)Depressive disorders, 2) Bipolar disorders and re-lated disorders, 3) Substance/drug induced depres-

Page 3: Text mining and expert curation to develop a database on ...€¦ · Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´errez-Sacrist

sive disorder, 4) Schizophrenia spectrum and otherpsychotic disorders, 5) Drug-induced psychosis,6) Alcohol use disorders, 7) Cannabis use disor-ders and 8) Cocaine use disorders. This informa-tion was used both for text mining of gene-diseaseassociations by BeFree (see below) and for identi-fication of disease classes during the curation.

2.3 Text mining of gene-disease associationsBeFree, a text-mining tool that exploits mor-phosyntactic information from the text to identifyrelationships between biomedical concepts, wasused to identify associations between genes andthe psychiatric diseases of interest from a corpusof ∼1M of MEDLINE abstracts focused on hu-man genetic diseases. The diseases were identifiedusing the UMLS concepts that define each disor-der, whereas an in-house developed gene dictio-nary was used to identify the genes, as describedin (Bravo et al., 2015). The identified disorderswere grouped according to the eight psychiatricdisease categories (described in section 2.2). Asa result, BeFree identified 6,349 associations be-tween genes and DCs (gene-disease category as-sociation or GDCA) supported by 4,065 publica-tions. A subset of the associations was initiallyevaluated by our group to identify the most fre-quent text mining errors. For instance, the worddepression is often used in other context in addi-tion to psychiatry. This initial evaluation was per-formed to identify this kind of errors and improvethe text mining system before the identification ofGDCAs. We then applied a number of filters to re-duce the size of the curation task and make it feasi-ble with the resources at hand. For instance, we re-moved associations already present in curated re-sources (DisGeNET (Pinero et al., 2015) and theprevious release of PsyGeNET ), kept only thoseassociations published recently (after year 2,000)in journals with Impact Factor greater than 1, andwe did not take into account reviews. After thisprocess we obtained 2,507 GDCAs, which weresubmitted to expert curation.

2.4 Annotation GuidelinesThe PsyGeNET annotation guidelines weredeveloped with the purpose of guiding the manualcuration process. The guidelines included thedefinition of a gene-disease association, how itshould be classified according to the level ofevidence, what information should be consideredfor the annotation and provided real examples of

the association types. Finally, it also included atutorial on how to use the PsyGeNET annotationtool. The goal of the curation was to validate theassociation of a gene to a particular disease. Weconsider that a gene is associated to a disease ifthe gene or the product of the gene plays a rolein the disease pathogenesis, or is a marker for thedisease. The PsyGeNET annotation tool was usedto help in this curation task. For each gene-diseaseassociation identified by text mining, the annota-tion tool displayed the evidence that supports theassociation, more concretely the abstracts and thesentences in which the gene-disease associationis stated. Then, by inspecting the evidence, thecurator had to determine the type of association(Association, No Association, False, Error andNot Clear). The types of association are describedas follows: i) Association: the publication clearlystates that there is an association between the geneand the disease - it can be a causative association(e.g. a mutation in the gene causes the disease),or a marker association (e.g. a SNP in the geneidentified in a GWAS study); ii) No Association:the publications clearly states that there is noassociation between the gene and the disease (e.g.a publication that reports a negative finding onthe association between the gene and the disease),iii) False Association: The gene and the diseaseare found co-occurring in a sentence, but there isno clear evidence from the publication that thegene plays a role or is a marker of the disease andiv) Error: when there is a text mining error inthe correct identification of the gene and/or thedisease.

Table 1 shows some examples of the associ-ation types considered in PsyGeNET. In the ex-ample for False Association, the study is on chil-dren that do not meet the criteria for the disease(FASD) therefore the association between the geneand the disease has to be classified as false. Inthe example of Error, note that in this abstractOCT is not a gene but an acronym of optical co-herence tomography (OCT). The document de-scribing the guidelines is available on the Psy-GeNET web page (http://www.psygenet.org/Psytool_manual_v5.0.pdf). Herewe provide the general instructions for the cu-ration of the gene-disease associations in Psy-GeNET:

1. The curation has to be performed at abstract

Page 4: Text mining and expert curation to develop a database on ...€¦ · Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´errez-Sacrist

AssociationType

PMID Sentence

Association 267012 The D-amino acid oxidase ac-tivator gene (G72) has beenfound associated with severalpsychiatric disorders such asschizophrenia, major depres-sion, and bipolar disorder.

No Association 17692928 There was no association be-tween TPH-2 gene variants andMD in the same population thathad shown a strong associationwith TPH-1.

False 25225167 Two children referred for sus-picion of FASD (neither ofwhich were exposed to alcoholor met the criteria for FASD)had a pathogenic microstruc-tural chromosomal rearrange-ment (del16p11.2 of 542 KBand dup1q44 of 915 KB).

Error 21174530 OCT demonstrated loss offoveal depression with distor-tion of the foveal architecture inthe macula in all patients

Table 1: Examples of Association types. Diseaseand genes that have to be evaluated are highlightedin the sentence in green and orange, respectively.

level. For those cases in which abstract is notclear enough, the full text article should bereviewed.

2. Annotate only relationships between the geneand disease. Other types of relationshipsshould not be annotated.

3. Annotate relationships according to the pro-vided categories: association, no association,error, and false.

2.5 Annotation toolA user-friendly web-based tool was developed toassist both the definition of the psychiatric disor-ders of interest and curation of gene-disease as-sociations. The tool was designed to support amulti-user environment by user and password as-signment. Figure 2 shows a screenshot of the toolfor the curation of GDCAs. The tool shows theGDCA to be evaluated (in this example the as-sociation between the ETNPPL gene and Bipolardisorders class), and a publication at a time. Thecurator has to review the publication and decide ifthe association of the gene and the disease classholds, and decide on the association type usingthe drop-down menu. To aid the curators task, thetool displays the terminology for the gene accord-ing to standard resources (NCBI Gene, UniProt

and HGNC), and highlights the sentences in whichBeFree identified an association between the geneand the disease under consideration. If required,the curator can access the full text article using thePubMed hyperlink. The curator is also asked to se-lect a sentence that best represents their validationdecision, if available. This was implemented inorder to collect example sentences to improve theperformance of the BeFree system. In addition,the tool also provides a progress bar indicating thenumber of validations and associations performedby the expert, and allows to review previous anno-tations. We refer to a validation to each publica-tion supporting a particular GDCA. Note that eachpublication can have more than one GDCA.

Figure 2: Annotation web-based tool

2.6 Curation workflowWe put in place a curation workflow including apilot phase and two curation and analysis phases

Page 5: Text mining and expert curation to develop a database on ...€¦ · Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´errez-Sacrist

(see Figure 1). During the pilot phase, the initialtraining of the curators was carried out includinghow to use the curation tool. A set of 100 abstractswas validated and analyzed during the pilot phase.After this process both the curation tool and theannotation guidelines were improved and the firstcuration phase was launched (Curation Phase I),to evaluate 2,507 GDCAs identified by text min-ing and supported by 4,065 publications. The re-sults of the curation were analyzed to estimate theinter-annotator agreement at the level of abstract.The validations for which an agreement was notfound in Curation Phase I are then reviewed by athird expert during Curation Phase II (results notreported here). Four experts are participating inthis phase. Only the validations for which agree-ment of at least 2 experts is found will be includedin the database.

3 Results and discussion

Firstly, three experts reviewed the terminology of2,523 UMLS concepts related to psychiatric dis-orders of interest. As a result, 1,942 UMLS con-cepts were assigned to one of the 8 disease cat-egories, being alcohol use disorder, depressionand schizophrenia defined by more than 300 con-cepts (321, 368 and 488, respectively). On theother hand, 581 UMLS concepts were excludedat this stage. Then, BeFree was used to iden-tify gene-disease associations from the literaturebased on the above disease definition and a sub-set of the associations focused on the disordersof interest was selected (see methods section 2.3).The 2,507 genes associated to DCs identified byBeFree were submitted to expert curation. Thesegenes were unevenly distributed across the diseasecategories, being schizophrenia the disease cate-gory with more associations followed by depres-sion and alcohol use disorders (see Figure 3).

Of note, most of the GDCAs were supported byonly one publication (70.6 %). We included up tothe 5 most recent publications for each GDCA forthe validation process. This led to 242-284 GD-CAs to be validated by each curator, dependingon the disease category. Since most of GDCAsare supported by only one publication, the num-ber of publications to be reviewed by the cura-tors ranged between 322 and 491. Before start-ing the curation of the 2,507 GDCAs, a Pilot cura-tion phase was designed with the purpose of train-ing the curators, testing the PsyGeNET annota-

Figure 3: Psychiatric disease categories and thenumber of associated genes.

tion tool, and reviewing the PsyGeNET annota-tion guidelines. One hundred publications werereviewed during the Pilot phase, distributed in 10publications per 2 experts. The average agree-ment between the experts pairs in the Pilot Phasewas 60%. The main sources of discrepancies werethe handling of speculations, the proper identifica-tion of text mining errors, in particular for genes,and the distinction between False and Error As-sociation types. The annotation tool was modi-fied to show the terminology of the genes in orderto help the curators to find potential errors in theidentification of genes, and by improving the Re-view function. Then, the proper curation (CurationPhase I in the workflow in Figure 1) was launchedand it was completed in 33 days. During CurationPhase I, 2,507 GDCAs supported by 4,065 publi-cations were reviewed by the curators. Each expertwas assigned with a set of approx. 275 GDCAs(corresponding to 450 publications) according totheir field of expertise (e.g. Major depression vsSchizophrenia). Some curators evaluated associa-tions from all the disease categories, while othersfocused in a single category. The results of thecuration phase I were analyzed to identify agree-ments and disagreements between the experts. Ta-ble 2 shows the number of abstracts validated byeach curator team (composed of two experts) andthe agreement achieved. The average agreementbetween all the experts was 68.95%, higher thatthe one obtained in the Pilot Phase. For one cura-

Page 6: Text mining and expert curation to develop a database on ...€¦ · Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´errez-Sacrist

tor team the agreement was higher (89%) than forthe rest of the teams. We can attribute this higheragreement to the fact that there was some commu-nication between the two experts to discuss on thecuration criteria during the Curation Phase I.

Teams Validations Agreem. Disagr. % Agreem.Team 1 494 325 169 65.79Team 2 319 194 125 60.89Team 3 489 342 147 69.94Team 4 450 402 48 89.33Team 5 492 308 184 62.60Team 6 508 341 167 67.12Team 7 463 317 146 68.46Team 8 516 363 153 70.35Team 9 334 221 113 66.17

Table 2: Agreement for each expert pair.

From the validations in which agreement wasfound (2,813 validations), 1,880 were classified asAssociation or No Association; 901 were classi-fied as False or Error, and only in 32 of them, theevidence extracted from the publication was notenough to classify them within any of the previouscategories, falling in the not clear category (Fig-ure 4). The set of 1,880 validations will be partof the next release of PsyGeNET. Notably, an im-portant fraction of these associations (24.7%) areclassified as No association, meaning that there isevidence reporting negative findings on the associ-ation between the gene and the disease. This high-lights the importance of recording negative find-ings from the literature in knowledge resources.On the other hand, collecting these information isrelevant for the development of corpora for train-ing text mining systems able to identify negativefindings regarding gene-disease associations fromthe literature.

We observe that for 30% of the total GDCAsvalidated, agreement between curators was notfound. A substantial fraction of the disagreementsinvolved the annotation of an association as Falseby one of the experts (53.28%, see Figure 5).The results of Curation Phase I were discussedwith the experts in order to identify the maindifficulties during the annotation. The mainsources of the discrepancies between curatorswere the following: i) difficulty in assessing ifthe studies using animal models captures well thedisease pathophysiology, ii) the studies focusedon pharmacogenomics or response to drug treat-ments, iii) studies assessing disease phenotypes(e.g. low mood) in otherwise normal populations,and iv) the assessment of validity of the statistical

Figure 4: Summary of the agreement results.Each bar in the barplot represents the number ofvalidations annotated as: Association, No associa-tion, False, Error and Not clear, respectively.

analysis in some studies (e.g. GWAS studies).In the first case, the decision on the associationtype will depend on the expertise of the curatoron animal model research in psychiatry, that wasnot the same among the team of experts. In theother three cases the experts expressed difficultiesin correctly identifying if an association has to beannotated or not. Overall, although the curationtask was very focused to the domain of geneticsof psychiatric diseases, the wide variety of studiescovered by the publications (GWAs studies,sequencing studies, animal models, etc) requirean equivalent diversity of expertise among theexperts. We think that this complexity in the taskis one of the main reasons for the inter-annotatoragreement achieved. Ongoing work includesrevisiting the annotation guidelines to furtherclarify the curation issues raised, in order toimprove the agreement in the annotations.

In recent years, many efforts have been madeto develop and contribute with novel corporain the biomedical domain. Nevertheless, thenumber of corpora annotated with informationon gene-disease associations is particularly low(Neves, 2014). For example, the Craven cor-pus (Craven et al., 1999), contains annotationsof gene-disease associations, but there is no in-formation on data quality such as inter-annotatoragreement in the original publication. The EU-

Page 7: Text mining and expert curation to develop a database on ...€¦ · Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´errez-Sacrist

ADR corpus (Van Mulligen et al., 2012) in-cludes associations between genes and diseasesfrom 100 MEDLINE abstracts, with an inter-annotator agreement of 86%. Wiegers et al.presented the manual curation of chemical-gene-disease network for the Comparative Toxicoge-nomics Database (CTD) (Wiegers et al., 2009).For this study 112 articles were distributed be-tween three curators (each one revised less than60 articles), achieving an inter-annotator agree-ment of 77%. The CoMAGC corpus (Lee et al.,2013), focused on genes associated to prostate,breast and ovarian cancer, is based on 821 sen-tences. The authors report an agreement 72%. Inanother study, agreement over 70% was reportedin the development of a sentence-based corpus onprostate cancer-gene associations (Chun et al.,2006). In summary, compared to other corpora an-notation initiatives, our inter-annotator agreementresults are lower. As described in the paragraphsabove, we think that the agreement obtained is dueto the complexity of the annotation task. In addi-tion, the large number of experts (for instance, 22in our case vs 5 in the case of the EU-ADR cor-pus) and also the large size of our corpus (4,065publications vs approx. 100 in EU-ADR and CTDcorpora) could also explain the lower agreementobtained compared to other curation initiatives.

Figure 5: Summary of the disagreement results atthe abstract level. Each cell in the heatmap rep-resents the number of abstracts in which disagree-ment was found for each pair of experts. The dark-est the blue, the higher is the disagreement. Forexample, there were 100 abstracts that one expertannotated as Association while the paired expertannotated as No association.

The Curation Phase II is aimed at reviewingthe associations in which no agreement was found

among two experts in the first phase of curation.Currently, this involves 1,252 validations, whichare being reviewed by a third expert (ongoingwork at the time of writing). Finally, the infor-mation that will be included in PsyGeNET are theassociations in which at least two experts agreedon the annotation.

4 Conclusions

In this communication we report the developmentof a protocol for the sustainable update of a knowl-edge resource on the genetics of psychiatric dis-eases, PysGeNET. We combined state-of-the-arttext-mining, data filtering and curation by a com-munity of domain experts for the release of a newversion of the database. We designed a proto-col that includes curators’ training and the iter-ative improvement of both the tools and annota-tion guidelines. The proposed approach is allow-ing to update the database in a timely manner withexpert-validated information. Importantly, our cu-ration protocol included the identification of neg-ative findings from the literature. Note that 24.7%of the GDCAs were classified as No association,indicating the importance of properly annotatingthis information in a knowledge resource. This in-formation will be taken into account for the rank-ing of the gene-disease association in the next re-lease of PsyGeNET. In addition, the corpus of an-notated sentences and abstracts developed duringthe curation constitutes a valuable resource for thedevelopment and evaluation of relation extractionsystems. In this era of biomedical big data, wepresent this approach involving the expert com-munity for the curation of the information as asuitable approach for the development and main-tenance of knowledge resources.

5 Fundings

We received support from ISCIII-FEDER(PI13/00082, CP10/00524), IMI-JU under grantsagreements n 115002 (eTOX), n 115191 (OpenPHACTS)], n 115372 (EMIF) and n 115735(iPiE), resources of which are composed ofnancial contribution from the EU-FP7 (FP7/2007-2013) and EFPIA companies in kind contribution,and the EU H2020 Programme 2014-2020 undergrant agreements no. 634143 (MedBioinformat-ics) and no. 676559 (Elixir-Excelerate). TheResearch Programme on Biomedical Informatics(GRIB) is a node of the Spanish National Institute

Page 8: Text mining and expert curation to develop a database on ...€¦ · Text mining and expert curation to develop a database on psychiatric diseases and their genes Alba Guti´errez-Sacrist

of Bioinformatics (INB).

References[Baldacchino et al.2009] A Baldacchino, N Groussard-

Escaffre, C Clancy, C Lack, K Sieroslavrska, C-LHodges, L-B Merinder, T Greacen, M Sorsa, H Lai-jarvi, et al. 2009. Epidemiological issues in comor-bidity: lessons learnt from a pan-european isadoraproject. Mental Health and Substance Use: DualDiagnosis, 2(2):88–100.

[Bravo et al.2015] Alex Bravo, Janet Pinero, NuriaQueralt-Rosinach, Michael Rautschka, and Laura IFurlong. 2015. Extraction of relations betweengenes and diseases from text and large-scale dataanalysis: implications for translational research.BMC bioinformatics, 16(1):1.

[Chun et al.2006] Hong-Woo Chun, Yoshimasa Tsu-ruoka, Jin-Dong Kim, Rie Shiba, Naoki Nagata,Teruyoshi Hishiki, and Jun’ichi Tsujii. 2006. Au-tomatic recognition of topic-classified relations be-tween prostate cancer and genes using medline ab-stracts. BMC bioinformatics, 7(3):1.

[Craven et al.1999] Mark Craven, Johan Kumlien, et al.1999. Constructing biological knowledge bases byextracting information from text sources. In ISMB,volume 1999, pages 77–86.

[Gutierrez-Sacristan et al.2015] Alba Gutierrez-Sacristan, Solene Grosdidier, Olga Valverde, MartaTorrens, Alex Bravo, Janet Pinero, Ferran Sanz, andLaura I Furlong. 2015. Psygenet: a knowledgeplatform on psychiatric disorders and their genes.Bioinformatics, page btv301.

[Kessler et al.2005] Ronald C Kessler, PatriciaBerglund, Olga Demler, Robert Jin, Kathleen RMerikangas, and Ellen E Walters. 2005. Life-time prevalence and age-of-onset distributionsof dsm-iv disorders in the national comorbiditysurvey replication. Archives of general psychiatry,62(6):593–602.

[Lee et al.2013] Hee-Jin Lee, Sang-Hyung Shim, Mi-Ryoung Song, Hyunju Lee, and Jong C Park. 2013.Comagc: a corpus with multi-faceted annotationsof gene-cancer relations. BMC bioinformatics,14(1):1.

[Murray and Lopez2013] Christopher JL Murray andAlan D Lopez. 2013. Measuring the global bur-den of disease. New England Journal of Medicine,369(5):448–457.

[Neves2014] Mariana Neves. 2014. An analysison the entity annotations in biological corpora.F1000Research, 3.

[Pinero et al.2015] Janet Pinero, Nuria Queralt-Rosinach, Alex Bravo, Jordi Deu-Pons, AnnaBauer-Mehren, Martin Baron, Ferran Sanz, and

Laura I Furlong. 2015. Disgenet: a discoveryplatform for the dynamical exploration of humandiseases and their genes. Database, 2015:bav028.

[Sullivan et al.2012] Patrick F Sullivan, Mark J Daly,and Michael O’Donovan. 2012. Genetic archi-tectures of psychiatric disorders: the emerging pic-ture and its implications. Nature Reviews Genetics,13(8):537–551.

[Van Mulligen et al.2012] Erik M Van Mulligen, AnnieFourrier-Reglat, David Gurwitz, Mariam Molokhia,Ainhoa Nieto, Gianluca Trifiro, Jan A Kors, andLaura I Furlong. 2012. The eu-adr corpus: anno-tated drugs, diseases, targets, and their relationships.Journal of biomedical informatics, 45(5):879–884.

[Whiteford et al.2013] Harvey A Whiteford, LouisaDegenhardt, Jurgen Rehm, Amanda J Baxter, Alize JFerrari, Holly E Erskine, Fiona J Charlson, Rosana ENorman, Abraham D Flaxman, Nicole Johns, et al.2013. Global burden of disease attributable to men-tal and substance use disorders: findings from theglobal burden of disease study 2010. The Lancet,382(9904):1575–1586.

[Wiegers et al.2009] Thomas C Wiegers, Allan P Davis,K Bretonnel Cohen, Lynette Hirschman, and Car-olyn J Mattingly. 2009. Text mining and manualcuration of chemical-gene-disease networks for thecomparative toxicogenomics database (ctd). BMCbioinformatics, 10(1):326.