8
Nucleic Acids Research, 2020 1 doi: 10.1093/nar/gkaa1087 GENCODE 2021 Adam Frankish 1 , Mark Diekhans 2 , Irwin Jungreis 3,4 , Julien Lagarde 5 , Jane E. Loveland 1 , Jonathan M. Mudge 1 , Cristina Sisu 6,7 , James C. Wright 8 , Joel Armstrong 2 , If Barnes 1 , Andrew Berry 1 , Alexandra Bignell 1 , Carles Boix 3,4,9 , Silvia Carbonell Sala 5 , Fiona Cunningham 1 , Tom´ as Di Domenico 10 , Sarah Donaldson 1 , Ian T. Fiddes 2 , Carlos Garc´ ıa Gir ´ on 1 , Jose Manuel Gonzalez 1 , Tiago Grego 1 , Matthew Hardy 1 , Thibaut Hourlier 1 , Kevin L. Howe 1 , Toby Hunt 1 , Osagie G. Izuogu 1 , Rory Johnson 11,12 , Fergal J. Martin 1 , Laura Mart´ ınez 10 , Shamika Mohanan 1 , Paul Muir 13,14 , Fabio C. P. Navarro 6 , Anne Parker 1 , Baikang Pei 6 , Fernando Pozo 10 , Ferriol Calvet Riera 1 , Magali Ruffier 1 , Bianca M. Schmitt 1 , Eloise Stapleton 1 , Marie-Marthe Suner 1 , Irina Sycheva 1 , Barbara Uszczynska-Ratajczak 15 , Maxim Y. Wolf 16 , Jinuri Xu 6 , Yucheng T. Yang 6,17 , Andrew Yates 1 , Daniel Zerbino 1 , Yan Zhang 6,18 , Jyoti S. Choudhary 8 , Mark Gerstein 6,17,19 , Roderic Guig ´ o 5,20 , Tim J. P. Hubbard 21 , Manolis Kellis 3,4 , Benedict Paten 2 , Michael L. Tress 10 and Paul Flicek 1,* 1 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2 UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA, 3 MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar St, Cambridge, MA 02139, USA, 4 Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA, 5 Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain, 6 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA, 7 Department of Bioscience, Brunel University London, Uxbridge UB8 3PH, UK, 8 Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, 237 Fulham Road, London SW3 6JB, UK, 9 Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA, USA, 10 Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain, 11 Department of Medical Oncology, Inselspital, University Hospital, University of Bern, Bern, Switzerland, 12 Department of Biomedical Research (DBMR), University of Bern, Bern, Switzerland, 13 Department of Molecular, Cellular &Developmental Biology, Yale University, New Haven, CT 06520, USA, 14 Systems Biology Institute, Yale University, West Haven, CT 06516, USA, 15 Centre of New Technologies, University of Warsaw, Warsaw, Poland, 16 Department of Biomedical Informatics at Harvard Medical School, 10 Shattuck Street, Suite 514, Boston, MA 02115, USA, 17 Program in Computational Biology & Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA, 18 Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA, 19 Department of Computer Science, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT06520, USA, 20 Universitat Pompeu Fabra (UPF), Barcelona, E-08003 Catalonia, Spain and 21 Department of Medical and Molecular Genetics, King’sCollege London, Guys Hospital, Great Maze Pond, London SE1 9RT, UK Received September 21, 2020; Revised October 21, 2020; Editorial Decision October 22, 2020; Accepted October 24, 2020 ABSTRACT The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational re- source that supports genome biology and clinical ge- nomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analy- sis generated both within the consortium and ex- ternally to support the creation of transcript struc- tures and the determination of their function. Here, we present improvements to our annotation infras- tructure, bioinformatics tools, and analysis, and the * To whom correspondence should be addressed. Tel: +44 1223 492581; Fax: +44 1223 494494; Email: fl[email protected] C The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1087/6018430 by MIT Libraries user on 25 December 2020

GENCODE 2021compbio.mit.edu/publications/227_Frankish_NucleicAcids... · 2020. 12. 25. · 2 NucleicAcidsResearch,2020 advances they support in the annotation of the hu-man and mouse

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Nucleic Acids Research, 2020 1doi: 10.1093/nar/gkaa1087

    GENCODE 2021Adam Frankish 1, Mark Diekhans 2, Irwin Jungreis 3,4, Julien Lagarde5,Jane E. Loveland 1, Jonathan M. Mudge1, Cristina Sisu6,7, James C. Wright8,Joel Armstrong2, If Barnes1, Andrew Berry1, Alexandra Bignell1, Carles Boix3,4,9,Silvia Carbonell Sala5, Fiona Cunningham 1, Tomás Di Domenico10, Sarah Donaldson1,Ian T. Fiddes2, Carlos Garcı́a Girón 1, Jose Manuel Gonzalez1, Tiago Grego1,Matthew Hardy1, Thibaut Hourlier 1, Kevin L. Howe 1, Toby Hunt1, Osagie G. Izuogu1,Rory Johnson 11,12, Fergal J. Martin 1, Laura Martı́nez10, Shamika Mohanan1,Paul Muir13,14, Fabio C. P. Navarro6, Anne Parker1, Baikang Pei6, Fernando Pozo10, FerriolCalvet Riera1, Magali Ruffier 1, Bianca M. Schmitt1, Eloise Stapleton1,Marie-Marthe Suner 1, Irina Sycheva1, Barbara Uszczynska-Ratajczak15, Maxim Y. Wolf16,Jinuri Xu6, Yucheng T. Yang6,17, Andrew Yates 1, Daniel Zerbino 1, Yan Zhang 6,18,Jyoti S. Choudhary8, Mark Gerstein6,17,19, Roderic Guigó5,20, Tim J. P. Hubbard21,Manolis Kellis3,4, Benedict Paten2, Michael L. Tress 10 and Paul Flicek 1,*

    1European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton,Cambridge CB10 1SD, UK, 2UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA95064, USA, 3MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar St, Cambridge, MA 02139,USA, 4Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA, 5Centre for GenomicRegulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003Catalonia, Spain, 6Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520,USA, 7Department of Bioscience, Brunel University London, Uxbridge UB8 3PH, UK, 8Functional Proteomics,Division of Cancer Biology, Institute of Cancer Research, 237 Fulham Road, London SW3 6JB, UK, 9Computationaland Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA, USA, 10Bioinformatics Unit,Spanish National Cancer Research Centre (CNIO), Madrid, Spain, 11Department of Medical Oncology, Inselspital,University Hospital, University of Bern, Bern, Switzerland, 12Department of Biomedical Research (DBMR), Universityof Bern, Bern, Switzerland, 13Department of Molecular, Cellular & Developmental Biology, Yale University, NewHaven, CT 06520, USA, 14Systems Biology Institute, Yale University, West Haven, CT 06516, USA, 15Centre of NewTechnologies, University of Warsaw, Warsaw, Poland, 16Department of Biomedical Informatics at Harvard MedicalSchool, 10 Shattuck Street, Suite 514, Boston, MA 02115, USA, 17Program in Computational Biology &Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA, 18Department ofBiomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA, 19Department ofComputer Science, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA, 20UniversitatPompeu Fabra (UPF), Barcelona, E-08003 Catalonia, Spain and 21Department of Medical and Molecular Genetics,King’s College London, Guys Hospital, Great Maze Pond, London SE1 9RT, UK

    Received September 21, 2020; Revised October 21, 2020; Editorial Decision October 22, 2020; Accepted October 24, 2020

    ABSTRACT

    The GENCODE project annotates human and mousegenes and transcripts supported by experimentaldata with high accuracy, providing a foundational re-source that supports genome biology and clinical ge-nomics. GENCODE annotation processes make use

    of primary data and bioinformatic tools and analy-sis generated both within the consortium and ex-ternally to support the creation of transcript struc-tures and the determination of their function. Here,we present improvements to our annotation infras-tructure, bioinformatics tools, and analysis, and the

    *To whom correspondence should be addressed. Tel: +44 1223 492581; Fax: +44 1223 494494; Email: [email protected]

    C© The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), whichpermits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

    Dow

    nloaded from https://academ

    ic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1087/6018430 by M

    IT Libraries user on 25 Decem

    ber 2020

    http://orcid.org/0000-0002-4333-628Xhttp://orcid.org/0000-0002-0430-0989http://orcid.org/0000-0002-3197-5367http://orcid.org/0000-0002-7669-2934http://orcid.org/0000-0002-7445-2419http://orcid.org/0000-0002-0935-7271http://orcid.org/0000-0003-4894-7773http://orcid.org/0000-0002-1751-9226http://orcid.org/0000-0003-4607-2782http://orcid.org/0000-0002-1672-050Xhttp://orcid.org/0000-0002-8386-1580http://orcid.org/0000-0002-0380-7171http://orcid.org/0000-0002-8886-4772http://orcid.org/0000-0001-5350-3056http://orcid.org/0000-0002-3357-5121http://orcid.org/0000-0001-9046-6370http://orcid.org/0000-0002-3897-7955

  • 2 Nucleic Acids Research, 2020

    advances they support in the annotation of the hu-man and mouse genomes including: the completionof first pass manual annotation for the mouse refer-ence genome; targeted improvements to the anno-tation of genes associated with SARS-CoV-2 infec-tion; collaborative projects to achieve convergenceacross reference annotation databases for the anno-tation of human and mouse protein-coding genes;and the first GENCODE manually supervised auto-mated annotation of lncRNAs. Our annotation is ac-cessible via Ensembl, the UCSC Genome Browserand https://www.gencodegenes.org.

    INTRODUCTION

    GENCODE produces widely-used reference genome an-notation of protein-coding and non-coding loci includingalternatively spliced transcripts and pseudogenes for thehuman and mouse genomes and makes these annotationsfreely available for the benefit of biomedical research andgenome interpretation. The GENCODE consortium devel-ops, maintains and improves targeted tools, analysis andprimary transcriptomic and proteomic data in support ofgene and transcript annotation. These resources supportupdates to genes in all functional classes or biotypes, includ-ing (i) the discovery of new features such as novel protein-coding genes and long non-coding RNA (lncRNA) genes;(ii) the extension of existing annotation including the identi-fication of novel alternatively spliced transcripts at protein-coding and lncRNA loci and (iii) the continuous criticalreappraisal of existing annotation that may result in re-moval or reclassification of protein-coding genes that lackevidence of protein-coding potential given all data nowavailable. GENCODE defines genes in terms of their tran-scriptional and functional overlap. The functional informa-tion implicit in the CDS of protein-coding gene supportsdecision making and provides high confidence in the inter-pretation of protein-coding genes. For lncRNAs, the lackof analogous knowledge makes representation of complexlncRNA loci difficult and we are working with lncRNAcommunity and other reference annotation databases to im-prove their annotation.

    Among other achievements, over the last two years wehave developed a manually supervised automated annota-tion pipeline and an annotation triage tool to leverage thevolume of data generated by current transcriptomics experi-ments while ensuring that the resulting annotated transcriptmodels maintain the quality of expert human annotation.We have completed the first pass manual annotation of themouse reference genome based on experiences on complet-ing the human annotation in 2013 and have used wholegenome PhyloCSF (1) analysis to generate ranked lists ofcandidate coding regions for investigation by expert humanannotators. To support research responding to the COVID-19 pandemic, we have reviewed and improved the annota-tion for a set of protein-coding genes associated with SARS-CoV-2 infection and immediately released the results usinga trackhub (2). We worked with the RefSeq (3) and UniProt(4) reference annotation databases toward achieving anno-tation convergence by ensuring that when a protein-coding

    gene or protein is present in one resource, it will be repre-sented in the others or there will be an explanation why not.We are part of the Matched Annotation from NCBI andEMBL-EBI (MANE) project to define a single representa-tive ‘MANE Select’ transcript for all protein-coding genesand ensure its structure and sequence is identical in boththe Ensembl/GENCODE and RefSeq genesets. We anno-tated new human protein-coding genes based on improvedanalyses and experimental validation using mass spectrom-etry. We have also improved the annotation of lncRNAs viathe discovery of novel loci and novel transcripts at existingloci primarily based on incorporating long transcriptomicsequence data generated using the CLS protocol (5).

    GENE ANNOTATION INFRASTRUCTURE

    We have made several key improvements to our processesand tools used for manual gene annotation.

    The Ensembl/GENCODE geneset is a merge of the man-ual gene annotation created by the Ensembl-HAVANAteam (methods and validation described in 6–8) and theautomated annotation produced by the Ensembl Geneb-uild team (9,10). Historically, these data were produced sep-arately and stored in independent and structurally differ-ent databases before being merged into a single set for re-lease. To speed data release and reduce complexity, we havenow moved all manual annotation and computational an-notation into a single database for human (and another formouse). In addition to continuing the support of manualannotation, this transition allows manual annotators to di-rectly ‘bless’, update or remove computationally annotatedmodels. Most significantly, new genes and transcripts re-leased early via the GENCODE update trackhub will beassigned their Ensembl (ENSX) formatted stable IDs attheir creation, having previously been given an interim ID(OTTX format).

    Long-read transcriptomic sequencing methods includ-ing those from Pacific Biosciences (PacBio) and OxfordNanopore Technologies (ONT) produce data volumes thatrequire change to our manual annotation process. In re-sponse, we developed the TAGENE pipeline to supportgreater automation of transcript model creation based onlong-read datasets generated both within GENCODE andby other groups. TAGENE implements filtering and merg-ing of long transcriptomic datasets before clustering puta-tive transcripts into loci (both existing and novel) and ap-plying further filters based on other transcriptomic datasets,including RNA-seq supported introns and existing GEN-CODE annotation (Figure 1). The clustering and final filter-ing steps are applied following multiple iterations of manualreview until a point is reached where the false positive ratefor the addition of spurious models is

  • Nucleic Acids Research, 2020 3

    Figure 1. Schematic of the TAGENE workflow to add long transcriptomic data to GENCODE annotation. Points in the workflow where manual reviewis applied are indicated.

    ated by the TAGENE pipeline. Kestrel is complementaryto our set of high quality annotation tools in Zmap, Blixemand Dotter, which were initially developed for the clone-by-clone annotation approach used for the first pass annota-tion of the human and mouse reference genomes. Kestrel’sstreamlined functionality is often all that is required to an-swer emerging manual annotation questions and thus fasterthan our traditional workflow.

    GENE ANNOTATION UPDATES

    The GENCODE consortium has improved and extendedthe annotation of the human and mouse reference genomesand makes the annotation publicly available (see Table 1 forannotation statistics from the most recent GENCODE re-leases).

    Since June 2018, ∼37 000 genes (∼32 000 human and5000 mouse) and ∼63 000 transcripts (∼55 000 human and∼8000 mouse) have either been created or updated in theGENCODE geneset (see Table 2 for a breakdown of newand updated genes and transcripts by functional biotype).During this period we have completed the first pass annota-tion of the mouse reference genome and conducted a num-ber of tightly focussed annotation projects including the hu-man and mouse olfactory receptor repertoire (12) and a re-annotation of developmental and epileptic encephalopathy-associated genes (13).

    Although a number of protein-coding genes in both hu-man and mouse have been added, removed or had theirbiotype changed over the past two years, the total numberof genes is stable. Similarly, the number of pseudogenes ofprotein-coding genes is broadly stable for human, althoughour ability to better identify unitary pseudogenes has ledto an increase in this specific class. In mouse, an increasein pseudogene count reflects the completion of manual an-

    notation for all chromosomes. LncRNAs continue to showthe largest increases in number, particularly in human whereour efforts have been concentrated.

    PROTEIN-CODING GENES

    In response to the SARS-CoV-2 pandemic, we have ap-plied our annotation resources to human genes with poten-tial links to viral infection and COVID-19 disease primarilyby investigating whether existing annotation for these genescan be improved. Our list of genes for reannotation comesfrom several sources including recently published drug re-purposing studies identifying host proteins associated withother related coronaviruses (14) and human proteins foundto physically associate with SARS-CoV-2 viral proteins inthe cell (15). We also included genes curated by UniProt (4)and the Human Cell Atlas project (16) as well as interferon-stimulated genes with known antiviral activity (17). Theseefforts added previously unannotated alternatively-splicedtranscript models and updated existing GENCODE tran-script models, in particular ‘partial’ models that were in-complete at their 5′ and/or 3′ ends that could be extendedto full length. All annotation takes advantage of long tran-scriptomic datasets and RNAseq data that was unavailableat the time of initial annotation. To date we have updatedthe annotation for 280 genes, adding ∼3700 novel tran-scripts and updating a further ∼850.

    GENCODE has been actively collaborating with otherreference annotation databases to try to achieve conver-gence on the annotation of protein-coding genes in hu-man and mouse. The MANE project aims to create a sin-gle agreed transcript for every human protein-coding genethat has a 100% match for sequence and structure (splic-ing, UTR and CDS) in both the Ensembl/GENCODE andRefSeq (3) annotation sets. The project is driven by two in-

    Dow

    nloaded from https://academ

    ic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1087/6018430 by M

    IT Libraries user on 25 Decem

    ber 2020

  • 4 Nucleic Acids Research, 2020

    Table 1. Total numbers of genes and transcripts in the GENCODE 35 (Human) and GENCODE M25 (Mouse) releases by gene functional biotype

    Protein-coding LncRNA Pseudogene sRNA IG/TR

    Human GENCODE 35 Genes 19954 17957 14767 7569 645Transcripts 154580 48684 18664 7569 666

    Mouse GENCODE M25 Genes 21859 13197 13741 6108 700Transcripts 102241 18856 14522 6108 864

    Table 2. Numbers of genes and transcripts that have been added to or updated in GENCODE Human and Mouse annotation since June 2018

    Human Mouse

    New Updated New and updated New Updated New and updated

    Protein-coding 131 17995 18126 845 1584 2429Genes LncRNA 1965 7678 9643 670 282 952

    Pseudogene 75 4152 4227 676 266 942Total 2171 29825 31996 2191 2132 4323

    Protein-coding 11334 21406 32740 4323 968 5291Transcripts LncRNA 19042 2807 21849 1171 73 1244

    Pseudogene 247 259 506 794 137 931Total 30623 24472 55095 6288 1178 7466

    dependent pipelines, one from each centre, followed by ex-tensive investigation and discussion by expert human anno-tators where the pipelines do not agree. The latest releaseof MANE v0.91, gives an overall coverage of 84% of allprotein-coding genes.

    We have been working extensively to improve the interop-erability of the existing annotations with UniProt. GenomeIntegration with FuncTion and Sequence (GIFTS) is a jointproject between Ensembl and the EMBL-EBI componentof the UniProt project and is currently available for humanand mouse proteins https://www.ebi.ac.uk/gifts/. GIFTScalculates mappings and pairwise alignments between En-sembl transcripts that have a protein translation with theircorresponding UniProt protein entries. Unmapped UniProtproteins are investigated by annotators from both teamsand edited where necessary. We have investigated 1044unmapped human (716) and mouse (328) proteins fromUniProt and identified cases where the GENCODE anno-tation needs to be updated (2 human, 49 mouse), and pro-teins that appear invalid in their putative genomic context(640 human, 54 mouse).

    We continue to analyse publications external to the GEN-CODE consortium reporting additional protein-codinggenes in the light of GENCODE criteria. For example, weexamined the novel protein-coding genes reported in theCHESS gene annotation set (18), adding five protein-codinggenes, 16 pseudogenes and 37 lncRNAs. A recent survey ofheart ORFs (19), has so far resulted in the annotation of 12additional human protein-coding genes.

    GENCODE annotation makes substantial use of com-parative genomics to help identify regions on the genomewith protein-coding potential. For example, we have usedCactus to create a 600-way vertebrate whole genome align-ment incorporating data from the 200 Mammals and Bird10K projects as the basis of a single base-pair resolutionmap of evolutionary selection (20). We will directly usethese alignments within the PhyloCSF phylogenetic analy-sis tool (1). The PhyloCSF pipeline has also been run on theeach new release of the human and mouse genome annota-tions to facilitate the discovery of additional novel coding

    genes, novel pseudogenes, and novel coding sequence (21).We have automated our process to generate updated lists ofPhyloCSF Candidate Coding Regions (PCCRs), which arethen examined by manual annotators. In human, PCCRsare part of the standard annotation workflow. In mouse, atargeted review of unannotated PCCRs analogous to thatpreviously undertaken in human has led to the identifica-tion of 64 novel protein-coding genes, 376 novel codingexons in preexisting protein-coding genes, and 202 pseu-dogenes including 56 unitary pseudogenes. PhyloCSF hasalso been used to identify candidate ribosomal stop codonreadthrough events in human and mouse (22,23). Follow-ing manual review of these and several others identified ex-perimentally, 14 and 11 genes with stop codon readthroughevents have been annotated in human and mouse, respec-tively (Figure 2).

    GENCODE annotation utilises proteomics data tosupplement transcriptomic and evolutionary evidence ofprotein-coding functionality and we have continued to bothgenerate experimental MS data and use publicly availabledata sets to aid the identification and annotation of protein-coding genes. Our data generation focus is on elements ofthe proteome that are missed by standard proteomics ap-proaches including the use of 155 novel synthetic peptidestargeting distinct and unique peptides mapping to putativecoding genes, newly discovered protein coding genes that re-quire validation, and pseudogenes that have shown strongpeptide evidence in previous experiments. These peptidesare compiled into a reference spectral library, which is usedto validate their existence in our experimental proteomicsdata and large public MS datasets. For example transcrip-tomic, conservation, and ribosome profiling data combinedwith experimental peptide evidence supported the discoveryand validation of an alternate protein isoform originatingfrom a non-ATG start site in the gene POLG (24), and high-lighted a novel class of unannotated protein-coding featuresthat are now under active investigation.

    To support the automated analysis of proteomics data forgenome annotation we collaborated with the PRIDE (25)proteomics repository at EMBL-EBI to build a reprocess-

    Dow

    nloaded from https://academ

    ic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1087/6018430 by M

    IT Libraries user on 25 Decem

    ber 2020

    https://www.ebi.ac.uk/gifts/

  • Nucleic Acids Research, 2020 5

    Figure 2. Screenshot from the Ensembl genome browser of the transcript view page for the gene LDHB, which contains a transcript (ENST00000673047,LDHB-211) with an annotated stop-codon readthrough event. The location of the annotation attribute flagging the stop-codon readthrough is highlightedby the red box.

    ing and peptide-to-genome mapping pipeline for public pro-teomics.

    Finally, we developed a pipeline based on UniProt (4),APPRIS (26), PhyloCSF (1), Ensembl gene trees (10),RNA-seq, MS and variation data to identify annotatedprotein-coding genes with weak or no support. This methodenables us to scrutinise currently annotated protein-codinggenes in the human and mouse gene set for misclassifiedgene models. To date we have flagged as potential non-coding genes more than 2475 human and 1807 mouse genesthat were annotated as protein-coding. These are then re-viewed in an iterative and ongoing process by expert manualannotators and retained, removed or reclassified based ontheir current supporting evidence. To date, ∼1000 humanprotein-coding genes have been reviewed and 119 removedor reclassified. A complementary approach has also beendeveloped to identify missing and partially complete genemodels in the human genome and submit to manual review.

    LncRNAS

    We have made improvements to the Capture Long Se-quencing (CLS) lab protocol (5), including a 5′ cap se-lection step (‘CapTrap’) (27), which increases the propor-tion of sequenced full-length transcripts and the use ofSpiked-in RNA Variant Control Mixes (SIRVs). ApplyingCLS, we have generated long transcriptomic data target-ing a variety of suspected lncRNA-producing genomic loci

    in both human and mouse. Focusing primarily on unan-notated regions such as GWAS sites, putative enhancers,and non-GENCODE lncRNA catalogs (e.g. miTranscrip-tome (28), NONCODE (29), FANTOM CAT (30)). In to-tal we have produced more than 36 million ONT readsand 2 million PacBio Sequel (PBS) reads identifying thou-sands of potential novel loci (∼1600 in Human, ∼4500 inmouse) in currently unannotated genomic regions for re-view and inclusion in the Ensembl/GENCODE geneset.Long transcriptomic sequence data produced within GEN-CODE and from public data archives has been run throughour TAGENE workflow and the results of this first setof analysis released to the public in GENCODE 31 (June2019). These initial results have already made a signifi-cant difference to the coverage of lncRNAs in GENCODE,with the addition of 1711 novel loci and 17 858 transcripts,an 11% and 60% increase compared to the previous releaserespectively.

    PSEUDOGENES

    Our pseudogene annotation has benefited from the analy-sis of new datasets. For example, using RNA-seq datasetsfrom ENTEx-pseudogene expression in various human tis-sues we have developed a computational framework to ac-curately quantify the expression level of pseudogenes, andidentify actively transcribed pseudogenes in each tissue. Wehave also used our pseudogene annotation in 16 closely re-

    Dow

    nloaded from https://academ

    ic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1087/6018430 by M

    IT Libraries user on 25 Decem

    ber 2020

  • 6 Nucleic Acids Research, 2020

    Figure 3. A screenshot from the Ensembl genome browser of the location view for the CTSS gene. The Comprehensive annotation from GENCODE 35 isshown in the upper panel and the updated annotation in the COVID-19 genes trackhub is shown in the lower panel. Transcript models that are unchangedwith respect to release Ensembl 101 are coloured blue, whereas new models or pre-existing models that have been modified are shown in orange.

    lated mouse strains from the Mouse Genomes Project (31)to create orthology relationships for the conserved annota-tions and the identification of patterns of pseudogene gainand loss between strains (32) and give a prototype for workannotating human pseudogenes leveraging variation acrossthe human population.

    DATA ACCESS

    GENCODE gene sets are currently updated up to fourtimes each year for both human and mouse. Each releaseis versioned and made available immediately upon releasefrom Ensembl (6) and https//www.gencodegenes.org withrelease on the UCSC Genome Browser (33) normally fol-lowing shortly thereafter. The current human release isGENCODE 35 (August 2020) and the current mouse re-lease is GENCODE M25 (April 2020). Additional infor-mation and previous releases can be found at https//www.gencodegenes.org.

    GENCODE is the now the standardised default humanand mouse annotation for both the Ensembl and UCSCgenome browsers following a transition of UCSC’s mouseannotation in April 2019. Data is presented through all ofthe standard interfaces from both resources.

    To expedite public access to updated annotation be-tween releases, all annotation changes are made freelyavailable within 24 h via the ‘GENCODE update’ TrackHub, which can be accessed at both the Ensembl and

    UCSC genome browsers. In the Ensembl browser, the hubhas been added to the Track Hub Registry (accessed viathe ‘Custom tracks’ section), and can be connected toby searching for ‘GENCODE update’. Alternatively, thedata can be added as a custom track in both Ensembland UCSC browsers (http://ftp.ebi.ac.uk/pub/databases/gencode/update trackhub/hub.txt). Additionally, a track-hub of updates to genes associated with COVID-19can be accessed in the same way (http://ftp.ebi.ac.uk/pub/databases/gencode/covid19 trackhub/hub.txt). In the‘COVID-19 genes’ track data view, transcript modelsthat are unchanged with respect to release GENCODE35/Ensembl 101 are coloured blue, whereas new models orpre-existing models that have been modified are shown inorange (Figure 3). We also offer BED and gtf files for theseannotations.

    We have made available the public ‘SynonymousConstraint’ track hub in the UCSC Genome Browserthat shows protein-coding regions under synonymousconstraint, indicating an overlapping function, andsynonymous accelerated regions, indicating a high mu-tation rate (https://data.broadinstitute.org/compbio1/SynonymousConstraintTracks/trackHub/).

    Supported GENCODE annotation is available on theGRCh38 human reference assembly and the GRCm38mouse reference assembly. Selected human releases aremapped back to the GRCh37 assembly and made availablefrom UCSC and https://www.gencodegenes.org as a service

    Dow

    nloaded from https://academ

    ic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1087/6018430 by M

    IT Libraries user on 25 Decem

    ber 2020

    http://www.gencodegenes.orghttp://www.gencodegenes.orghttp://ftp.ebi.ac.uk/pub/databases/gencode/update_trackhub/hub.txthttp://ftp.ebi.ac.uk/pub/databases/gencode/covid19_trackhub/hub.txthttps://data.broadinstitute.org/compbio1/SynonymousConstraintTracks/trackHub/https://www.gencodegenes.org

  • Nucleic Acids Research, 2020 7

    to the community. The resulting mapping are not manu-ally checked and may have errors especially in complicatedregions of the human genome. We recommend use of theGRCh38 annotations if possible.

    Training about the GENCODE annotation and its useis available from the Ensembl and UCSC training teamand user support is available from the Ensembl and UCSChelpdesks.

    CONCLUSION

    The GENCODE consortium leverages the best availabledata, analysis and tools to continually improve the gene an-notation of the human and mouse reference genomes. Wehave developed new methods and workflows to take ad-vantage of the increasing quality and volume of data, andin particular long transcriptomic data, while maintainingthe specificity afforded by expert human oversight. We ex-pect our ability to use new data to improve our coverageof novel genes and alternatively spliced transcripts will al-low us to move towards a more complete representation ofall gene features of known functional classes as we monitorthe emergence of new functional features that may requireannotation such as alternative translations of known codinggenes, non-canonical translations in, for example, lncRNAsand mRNA with multiple functions.

    FUNDING

    National Human Genome Research Institute of the Na-tional Institutes of Health [U41HG007234]; the contentis solely the responsibility of the authors and does notnecessarily represent the official views of the National In-stitutes of Health; Wellcome Trust [WT108749/Z/15/Z,WT200990/Z/16/Z]; European Molecular Biology Labo-ratory; Swiss National Science Foundation through the Na-tional Center of Competence in Research ‘RNA & Disease’(to R.J.); Medical Faculty of the University of Bern (toR.J). Funding for open access charge: National Institutesof Health.Conflict of interest statement. Paul Flicek is a member of theScientific Advisory Boards of Fabric Genomics, Inc., andEagle Genomics, Ltd.

    REFERENCES1. Lin,M.F., Jungreis,I. and Kellis,M. (2011) PhyloCSF: a comparative

    genomics method to distinguish protein coding and non-codingregions. Bioinformatics, 27, i275–82.

    2. Raney,B.J., Dreszer,T.R., Barber,G.P., Clawson,H., Fujita,P.A.,Wang,T., Nguyen,N., Paten,B., Zweig,A.S., Karolchik,D. et al. (2014)Track data hubs enable visualization of user-defined genome-wideannotations on the UCSC Genome Browser. Bioinformatics, 30,1003–1005.

    3. O’Leary,N.A., Wright,M.W., Brister,J.R., Ciufo,S., Haddad,D.,McVeigh,R., Rajput,B., Robbertse,B., Smith-White,B., Ako-Adjei,D.et al. (2016) Reference sequence (RefSeq) database at NCBI: currentstatus, taxonomic expansion, and functional annotation. NucleicAcids Res., 44, D733–D745

    4. The UniProt Consortium (2019) UniProt: a worldwide hub of proteinknowledge. Nucleic Acids Res., 47, D506–D515

    5. Lagarde,J., Uszczynska-Ratajczak,B., Carbonell,S., Pérez-Lluch,S.,Abad,A., Davis,C., Gingeras,T.R., Frankish,A., Harrow,J., Guigo,R.et al. (2017) High-throughput annotation of full-length long

    noncoding RNAs with capture long-read sequencing. Nat Genet., 49,1731–1740.

    6. Harrow,J., Denoeud,F., Frankish,A., Reymond,A., Chen,C.K.,Chrast,J., Lagarde,J., Gilbert,J.G., Storey,R., Swarbreck,D. et al.(2006) GENCODE: producing a reference annotation for ENCODE.Genome Biol., 7, S4.

    7. Harrow,J., Frankish,A., Gonzalez,J.M., Tapanari,E., Diekhans,M.,Kokocinski,F., Aken,B.L., Barrell,D., Zadissa,A., Searle,S. et al.(2012) GENCODE: the reference human genome annotation for TheENCODE Project. Genome Res., 22, 1760–1774.

    8. Howald,C., Tanzer,A., Chrast,J., Kokocinski,F., Derrien,T.,Walters,N., Gonzalez,J.M., Frankish,A., Aken,B.L., Hourlier,T. et al.(2012) Combining RT-PCR-seq and RNA-seq to catalog all genicelements encoded in the human genome. Genome Res., 22, 1698–1710.

    9. Aken,B.L., Ayling,S., Barrell,D., Clarke,L., Curwen,V., Fairley,S.,Fernandez Banet,J., Billis,K., Garcı́a Girón,C., Hourlier,T. et al.(2016) The Ensembl gene annotation system. Database (Oxford),2016, baw093.

    10. Yates,A.D., Achuthan,P., Akanni,W., Allen,J., Allen,J.,Alvarez-Jarreta,J., Amode,M.R., Armean,I.M., Azov,A.G.,Bennett,R. et al. (2020) Ensembl 2020. Nucleic Acids Res., 48,D682–D688.

    11. Kokocinski,F., Harrow,J. and Hubbard,T. (2010) AnnoTrack–atracking system for genome annotation. BMC Genomics, 11, 538.

    12. Barnes,I.H.A., Ibarra-Soria,X., Fitzgerald,S., Gonzalez,J.M.,Davidson,C., Hardy,M.P., Manthravadi,D., Van Gerven,L.,Jorissen,M., Zeng,Z. et al. (2020) Expert curation of the human andmouse olfactory receptor gene repertoires identifies conserved codingregions split across two exons. BMC Genomics, 21, 196.

    13. Steward,C.A., Roovers,J., Suner,M.M., Gonzalez,J.M.,Uszczynska-Ratajczak,B., Pervouchine,D., Fitzgerald,S., Viola,M.,Stamberger,H., Hamdan,F.F. et al. (2019) Re-annotation of 191developmental and epileptic encephalopathy-associated genesunmasks de novo variants in SCN1A. NPJ Genom. Med., 4, 31.

    14. Zhou,Y., Hou,Y., Shen,J., Huang,Y., Martin,W. and Cheng,F. (2020)Network-based drug repurposing for novel coronavirus2019-nCoV/SARS-CoV-2. Cell Discov., 6, 14.

    15. Gordon,D.E., Jang,G.M., Bouhaddou,M., Xu,J., Obernier,K.,White,K.M., O’Meara,M.J., Rezelj,V.V., Guo,J.Z., Swaney,D.L. et al.(2020) A SARS-CoV-2 protein interaction map reveals targets fordrug repurposing. Nature, 583, 459–468.

    16. Rozenblatt-Rosen,O., Stubbington,M.J.T., Regev,A. andTeichmann,S.A. (2017) The Human Cell Atlas: from vision to reality.Nature, 550, 451–453

    17. Schoggins,J.W. and Rice,C.M. (2011) Interferon-stimulated genes andtheir antiviral effector functions. Curr. Opin. Virol., 1, 519–525.

    18. Pertea,M., Shumate,A., Pertea,G., Varabyou,A., Breitwieser,F.P.,Chang,Y.C., Madugundu,A.K., Pandey,A. and Salzberg,S.L. (2018)CHESS: a new human gene catalog curated from thousands oflarge-scale RNA sequencing experiments reveals extensivetranscriptional noise. Genome Biol., 28, 208.

    19. an Heesch,S., Witte,F., Schneider-Lunitz,V., Schulz,J.F., Adami,E.,Faber,A.B., Kirchner,M., Maatz,H., Blachut,S., Sandmann,C.L.et al. (2019) The translational landscape of the human heart. Cell,178, 242–260.

    20. Armstrong,J., Hickey,G., Diekhans,M., Fiddes,I.T., Novak,A.M.,Deran,A., Fang,Q., Xie,D., Feng,S., Stiller,J. et al. (2020) ProgressiveCactus is a multiple-genome aligner for the thousand-genome era.Nature, 587, 246–251.

    21. Mudge,J.M., Jungreis,I., Hunt,T., Gonzalez,J.M., Wright,J.C.,Kay,M., Davidson,C., Fitzgerald,S., Seal,R., Tweedie,S. et al. (2019)Discovery of high-confidence human protein-coding genes and exonsby whole-genome PhyloCSF helps elucidate 118 GWAS loci. GenomeRes., 29, 2073–2087.

    22. Jungreis,I., Chan,C.S., Waterhouse,R.M., Fields,G., Lin,M.F. andKellis,M. (2016) Evolutionary dynamics of abundant stop codonreadthrough. Mol. Biol. Evol., 33, 3108–3132.

    23. Loughran,G., Jungreis,I., Tzani,I., Power,M., Dmitriev,R.I.,Ivanov,I.P., Kellis,M. and Atkins,J.F. (2018) Stop codon readthroughgenerates a C-terminally extended variant of the human vitamin Dreceptor with reduced calcitriol response. J. Biol. Chem., 293,4434–4444.

    24. Khan,Y.A., Jungreis,I., Wright,J.C., Mudge,J.M., Choudhary,J.S.,Firth,A.E. and Kellis,M. (2020) Evidence for a novel overlapping

    Dow

    nloaded from https://academ

    ic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1087/6018430 by M

    IT Libraries user on 25 Decem

    ber 2020

  • 8 Nucleic Acids Research, 2020

    coding sequence in POLG initiated at a CUG start codon. BMCGenet., 21, 25.

    25. Perez-Riverol,Y., Bai,J., Bernal-Llinares,M., Hewapathirana,S.,Kundu,D.J., Inuganti,A., Griss,J., Mayer,G., Eisenacher,M., Pérez,E.et al. (2019) The PRIDE database and related tools and resources in2019: improving support for quantification data. Nucleic Acids Res.,47, D442–D450.

    26. Rodriguez,J.M., Rodriguez-Rivas,J., Di Domenico,T., Vázquez,J.,Valencia,A. and Tress,M.L. (2018) APPRIS 2017: principal isoformsfor multiple gene sets. Nucleic Acids Res., 46, D213–D217.

    27. Carninci,P., Kvam,C., Kitamura,A., Ohsumi,T., Okazaki,Y., Itoh,M.,Kamiya,M., Shibata,K., Sasaki,N., Izawa,M. et al. (1996)High-efficiency full-length cDNA cloning by biotinylated CAPtrapper. Genomics, 37, 327–336.

    28. Iyer,M.K., Niknafs,Y.S., Malik,R., Singhal,U., Sahu,A., Hosono,Y.,Barrette,T.R., Prensner,J.R., Evans,J.R., Zhao,S. et al. (2015) Thelandscape of long noncoding RNAs in the human transcriptome.Nat. Genet., 47, 199–208.

    29. Fang,S., Zhang,L., Guo,J., Niu,Y., Wu,Y., Li,H., Zhao,L., Li,X.,Teng,X., Sun,X. et al. (2018) NONCODEV5: a comprehensive

    annotation database for long non-coding RNAs. Nucleic Acids Res.,46, D308–D314.

    30. Hon,C.C., Ramilowski,J.A., Harshbarger,J., Bertin,N.,Rackham,O.J., Gough,J., Denisenko,E., Schmeier,S., Poulsen,T.M.,Severin,J. et al. (2017) An atlas of human long non-coding RNAswith accurate 5′ ends. Nature, 543, 199–204.

    31. Lilue,J., Doran,A.G., Fiddes,I.T., Abrudan,M., Armstrong,J.,Bennett,R., Chow,W., Collins,J., Collins,S., Czechanski,A. et al.(2018) Sixteen diverse laboratory mouse reference genomes definestrain-specific haplotypes and novel functional loci. Nat. Genet., 50,1574–1583.

    32. Sisu,C., Muir,P., Frankish,A., Fiddes,I., Diekhans,M., Thybert,D.,Odom,D.T., Flicek,P., Keane,T.M., Hubbard,T. et al. (2020)Transcriptional activity and strain-specific history of mousepseudogenes. Nat. Commun., 11, 3695.

    33. Kent,W.J., Sugnet,C.W., Furey,T.S., Roskin,K.M., Pringle,T.H.,Zahler,A.M. and Haussler,D. (2002) The human genome browser atUCSC. Genome Res., 12, 996–1006.

    Dow

    nloaded from https://academ

    ic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1087/6018430 by M

    IT Libraries user on 25 Decem

    ber 2020