6
Informatics Issues in Large-Scale Sequence Analysis: Elucidating the Protein Kinases of C. elegans Jonathan Bingham, Greg D. Plowman, and Sucha Sudarsanam * SUGEN, Inc., South San Francisco, California 94080 Abstract With the availability of the nearly complete genomic sequence of C. elegans, the first multicellular organism to be sequenced, molecular biology has definitely entered the postgenomic era. Annotation of the genomic sequence, which refers to identifying the genes and other biologically relevant sections of the genome, is an important and nontrivial next step. A first-pass annotation will be necessarily incomplete but will drive further biological experiments, which in turn will help to annotate the genome better. Given the scale of the genome sequence analysis, it is clear that the annotation should be automated as much as possible without sacrificing the quality of analysis. In this work, we outline our approach to identifying the protein kinases of C. elegans from the genomic sequence. We describe new tools we have developed for analysis, management and visualization of genomic data. By developing modular and scalable solutions, this study has provided a framework for future analysis of the Drosophila and human genomes. J. Cell. Biochem. 80:181–186, 2000. © 2000 Wiley-Liss, Inc. Recent genomic sequencing efforts are yield- ing unprecedented volumes of genomic data at a pace previously undreamed of, promising to soon provide several complete genomes, includ- ing Drosophila and human. The challenge now is to manage, analyze, annotate, and visualize genomic data. The recently completed C. el- egans genome [The C. elegans Sequencing Con- sortium, 1998] provides a useful glimpse into the prospects and perils of large-scale sequence analysis. At 100 million bases in length, the C. elegans genome is still small compared to those of higher eukaryotes, but already large enough to present significant challenges in scalability of analysis. Also, as the first multicellular or- ganism to be completely sequenced, it presents a reasonable model system for the analysis of higher eukaryotes. For the present purposes, we focused our anal- ysis on identifying protein kinases, which are involved in signal transduction and specifically protein phosphorylation. In eukaryotes, protein phosphorylation plays a critical role in cell cycle regulation, DNA replication, gene transcription, protein translation, and energy metabolism. In multicellular eukaryotes, such as worms, it also affects more complex functionality (such as cel- lular differentiation, intercellular communica- tion, cell survival and senescence, synaptic trans- mission, and environmental interaction). Protein kinases constitute 2.4% of the worm genome and are the second most prevalent protein family in worms. A comprehensive analysis of the C. el- egans protein kinases appears elsewhere [Plow- man et al., 1999]; our emphasis here is on de- scribing the computational framework used for the analysis. Our analysis strategy can be ap- plied to other protein families as well as other genomes to identify biologically interesting fea- tures. Data Analysis Pipeline In order to cope with the huge amount of genomic data, sequence analysis clearly must be automated to the greatest extent possible, while preserving the quality, accuracy, and precision of the results. Relational database management systems greatly simplify the task of data warehousing and retrieval, but creating a central, ordered repository of sequence data is only the beginning. The larger questions are what data to store, how to analyze and anno- tate it, and how to ultimately visualize the results. Scripting and automation languages make it possible to create a large-scale data flow pipeline with minimal human interven- tion and human error. It should be emphasized *Correspondence to: Sucha Sudarsanam, SUGEN, Inc., 230 East Grand Avenue, South San Francisco, CA 94080. E-mail: [email protected] Received 19 July 1999; Accepted 19 July 1999 Journal of Cellular Biochemistry 80:181–186 (2000) © 2000 Wiley-Liss, Inc.

Informatics issues in large-scale sequence analysis: Elucidating the protein kinases of C. elegans

Embed Size (px)

Citation preview

Page 1: Informatics issues in large-scale sequence analysis: Elucidating the protein kinases of C. elegans

Informatics Issues in Large-Scale Sequence Analysis:Elucidating the Protein Kinases of C. elegansJonathan Bingham, Greg D. Plowman, and Sucha Sudarsanam*

SUGEN, Inc., South San Francisco, California 94080

Abstract With the availability of the nearly complete genomic sequence of C. elegans, the first multicellularorganism to be sequenced, molecular biology has definitely entered the postgenomic era. Annotation of the genomicsequence, which refers to identifying the genes and other biologically relevant sections of the genome, is an importantand nontrivial next step. A first-pass annotation will be necessarily incomplete but will drive further biologicalexperiments, which in turn will help to annotate the genome better. Given the scale of the genome sequence analysis,it is clear that the annotation should be automated as much as possible without sacrificing the quality of analysis. Inthis work, we outline our approach to identifying the protein kinases of C. elegans from the genomic sequence. Wedescribe new tools we have developed for analysis, management and visualization of genomic data. By developingmodular and scalable solutions, this study has provided a framework for future analysis of the Drosophila and humangenomes. J. Cell. Biochem. 80:181–186, 2000. © 2000 Wiley-Liss, Inc.

Recent genomic sequencing efforts are yield-ing unprecedented volumes of genomic data ata pace previously undreamed of, promising tosoon provide several complete genomes, includ-ing Drosophila and human. The challenge nowis to manage, analyze, annotate, and visualizegenomic data. The recently completed C. el-egans genome [The C. elegans Sequencing Con-sortium, 1998] provides a useful glimpse intothe prospects and perils of large-scale sequenceanalysis. At 100 million bases in length, the C.elegans genome is still small compared to thoseof higher eukaryotes, but already large enoughto present significant challenges in scalabilityof analysis. Also, as the first multicellular or-ganism to be completely sequenced, it presentsa reasonable model system for the analysis ofhigher eukaryotes.

For the present purposes, we focused our anal-ysis on identifying protein kinases, which areinvolved in signal transduction and specificallyprotein phosphorylation. In eukaryotes, proteinphosphorylation plays a critical role in cell cycleregulation, DNA replication, gene transcription,protein translation, and energy metabolism. Inmulticellular eukaryotes, such as worms, it also

affects more complex functionality (such as cel-lular differentiation, intercellular communica-tion, cell survival and senescence, synaptic trans-mission, and environmental interaction). Proteinkinases constitute 2.4% of the worm genome andare the second most prevalent protein family inworms. A comprehensive analysis of the C. el-egans protein kinases appears elsewhere [Plow-man et al., 1999]; our emphasis here is on de-scribing the computational framework used forthe analysis. Our analysis strategy can be ap-plied to other protein families as well as othergenomes to identify biologically interesting fea-tures.

Data Analysis Pipeline

In order to cope with the huge amount ofgenomic data, sequence analysis clearly mustbe automated to the greatest extent possible,while preserving the quality, accuracy, andprecision of the results. Relational databasemanagement systems greatly simplify the taskof data warehousing and retrieval, but creatinga central, ordered repository of sequence datais only the beginning. The larger questions arewhat data to store, how to analyze and anno-tate it, and how to ultimately visualize theresults. Scripting and automation languagesmake it possible to create a large-scale dataflow pipeline with minimal human interven-tion and human error. It should be emphasized

*Correspondence to: Sucha Sudarsanam, SUGEN, Inc.,230 East Grand Avenue, South San Francisco, CA 94080.E-mail: [email protected] 19 July 1999; Accepted 19 July 1999

Journal of Cellular Biochemistry 80:181–186 (2000)

© 2000 Wiley-Liss, Inc.

Page 2: Informatics issues in large-scale sequence analysis: Elucidating the protein kinases of C. elegans

that these are only enabling technologies thatleave open the question of what to automateand how to ensure the quality of the end prod-uct. A variety of algorithms exist for homologyand profile searching of large data sets, includ-ing BLAST [Altschul et al., 1997], Smith-Waterman [Smith and Waterman, 1981], andHidden Markov Models [Krogh et al., 1994].These tools form an essential part of a se-quence analysis pipeline. But their integrationand the interpretation of their results pose asignificant challenge. Finally, after construct-ing a data analysis pipeline, complete withdata warehousing, scripting, and algorithms,some new set of annotation data emerges. Ul-timately a human must inspect or sift throughthe data in order to confirm its validity andutility. Existing tools are already of use in datavisualization, including genome browsers andtree viewers. But the sheer volume of dataalready strains many existing visualizationtools to their practical limits, while introducingnew problems of navigation. In the course ofour analysis of the C. elegans protein kinases,we found it necessary to create our own datavisualization tools. In what follows, we outlineour data analysis pipeline.

Data Management

The complete C. elegans genome is availablefor download from the Sanger Center (http://www/sanger.ac.uk), as cosmids, as six chromo-somal assemblies, and as 19,099 putativeworm proteins. After downloading each dataset, we imported the raw sequence data into arelational database, storing the accession num-bers, sequence text, time stamp, and so forth.At each step of the analysis, results—eitherintermediate or final—were imported into newtables, each linked back to the raw sequencedata. The Sugen C. elegans database, calledSuCeDb, contains tables for raw sequencedata, for open reading frame predictions, fordomain predictions including protein kinases,and for additional sequence annotation (Fig. 1).A typical step in the analysis pipeline requiresquerying the database, running some form ofanalysis on the data, and storing the results ofthe analysis back in SuCeDb. Depending onthe complexity of a particular analysis algo-rithm, database interaction may itself be therate-limiting step. This is the case, for exam-ple, with sequence translation, a trivial enoughoperation taking on the order of a few minutes

for the complete worm genome. In the case ofmore complex analysis, such as homology andpattern-based searching, database access occu-pies a less significant portion of the overalltime. Better hardware and relational databaseperformance tuning can alleviate some of thestress on computer resources. But the biologi-cally interesting portion of the data analysispipeline is not the storage mechanism, but ratherthe data and the analysis that produces it.

Analysis

For the nucleic acid databases, the next stepafter downloading and data warehousing wasto translate the sequences into all six readingframes, identifying potential coding regions.All open reading frames (ORFs) longer than 30amino acids were retained and stored in a sep-arate database table. A variety of algorithmsexist for searching and analyzing the openingreading frames, including gene recognition al-gorithms (for a comprehensive collection ofgene prediction algorithms, see http://linkage.rockefeller.edu/wli/gene/), homology searchmethods [Altschul et al., 1997; Smith and Wa-terman, 1981], profile search methods [Kroghet al., 1994; Eddy, 1998], and pattern searchtools [Henikoff et al., 1999]. As always, thechoice of algorithms must depend on the spe-cific nature of the problem. Clearly to identifyprotein kinases it was not necessary to assem-ble all genes or identify all intron/exon bound-aries in the entire genome.

One search strategy, and by far the mostprevalent today, is homology searching, using

Fig. 1. Schema for sequence analysis dataflow. Database con-tains tables for raw sequences (Worm chromosomes, Wormcosmids); predicted proteins (WormPep); known kinases andboundaries for catalytic domains (Known kinases); and tablesfor storing results of the analysis (Analysis).

182 Bingham et al.

Page 3: Informatics issues in large-scale sequence analysis: Elucidating the protein kinases of C. elegans

either the fast, heuristic BLAST2 algorithm orthe slower but more sensitive Smith-Waterman algorithm. These general purposealgorithms accept a query sequence as an ar-gument and search against a database for ho-mologous sequence segments, assigning anE-value to the results. By using a straighthomology-based method, it is possible to searchthe C. elegans open reading frames against abroad set of protein sequences. Then any matchagainst a known kinase, meeting some thresh-old criterion, would positively identify a poten-tial C. elegans kinase. This turns out to be aviable strategy, although as a general-purposealgorithm it does not benefit in any way fromthe family-specific nature of the problem offinding a particular known protein domain.

Enter the pattern search strategies, includ-ing PSI-BLAST [Altschul et al., 1997] and Hid-den Markov Models (HMM) [Eddy, 1998]. PSI-BLAST is a hybrid strategy that begins with ageneral-purpose search, but iteratively con-structs a problem-specific pattern. HMM ismore directly family specific, requiring clear apriori knowledge of a family or of some of itsmembers. In the case of protein kinases, such apriori knowledge is readily available. Concep-tually, the most natural choice for a patternsearch algorithm was a Hidden Markov Model.Such a model “describes a family of proteins byassigning large probabilities to sequences inthat family” [Krogh et al., 1994]. The descrip-tion consists, essentially, of a vector of proba-bilities for each amino acid position, with theprobability specifying the likelihood of a givenamino acid occurring at that position. In orderto construct such a model, considerable priorknowledge of a protein family is required, typ-ically in the form of a set of representativesequences. Therefore, HMMs are not suitablefor general database searching. Nor do they

take into account higher-level correlations;such as those arising from structural features.These caveats aside, HMMs capture thefamily-specific nature of the problem of identi-fying protein kinases.

To construct a model of the kinase family, westarted with sequences of kinase catalytic do-mains from yeast to human. To avoid biasingthe model toward certain subfamilies, thus re-ducing sensitivity to other subfamilies of ki-nases, we effectively eliminated all kinaseswith homology in their catalytic domainsgreater than 50%. This left 70 broadly repre-sentative kinases. This general, ‘representa-tive’ kinase profile served to identify all poten-tial kinases in the C. elegans genome.

An HMM search may be used to identifycomplete or partial matches against the profile.It was crucial to search for partial matches,since short open reading frames could not beexpected to contain a complete kinase. Rather,kinases might be spread across multiple exons,with frame-shifts in between. In fact, such wasfrequently the case. See Table I in which acosmid denoted by its accession number ZC404,has five distinct exons in all three reversestrand reading frames. Some of these frag-ments have relatively low scores—low enoughthat they might easily be overlooked on theirown. But combined, the five fragments yield acomplete kinase domain with a highly signifi-cant score.

Using a custom gene assembly algorithm, allsuch fragments were assembled to form asmany complete kinase domains as possible.The criteria were as follows: fragments must beon the same strand; they must consecutivelyspan the kinase profile so that each additionalfragment adds to the complete kinase domain;they must not overlap by ‘too much’, definedarbitrarily as a maximum of a 10 amino acid

TABLE I.

Frame Begin base End base Begin model End model E value Score

21 21,512 20,169 122 261 4.60E-37 137.222 21,532 21,398 94 122 3.60E-02 14.823 21,648 21,529 71 90 9.40E-02 13.323 21,798 21,652 28 67 1.00E-01 13.222 22,223 21,765 1 21 3.90E-04 21.7

Alignment of ZC404 kinase on the hidden Markov model (HMM). Coordinates for open reading frames FRAME, BEGIN-_BASE, END_BASE) that contain the putative kinase exons are shown in the first three columns. Alignments of each exonon the HMM (BEGIN_MODEL, END_MODEL) are shown in the next two columns. Finally, for each kinase exon, expectationvalues and scores against the model are shown.

183Issues in Large Scale Sequence Analysis

Page 4: Informatics issues in large-scale sequence analysis: Elucidating the protein kinases of C. elegans

overlap (results varied little in response to theprecise cutoff value). No restrictions wereplaced on the length of introns, as intronlengths varied tremendously. After assemblingkinase fragments, the assembled kinases werescored against the complete kinase HMM, thistime forcing a global match. All fragments andassembled kinases were stored in the database,along with precise location and strand informa-tion, HMM scores and E-values, and thematching region in the kinase profile.

All analysis up through this point—fromdownloading the raw genomic sequence, stor-ing it in a database, translating it an sixframes, searching against an HMM profile,parsing the results of that search, storing theresults in a database, assembling fragmentarykinases, and storing these in the database—was entirely automated using simple shellscripts and programs written in the Java lan-guage. It executes relatively quickly (on theorder of a few hours using a standard single-processor desktop computer with average spec-ifications), with the overwhelming majority ofthe time consumed in database access andHMM searching. With hardware-acceleratedversions of the HMM algorithm just now be-coming available, this portion of the analysiscan be vastly sped up. By bypassing some of theintermediate database storage, or by accessingthe database using a separate processor, thefinal bottleneck can be partly alleviatedthrough a combination of hardware and carefuloptimization. Significantly, the entire auto-mated pipeline scales linearly with respect tothe length of the genomic sequence. By anno-tating sequence data incrementally, as it be-comes available, the volume of data should re-main manageable even as the largerDrosophila and human genomes become avail-able.

With a scalable, fully automated pipeline inplace for identifying C. elegans kinases, theautomated portion of the annotation quicklyreached its end. Human inspection and analy-sis were required to refine the initial output setdown to a final set of 408 kinases, 21 kinasefragments and 82 kinase-like domains [Plow-man et al., 1999]. Some redundant entries ap-peared, which were eliminated using pairwiseSmith-Waterman analysis. Also, a few kinasesspanned multiple cosmid clones and by exam-ining the N- or C- termini of adjacent cosmids,it was possible to assemble the complete do-

mains. Notably, the analysis of translatedgenomic data turned up approximately 40 ki-nase domains that were absent in the 19,099protein data set separately available. This dis-parity suggests limitations in using gene pre-diction algorithms as the sole means of identi-fying coding regions. Our experience suggeststhat gene prediction algorithms should be sup-plemented by homology or profile search algo-rithms.

Finally, the putative kinases were clusteredusing hierarchical ‘phylogenetic’ methods. Thisrequired the creation of a multiple sequencealignment of all of the kinase catalytic domainswhich was created by aligning against theHMM. Then it was possible, using the Phylippackage [Felsenstein, 1989], to create parsimo-nious tree structures approximating the rela-tions among the kinases. At this point, theproblem of data analysis became a problem ofdata visualization.

Vizualization

There were two primary types of data thatneeded to be visualized in the course of identi-fying potential protein kinases in C. elegans.As mentioned already, it was helpful to repre-sent the various kinase families as a tree. Oncea tree structure had been generated, we graph-ically viewed that tree in a variety of forms: anunrooted tree, a dendogram, and a less familiarformat called a hyperbolic tree [Bingham andSudarsanam, 1999a]. In addition to tree struc-tures, other types of data needed visualization,specifically the HMM alignments and the posi-tion of putative kinases relative to their openreading frames raw sequences. A combinationgenome browser and sequence viewer servedthese purposes [Bingham and Sudarsanam,unpublished].

Visualizing a tree of over 400 nodes, one foreach putative kinase, posed a variety of prob-lems. It was difficult to search by eye throughall the labels in order to find a particular ki-nase. It was hard to see in a glance whether theclustering met prior expectations, and alsowhere exactly the clusters fell. Also, it wasdifficult to navigate the tree, zooming in onparticular portions without losing track of therelationship between that portion of the treeand the rest of it. To address these needs, cus-tom software was written to project large treeson to hyperbolic space [Bingham and Sudarsa-

184 Bingham et al.

Page 5: Informatics issues in large-scale sequence analysis: Elucidating the protein kinases of C. elegans

nam, 2000]. A projection of C. elegans kinasetree in hyperbolic space is shown in Figure 2.

In addition to the tree viewer, a genomebrowser and sequence viewer was created. Itcan access SuCeDb or the local file system, andprovides three distinct views of sequence data.The first view is a simple tabular view of ac-cession number, locations, HMM scores, and soon. The second view shows pairwise or multiplesequence alignments, with color-coded aminoacids and nucleic acids. Mismatches can behighlighted by color inversion. The third viewrepresents a sequence, its open reading frames,and any feature predictions such as kinases, ashorizontal bars. Base locations span the x-axis,while the y-axis supports features and se-quences of various categories (see Fig. 3). Aswith the tree viewer, data can be color coded forintuitive, at-a-glance classification, and indi-vidual sequences or features can be selectedfrom a list. Also, the three views are closelyintegrated, with selections in one view high-lighting the corresponding data in the otherviews. With both of these software projects,software design posed its own set of challengesfor sequence analysis, automation, and scal-ability.

Software Design

If not planned properly, software develop-ment time can easily exceed data analysistime. To minimize the time spent in writingsoftware, all code was written in the Java lan-

guage, proven to have development and debug-ging time less than half that of C or C11.Secondly, the programs were written usingcomponent-based design for maximal code re-use between applications and for the highestlevel of software maintainability. Individualcomponents can be updated or interchangedwithout affecting the integrity of the completesystem. Also, components can be quicklyadapted to new purposes; for example, thealignment viewer can be used without modifi-cation to view HMM alignments, BLAST align-ments, or multiple sequence alignments. Byusing current software development para-digms, tremendous time and energy has beensaved, as great as any in the remainder of theanalysis pipeline. Finally, it is worth notingthat the performance-critical aspects of the vi-sualization tools all scale linearly with respectto the number of sequences. A standard per-sonal computer with typical RAM and clockspeeds can support all visualization needs forthe C. elegans kinases.

CONCLUSION

Many of the prerequisites for large-scalegenomic sequence analysis are readily available.With the complete Drosophila and human ge-nomes on the way, there can be no doubt aboutthe ‘large scale’—the pure magnitude—of thechallenge that confronts genomics and bioinfor-matics. Fortunately, a range of algorithms andtechnologies already exist for dealing with thisdata, including relational databases, homologyand profile search algorithms, scripting lan-guages, and data visualization tools. While cer-tainly partial and imperfect, they nonethelessprovide a starting point. The challenge of large-scale sequence analysis is to carefully integratethe existing technologies, to build upon themwhere necessary, and to preserve data integrityand quality of analysis in the process.

With the completion of the C. elegans ge-nome, the first fully sequenced multicellularorganism, all of the problems and pitfalls oflarge-scale sequence analysis are becoming ap-parent. From our study, the following guidingprinciples have emerged: write scalable soft-ware, since more hardware may not be thesolution; encapsulate domain-specific knowl-edge in the database design; use existing soft-ware when available, and write reusablecomponent-based software when it is unavail-

Fig. 2. Hyperbolic view of protein kinases of C. elegans.

185Issues in Large Scale Sequence Analysis

Page 6: Informatics issues in large-scale sequence analysis: Elucidating the protein kinases of C. elegans

able; automate analysis to the greatest extentpossible. Finally, since there is no real substi-tute for a human expert in assessing the qual-ity of sequence alignments, all tools should ca-ter to human experts. By following theseprinciples, the analysis pipeline outlined forprotein kinases can scale up as larger completegenomes become available.

REFERENCES

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms. Nucleic Acids Res 25:3389–3402.

Bingham J, Sudarsanam S. 2000. Visualizing large hierar-chical clusters in hyperbolic space. Bioinformatics. Inpress.

Eddy SR. 1996. Hidden Markov models. Curr Opin StructBiol 6:361–365.

Felsenstein J. 1989. PHYLIP—Phylogeny Inference Pack-age (Version 3.2). Cladistics 5:164–166

Henikoff JG, Henikoff S, Pietrokovski S. 1999. New fea-tures of the Blocks Database servers. Nucl Acids Res27:226–228.

Krogh A, Brown M, Mian IS, Sjolander K, Haussler D.1994. Hidden Markov models in computational biology:applications to protein modeling. J Mol Biol 235:1501–1531.

Plowman GD, Sudarsanam S, Bingham J, Whyte D,Hunter T. 1999. The protein kinases of C. elegans: amodel for signal transduction in multicellular organ-isms. PNAS. 96:13603–13610.

Smith T, Waterman M. 1981. Identification of commonmolecular subsequences. J Mol Biol 147:195–197.

The C. elegans Sequencing Consortium. 1998. Genome se-quence of the nematode C. elegans: a platform for inves-tigating biology. Science 282:2012–2018.

Fig. 3. Genome browser. Table in top left shows kinase exon predictions along with coordinates of these exons ona cosmid. Top right panel shows a color-coded view of the assembled kinase domain. Bottom panel shows thealignment of kinase exons (top row) on the cosmid (middle row). ORFs in six frames are shown above and belowthe cosmid.

186 Bingham et al.