18
Review Paper 83 Yearbook of Medical Informatics 2001 Review What is bioinformatics? An introduction and overview N.M. Luscombe, D. Greenbaum, M. Gerstein Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding in at an unprecedented rate (1). For example as of August 2000, the GenBank repository of nucleic acid sequences contained 8,214,000 entries (2) and the SWISS-PROT database of protein sequences contained 88,166 (3). On average, the amount of information stored in these databases is doubling every 15 months (2). In addition, since the publication of the H. influenzae genome (4), complete sequences for over 40 organisms have been released, ranging from 450 genes to over 100,000. Add to this the data from the myriad of related projects that study gene expression, determine the protein structures encoded by the genes, and detail how these products interact with one another, and we can begin to imagine the enormous quantity and variety of information that is being produced. As a result of this surge in data, many of the challenges in biology have actually become challenges in computing. Such an approach is ideal because of the ease with which computers can handle large quantities of data and probe the complex dynamics observed in nature. Bioinformatics, the subject of the current review, is often defined as the application of computational techniques to understand and organise the information associated with biological macromolecules. This shotgun marriage between the two subjects is largely attributed to the Abstract: A flood of data means that many of the challenges in biology are now challenges in computing. Bioinformatics, the application of computational techniques to analyse the information associated with biomolecules on a large-scale, has now firmly established itself as a discipline in molecular biology, and encompasses a wide range of subject areas from structural biology, genomics to gene expression studies. In this review we provide an introduction and overview of the current state of the field. We discuss the main principles that underpin bioinformatics analyses, look at the types of biological information and databases that are commonly used, and finally examine some of the studies that are being conducted, particularly with reference to transcription regulatory systems. (Molecular) bio informatics: bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry) and applying informatics techniques” (derived from disciplines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many Bioinformatics - a definition 1 1 As submitted to the Oxford English Dictionary fact that biology itself is an information technology; an organism’s physiology and behaviour are largely dictated by its genes, which at the basic level can be viewed as digital repositories of information. At the same time, there have been major advances in the technologies that supply the raw data; according to Anthony Kerlavage of Celera, an experimental laboratory can easily produce over 100 gigabytes of data a day (5). This incredible processing power has been matched by developments in computer technology; the most important areas of improvements have been in the Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

Review Paper

83Yearbook of Medical Informatics 2001

Review

What is bioinformatics? Anintroduction and overview

N.M. Luscombe,D. Greenbaum,M. GersteinDepartment of Molecular Biophysicsand BiochemistryYale UniversityNew Haven, USA

Introduction

Biological data are flooding in atan unprecedented rate (1). Forexample as of August 2000, theGenBank repository of nucleic acidsequences contained 8,214,000entries (2) and the SWISS-PROTdatabase of protein sequencescontained 88,166 (3). On average,the amount of information stored inthese databases is doubling every15 months (2). In addition, since thepublication of the H. influenzaegenome (4), complete sequencesfor over 40 organisms have beenreleased, ranging from 450 genesto over 100,000. Add to this thedata from the myriad of relatedprojects that study gene expression,determine the protein structuresencoded by the genes, and detailhow these products interact withone another, and we can begin toimagine the enormous quantity andvariety of information that is beingproduced.

As a result of this surge in data,many of the challenges in biologyhave actually become challenges incomputing. Such an approach is idealbecause of the ease with whichcomputers can handle large quantitiesof data and probe the complexdynamics observed in nature.Bioinformatics, the subject of thecurrent review, is often defined asthe application of computationaltechniques to understand and organisethe information associated withbiological macromolecules. Thisshotgun marriage between the twosubjects is largely attributed to the

Abstract: A flood of data means that many of the challenges in biology are now challengesin computing. Bioinformatics, the application of computational techniques to analyse theinformation associated with biomolecules on a large-scale, has now firmly establisheditself as a discipline in molecular biology, and encompasses a wide range of subject areasfrom structural biology, genomics to gene expression studies.

In this review we provide an introduction and overview of the current state of the field.We discuss the main principles that underpin bioinformatics analyses, look at the typesof biological information and databases that are commonly used, and finally examinesome of the studies that are being conducted, particularly with reference to transcriptionregulatory systems.

(Molecular) bio – informatics: bioinformatics is conceptualising biology interms of molecules (in the sense of physical chemistry) and applying”informatics techniques” (derived from disciplines such as applied maths,computer science and statistics) to understand and organise the informationassociated with these molecules, on a large scale. In short, bioinformaticsis a management information system for molecular biology and has many

Bioinformatics - a definition1

1 As submitted to the Oxford English Dictionary

fact that biology itself is an informationtechnology; an organism’s physiologyand behaviour are largely dictated byits genes, which at the basic level canbe viewed as digital repositories ofinformation. At the same time, therehave been major advances in thetechnologies that supply the raw data;according to Anthony Kerlavage ofCelera, an experimental laboratory caneasily produce over 100 gigabytes ofdata a day (5). This incredibleprocessing power has been matchedby developments in computertechnology; the most important areasof improvements have been in the

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 2: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

84

Review Paper

Yearbook of Medical Informatics 2001

CPU, disk storage and Internet,allowing faster computations, betterdata storage and revolutionalised themethods for accessing and exchangingof data.

Aims of bioinformaticsThe aims of bioinformatics are three-

fold. First, at its simplest bioinformaticsorganises data in a way that allowsresearchers to access existing infor-mation and to submit new entries asthey are produced, eg the Protein DataBank for 3D macromolecular struc-tures (6, 7). While data-curation is anessential task, the information storedin these databases is essentially use-less until analysed. Thus the purposeof bioinformatics extends far beyondmere volume control. The second aimis to develop tools and resources thataid in the analysis of data. For example,having sequenced a particular protein, itis of interest to compare it with previouslycharacterised sequences. This requiresmore than just a straightforward databasesearch. As such, programs such asFASTA (8) and PSI-BLAST (9) mustconsider what constitutes a biologicallysignificant resemblance. Developmentof such resources requires extensiveknowledge of computational theory, aswell as a thorough understanding ofbiology. The third aim is to use these toolsto analyse the data and interpret theresults in a biologically meaningfulmanner. Traditionally, biological studiesexamined individual systems in detail,and frequently compared them with afew that are related. In bioinformatics,we can also conduct global analyses ofall the available data with the aim ofuncovering common principles that applyacross many systems and highlightfeatures that are unique to some.

In this review, we provide an intro-duction to bioinformatics. We focus onthe first and third aims just described,with particular reference to the key-words underlined in the definition: infor-mation,informatics, organisation,

understanding, large-scale andpractical applications. Specifically, wediscuss the range of data that arecurrently being examined, the databasesinto which they are organised, the typesof analyses that are being conductedusing transcription regulatory systemsas an example, and finally discuss someof the major practical applications ofbioinformatics.

“…the INFORMATIONassociated with thesemolecules…”

Table 1 lists the types of data that areanalysed in bioinformatics and the rangeof topics that we consider to fall withinthe field. Here we take a broad view andinclude subjects that may not normally

be listed. We also give approximatevalues describing the sizes of data beingdiscussed.

We start with an overview of thesources of information: these maybe divided into raw DNA sequences,protein sequences, macromolecularstructures, genome sequences, andother whole genome data. Raw DNAsequences are strings of the four base-letters comprising genes, each typically1,000 bases long. The GenBankrepository of nucleic acid sequencescurrently holds a total of 9.5 billionbases in 8.2 million entries (all databasefigures as of August 2000). At the nextlevel are protein sequences comprisingstrings of 20 amino acid-letters. Atpresent there are about 300,000 knownprotein sequences, with a typical

Data source Data size Bioinformatics topicsRaw DNA sequence

Protein sequence

Macromolecularstructure

Genomes

Gene expression

8.2 million sequences(9.5 billion bases)

300,000 sequences(~300 amino acidseach)

13,000 structures(~1,000 atomiccoordinates each)

40 complete genomes(1.6 million –3 billion bases each)

largest: ~20 timepoint measurementsfor ~6,000 genes

Separating coding and non-coding regionsIdentification of introns and exonsGene product predictionForensic analysis

Sequence comparison algorithmsMultiple sequence alignments algorithmsIdentification of conserved sequence motifs

Secondary, tertiary structure prediction3D structural alignment algorithmsProtein geometry measurementsSurface and volume shape calculationsIntermolecular interactions

Molecular simulations(force-field calculations,molecular movements,docking predictions)

Characterisation of repeatsStructural assignments to genesPhylogenetic analysisGenomic-scale censuses(characterisation of protein content, metabolic pathways)Linkage analysis relating specific genes to diseases

Correlating expression patternsMapping expression data to sequence, structural andbiochemical data

Other data

Literature

Metabolic pathways

11 million citations Digital libraries for automated bibliographical searchesKnowledge databases of data from literature

Pathway simulations

Table 1. Sources of data used in bioinformatics, the quantity of each type of data that is currently(August 2000) available, and bioinformatics subject areas that utilise this data.

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 3: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

Review Paper

85Yearbook of Medical Informatics 2001

bacterial protein containing approxi-mately 300 amino acids. Macromo-lecular structural data represents amore complex form of information.There are currently 13,000 entries inthe Protein Data Bank, PDB, mostof which are protein structures. Atypical PDB file for a medium-sizedprotein contains the xyz coordinatesof approximately 2,000 atoms.

Scientific euphoria has recentlycentred on whole genome sequencing.As with the raw DNA sequences,genomes consist of strings of base-letters, ranging from 1.6 million basesin Haemophilus influenzae to 3 billionin humans. An important aspect ofcomplete genomes is the distinctionbetween coding regions and non-coding regions –‘junk’ repetitivesequences making up the bulk of basesequences especially in eukaryotes.We can now measure expression levelsof almost every gene in a given cell ona whole-genome level although publicavailability of such data is still limited.Expression level measurements aremade under different environmentalconditions, different stages of the cellcycle and different cell types in multi-cellular organisms. Currently the largestdataset for yeast has made approxi-mately 20 time-point measurementsfor 6,000 genes (10). Other genomic-scale data include biochemicalinformation on metabolic pathways,regulatory networks, protein-proteininteraction data from two-hybridexperiments, and systematic knockoutsof individual genes to test the viabilityof an organism.

What is apparent from this list is thediversity in the size and complexity ofdifferent datasets. There are invariablymore sequence-based data thanstructural data because of the relativeease with which they can be produced.This is partly related to the greatercomplexity and information-content ofindividual structures compared to

individual sequences. While morebiological information can be derivedfrom a single structure than a proteinsequence, the problem is overcome inthe latter by analysing larger quantitiesof data.

“… ORGANISE the informa-tion on a LARGE SCALE …”

Redundancy and multiplicity of dataA concept that underpins most

research methods in bioinformatics isthat much of this data can be groupedtogether based on biologically meaning-ful similarities. For example, sequencesegments are often repeated atdifferent positions of genomic DNA(11). Genes can be clustered into thosewith particular functions (eg enzymaticactions) or according to the metabolicpathway to which they belong (12),although here, single genes may actuallypossess several functions (13). Goingfurther, distinct proteins frequentlyhave comparable sequences – orga-nisms often have multiple copies of aparticular gene through duplication anddifferent species have equivalent orsimilar proteins that were inheritedwhen they diverged from each other inevolution. At a structural level, wepredict there to be a finite number ofdifferent tertiary structures – estimatesrange between 1,000 and 10,000 folds(14, 15) – and proteins adopt equivalentstructures even when they differgreatly in sequence (16). As a result,although the number of structures inthe PDB has increased exponentially,the rate of discovery of novel folds hasactually decreased.

There are common terms to describethe relationship between pairs ofproteins or the genes from which theyare derived: analogous proteins haverelated folds, but unrelated sequences,while homologous proteins are bothsequentially and structurally similar.The two categories can sometimes be

difficult to distinguish especially if therelationship between the two proteinsis remote (17, 18). Among homologues,it is useful to distinguish betweenorthologues, proteins in differentspecies that have evolved from acommon ancestral gene, andparalogues, proteins that are related bygene duplication within a genome (19).Normally, orthologues retain the samefunction while paralogues evolvedistinct, but related functions (20).

An important concept that arisesfrom these observations is that of afinite “parts list” for different organisms(21, 22): an inventory of proteinscontained within an organism, arrangedaccording to different properties suchas gene sequence, protein fold orfunction. Taking protein folds as anexample, we mentioned that with afew exceptions, the tertiary structuresof proteins adopt one of a limitedrepertoire of folds. As the number ofdifferent fold families is considerablysmaller than the number of genefamilies, categorising the proteins byfold provides a substantial simplificationof the contents of a genome. Similarsimplifications can be provided by otherattributes such as protein function. Assuch, we expect this notion of a finiteparts list to become increasinglycommon in the future genomic analyses.

Clearly, an essential aspect of mana-ging this large volume of data lies indeveloping methods for assessingsimilarities between different biomole-cules and identifying those that arerelated. Below, we discuss the majordatabases that provide access to theprimary sources of information, andalso introduce some secondary data-bases that systematically group thedata (Table 2). These classificationsease comparisons between genomesand their products, allowing the identi-fication of common themes betweenthose that are related and highlightingfeatures that are unique to some.

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 4: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

86

Review Paper

Yearbook of Medical Informatics 2001

Protein sequence databasesProtein sequence databases are

categorised as primary, composite orsecondary. Primary databases containover 300,000 protein sequences andfunction as a repository for the rawdata. Some more common repositories,such as SWISS-PROT (3) and PIR-International (23), annotate thesequences as well as describe theproteins’ functions, its domain structureand post-translational modifications.Composite databases such as OWL(24) and the NRDB (25) compile andfilter sequence data from differentprimary databases to producecombined non-redundant sets that aremore complete than the individual

databases and also include proteinsequence data from the translatedcoding regions in DNA sequencedatabases (see below). Secondarydatabases contain information derivedfrom protein sequences and help theuser determine whether a newsequence belongs to a known proteinfamily. One of the most popular isPROSITE (26), a database of shortsequence patterns and profiles thatcharacterise biologically significantsites in proteins. PRINTS (27) expandson this concept and provides acompendium of protein fingerprints –groups of conserved motifs thatcharacterise a protein family. Motifsare usually separated along a protein

sequence, but may be contiguous in3D-space when the protein is folded.By using multiple motifs, fingerprintscan encode protein folds andfunctionalities more flexibly thanPROSITE. Finally, Pfam (28) containsa large collection of multiple sequencealignments and profile Hidden MarkovModels covering many common proteindomains. Pfam-A comprises accuratemanually compiled alignments whilePfam-B is an automated clustering ofthe whole SWISS-PROT database.These different secondary databaseshave recently been incorporated into asingle resource named InterPro (29).

Structural databasesNext we look at databases of macro-

molecular structures. The Protein DataBank, PDB (6, 7), provides a primaryarchive of all 3D structures formacromolecules such as proteins,RNA, DNA and various complexes.Most of the ~13,000 structures (August2000) are solved by x-ray crystallo-graphy and NMR, but some theoreticalmodels are also included. As the infor-mation provided in individual PDBentries can be difficult to extract,PDBsum (30) provides a separate Webpage for every structure in the PDBdisplaying detailed structural analyses,schematic diagrams and data on inter-actions between different molecules ina given entry. Three major databasesclassify proteins by structure in orderto identify structural and evolutionaryrelationships: CATH (31), SCOP (32),and FSSP databases (33). All comprisehierarchical structural taxonomy wheregroups of proteins increase in similarityat lower levels of the classificationtree. In addition, numerous databasesfocus on particular types of macromo-lecules. These include the NucleicAcids Database, NDB (34), forstructures related to nucleic acids, theHIV protease database (35) for HIV-1, HIV-2 and SIV protease structuresand their complexes, and ReLiBase(36) for receptor-ligand complexes.

Database URL

Protein sequence(primary)SWISS-PROTPIR-International

Protein sequence (composite)OWLNRDB

Protein sequence (secondary)PROSITEPRINTSPfam

MacromolecularstructuresProtein Data Bank (PDB)Nucleic Acids Database (NDB)HIV Protease DatabaseReLiBasePDBsumCATHSCOPFSSP

Nucleotide sequencesGenBankEMBLDDBJ

Genome sequencesEntrez genomesGeneCensusCOGs

Integrated databasesInterProSequence retrieval system (SRS)Entrez

www.expasy.ch/sprot/sprot-top.htmlwww.mips.biochem.mpg.de/proj/protseqdb

www.bioinf.man.ac.uk/dbbrowser/OWLwww.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

www.expasy.ch/prositewww.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.htmlwww.sanger.ac.uk/Pfam/

www.rcsb.org/pdbndbserver.rutgers.edu/www.ncifcrf.gov/CRYS/HIVdb/NEW_DATABASEwww2.ebi.ac.uk:8081/home.htmlwww.biochem.ucl.ac.uk/bsm/pdbsumwww.biochem.ucl.ac.uk/bsm/cathscop.mrc-lmb.cam.ac.uk/scopwww2.embl-ebi.ac.uk/dali/fssp

www.ncbi.nlm.nih.gov/Genbankwww.ebi.ac.uk/emblwww.ddbj.nig.ac.jp

www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genomebioinfo.mbb.yale.edu/genomewww.ncbi.nlm.nih.gov/COG

www.ebi.ac.uk/interprowww.expasy.ch/srs5www.ncbi.nlm.nih.gov/Entrez

Table 2. List of URLs for the databases that are cited in the review.

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 5: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

Review Paper

87Yearbook of Medical Informatics 2001

Nucleotide and Genomesequences

As described previously, the biggestexcitement currently lies with theavailability of complete genomesequences for different organisms. TheGenBank (2), EMBL (37) and DDBJ(38) databases contain DNA sequen-ces for individual genes that encodeprotein and RNA products. Much likethe composite protein sequencedatabase, the Entrez nucleotidedatabase (39) compiles sequence datafrom these primary databases.

As whole-genome sequencing isoften conducted through internationalcollaborations, individual genomes arepublished at different sites. The Entrezgenome database (40) brings togetherall complete and partial genomes in asingle location and currently representsover 1,000 organisms (August 2000).In addition to providing the rawnucleotide sequence, information ispresented at several levels of detailincluding: a list of completed genomes,all chromosomes in an organism,detailed views of single chromosomesmarking coding and non-coding regions,and single genes. At each level thereare graphical presentations, pre-computed analyses and links to othersections of Entrez. For example,annotations for single genes includethe translated protein sequence,sequence alignments with similar genesin other genomes and summaries ofthe experimentally characterised orpredicted function. GeneCensus (41)also provides an entry point for genomeanalysis with an interactive whole-genome comparison from anevolutionary perspective. The databaseallows building of phylogenetic treesbased on different criteria such asribosomal RNA or protein fold occur-rence. The site also enables multiplegenome comparisons, analysis of singlegenomes and retrieval of informationfor individual genes. The COGs data-base (20) classifies proteins encoded

in 21 completed genomes on the basisof sequence similarity. Members ofthe same Cluster of Orthologous Group,COG, are expected to have the same3D domain architecture and often, simi-lar functions. The most straightforwardapplication of the database is to predictthe function of uncharacterised proteinsthrough their homology to characterisedproteins, and also to identifyphylogenetic patterns of proteinoccurrence – for example, whether agiven COG is represented across mostor all organisms or in just a few closelyrelated species.

Gene expression dataThe most recent sources of genomic-

scale data have been from expressionexperiments, which quantify theexpression levels of individual genes.These experiments measure theamount of mRNA or protein productsthat are produced by the cell. For theformer, there are three maintechnologies: the cDNA microarray(42-44), Affymatrix GeneChip (45) andSAGE methods (46). The first methodmeasures relative levels of mRNAabundance between different samples,while the last two measure absolutelevels. Most of the effort in geneexpression analysis has concentratedon the yeast and human genomes andas yet, there is no central repository forthis data. For yeast, the Young (10),Church (47) and Samson datasets (48)use the GeneChip method, while theStanford cell cycle (49), diauxic shift(50) and deletion mutant datasets (51)use the microarray. Most measuremRNA levels throughout the wholeyeast cell cycle, although some focuson a particular stage in the cycle. Forhumans, the main application has beento understand expression in tumourand cancer cells. The MolecularPortraits of Breast Tumours (52),Lymphoma and Leukaemia MolecularProfiling (53) projects provide datafrom microarray experiments onhuman cancer cells.

The technologies for measuringprotein abundance are currently limitedto 2D gel electrophoresis followed bymass spectrometry (54). As gels canonly routinely resolve about 1,000proteins (55), only the most abundantcan be visualised. At present, datafrom these experiments are onlyavailable from the literature (56, 57).

Data integrationThe most profitable research in

bioinformatics often results fromintegrating multiple sources of data(58). For instance, the 3D coordinatesof a protein are more useful if combinedwith data about the protein’s function,occurrence in different genomes, andinteractions with other molecules. Inthis way, individual pieces of infor-mation are put in context with respectto other data. Unfortunately, it is notalways straightforward to access andcross-reference these sources of infor-mation because of differences innomenclature and file formats.

At a basic level, this problem isfrequently addressed by providingexternal links to other databases, forexample in PDBsum, web-pages forindividual structures direct the usertowards corresponding entries in thePDB, NDB, CATH, SCOP andSWISS-PROT. At a more advancedlevel, there have been efforts tointegrate access across several datasources. One is the Sequence RetrievalSystem, SRS (59), which allows anyflat-file databases to be indexed toeach other; this allows the user toretrieve, link and access entries fromnucleic acid, protein sequence, proteinmotif, protein structure and bibliogra-phic databases. Another is the Entrezfacility (39), which provides similargateways to DNA and protein sequen-ces, genome mapping data, 3D macro-molecular structures and the PubMedbibliographic database (60). A searchfor a particular gene in either databasewill allow smooth transitions to the

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 6: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

88

Review Paper

Yearbook of Medical Informatics 2001

genome it comes from, the proteinsequence it encodes, its structure,bibliographic reference and equivalententries for all related genes.

“…UNDERSTAND andorganise the information…”

Having examined the data, we candiscuss the types of analyses that areconducted. As shown in Table 1, thebroad subject areas in bioinformaticscan be separated according to the sourcesof information that are used in the studies.For raw DNA sequences, investigationsinvolve separating coding and non-codingregions, and identification of introns,exons and promoter regions for annotatinggenomic DNA (61) (62). For proteinsequences, analyses include developingalgorithms for sequence comparisons(63), methods for producing multiplesequence alignments (64), and searchingfor functional domains from conservedsequence motifs in such alignments.Investigations of structural data includeprediction of secondary and tertiary pro-tein structures, producing methods for3D structural alignments (65, 66), exa-mining protein geometries using distanceand angular measurements, calculationsof surface and volume shapes and ana-lysis of protein interactions with othersubunits, DNA, RNA and smaller mole-cules. These studies have lead to molecu-lar simulation topics in which structuraldata are used to calculate the energeticsinvolved in stabilising macromolecularstructures, simulating movements withinmacromolecules, and computing theenergies involved in molecular docking.The increasing availability of annotatedgenomic sequences has resulted in theintroduction of computational genomicsand proteomics – large-scale analysesof complete genomes and the proteinsthat they encode. Research includescharacterisation of protein content andmetabolic pathways between differentgenomes, identification of interactingproteins, assignment and prediction of

gene products, and large-scale analysesof gene expression levels. Some of theseresearch topics will be demonstrated inour example analysis of transcriptionregulatory systems.

Other subject areas we have includedin Table 1 are development of digitallibraries for automated bibliographicalsearches, knowledge bases of biologicalinformation from the literature, DNAanalysis methods in forensics, predictionof nucleic acid structures, metabolicpathway simulations, and linkage analysis– linking specific genes to differentdisease traits.

In addition to finding relationshipsbetween different proteins, much ofbioinformatics involves the analysis ofone type of data to infer and understandthe observations for another type ofdata. An example is the use of sequenceand structural data to predict thesecondary and tertiary structures of newprotein sequences (67). These methods,especially the former, are often based onstatistical rules derived from structures,such as the propensity for certain aminoacid sequences to produce differentsecondary structural elements. Anotherexample is the use of structural data tounderstand a protein’s function; herestudies have investigated the relationshipdifferent protein folds and their functions(68, 69) and analysed similarities betweendifferent binding sites in the absence ofhomology (70). Combined with similaritymeasurements, these studies provide uswith an understanding of how muchbiological information can be accuratelytransferred between homologousproteins (71).

The bioinformatics spectrumFigure 1 summarises the main points

we raised in our discussions oforganising and understandingbiological data – the development ofbioinformatics techniques has allowedan expansion of biological analysis intwo dimension, depth and breadth. The

first is represented by the vertical axis inthe figure and outlines a possible approachto the rational drug design process. Theaim is to take a single protein and followthrough an analysis that maximises ourunderstanding of the protein it encodes.Starting with a gene sequence, we candetermine the protein sequence withstrong certainty. From there, predictionalgorithms can be used to calculate thestructure adopted by the protein.Geometry calculations can define theshape of the protein’s surface andmolecular simulations can determine theforce fields surrounding the molecule.Finally, using docking algorithms, onecould identify or design ligands that maybind the protein, paving the way fordesigning a drug that specifically altersthe protein’s function. In practise, theintermediate steps are still difficult toachieve accurately, and they are bestcombined with experimental methods toobtain some of the data, for examplecharacterising the structure of the proteinof interest.

The aims of the second dimension, thebreadth in biological analysis, is tocompare a gene with others. Initially,simple algorithms can be used to com-pare the sequences and structures of apair of related proteins. With a largernumber of proteins, improved algorithmscan be used to produce multiple align-ments, and extract sequence patterns orstructural templates that define a familyof proteins. Using this data, it is alsopossible to construct phylogenetic treesto trace the evolutionary path of proteins.Finally, with even more data, the infor-mation must be stored in large-scaledatabases. Comparisons become morecomplex, requiring multiple scoringschemes, and we are able to conductgenomic scale censuses that providecomprehensive statistical accounts ofprotein features, such as the abundanceof particular structures or functions indifferent genomes. It also allows us tobuild phylogenetic trees that trace theevolution of whole organisms.

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 7: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

Review Paper

89Yearbook of Medical Informatics 2001

Fig. 1. Paradigm shifts during the past couple of decades have taken much of biology away from the laboratory bench and have allowed theintegration of other scientific disciplines, specifically computing. The result is an expansion of biological research in breadth and depth. Thevertical axis demonstrates how bioinformatics can aid rational drug design with minimal work in the wet lab. Starting with a single gene sequence,we can determine with strong certainty, the protein sequence. From there, we can determine the structure using structure prediction techniques.With geometry calculations, we can further resolve the protein’s surface and through molecular simulation determine the force fields surroundingthe molecule. Finally docking algorithms can provide predictions of the ligands that will bind on the protein surface, thus paving the way forthe design of a drug specific to that molecule. The horizontal axis shows how the influx of biological data and advances in computer technologyhave broadened the scope of biology. Initially with a pair of proteins, we can make comparisons between the between sequences and structuresof evolutionary related proteins. With more data, algorithms for multiple alignments of several proteins become necessary. Using multiplesequences, we can also create phylogenetic trees to trace the evolutionary development of the proteins in question. Finally, with the delugeof data we currently face, we need to construct large databases to store, view and deconstruct the information. Alignments now become morecomplex, requiring sophisticated scoring schemes and there is enough data to compile a genome census – a genomic equivalent of a populationcensus – providing comprehensive statistical accounting of protein features in genomes.

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 8: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

90

Review Paper

Yearbook of Medical Informatics 2001

“… applying INFORMATICSTECHNIQUES…”

The distinct subject areas wemention require different types ofinformatics techniques. Briefly, for dataorganisation, the first biologicaldatabases were simple flat files.However with the increasing amountof information, relational databasemethods with Web-page interfaceshave become increasingly popular. Insequence analysis, techniques includestring comparison methods such astext search and 1D alignmentalgorithms. Motif and patternidentification for multiple sequencesdepend on machine learning, clusteringand data-mining techniques. 3Dstructural analysis techniques includeEuclidean geometry calculationscombined with basic application ofphysical chemistry, graphical repre-sentations of surfaces and volumes,and structural comparison and 3Dmatching methods. For molecularsimulations, Newtonian mechanics,quantum mechanics, molecularmechanics and electrostatic calcula-tions are applied. In many of theseareas, the computational methods mustbe combined with good statisticalanalyses in order to provide an objectivemeasure for the significance of theresults.

Transcription regulation – a casestudy in bioinformatics

DNA-binding proteins have a centralrole in all aspects of genetic activitywithin an organism, participating inprocesses such as transcription, packa-ging, rearrangement, replication andrepair. In this section, we focus on thestudies that have contributed to ourunderstanding of transcription regula-tion in different organisms. Throughthis example, we demonstrate howbioinformatics has been used to increaseour knowledge of biological systemsand also illustrate the practicalapplications of the different subject

areas that were briefly outlined earlier.We start by considering structuralanalyses of how DNA-binding proteinsrecognise particular base sequences.Later, we review several genomicstudies that have characterised thenature of transcription factors indifferent organisms, and the methodsthat have been used to identify regula-tory binding sites in the upstreamregions. Finally, we provide an overviewof gene expression analyses that havebeen recently conducted and suggestfuture uses of transcription regulatoryanalyses to rationalise the observationsmade in gene expression experiments.All the results that we describe havebeen found through computationalstudies.

Structural studiesAs of August 2000, there were 379

structures of protein-DNA complexesin the PDB. Analyses of thesestructures have provided valuableinsight into the stereochemicalprinciples of binding, including howparticular base sequences arerecognized and how the DNA structureis quite often modified on binding.

A structural taxonomy of DNA-binding proteins, similar to thatpresented in SCOP and CATH, wasfirst proposed by Harrison (72) andperiodically updated to accommodatenew structures as they are solved (73).The classification consists of a two-tier system: the first level collectsproteins into eight groups that sharegross structural features for DNA-binding, and the second comprises 54families of proteins that are structurallyhomologous to each other. Assemblyof such a system simplifies thecomparison of different bindingmethods; it highlights the diversity ofprotein-DNA complex geometriesfound in nature, but also underlines theimportance of interactions between a-helices and the DNA major groove,the main mode of binding in over half

the protein families. While the numberof structures represented in the PDBdoes not necessarily reflect the relativeimportance of the different proteins inthe cell, it is clear that helix-turn-helix,zinc-coordinating and leucine zippermotifs are used repeatedly. Thisprovides compact frameworks thatpresent the a-helix on the surfaces ofstructurally diverse proteins. At a grosslevel, it is possible to highlight thedifferences between transcriptionfactor domains that “just” bind DNAand those involved in catalysis (74).Although there are exceptions, theformer typically approach the DNAfrom a single face and slot into thegrooves to interact with base edges.The latter commonly envelope thesubstrate, using complex networks ofsecondary structures and loops.

Focusing on proteins with a-helices,the structures show many variations,both in amino acid sequences anddetailed geometry. They have clearlyevolved independently in accordancewith the requirements of the context inwhich they are found. While achievinga close fit between the a-helix andmajor groove, there is enough flexibilityto allow both the protein and DNA toadopt distinct conformations. However,several studies that analysed the bindinggeometries of a-helices demonstratedthat most adopt fairly uniformconformations regardless of proteinfamily. They are commonly inserted inthe major groove sideways, with theirlengthwise axis roughly parallel to theslope outlined by the DNA backbone.Most start with the N-terminus in thegroove and extend out, completing twoto three turns within contacting distanceof the nucleic acid (75, 76).

Given the similar binding orientations,it is surprising to find that the interactionsbetween each amino acid position alongthe a-helices and nucleotides on theDNA vary considerably betweendifferent protein families. However,

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 9: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

Review Paper

91Yearbook of Medical Informatics 2001

by classifying the amino acids accordingto the sizes of their side chains, we areable to rationalise the differentinteractions patterns. The rules ofinteractions are based on the simplepremise that for a given residue positionon a-helices in similar conformations,small amino acids interact withnucleotides that are close in distanceand large amino acids with those thatare further (76, 77). Equivalent studiesfor binding by other structural motifs,like b-hairpins, have also beenconducted (78). When consideringthese interactions, it is important toremember that different regions of theprotein surface also provide interfaceswith the DNA.

This brings us to look at the atomiclevel interactions between individualamino acid-base pairs. Such analysesare based on the premise that asignificant proportion of specific DNA-binding could be rationalised by auniversal code of recognition betweenamino acids and bases, ie whethercertain protein residues preferablyinteract with particular nucleotidesregardless of the type of protein-DNAcomplex (79). Studies have consideredhydrogen bonds, van der Waals contactsand water-mediated bonds (80-82).Results showed that about 2/3 of allinteractions are with the DNAbackbone and that their main role isone of sequence-independentstabilisation. In contrast, interactionswith bases display some strongpreferences, including the interactionsof arginine or lysine with guanine,asparagine or glutamine with adenineand threonine with thymine. Suchpreferences were explained throughexamination of the stereochemistry ofthe amino acid side chains and baseedges. Also highlighted were morecomplex types of interactions wheresingle amino acids contact more thanone base-step simultaneously, thusrecognising a short DNA sequence.These results suggested that universal

specificity, one that is observed acrossall protein-DNA complexes, indeedexists. However, many interactionsthat are normally considered to be non-specific, such as those with the DNAbackbone, can also provide specificitydepending on the context in which theyare made.

Armed with an understanding ofprotein structure, DNA-binding motifsand side chain stereochemistry, a majorapplication has been the prediction ofbinding either by proteins known tocontain a particular motif, or those withstructures solved in the uncomplexedform. Most common are predictionsfor a-helix-major groove interactions –given the amino acid sequence, whatDNA sequence would it recognise(77, 83). In a different approach,molecular simulation techniques havebeen used to dock whole proteins andDNAs on the basis of force-fieldcalculations around the two molecules(84, 85).

The reason that both methods haveonly been met with limited success isbecause even for apparently simplecases like a-helix-binding, there aremany other factors that must beconsidered. Comparisons betweenbound and unbound nucleic acidstructures show that DNA-bending isa common feature of complexes formedwith transcription factors (74, 86). Thisand other factors such as electrostaticand cation-mediated interactions assistindirect recognition of the nucleotidesequence, although they are not wellunderstood yet. Therefore, it is nowclear that detailed rules for specificDNA-binding will be family specific,but with underlying trends such as thearginine-guanine interactions.

Genomic studiesDue to the wealth of biochemical

data that are available, genomic studiesin bioinformatics have concentratedon model organisms, and the analysis

of regulatory systems has been noexception. Identification of transcriptionfactors in genomes invariably dependson similarity search strategies, whichassume a functional and evolutionaryrelationship between homologousproteins. In E. coli, studies have so farestimated a total of 300 to 500transcription regulators (87) andPEDANT (88), a database ofautomatically assigned gene functions,shows that typically 2-3% ofprokaryotic and 6-7% of eukaryoticgenomes comprise DNA-bindingproteins. As assignments were onlycomplete for 40-60% of genomes as ofAugust 2000, these figures most likelyunderestimate the actual number.Nonetheless, they already represent alarge quantity of proteins and it is clearthat there are more transcriptionregulators in eukaryotes than otherspecies. This is unsurprising,considering the organisms havedeveloped a relatively sophisticatedtranscription mechanism.

From the conclusions of the structuralstudies, the best strategy forcharacterising DNA-binding of theputative transcription factors in eachgenome is to group them by homologyand analyse the individual families.Such classifications are provided in thesecondary sequence databasesdescribed earlier and also those thatspecialise in regulatory proteins suchas RegulonDB (89) and TRANSFAC(90). Of even greater use is the provisionof structural assignments to theproteins; given a transcription factor, itis helpful to know the structural motifthat it uses for binding, thereforeproviding us with a better understandingof how it recognises the targetsequence. Structural genomics throughbioinformatics assigns structures to theprotein products of genomes bydemonstrating similarity to proteins ofknown structure (91). These studieshave shown that prokaryotictranscription factors most frequently

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 10: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

92

Review Paper

Yearbook of Medical Informatics 2001

contain helix-turn-helix motifs (87, 92)and eukaryotic factors containhomeodomain type helix-turn-helix, zincfinger or leucine zipper motifs. Fromthe protein classifications in eachgenome, it is clear that different typesof regulatory proteins differ inabundance and families significantlydiffer in size. A study by Huynen andvan Nimwegen (93) has shown thatmembers of a single family have similarfunctions, but as the requirements ofthis function vary over time, so doesthe presence of each gene family in thegenome.

Most recently, using a combinationof sequence and structural data, weexamined the conservation of aminoacid sequences between related DNA-binding proteins, and the effect thatmutations have on DNA sequencerecognition. The structural familiesdescribed above were expanded toinclude proteins that are related bysequence similarity, but whosestructures remain unsolved. Again,members of the same family arehomologous, and probably derive froma common ancestor.

Amino acid conservations werecalculated for the multiple sequencealignments of each family (94).Generally, alignment positions thatinteract with the DNA are betterconserved than the rest of the proteinsurface, although the detailed patternsof conservation are quite complex.Residues that contact the DNAbackbone are highly conserved in allprotein families, providing a set ofstabilising interactions that are commonto all homologous proteins. Theconservation of alignment positions thatcontact bases, and recognise the DNAsequence, are more complex and couldbe rationalised by defining a 3-classmodel for DNA-binding. First, proteinfamilies that bind non-specificallyusually contain several conserved base-contacting residues; without exception,

interactions are made in the minorgroove where there is littlediscrimination between base types. Thecontacts are commonly used to stabilisedeformations in the nucleic acidstructure, particularly in widening theDNA minor groove. The second classcomprise families whose members alltarget the same nucleotide sequence;here, base-contacting positions areabsolutely or highly conserved allowingrelated proteins to target the samesequence.

The third, and most interesting, classcomprises families in which bindingis also specific but different membersbind distinct base sequences. Hereprotein residues undergo frequentmutations, and family members canbe divided into subfamilies accordingto the amino acid sequences at base-contacting positions; those in thesame subfamily are predicted to bindthe same DNA sequence and thoseof different subfamilies to binddistinct sequences. On the whole,the subfamilies corresponded wellwith the proteins’ functions andmembers of the same subfamilies werefound to regulate similar transcriptionpathways. The combined analysis ofsequence and structural data describedby this study provided an insight intohow homologous DNA-bindingscaffolds achieve different specificitiesby altering their amino acid sequences.In doing so, proteins evolved distinctfunctions, therefore allowingstructurally related transcriptionfactors to regulate expression ofdifferent genes. Therefore, therelative abundance of transcriptionregulatory families in a genomedepends, not only on the importanceof a particular protein function, butalso in the adaptability of the DNA-binding motifs to recognise distinctnucleotide sequences. This, in turn,appears to be best accommodatedby simple binding motifs, such as thezinc fingers.

Given the knowledge of thetranscription regulators that arecontained in each organism, and anunderstanding of how they recogniseDNA sequences, it is of interest tosearch for their potential binding siteswithin genome sequences (95). Forprokaryotes, most analyses haveinvolved compiling data onexperimentally known binding sites forparticular proteins and building aconsensus sequence that incorporatesany variations in nucleotides. Additionalsites are found by conducting word-matching searches over the entiregenome and scoring candidate sites bysimilarity (96-99). Unsurprisingly, mostof the predicted sites are found in non-coding regions of the DNA (96) andthe results of the studies are oftenpresented in databases such asRegulonDB (89). The consensussearch approach is often complementedby comparative genomic studiessearching upstream regions oforthologous genes in closely relatedorganisms. Through such an approach,it was found that at least 27% ofknown E. coli DNA-regulatory motifsare conserved in one or more distantlyrelated bacteria (100).

The detection of regulatory sites ineukaryotes poses a more difficultproblem because consensus sequencestend to be much shorter, variable, anddispersed over very large distances.However, initial studies in S. cerevisiaeprovided an interesting observation forthe GATA protein in nitrogenmetabolism regulation. While the 5base-pair GATA consensus sequenceis found almost everywhere in thegenome, a single isolated binding site isinsufficient to exert the regulatoryfunction (101). Therefore specificityof GATA activity comes from therepetition of the consensus sequencewithin the upstream regions ofcontrolled genes in multiple copies. Aninitial study has used this observationto predict new regulatory sites by

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 11: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

Review Paper

93Yearbook of Medical Informatics 2001

searching for over-representedoligonucleotides in non-coding regions ofyeast and worm genomes (102, 103).

Having detected the regulatorybinding sites, there is the problem ofdefining the genes that are actuallyregulated, commonly termed regulons.Generally, binding sites are assumed tobe located directly upstream of theregulons; however there are differentproblems associated with thisassumption depending on the organism.For prokaryotes, it is complicated bythe presence of operons; it is difficultto locate the regulated gene within anoperon since it can lie several genesdownstream of the regulatorysequence. It is often difficult to predictthe organisation of operons (104),especially to define the gene that isfound at the head, and there is often alack of long-range conservation in geneorder between related organisms (105).The problem in eukaryotes is evenmore severe; regulatory sites often actin both directions, binding sites areusually distant from regulons becauseof large intergenic regions, andtranscription regulation is usually aresult of combined action by multipletranscription factors in a combinatorialmanner.

Despite these problems, thesestudies have succeeded in confirmingthe transcription regulatory pathwaysof well-characterised systems such asthe heat shock response system (99).In addition, it is feasible to experi-mentally verify any predictions, mostnotably using gene expression data.

Gene expression studiesMany expression studies have so

far focused on devising methods tocluster genes by similarities inexpression profiles. This is in order todetermine the proteins that areexpressed together under differentcellular conditions. Briefly, the mostcommon methods are hierarchical

clustering, self-organising maps, andK-means clustering. Hierarchicalmethods originally derived fromalgorithms to construct phylogenetictrees, and group genes in a “bottom-up” fashion; genes with the most similarexpression profiles are clustered first,and those with more diverse profilesare included iteratively (106-108). Incontrast, the self-organising map (109,110) and K-means methods (111)employ a “top-down” approach in whichthe user pre-defines the number ofclusters for the dataset. The clustersare initially assigned randomly, and thegenes are regrouped iteratively untilthey are optimally clustered.

Given these methods, it is of interestto relate the expression data to otherattributes such as structure, functionand subcellular localisation of eachgene product. Mapping these propertiesprovide an insight into thecharacteristics of proteins that areexpressed together, and also suggestsome interesting conclusions about theoverall biochemistry of the cell. Inyeast, shorter proteins tend to be morehighly expressed than longer proteins,probably because of the relative easewith which they are produced (112).Looking at the amino acid content,highly expressed genes are generallyenriched in alanine and glycine, anddepleted in asparagine; these arethought to reflect the requirements ofamino acid usage in the organism, wheresynthesis of alanine and glycine areenergetically less expensive thanasparagine. Turning to proteinstructure, expression levels of the TIMbarrel and NTP hydrolase folds arehighest, while those for the leucinezipper, zinc finger and transmembranehelix-containing folds are lowest. Thisrelates to the functions associated withthese folds; the former are commonlyinvolved in metabolic pathways andthe latter in signalling or transportprocesses (113). This is also reflectedin the relationship with subcellular

localisations of proteins, whereexpression of cytoplasmic proteins ishigh, but nuclear and membraneproteins tend to be low (114, 115).

More complex relationships havealso been assessed. Conventionalwisdom is that gene products thatinteract with each other are more likelyto have similar expression profiles thanif they do not (116, 117). However, arecent study showed that thisrelationship is not so simple (118).While expression profiles are similarfor gene products that are permanentlyassociated, for example in the largeribosomal subunit, profiles differsignificantly for products that are onlyassociated transiently, including thosebelonging to the same metabolicpathway.

As described below, one of the maindriving forces behind expressionanalysis has been to analyse cancerouscell lines (119). In general, it has beenshown that different cell lines (egepithelial and ovarian cells) can bedistinguished on the basis of theirexpression profiles, and that theseprofiles are maintained when cells aretransferred from an in vivo to an invitro environment (120). The basis fortheir physiological differences wereapparent in the expression of specificgenes; for example, expression levelsof gene products necessary forprogression through the cell cycle,especially ribosomal genes, correlatedwell with variations in cell proliferationrate. Comparative analysis can beextended to tumour cells, in which theunderlying causes of cancer can beuncovered by pinpointing areas ofbiological variations compared tonormal cells. For example in breastcancer, genes related to cellproliferation and the IFN-regulatedsignal transduction pathway werefound to be upregulated (52, 121). Oneof the difficulties in cancer treatmenthas been to target specific therapies to

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 12: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

94

Review Paper

Yearbook of Medical Informatics 2001

pathogenetically distinct tumour types,in order to maximise efficacy andminimise toxicity. Therefore,improvements in cancer classificationshave been central to advances incancer treatment. Although thedistinction between different forms ofcancer – for example subclasses ofacute leukaemia – has been wellestablished, it is still not possible toestablish a clinical diagnosis on thebasis of a single test. In a recent study,acute myeloid leukaemia and acutelymphoblastic leukaemia weresuccessfully distinguished based on theexpression profiles of these cells (53).As the approach does not require priorbiological knowledge of the diseases, itmay provide a generic strategy forclassifying all types of cancer.

Clearly, an essential aspect ofunderstanding expression data lies inunderstanding the basis of transcriptionregulation. However, analysis in this areais still limited to preliminary analyses ofexpression levels in yeast mutants lackingkey components of the transcriptioninitiation complex (10, 122).

“… many PRACTICALAPPLICATIONS…”

Here, we describe some of the majoruses of bioinformatics.

Finding HomologuesAs described earlier, one of the

driving forces behind bioinformatics isthe search for similarities betweendifferent biomolecules. Apart fromenabling systematic organisation ofdata, identification of proteinhomologues has some direct practicaluses. The most obvious is transferringinformation between related proteins.For example, given a poorly character-ised protein, it is possible to search forhomologues that are better understoodand with caution, apply some of theknowledge of the latter to the former.

Specifically with structural data,theoretical models of proteins areusually based on experimentally solvedstructures of close homologues (123).Similar techniques are used in foldrecognition in which tertiary structurepredictions depend on finding structuresof remote homologues and checkingwhether the prediction is energeticallyviable (124). Where biochemical orstructural data are lacking, studies couldbe made in low-level organisms likeyeast and the results applied tohomologues in higher-level organismssuch as humans, where experimentsare more demanding.

An equivalent approach is alsoemployed in genomics. Homologue-finding is extensively used to confirmcoding regions in newly sequencedgenomes and functional data isfrequently transferred to annotateindividual genes. On a larger scale, italso simplifies the problem ofunderstanding complex genomes byanalysing simple organisms first andthen applying the same principles tomore complicated ones – this is onereason why early structural genomicsprojects focused on Mycoplasmagenitalium (91).

Ironically, the same idea can beapplied in reverse. Potential drugtargets are quickly discovered bychecking whether homologues ofessential microbial proteins are missingin humans. On a smaller scale, structuraldifferences between similar proteinsmay be harnessed to design drugmolecules that specifically bind to onestructure but not another.

Rational Drug DesignOne of the earliest medical

applications of bioinformatics has beenin aiding rational drug design. Figure 2outlines the commonly cited approach,taking the MLH1 gene product as anexample drug target. MLH1 is a humangene encoding a mismatch repair

protein (mmr) situated on the shortarm of chromosome 3 (125). Throughlinkage analysis and its similarity tommr genes in mice, the gene hasbeen implicated in nonpolyposiscolorectal cancer (126). Given thenucleotide sequence, the probableamino acid sequence of the encodedprotein can be determined usingtranslation software. Sequencesearch techniques can then be usedto find homologues in modelorganisms, and based on sequencesimilarity, it is possible to model thestructure of the human protein onexperimentally characterised struc-tures. Finally, docking algorithmscould design molecules that couldbind the model structure, leading theway for biochemical assays to testtheir biological activity on the actualprotein.

Large-scale censusesAlthough databases can efficiently

store all the information related togenomes, structures and expressiondatasets, it is useful to condense all thisinformation into understandable trendsand facts that users can readilyunderstand. Broad generalisations helpidentify interesting subject areas forfurther detailed analysis, and placenew observations in a proper context.This enables one to see whether theyare unusual in any way.

Through these large-scalecensuses, one can address a numberof evolutionary, biochemical andbiophysical questions. For example,are specific protein folds associatedwith certain phylogenetic groups?How common are different foldswithin particular organisms? And towhat degree are folds sharedbetween related organisms? Doesthis extent of sharing parallelmeasures of relatedness derived fromtraditional evolutionary trees? Initialstudies show that the frequency offolds differs greatly between

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 13: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

Review Paper

95Yearbook of Medical Informatics 2001

Fig.2. Above is a schematic outlining how scientists can use bioinformatics to aid rational drug discovery. MLH1 is a human gene encoding a mismatchrepair protein (mmr) situated on the short arm of chromosome 3. Through linkage analysis and its similarity to mmr genes in mice, the gene has beenimplicated in nonpolyposis colorectal cancer. Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can bedetermined using translation software. Sequence search techniques can be used to find homologues in model organisms, and based on sequencesimilarity, it is possible to model the structure of the human protein on experimentally characterised structures. Finally, docking algorithms coulddesign molecules that could bind the model structure, leading the way for biochemical assays to test their biological activity on the actual protein.

organisms and that the sharing offolds between organisms does in factfollow traditional phylogeneticclassifications (21, 41). We can alsointegrate data on protein functions;given that the particular protein foldsare often related to specific bio-chemical functions (68, 69), thesefindings highlight the diversity ofmetabolic pathways in differentorganisms (20, 105).

As we discussed earlier, one of themost exciting new sources of genomicinformation is the expression data.Combining expression information withstructural and functional classificationsof proteins we can ask whether thehigh occurrence of a protein fold in a

genome is indicative of high expressionlevels (112). Further genomic scaledata that we can consider in large-scale surveys include the subcellularlocalisations of proteins and their inter-actions with each other (127-129). Inconjunction with structural data, wecan then begin to compile a map of allprotein-protein interactions in anorganism.

Further applications in medicalsciences

Most recent applications in themedical sciences have centred ongene expression analysis (130). Thisusually involves compiling expressiondata for cells affected by differentdiseases (131), eg cancer (53, 132,

133) and ateriosclerosis (134), andcomparing the measurements againstnormal expression levels. Identifi-cation of genes that are expresseddifferently in affected cells providesa basis for explaining the causes ofillnesses and highlights potential drugtargets. Using the process describedin Figure 2, one would designcompounds that bind the expressedprotein, or perhaps more importantly,the transcription regulator has causedthe change in expression levels. Givena lead compound, microarray experi-ments can then be used to evaluateresponses to pharmacologicalintervention, (135, 136) and alsoprovide early tests to detect or predictthe toxicity of trial drugs.

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 14: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

96

Review Paper

Yearbook of Medical Informatics 2001

Further advances in bioinformaticscombined with experimental genomicsfor individuals are predicted torevolutionalise the future of healthcare.A typical scenario for a patient maystart with post-natal genotyping toassess susceptibility or immunity fromspecific diseases and pathogens. Withthis information, a unique combinationof vaccines could be prescribed,minimising the healthcare costs ofunnecessary treatments andanticipating the onslaught of diseaseslater in life. Regular lifetime screeningscould lead to guidance for nutritionintake and early detections of anyillnesses (137). In addition, drug-basedtreatments could be tailored specificallyto the patient and disease, thusproviding the most effective course ofmedication with minimal side-effects(138). Given the present rate ofdevelopment, such a scenario inhealthcare appears to be possible inthe not too distant future.

ConclusionsWith the current deluge of data,

computational methods have becomeindispensable to biological investiga-tions. Originally developed for theanalysis of biological sequences, bioin-formatics now encompasses a widerange of subject areas including struc-tural biology, genomics and gene ex-pression studies. In this review, weprovided an introduction and overviewof the current state of field. Inparticular, we discussed the types ofbiological information and databasesthat are commonly used, examinedsome of the studies that are beingconducted – with reference to trans-cription regulatory systems – and finallylooked at several practical applicationsof the field.

Two principal approaches underpinall studies in bioinformatics. First isthat of comparing and grouping thedata according to biologically meaning-ful similarities and second, that of

analysing one type of data to infer andunderstand the observations for anothertype of data. These approaches arereflected in the main aims of the field,which are to understand and organisethe information associated with biolo-gical molecules on a large scale. As aresult, bioinformatics has not onlyprovided greater depth to biologicalinvestigations, but added the dimensionof breadth as well. In this way, we areable to examine individual systems indetail and also compare them withthose that are related in order touncover common principles that applyacross many systems and highlightunusual features that are unique tosome.

Acknowledgements

We thank Patrick McGarvey for commentson the manuscript.

References

1. Reichhardt T. It’s sink or swim as a tidalwave of data approaches [news] [seecomments]. Nature 1999;399(6736):517-20.

2. Benson DA, Karsch-Mizrachi I, LipmanDJ, Ostell J, Rapp BA, Wheeler DL.GenBank. Nucleic Acids Res 2000;28(1):15-8.

3. Bairoch A, Apweiler R. The SWISS-PROTprotein sequence database and itssupplement TrEMBL in 2000. NucleicAcids Res 2000;28(1):45-8.

4. Fleischmann RD, Adams MD, White O,Clayton RA, Kirkness EF, Kerlavage AR,et al. Whole-genome random sequencingand assembly of Haemophilus influenzaeRd [see comments]. Science 1995;269(5223):496-512.

5. Drowning in data. The Economist 26 June1999.

6. Bernstein FC, Koetzle TF, Williams GJ,Meyer EF, Jr., Brice MD, Rodgers JR, et al.The Protein Data Bank. A computer-basedarchival file for macromolecular structures.Eur J Biochem 1977;80(2):319-24.

7. Berman HM, Westbrook J, Feng Z, GillilandG, Bhat TN, Weissig H, et al. The ProteinData Bank. Nucleic Acids Res2000;28(1):235-42.

8. Pearson WR, Lipman DJ. Improved toolsfor biological sequence comparison. ProcNatl Acad Sci U S A 1988;85(8):2444-2448.

9. Altschul SF, Madden TL, Schaffer AA,

Zhang J, Zhang Z, Miller W, et al. GappedBLAST and PSI-BLAST: a new generationof protein database search programs. NucleicAcids Res. 1997;25(17):3389-3402.

10. Holstege FC JE, Wyrick JJ, Lee TI,Hengartner CJ, Green MR, Golub TR,Lander ES, Young RA. Dissecting theregulatory circuitry of a eukaryotic genome.Cell 1998;95(5):717-728.

11. Pedersendagger AG, Jensendagger LJ,Brunak S, Staerfeldt HH, Ussery DW. ADNA structural atlas for Escherichia coli. JMol Biol 2000;299(4):907-930.

12. Kanehisa M, Goto S. KEGG: kyotoencyclopedia of genes and genomes. NucleicAcids Res 2000;28(1):27-30.

13. Jeffery CJ. Moonlighting proteins. TIBS1999;24(1):8-11.

14. Chothia C. Proteins. One thousand familiesfor the molecular biologist [news]. Nature1992;357(6379):543-4.

15. Orengo CA, Jones DT, Thornton JM.Protein superfamilies and domainsuperfolds. Nature 1994;372(6507):631-4.

16. Lesk AM, Chothia C. How different aminoacid sequences determine similar proteinstructures: the structure and evolutionarydynamics of the globins. J Mol Biol1980;136(3):225-70.

17. Russell RB, Saqi MA, Sayle RA, Bates PA,Sternberg MJ. Recognition of analogousand homologous protein folds: analysis ofsequence and structure conservation. J MolBiol 1997;269(3):423-39.

18. Russell RB, Saqi MA, Bates PA, Sayle RA,Sternberg MJ. Recognition of analogousand homologous protein folds—assessmentof prediction success and associatedalignment accuracy using empiricalsubstitution matrices. Protein Eng1998;11(1):1-9.

19. Fitch WM. Distinguishing homologousfrom analogous proteins. Syst Zool1970;19:99-110.

20. Tatusov RL, Koonin EV, Lipman DJ. Agenomic perspective on protein families.Science 1997;278(5338):631-7.

21. Gerstein M, Hegyi H. Comparing genomesin terms of protein structure: surveys of afinite parts list. FEMS Microbiol Rev1998;22(4):277-304.

22. Skolnick J, Fetrow JS. From genes to proteinstructure and function: novel applicationsof computational approaches in the genomicera. TIBtech 2000;18:34-39.

23. McGarvey PB, Huang H, Barker WC,Orcutt BC, Garavelli JS, Srinivasarao GY,et al. PIR: a new resource for bioinformatics.Bioinformatics 2000;16(3):290-291.

24. Bleasby AJ, Akrigg D, Attwood TK.OWL—a non-redundant composite proteinsequence database. Nucleic Acids Res1994;22(17):3574-3577.

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 15: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

Review Paper

97Yearbook of Medical Informatics 2001

25. Bleasby AJ, Wootton JC. Construction ofvalidated, non-redundant composite proteinsequence databases. Protein Eng1990;3(3):153-159.

26. Hofmann K, Bucher P, Falquet L, Bairoch A.The PROSITE database, its status in 1999.Nucleic Acids Res 1999;27(1):215-219.

27. Attwood TK, Croning MD, Flower DR,Lewis AP, Mabey JE, Scordis P, et al.PRINTS-S: the database formerly knownas PRINTS. Nucleic Acids Res2000;28(1):225-227.

28. Bateman A, Birney E, Durbin R, Eddy SR,Howe KL, Sonnhammer EL. The Pfamprotein families database. Nucleic AcidsRes 2000;28(1):263-266.

29. Attwood TK, Flower DR, Lewis AP,Mabey JE, Morgan SR, Scordis P, et al.PRINTS prepares for the new millennium.Nucleic Acids Res 1999;27(1):220-225.

30. Laskowski RA, Hutchinson EG, MichieAD, Wallace AC, Jones ML, ThorntonJM. PDBsum: a Web-based database ofsummaries and analyses of all PDBstructures. TIBS 1997;22(12):488-490.

31. Pearl FM, Lee D, Bray JE, Sillitoe I, ToddAE, Harrison AP, et al. Assigning genomicsequences to CATH. Nucleic Acids Res2000;28(1):277-282.

32. Lo Conte L, Ailey B, Hubbard TJ, BrennerSE, Murzin AG, Chothia C. SCOP: astructural classification of proteins database.Nucleic Acids Res 2000;28(1):257-259.

33. Holm L, Sander C. Touring protein foldspace with Dali/FSSP. Nucleic Acids Res1998;26(1):316-319.

34. Berman HM, Olson WK, Beveridge DL,Westbrook J, Gelbin A, Demeny T, et al.The Nucleic Acid Database. Acomprehensive relational database of three-dimensional structures of nucleic acids.Biophys J 1992;63(3):751-759.

35. Vondrasek J, Wlodawer A. Database ofHIV proteinase structures. TIBS1997;22(5):183.

36. Hendlich M. Databases for protein-ligandcomplexes. Acta Cryst D 1998;54(1):1178-1182.

37. Baker W, van den Broek A, Camon E,Hingamp P, Sterk P, Stoesser G, et al. TheEMBL nucleotide sequence database.Nucleic Acids Res 2000;28(1):19-23.

38. Okayama T, Tamura T, Gojobori T, TatenoY, Ikeo K, Miyazaki S, et al. Formal designand implementation of an improved DDBJDNA database with a new schema andobject-oriented library. Bioinformatics1998;14(6):472-8.

39. Schuler GD, Epstein JA, Ohkawa H, KansJA. Entrez: molecular biology database andretrieval system. Methods Enzymol1996;266:141-62.

40. Tatusova TA, Karsch-Mizrachi I, Ostell

JA. Complete genomes in WWW Entrez:data representation and analysis.Bioinformatics 1999;15(7-8):536-43.

41. Lin J, Gerstein M. Whole-genome treesbased on the occurrence of folds andorthologs: implications for comparinggenomes on different levels [In ProcessCitation]. Genome Res 2000;10(6):808-18.

42. Eisen MB, Brown PO. DNA arrays foranalysis of gene expression. MethodsEnzymol 1999;303:179-205.

43. Cheung VG, Morley M, Aguilar F, MassimiA, Kucherlapati R, Childs G. Making andreading microarrays. Nat Genet 1999;21(1Suppl):15-9.

44. Duggan DJ, Bittner M, Chen Y, Meltzer P,Trent JM. Expression profiling using cDNAmicroarrays. Nat Genet 1999;21(1Suppl):10-4.

45. Lipshutz RJ FS, Gingeras TR, LockhartDJ. High density synthetic oligonucleotidearrays. Nat Gen 1999;21(1):20-24.

46. Velculescu VE ZL, Zhou, W Traverso, G StCroix, B Vogelstein B, Kinzler KW. SerialAnalysis of Gene Expression DetailedProtocol. 1999.

47. Roth FP HJ, Estep PW, Church GM.Finding DNA regulatory motifs withinunaligned noncoding sequences clusteredby whole-genome mRNA quantitation. NatBiotechnol 1998;16(10):939-45.

48. Jelinsky SA, Samson LD. Global responseof Saccharomyces cerevisiae to an alkylatingagent. Proc Natl Acad Sci U S A1999;96(4):1486-91.

49. Cho RJ, Campbell MJ, Winzeler EA,Steinmetz L, Conway A, Wodicka L, et al.A genome-wide transcriptional analysis ofthe mitotic cell cycle. Mol Cell1998;2(1):65-73.

50. DeRisi JL, Iyer VR, Brown PO. Exploringthe metabolic and genetic control of geneexpression on a genomic scale. Science1997;278(5338):680-6.

51. Winzeler EA, Shoemaker DD, Astromoff A,Liang H, Anderson K, Andre B, et al.Functional characterization of the S. cerevisiaegenome by gene deletion and parallel analysis.Science 1999;285(5429):901-6.

52. Perou CM, Sorlie T, Eisen MB, van de RijnM, Jeffrey SS, Rees CA, et al. Molecularportraits of human breast tumours. Nature2000;406(6797):747-52.

53. Golub TR, Slonim DK, Tamayo P, HuardC, Gaasenbeek M, Mesirov JP, et al.Molecular classification of cancer: classdiscovery and class prediction by geneexpression monitoring. Science 1999;286(5439):531-7.

54. Celis JE, Gromov P. 2D proteinelectrophoresis: can it be perfected? CurrOpin Biotechnol 1999;10(1):16-21.

55. Pandey A, Mann M. Proteomics to study

genes and genomes [In Process Citation].Nature 2000;405(6788):837-46.

56. Futcher B, Latter GI, Monardo P,McLaughlin CS, Garrels JI. A sampling ofthe yeast proteome. Mol Cell Biol1999;19(11):7357-68.

57. Gygi SP, Rist B, Gerber SA, Turecek F,Gelb MH, Aebersold R. Quantitativeanalysis of complex protein mixtures usingisotope-coded affinity tags. Nat Biotechnol1999;17(10):994-9.

58. Gerstein M. Integrative database analysisin structural genomics. Nature Struct BiolIn Press.

59. Etzold T, Ulyanov A, Argos P. SRS:information retrieval system for molecularbiology data banks. Methods Enzymol1996;266:114-28.

60. Wade K. Searching Entrez PubMed anduncover on the internet [news]. Aviat SpaceEnviron Med 2000;71(5):559.

61. Zhang MQ. Promoter analysis of co-regulated genes in the yeast genome.Comput Chem 1999;23(3-4):233-50.

62. Boguski MS. Biosequence exegesis. Science1999;286(5439):453-5.

63. Miller C, Gurd J, Brass A. A RAPIDalgorithm for sequence databasecomparisons: application to theidentification of vector contamination inthe EMBL databases. Bioinformatics1999;15(2):111-21.

64. Gonnet GH, Korostensky C, Benner S.Evaluation measures of multiple sequencealignments [In Process Citation]. J ComputBiol 2000;7(1-2):261-76.

65. Orengo CA, Taylor WR. SSAP: sequentialstructure alignment program for proteinstructure comparison. Methods Enzymol1996;266:617-35.

66. Orengo CA. CORA—topologicalfingerprints for protein structural families.Protein Sci 1999;8(4):699-715.

67. Russell RB, Sternberg MJ. Structureprediction. How good are we? Curr Biol1995;5(5):488-90.

68. Martin AC, Orengo CA, Hutchinson EG,Jones S, Karmirantzou M, Laskowski RA,et al. Protein folds and functions. Structure1998;6(7):875-84.

69. Hegyi H, Gerstein M. The relationshipbetween protein structure and function: acomprehensive survey with application tothe yeast genome. J Mol Biol1999;288(1):147-64.

70. Russell RB, Sasieni PD, Sternberg MJE.Supersites within superfolds. Binding sitesimilarity in the absence of homology. JMol Biol 1998;282(4):903-18.

71. Wilson CA, Kreychman J, Gerstein M.Assessing annotation transfer for genomics:quantifying the relations between proteinsequence, structure and function through

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 16: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

98

Review Paper

Yearbook of Medical Informatics 2001

traditional and probabilistic scores. J MolBiol 2000;297(1):233-49.

72. Harrison SC. A structural taxonomy ofDNA-binding domains. Nature1991;353(6346):715-9.

73. Luscombe NM, Austin SE, Berman HM,Thornton JM. An overview of the structuresof protein-DNA complexes. GenomeBiology 2000;1(1):1-37.

74. Jones S, van Heyningen P, Berman HM,Thornton JM. Protein-DNA interactions:A structural analysis. J Mol Biol1999;287(5):877-96.

75. Suzuki M, Gerstein M. Binding geometryof alpha-helices that recognize DNA.Proteins 1995;23(4):525-35.

76. Luscombe NM, Thornton JM. Protein-DNA interactions: a 3D analysis of alpha-helix-binding in the major groove.Manuscript in preparation.

77. Suzuki M, Brenner SE, Gerstein M, Yagi N.DNA recognition code of transcriptionfactors. Protein Eng 1995;8(4):319-28.

78. Suzuki M. DNA recognition by a beta-sheet. Protein Eng 1995;8(1):1-4.

79. Seeman NC, Rosenberg JM, Rich A.Sequence specific recognition of doublehelical nucleic acids by proteins. Proc NatlAcad Sci U S A 1976;73:804-808.

80. Suzuki M. A framework for the DNA-protein recognition code of the probe helixin transcription factors: the chemical andstereochemical rules [see comments].Structure 1994;2(4):317-26.

81. Mandel-Gutfreund Y, Schueler O, MargalitH. Comprehensive analysis of hydrogenbonds in regulatory protein DNA-complexes: in search of common principles.J Mol Biol 1995;253(2):370-82.

82. Luscombe NM, Laskowski RA, ThorntonJM. Protein-DNA interactions: a 3Danalysis of amino acid-base interactions.Manuscript in preparation.

83. Mandel-Gutfreund Y, Margalit H, JerniganRL, Zhurkin VB. A role for CH...Ointeractions in protein-DNA recognition. JMol Biol 1998;277(5):1129-40.

84. Sternberg MJ, Gabb HA, Jackson RM.Predictive docking of protein-protein andprotein-DNA complexes. Curr Opin StructBiol 1998;8(2):250-6.

85. Aloy P, Moont G, Gabb HA, Querol E,Aviles FX, Sternberg MJ. Modellingrepressor proteins docking to DNA.Proteins 1998;33(4):535-49.

86. Dickerson RE. DNA bending: the prevalenceof kinkiness and the virtues of normality.Nucleic Acids Res 1998;26(8):1906-26.

87. Perez-Rueda E, Collado-Vides J. Therepertoire of DNA-binding transcriptionalregulators in Escherichia coli K-12. NucleicAcids Res 2000;28(8):1838-47.

88. Mewes HW, Frishman D, Gruber C, Geier

B, Haase D, Kaps A, et al. MIPS: a databasefor genomes and protein sequences. NucleicAcids Res 2000;28(1):37-40.

89. Salgado H, Santos-Zavaleta A, Gama-Castro S, Millan-Zarate D, Blattner FR,Collado-Vides J. RegulonDB (version 3.0):transcriptional regulation and operonorganization in Escherichia coli K-12.Nucleic Acids Res 2000;28(1):65-7.

90. Wingender E, Chen X, Hehl R, Karas H,Liebich I, Matys V, et al. TRANSFAC: anintegrated system for gene expressionregulation. Nucleic Acids Res2000;28(1):316-9.

91. Teichmann SA, Chothia C, Gerstein M.Advances in structural genomics. Curr OpinStruct Biol 1999;9(3):390-9.

92. Aravind L, Koonin EV. DNA-bindingproteins and evolution of transcriptionregulation in the archaea. Nucleic Acids Res1999;27(23):4658-70.

93. Huynen MA, van Nimwegen E. Thefrequency distribution of gene family sizesin complete genomes. Mol Biol Evol1998;15(5):583-9.

94. Luscombe NM, Thornton JM. Protein-DNA interactions: an analysis of aminoacid conservation and the effect on bindingspecificity. Manuscript in preparation.

95. Gelfand MS. Prediction of function in DNAsequence analysis. J Comp Biol 1995;1:87-115.

96. Robison K, McGuire AM, Church GM. Acomprehensive library of DNA-binding sitematrices for 55 proteins applied to thecomplete Escherichia coli K-12 genome. JMol Biol 1998;284(2):241-54.

97. Thieffry D, Salgado H, Huerta AM, Collado-Vides J. Prediction of transcriptionalregulatory sites in the complete genomesequence of Escherichia coli K-12.Bioinformatics 1998;14(5):391-400.

98. Mironov AA, Koonin EV, Roytberg MA,Gelfand MS. Computer analysis oftranscription regulatory patterns incompletely sequenced bacterial genomes.Nucleic Acids Res 1999;27(14):2981-9.

99. Gelfand MS, Koonin EV, Mironov AA.Prediction of transcription regulatory sitesin Archaea by a comparative genomicapproach. Nucleic Acids Res2000;28(3):695-705.

100. McGuire AM, Hughes JD, Church GM.Conservation of DNA regulatory motifsand discovery of new motifs in microbialgenomes [In Process Citation]. GenomeRes 2000;10(6):744-57.

101. Bysani N, Daugherty JR, Cooper TG.Saturation mutagenesis of the UASNTR(GATAA) responsible for nitrogencatabolite repression-sensitivetranscriptional activation of the allantoinpathway genes in Saccharomyces

cerevisiae. J Bacteriol 1991;173(16):4977-82.

102. Clarke ND, Berg JM. Zinc fingers inCaenorhabditis elegans: finding familiesand probing pathways [see comments].Science 1998;282(5396):2018-22.

103. van Helden J, Andre B, Collado-Vides J.Extracting regulatory sites from theupstream region of yeast genes bycomputational analysis of oligonucleotidefrequencies. J Mol Biol 1998;281(5):827-42.

104. Salgado H, Moreno-Hagelsieb G, SmithTF, Collado-Vides J. Operons inEscherichia coli: genomic analyses andpredictions. Proc Natl Acad Sci U S A2000;97(12):6652-7.

105. Tatusov RL, Mushegian AR, Bork P,Brown NP, Hayes WS, Borodovsky M, etal. Metabolism and evolution of Haemophilusinfluenzae deduced from a whole- genomecomparison with Escherichia coli. CurrBiol 1996;6(3):279-91.

106. Eisen MB, Spellman PT, Brown PO, BotsteinD. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad SciU S A 1998;95(25):14863-8.

107. Wen X, Fuhrman S, Michaels GS, CarrDB, Smith S, Barker JL, et al. Large-scaletemporal gene expression mapping ofcentral nervous system development. ProcNatl Acad Sci U S A 1998;95(1):334-9.

108. Alon U, Barkai N, Notterman DA, Gish K,Ybarra S, Mack D, et al. Broad patterns ofgene expression revealed by clusteringanalysis of tumor and normal colon tissuesprobed by oligonucleotide arrays. ProcNatl Acad Sci U S A 1999;96(12):6745-50.

109. Tamayo P, Slonim D, Mesirov J, Zhu Q,Kitareewan S, Dmitrovsky E, et al.Interpreting patterns of gene expressionwith self-organizing maps: methods andapplication to hematopoietic differentia-tion. Proc Natl Acad Sci U S A1999;96(6):2907-12.

110. Toronen P, Kolehmainen M, Wong G,Castren E. Analysis of gene expressiondata using self-organizing maps. FEBSLett 1999;451(2):142-6.

111. Tavazoie S, Hughes JD, Campbell MJ,Cho RJ, Church GM. Systematicdetermination of genetic networkarchitecture [see comments]. Nat Genet1999;22(3):281-5.

112. Jansen R, Gerstein M. Analysis of theyeast transcriptome with structural andfunctional categories: characterizinghighly expressed proteins. Nucleic AcidsRes 2000;28(6):1481-8.

113. Gerstein M, Jansen R. The current excitmentin bioinformatics, analysis of whole-genome expression data: how does it relateto protein structure and function. CurrentOpinion in Structural Biology In Press.

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 17: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

Review Paper

99Yearbook of Medical Informatics 2001

114. Drawid A, Gerstein M. A BayesianSystem Integrating Expression Data withSequence Patterns for Localizing Proteins:Comprehensive Application to the YeastGenome. J Mol Biol In press.

115. Drawid A, Jansen R, Gerstein M. Genom-wide analysis relating expression levelwith protein subcellular localisation.TIGS In press.

116. Marcotte EM, Pellegrini M, Ng HL, RiceDW, Yeates TO, Eisenberg D. Detectingprotein function and protein-proteininteractions from genome sequences.Science 1999;285(5428):751-3.

117. Eisenberg D, Marcotte EM, Xenarios I,Yeates TO. Protein function in the post-genomic era. Nature 2000;405(6788):823-6.

118. Jansen R, Greenbaum D, Gerstein M.Relating whole-genome expression datawith protein-protein interactions.Manuscript in preparation.

119. Marx J. Medicine. DNA arrays revealcancer in its many forms [news]. Science2000;289(5485):1670-2.

120. Ross DT, Scherf U, Eisen MB, PerouCM, Rees C, Spellman P, et al. Systematicvariation in gene expression patterns inhuman cancer cell lines [see comments].Nat Genet 2000;24(3):227-35.

121. Perou CM, Jeffrey SS, van de Rijn M,Rees CA, Eisen MB, Ross DT, et al.Distinctive gene expression patterns inhuman mammary epithelial cells andbreast cancers. Proc Natl Acad Sci U S A1999;96(16):9212-7.

122. Livesey FJ, Furukawa T, Steffen MA,Church GM, Cepko CL. Microarray

analysis of the transcriptional networkcontrolled by the photoreceptorhomeobox gene Crx. Curr Biol2000;10(6):301-10.

123. Sali A, Blundell TL. Comparative proteinmodelling by satisfaction of spatialrestraints. J Mol Biol 1993;234(3):779-815.

124. Jones DT, Taylor WR, Thornton JM. Anew approach to protein fold recognition.Nature 1992;358(6381):86-9.

125. Kok K, Naylor SL, Buys CH. Deletionsof the short arm of chromosome 3 in solidtumors and the search for suppressorgenes. Adv Cancer Res 1997;71:27-92.

126. Syngal S, Fox EA, Eng C, Kolodner RD,Garber JE. Sensitivity and specificity ofclinical criteria for hereditary non-polyposis colorectal cancer associatedmutations in MSH2 and MLH1. J MedGen 2000;37(9):641-645.

127. Uetz P, Giot L, Cagney G, MansfieldTA, Judson RS, Knight JR, et al. Acomprehensive analysis of protein-protein interactions in Saccharomycescerevisiae [see comments]. Nature2000;403(6770):623-7.

128. Ross-Macdonald P, Sheehan A, FriddleC, Roeder GS, Snyder M. Transposonmutagenesis for the analysis of proteinproduction, function, and localization.Methods Enzymol 1999;303:512-32.

129. Mewes HW, Heumann K, Kaps A, MayerK, Pfeiffer F, Stocker S, et al. MIPS: adatabase for genomes and proteinsequences. Nucleic Acids Res1999;27(1):44-8.

130. Murray-Rust P. Bioinformatics and drug

discovery. Curr Opin Biotechnol1994;5(6):648-53.

131. Friend SH. How DNA microarrays andexpression profiling will affect clinicalpractice. BMJ 1999;319(7220):1306-7.

132. Tamayo P SD, Mesirov J, Zhu Q,Kitareewan S, Dmitrovsky E, Lander ES,Golub TR. Interpreting patterns of geneexpression with self-organizing maps:methods and application tohematopoietic differentiation. Proc NatlAcad Sci U S A 1999;96(6):2907-12.

133. Perou CM JS, van de Rijn M, Rees CA,Eisen MB, Ross DT, PergamenschikovA, Williams CF, Zhu SX, Lee JC, LashkariD, Shalon D, Brown PO, Botstein D.Distinctive gene expression patterns inhuman mammary epithelial cells andbreast cancers. Proc Natl Acad Sci1999;96(16):9212-7.

134. Hiltunen MO, Niemi M, Yla-Herttuala S.Functional genomics and DNA arraytechniques in atherosclerosis research.Curr Opin Lipidol 1999;10(6):515-9.

135. Colantuoni C, Purcell AE, Bouton CM,Pevsner J. High throughput analysis ofgene expression in the human brain. JNeurosci Res 2000;59(1):1-10.

136. Debouck C, Metcalf B. The impact ofgenomics on drug discovery. Annu RevPharmacol Toxicol 2000;40:193-207.

137. Sander C. Genomic medicine and thefuture of health care. Science 2000;287(5460):1977-8.

138. Ohlstein EH, Ruffolo RR, Jr., Elliott JD.Drug discovery in the next millennium.Annu Rev Pharmacol Toxicol2000;40:177-91.

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2

Page 18: What is bioinformatics? An introduction and overview€¦ · Department of Molecular Biophysics and Biochemistry Yale University New Haven, USA Introduction Biological data are flooding

100

Review Paper

Yearbook of Medical Informatics 2001

For personal or educational use only. No other uses without permission. All rights reserved.Downloaded from imia.schattauer.de on 2018-02-23 | IP: 115.113.232.2