29
Computational Biology, Part 1 Introduction Robert F. Murphy Robert F. Murphy Copyright Copyright 1996, 2000, 1996, 2000, 2001. 2001. All rights reserved. All rights reserved.

Computational Biology, Part 1 Introduction Robert F. Murphy Copyright 1996, 2000, 2001. All rights reserved

Embed Size (px)

Citation preview

Computational Biology, Part 1Introduction

Computational Biology, Part 1Introduction

Robert F. MurphyRobert F. Murphy

Copyright Copyright 1996, 2000, 2001. 1996, 2000, 2001.

All rights reserved.All rights reserved.

Course IntroductionCourse Introduction

What these courses are aboutWhat these courses are about What I expectWhat I expect What you can expectWhat you can expect

What these courses are aboutWhat these courses are about

overview of ways in which computers are overview of ways in which computers are used to solve problems in biologyused to solve problems in biology

supervised learning of illustrative or supervised learning of illustrative or frequently-used programsfrequently-used programs

(03-510) supervised learning of (03-510) supervised learning of programming techniques and algorithms programming techniques and algorithms selected from these usesselected from these uses

I expectI expect

students will have basic knowledge of biology and chemistry students will have basic knowledge of biology and chemistry (at the level of Modern Biology/Chemistry) and willingness (at the level of Modern Biology/Chemistry) and willingness to learn moreto learn more

students will have basic familiarity with use of computers students will have basic familiarity with use of computers (e.g., at the level of Computing Skills Workshop) and (e.g., at the level of Computing Skills Workshop) and eagerness to gain new skillseagerness to gain new skills

(03-510) students have some programming experience and (03-510) students have some programming experience and willingness to work to improve willingness to work to improve

heterogeneous class - I plan to include refreshers on each heterogeneous class - I plan to include refreshers on each new topicnew topic

students will ask questions in class and via emailstudents will ask questions in class and via email

You can expectYou can expect Three major course sectionsThree major course sections

Sequence Analysis (13 classes)Sequence Analysis (13 classes) Biological Modeling (11 classes)Biological Modeling (11 classes) Biological Imaging (4 classes)Biological Imaging (4 classes)

Class sessions: lectures/demonstrations/exercises/quizzesClass sessions: lectures/demonstrations/exercises/quizzes Homework assignmentsHomework assignments

4 homework assignments for 03-311 (80% of grade)4 homework assignments for 03-311 (80% of grade) 8 homework assignments for 03-310 (70% of grade)8 homework assignments for 03-310 (70% of grade) 10 homework assignments for 03-510 (70% of grade)10 homework assignments for 03-510 (70% of grade)

Test Test March 1 March 1 (20% for 03-311, 10% for others)(20% for 03-311, 10% for others) Final (20% of grade for 03-310, 03-510)Final (20% of grade for 03-310, 03-510) Communication on class matters via email listCommunication on class matters via email list

Textbooks for first half of courseTextbooks for first half of course

For 03-310/311 studentsFor 03-310/311 students ““Required textbook” is Baxevanis & OuelletteRequired textbook” is Baxevanis & Ouellette

For 03-510 studentsFor 03-510 students ““Recommended” textbook is Durbin et al.Recommended” textbook is Durbin et al.

Additional suggested bookAdditional suggested book Computational Molecular Biology, Peter Clote & Computational Molecular Biology, Peter Clote &

Rolf Backofen (ISBN 0-471-87252-0)Rolf Backofen (ISBN 0-471-87252-0) Chap. 1 is an excellent introduction to Molec. Biol. for Chap. 1 is an excellent introduction to Molec. Biol. for

non-Biology majorsnon-Biology majors

Specific sources for CMU computational biology classesSpecific sources for CMU computational biology classes Web page Web page ((http://www.bio.cmu.edu/Courses/03310http://www.bio.cmu.edu/Courses/03310 or or

0331103311 or or 0351003510))

Lecture Notes (as PowerPoint files)Lecture Notes (as PowerPoint files) Homework Assignments (as Word files)Homework Assignments (as Word files) Additional materials as neededAdditional materials as needed

FTP server (FTP server (www.bio.cmu.eduwww.bio.cmu.edu)) Files needed for homework assignmentsFiles needed for homework assignments

CompBiol project volume on AFSCompBiol project volume on AFS /afs/andrew.cmu.edu/usr/murphy/CompBiol/afs/andrew.cmu.edu/usr/murphy/CompBiol

Additional classes for 03-510Additional classes for 03-510

We will have one additional class meeting We will have one additional class meeting per week for 03-510 for the first half of the per week for 03-510 for the first half of the semester onlysemester only

Purpose is to cover some more advanced Purpose is to cover some more advanced material and programming assignmentsmaterial and programming assignments

Other relevant coursesOther relevant courses

Second half mini-course “Second half mini-course “47-863: Topics in Operations Research: Computational Biology” will be taught by Dr. R. Ravi” will be taught by Dr. R. Ravi Tuesday-Thursday 1:30-2:50 starting 3/13Tuesday-Thursday 1:30-2:50 starting 3/13 Recommended for 03-510 studentsRecommended for 03-510 students

Fall 2001 course on advanced topics in Fall 2001 course on advanced topics in computational molecular biology will be taught computational molecular biology will be taught by Dr. Dannie Durandby Dr. Dannie Durand Prerequisite: 03-310/311/510Prerequisite: 03-310/311/510

Information flowInformation flow

A major task in computational molecular A major task in computational molecular biology is to “decipher” information biology is to “decipher” information contained in biological sequencescontained in biological sequences

Since the nucleotide sequence of a genome Since the nucleotide sequence of a genome contains all information necessary to contains all information necessary to produce a functional organism, we should in produce a functional organism, we should in theory be able to duplicate this decoding theory be able to duplicate this decoding using computersusing computers

Review of basic biochemistryReview of basic biochemistry

Central Dogma: DNA makes RNA makes Central Dogma: DNA makes RNA makes proteinprotein

Sequence determines structure determines Sequence determines structure determines functionfunction

StructureStructure

macromolecular structure divided intomacromolecular structure divided into primaryprimary structure (1D sequence)structure (1D sequence) secondarysecondary structure (local 2D & 3D) structure (local 2D & 3D) tertiarytertiary structure (global 3D)structure (global 3D)

DNA composed of four DNA composed of four nucleotidesnucleotides or "bases": or "bases": A,C,G,TA,C,G,T

RNA composed of four also: A,C,G,U (T RNA composed of four also: A,C,G,U (T transcribed as U)transcribed as U)

proteins are composed of proteins are composed of amino acidsamino acids

DNA properties - base compositionDNA properties - base composition

Some properties of long, naturally-occuring Some properties of long, naturally-occuring DNA molecules can be predicted accurately DNA molecules can be predicted accurately given only the given only the base compositionbase composition, usually , usually expressed as eitherexpressed as either %GC %GC (the percent of all base pairs that are (the percent of all base pairs that are

G:C), orG:C), or GCGC (the (the mole fraction mole fraction of all bases that are of all bases that are

either G or C)either G or C) %GC %GC = 100*= 100*GCGC

DNA properties - melting temperature and buoyant densityDNA properties - melting temperature and buoyant density Two such properties areTwo such properties are

TTmm, the , the melting temperaturemelting temperature, defined as the , defined as the

temperature at which half of the DNA is single-temperature at which half of the DNA is single-stranded and half is double-strandedstranded and half is double-stranded TTmm ( (ooC) = 69.3 + 41 C) = 69.3 + 41 GCGC (for 0.15 M NaCl) (for 0.15 M NaCl)

00, the , the buoyant densitybuoyant density, defined as the density , defined as the density

of a solution in which a DNA molecule will feel of a solution in which a DNA molecule will feel no net force when centrifuged (the density at no net force when centrifuged (the density at the point in a density gradient at which the the point in a density gradient at which the DNA stops moving, or “bands”)DNA stops moving, or “bands”) 00 (g cm (g cm-3-3) = 1.660 + 0.098) = 1.660 + 0.098 GCGC (for CsCl)(for CsCl)

DNA structure - restriction mapsDNA structure - restriction maps

Restriction enzymes Restriction enzymes cut DNA at specific cut DNA at specific sequences.sequences.

A A restriction map restriction map is a graphical is a graphical description of the order and lengths of description of the order and lengths of fragments that would be produced by the fragments that would be produced by the digestion of a DNA molecule with one or digestion of a DNA molecule with one or more restriction enzymesmore restriction enzymes

Restriction map of a circular plasmid with one enzymeRestriction map of a circular plasmid with one enzyme

AccII AccII

AccII

AccII

AccII

AccII

AccII

AccIIAccII

AccII

AccII

pGEM4

Restriction map of all enzymes that cut only onceRestriction map of all enzymes that cut only once

AcsI ApoI EcoRIEcl136II EcoICRISacI SstI Acc65I Asp718I AvaI BcoI

AflIII

AlwNI

AhdI AspEI Eam1105IEclHKIBpmI GsuI BglI AviII FspIBspCIPvuIXorII

Eco255IScaI

Asp700IXmnI

SspI

AatII

EcoNI

BsmFIDsaIAor51HIEco47III

SgrAINgoAIVNgoMINaeINheIBsp1407IBsrGISspBI

pGEM4

TranscriptionTranscription

transcription is accomplished by RNA polymerasetranscription is accomplished by RNA polymerase RNA polymerase binds to RNA polymerase binds to promoterspromoters promoters have distinct regions "-35" and "-10"promoters have distinct regions "-35" and "-10" efficiency of transcription controlled by binding efficiency of transcription controlled by binding

and progression ratesand progression rates transcription start and stop affected by tertiary transcription start and stop affected by tertiary

structurestructure regulatory sequences can be positive or negativeregulatory sequences can be positive or negative

RNA processingRNA processing

eukaryotic genes are interrupted byeukaryotic genes are interrupted by intronsintrons these are "spliced" out to yield mRNAthese are "spliced" out to yield mRNA splicing done by spliceosomesplicing done by spliceosome splicing sites are quite degenerate but not all splicing sites are quite degenerate but not all

are usedare used

TranslationTranslation

conversion from RNA to protein is by conversion from RNA to protein is by codoncodon: 3 bases = 1 amino acid: 3 bases = 1 amino acid

translation done by ribosometranslation done by ribosome translation efficiency controlled by mRNA translation efficiency controlled by mRNA

copy number (turnover) and ribosome copy number (turnover) and ribosome binding efficiencybinding efficiency

translation affected by mRNA tertiary translation affected by mRNA tertiary structurestructure

Protein localizationProtein localization

leader sequences can specify cellular leader sequences can specify cellular location (e.g., insert across membranes)location (e.g., insert across membranes)

leader sequences usually removed by leader sequences usually removed by proteolytic cleavageproteolytic cleavage

Postranslational processingPostranslational processing

peptides fold after translation - may be peptides fold after translation - may be assisted or unassistedassisted or unassisted

processing enzymes recognize specific sites processing enzymes recognize specific sites (amino acid sequences)(amino acid sequences)

protein signals can involve secondary and protein signals can involve secondary and tertiary structure, not just primary structuretertiary structure, not just primary structure

Goals of Sequence AnalysisGoals of Sequence Analysis

Assigned Reading:Assigned Reading: Baxevanis & Ouellette, Chapter 10Baxevanis & Ouellette, Chapter 10

Goals of Sequence AnalysisGoals of Sequence Analysis

Management of sequence informationManagement of sequence information Assembly of sequence fragments into complete Assembly of sequence fragments into complete

units (proteins, genes, chromosomes)units (proteins, genes, chromosomes)

Goals of Sequence AnalysisGoals of Sequence Analysis

Confirmation and prediction of restriction enzyme Confirmation and prediction of restriction enzyme sites (for nuc.acids)sites (for nuc.acids) can aid sequence determination in areas of uncertainty can aid sequence determination in areas of uncertainty

by permitting testing of specific basesby permitting testing of specific bases can permit selection of appropriate enzymes for can permit selection of appropriate enzymes for

sequence checkingsequence checking can permit selection of appropriate enzymes for can permit selection of appropriate enzymes for

subcloning or generation of probessubcloning or generation of probes

Goals of Sequence AnalysisGoals of Sequence Analysis

Finding open reading frames (ORFs) for cDNAs or Finding open reading frames (ORFs) for cDNAs or genomic DNA from organisms without intronsgenomic DNA from organisms without introns

Finding protein coding regions in DNAs using codon Finding protein coding regions in DNAs using codon usage tablesusage tables not all ORFs are made into proteinsnot all ORFs are made into proteins redundancy in genetic code is not fully reflected in the tRNAs redundancy in genetic code is not fully reflected in the tRNAs

made by a particular organism (codon preference)made by a particular organism (codon preference) can use to identify "real" coding regions (pseudo-genes "drift" in can use to identify "real" coding regions (pseudo-genes "drift" in

their codon usage)their codon usage) can use expressed sequence tags (ESTs)can use expressed sequence tags (ESTs)

Goals of Sequence AnalysisGoals of Sequence Analysis

Finding and using consensus sequencesFinding and using consensus sequences ExamplesExamples

promoterspromoters transcription initiation sitestranscription initiation sites transcription termination sitestranscription termination sites polyadenylation sitespolyadenylation sites ribosome binding sitesribosome binding sites protein featuresprotein features

use sets of sequences identified (by other means) as relateduse sets of sequences identified (by other means) as related use sets of sequences identified by sequence comparisonuse sets of sequences identified by sequence comparison

Goals of Sequence AnalysisGoals of Sequence Analysis

Comparison and alignment of sequencesComparison and alignment of sequences compare sequence to database - goal: find related compare sequence to database - goal: find related

sequences (SIMILARITY)sequences (SIMILARITY) compare sequence to sequence - goal: find matching compare sequence to sequence - goal: find matching

domains (ALIGNMENT)domains (ALIGNMENT) compare database to database - goal: estimate genetic compare database to database - goal: estimate genetic

distance (EVOLUTION)distance (EVOLUTION) either: determine consensus sequenceseither: determine consensus sequences comparisons can be pairwise or multiple-strandcomparisons can be pairwise or multiple-strand

Goals of Sequence AnalysisGoals of Sequence Analysis

Translation to protein sequence and prediction of Translation to protein sequence and prediction of protein properties - use measured propensities of protein properties - use measured propensities of particular amino acids or amino acid stretchesparticular amino acids or amino acid stretches Predict molecular weightPredict molecular weight Predict isoelectric point (pI)Predict isoelectric point (pI) Predict extinction coefficientPredict extinction coefficient

Prediction of secondary and tertiary structurePrediction of secondary and tertiary structure RNA - use base pairing energiesRNA - use base pairing energies protein - use propensitiesprotein - use propensities