View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Welcome to CS374Welcome to CS374
Algorithms in BiologyAlgorithms in Biology
Overview
• Administrivia
• Molecular Biology and Computation
DNA, proteins, cells, evolution
Some examples of CS in biology
• Computer Scientists vs Biologists
CS374: Algorithms in Biologycs374.stanford.edu
1. Attendance• At most 2 classes missed without affecting grade
2. Lectures• Most important requirement
• Select available topic & day, send email to Serafim and George
• Read papers, meet with Serafim 1-2 weeks before lecture
• Ask George any questions on papers while preparing presentation
• Schedule long (2 hr) meeting with Serafim the day before lecture
• Slides due at noon before lecture
CS374: Algorithms in Biologycs374.stanford.edu
3. Scribing• Please sign up on a first-come first-serve basis• Due 1 week after lecture, edited & distributed 2 weeks after lecture• George will help you edit
4. Summaries• Select 1 lecture among first 10, 1 lecture among rest• Find one relevant paper• Write a 1-page summary of the paper
» Paper reference» Abstract» Discussion
• Ask George for questions/feedback
5. Have fun!
Structure of DNA double helix
T
C
A
C
T
G
G
C
G
A
G
T
C
A
G
C
DNA
Phosphate Group
Sugar
NitrogenousBase
A, C, G, T
Physicist Ornithologist
DNA to RNA, and genes
DNA, ~3x109 long in humansContains ~ 22,000 genes G
A
G
U
C
A
G
C
RNA: carries the “message” for “translating”, or “expressing” one gene
transcription translation
folding
Structure of proteins
Composed of a chain of amino acids.
R
|
H2N--C--COOH
|
H
20 possible groupsSequence of amino acids folds to form a
complex 3-D structure.
The structure of a protein is intimately connected to its function.
All living organisms are composed of cells
Genetics in the 20th Century
21st Century
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT
AGTAGGACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT
Computational Biology
• Organize & analyze massive amounts of biological data
Enable biologists to use data
Form testable hypotheses
Discover new biology
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATGCGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCACGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTAGTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT
DNA to RNA, and genes
G
A
G
U
C
A
G
C
DNA, ~3x109 long in humansContains ~ 22,000 genes
RNA: carries the “message” for “translating”, or “expressing” one gene
transcription translation
folding
1
Some examples of central role of CS1. Sequencing
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT
3x109 nucleotides
~500 nucleotides
Some examples of central role of CS1. Sequencing
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT
3x109 nucleotides
Computational Fragment AssemblyIntroduced ~19801995: assemble up to 1,000,000 long DNA pieces2000: assemble whole human genome
A big puzzle~60 million pieces
Complete genomes today
More than 300 complete genomes have been
sequenced
DNA to RNA, and genes
G
A
G
U
C
A
G
C
DNA, ~3x109 long in humansContains ~ 22,000 genes
RNA: carries the “message” for “translating”, or “expressing” one gene
transcription translation
folding
1
2
Where are the genes?Where are the genes?
2. Gene Finding
In humans:
~22,000 genes~1.5% of human DNA
atg
tga
ggtgag
ggtgag
ggtgag
caggtg
cagatg
cagttg
caggccggtgag
Start codonATG
5’ 3’Exon 1 Exon 2 Exon 3Intron 1 Intron 2
Stop codonTAG/TGA/TAA
Splice sites
2. Gene FindingTopics in CS374:
Finding noncoding RNA genes
Finding short words that regulate the expression of genes
DNA to RNA, and genes
G
A
G
U
C
A
G
C
DNA, ~3x109 long in humansContains ~ 22,000 genes
RNA: carries the “message” for “translating”, or “expressing” one gene
transcription translation
folding
1
2easy
3
3. Protein Folding
• The amino-acid sequence of a protein determines the 3D fold• The 3D fold of a protein determines its function• Can we predict 3D fold of a protein given its amino-acid sequence?
Holy grail of compbio—35 years old problem Molecular dynamics, robotics, machine learning, computational geometry
Topics on Proteins in CS374
1. Protein Structure• Protein Structure Comparison• Evolution of Protein Domains• Molecular Dynamics & Drug Targets• Protein Classification• Protein Folding Dynamics• Protein Kinetics
2. Protein Comparison• Latest multiple alignment tools• Selecting parameters for alignment• Phylogenetic trees
Complete Genomes
More than 200 complete genomes have been
sequenced
Evolution
Evolution at the DNA level
OK
OK
OK
X
X
Still OK?
next generation
4. Sequence ComparisonSequence conservation implies function
Sequence comparison is key to• Finding genes• Determining function• Uncovering the evolutionary processes
Sequence Comparison—Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | |
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Sequence AlignmentIntroduced ~1970BLAST: 1990, most cited paper in historyStill very active area of research
query
DB
BLAST
Comparison of Human, Mouse, and Rat
Topics on Genomics in CS374
• Indexing Large DatabasesNewest BLAST techniques
• Repeat Detection
• Genomic RearrangementsFinding the order of shufflesbetween two genomes
5. Clustering of MicroarraysClinical prediction of Leukemia type
• 2 types Acute lymphoid (ALL) Acute myeloid (AML)
• Different treatment & outcomes• Predict type before treatment?
Bone marrow samples: ALL vs AML
Measure amount of each gene
6. Protein networks
Newer research area• Construct networks from
multiple data sources
• Navigate networks
• Compare networks across organisms
Statistics Machine learning Graph algorithms Databases
Topics on Protein Networks in CS374
1. IntegrationBuild networks from multiple sources
2. AlignmentCompare networks across species
3. Mathematical propertiesModular, scale free
7. Human evolution
A
A
A
A
G
G
G
G
A
A
A
A
A
T
T
T
C
C
C
G
T
A
A
T
T
C
C
G
A
A
A
A
T
T
C
C
G
G
G
G
A
A
G
C GA
A C A
A C GA
A C A
C GA
A C GA
A C GAA
A
A
G
A
T
G
A
T
T
G
G
G
A
G
Topics on Human PopulationGenetics in CS374
1. EvolutionFinding fast-evolvinggenes in human populations
2. MigrationTracing the migration ofhumans out of Africa bygenetic studies
8. Building circuits from cells
The abstract submission deadline is 11:59 pm, Sunday, October 1, 2006.
Computer Scientists vs Biologists
Computer scientists vs Biologists
• (almost) Nothing is ever true or false in Biology
• Everything is true or false in computer science
Computer scientists vs Biologists
• Biologists strive to understand the complicated, messy natural world
• Computer scientists seek to build their own clean and organized virtual worlds
• Biologists are obsessed with being the first to discover something
• Computer scientists are obsessed with being the first to invent or prove something
Computer scientists vs Biologists
• Biologists are comfortable with the idea that all data have errors
• Computer scientists are not
Computer scientists vs Biologists
• Computer scientists get high-paid jobs after graduation
• Biologists typically have to complete one or more 5-year post-docs...
Computer scientists vs Biologists
Computer Science is to Biology what Mathematics is to Physics
“Antedisciplinary” ScienceWhat is computational biology?
http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0010006