Upload
emery-hudson
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
A rapid tour of
Bioinformatics
Saurabh Sinha, Lenny Pitt
Bioinformatics, or Computational Biology ?
• sometimes used interchangeably• latter sometimes includes former• often, latter means molecular modeling to
investigate properties and behaviors of molecules via computer simulation
• often, former refers to application of databases, algorithms, computational and statistical techniques to solve problems arising from the management and analysis of biological data.
Computational Biology
• Example: protein folding
• http://www.youtube.com/watch?v=lijQ3a8yUYQ
• http://fold.it/portal/
Molecular Biology 101
Cells
• Cells are the fundamental units of living organisms
• Cells are born, do their jobs, and die
• Study of life =
study of cells
Proteins• Many of the processes (chemical
reactions) inside cells are carried out by proteins iwrwww1.fzk.de/biostruct/ Assets/1a00x500.jpg
DNA• DNA carries the information on which
proteins to produce in a cell, and how
SOURCE: http://www.microbe.org/espanol/news/human_genome.asp
Chromosome
DNA
• DNA is a string written in the alphabet {A,C,G,T}
• Human DNA is a string with 3 billion characters !
adenine, cytosine, guanine, (DNA and RNA), thymine (DNA) uracil (RNA)
DNA and Proteins
www.ornl.gov/.../slides/ images/01-0037low.jpg
Genes
• Genes are “substrings” (~1000 bp) of DNA
• A gene is used as a template for producing a protein
• Each protein comes from a different gene
• ~25,000 genes in the human DNA
• The process of making a protein from a gene can be regulated in the cell: GENE REGULATION
The initial successes of bioinformatics
Some problems & successes
1. Sequence alignment
2. Comparative genomics
3. Sequencing the genome
4. Gene search
5. Evolutionary biology & phylogenetic trees
1. Sequence Alignment(fundamental question)
• Is this string equal to that one?
• Does this string contain a copy of that one?
• Is this string “like” that one? How much alike?
• Is this string “like” a portion of that one?
Sequence alignment
• Could you have done this task, for two strings of length 1 million characters, by hand ? • Sequence analysis algorithms are the bread and butter of bioinformaticians.
CS has already studied these!
• Is this string equal to that one?– compare two files
• Is this string equal to a portion of that one?– find a word in a document
• Is this string “like” a portion of that one?– find suggested spelling corrections
Edit Distance• how much alike are two strings?• CATTGAGCT• CTTAGCCTA
CATTGAGCT–
C–TTAGCCTA
• Is this the best possible?
CATTGAGC–T–
C–TT –AGCCTA
• Charge one for each mismatch, each insertion, each deletion.
• Problem: find the least cost alignment
• Extensions: charge different amounts for A/C mismatch, for insertion, etc., reflecting (un)likelihood of certain genetic mutations.
• There are reasonably efficient algorithms for all of these problems
2. Comparative genomics
• Human and mouse share the genetic “toolkit” for development
• Compare the two genomes and find the conserved features
• These are likely to be of functional importance
• How to compare two genomes ?– Sequence alignment
2. Comparative genomics
http://genome.ucsc.edu/cgi-bin/hgGateway
3. Sequencing the GenomeThe Human Genome Project
• Human genome: a “string” of length 3,000,000,000 characters !
• Starting with a human cell, how can we obtain this sequence ?– The problem of sequencing– 2001
Shotgun Sequencing• Lab technology: can sequence snippet of 1000-
2000 nucleotides.• Idea: “shotgun” apart multiple copies of whole
genome, sequence all snippets, reconstruct. http://en.wikipedia.org/wiki/Shotgun_sequencing
• 3 billion / 1000 = 3 million snippets.• Want multiple copies divided in different spots, so
many snippets overlap• From overlap, we can tell how things go together. • Need 7-fold replication to guarantee coverage
How is the genome sequenced ?
http://www.wiley.com/legacy/college/boyer/0470003790/cutting_edge/shotgun_seq/computer.gif
Assembly Methods
• Greedy approach
• Graph approaches:– Hamiltonian path– Traveling Salesman (TSP) in k-mer graph– Eulerian path in k-mer graph
READ: http://www.cbcb.umd.edu/research/assembly_primer.shtml
Greedy Approach
• Merge two snippets with greatest overlap• Repeat
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Problem: may merge repeated segments (>50% of human genome are repeats)
Hamiltonian Path
• Create graph– vertices = snippets– edges = overlap
http://www.cbcb.umd.edu/research/assembly_primer.shtml
red edges correspond to repeated segments
Find a path that visits each vertex exactly once
Other graph approaches
• Unknown sequence.• Challenge: here are the “3-mers”:
CAG, ATC, GTC, CCA,
CAT, AGT, TCC, TCA• Max TSP approach
– 3-mers are vertices
• Eulerian Path approach– 3-mers are edges
Solution
CATCCAGTCA
Max TSP approach
3-mers sequenced: { ATC, CCA, CAG, TCC, AGT }
AGT
CCA
ATC
ATCCAGT TCC CAG
ATCCAGT
ATC
CCA
TCC
AGT
CAG
2
2 22
1
1
10
11
3-mers extracted from unknown sequence
Find max-weight tour visiting all vertices
Max TSP approach
3-mers sequenced: { ATC, CCA, CAG, TCC, AGT }
AGT
CCA
ATC
ATCCAGT TCC CAG
ATCCAGT
ATC
CCA
TCC
AGT
CAG
2
2 22
1
1
10
11
3-mers extracted from unknown sequence
Find max-weight tour visiting all vertices
Eulerian paths and k-mers
• get sequence of all k-mers (including multiplicities)
• edges are k-mers
• vertices are k-1 bp prefix and suffix.
• find Eulerian path (traverses each edge)
3-mers sequenced: { ATC, CCA, CAG, TCC, AGT }
AGT
CCA
ATC
ATCCAGT TCC CAG
AT
TC
CC
CA
AG
GT
ATCCAG
AGT
CC
A
TC
C
ATCCAGTFind tour using all edges
Exercise
• Length-9 DNA sequence was deconstructed • 3-mers = {GTT, TCG, CGT, TTA, ACG, TTC,
TAC} • Draw graph with directed edges labeled by these
3-mers, and vertices labeled with the corresponding 2-mers
• Find a directed path through this graph that crosses each edge exactly once, and write down the possible original length-9 sequence that can be reconstructed from the path
4. Gene Search• Find out where the genes are located in this long string• Genes cover ~2% of human genome• Finding them using computer algorithms and statistics
http://www.broad.mit.edu/annotation/argo/help/usecase/index_files/image012.jpg
4. Gene Search
• Comparative genomics - similar regions to known genes for other organisms likely indicate similar function
• Similarity to gene-like patterns
• Reverse engineering from expressed proteins
• (http://en.wikipedia.org/wiki/Gene_prediction)
5. Evolutionary Biology and Phylogenetic Trees
• See presentation by Jana Sperschneider
21st century biology: bioinformatics drives the revolution
Special issue of journal Science, July 1, 2005.
>What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?
>What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?
A simple organism
GENE
Raw
mat
eria
lsEnvironmental signal
Response (protein)
A simple organism
GENE1
GENE2
GENE3
Environmental signalR
aw m
ater
ials
A simple organism
GENE1
GENE2
GENE3
GENE4
GENE5
GENE6
GENE7
GENE8
GENE9
GENE10
A complex organism
GENE1
GENE2
GENE3
GENE4
GENE5
GENE6
GENE7
GENE8
GENE9
GENE10
Complex circuit of interactions
Do not need more genes; additional complexitycomes from more interconnections among genes
Regulatory networks
• Genes are switches, transcription factors are input signals, proteins are outputs
• Proteins (outputs) are the signals for other genes (switches)
• This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity)
• Bioinformatics can unravel such networks, given the genome (DNA sequence) and gene activity information
Decoding the regulatory network
• Find patterns (“binding sites”) in DNA sequence • Analyze high throughput measurements of gene
activity levels (“microarrays”)• Analyze measurements of protein-DNA interaction
(“ChIP-on-chip”)• Integration of heterogeneous sources of data
REGULATORYNETWORKDISCOVERY
http://www.chiponchip.org/Images/scheme_800x600_crop.jpg
Microarrays
ChIP-on-chip
Patterns in DNA sequence
“How does a single somatic cell become a whole plant ?”
Developmental biology
• The timeline from a single cell (with genetic material from mother and father) to a multicellular embryo, and to an adult
• A paradox : All cells in the adult body have the same DNA, then how come different cells are different ?
How does a single cell lead to this ? …
… and to this ?
Drosophila(fruitfly)
Answer: Regulatory networks (Again !)
• Bioinformatics used to scan entire genome for regions that participate in “segmenting” the embryo
• Hidden Markov models, a popular technique in signal processing, used to detect such regions
• Multiple species comparison aids discovery
“How did cooperative behavior evolve?”
Cooperative social behavior
• What is the genetic (molecular) basis of social behavior ?• Social behavior in honey bees• Young worker bees are nurses in the hive; older ones go
out to forage• This behavioral pattern is determined by needs of colony
– How do the bees know ?
Bioinformatics of social behavior
• UIUC team scanned the honeybee genome to understand this
• Regulatory network of social behavior
• Statistical tools, machine learning, sequence analysis used for this project
“How will big pictures emerge from a sea of biological data?”
The sea
• Genomes: 3 x 109 bp of human genome• Similar numbers for other genomes: mouse, rat,
dog, chicken, chimp etc.• Microarray: snapshots of 1000s of genes’
activities at one time and condition. Thousands of microarrays.
• ChIP-on-chip data: measurements of a transcription factor’s binding affinity for 1000s of genes (promoters).
Segal et al. Nature Genetics 2005.
Big pictures
A compendium of cancergenes and their regulation
The sea of biological data
• Biological literature, capturing decades of painstaking experimental work on genetics and molecular biology
• Can we glean useful information from this vast body of knowledge ?
• Biological literature mining. – Natural language processing– Text Information Retrieval (statistical approaches)
Some other challenges• Protein structure prediction• Can we predict the 3-D structure of a protein from
its sequence ?– Why ? – One good reason: structure gives clues about function. If
we can tell the structure, we can perhaps tell the function– We can design amino acid sequences that will fold into
proteins that do what we want them to do. Drug design !!
• Neural networks, a popular technique in computer science, applied to this problem
Some other challenges
• “Metagenomics”• Most studies to date are on genomes of one
species• A sample from the soil contains hundreds of
bacteria, thousands of viruses. Can we study all of these ?
• Bioinformatics is indispensable !!• New type of data, new types of algorithms
Many more challenges
• New types of data come due to technological breakthroughs in biology
• High throughput data carries unprecedented amount of information
• Too much noise
• Bioinformatics removes the noise and reveals the truth
Bioinformatics
• Is not about one problem (e.g., designing better computer chips, better compilers, better graphics, better networks, better operating systems, etc.)
• Is about a family of very different problems, all related to biology, all related to each other
• How can computers help solve any of this family of problems ?
Bioinformatics and You
• You can learn the tools of bioinformatics• These tools owe their origin to computer
science, information theory, probability theory, statistics, etc.
• You can learn the language of biology, enough to understand what the problems are
• You can apply the tools to these problems and contribute to science