41
CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

CS 598SSProbabilistic Methods in

Biological Sequence Analysis

Saurabh Sinha

Page 2: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

What is the course about?

• Bioinformatics / Computational Biology

• Tools for analyzing genomes

• Probabilistic methods

Page 3: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

What is the course format?

• Research course• Lectures by instructor• Student presentations of research papers

– 1 or 2 paper(s) per student

• Research project & presentation– Typically, 2 students per project– 30 mins presentation at end of course.

Page 4: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Grading

• Project: 40%

• Paper presentation: 25%

• Assignments and/or tests: 25%

• Participation: 10%

• Grade distribution

Page 5: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Expectations

• Programming skills (for the project)

• Basic exposure to probability theory

• Basic exposure to algorithms

Page 6: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

What you can do at the end of the course

• Start working on research projects in bioinformatics: biological sequence analysis

• Use principled approaches, supported by probability theory, instead of ad hoc methods

• Join me as a graduate advisee ?

Page 7: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Administrative Details

• Instructor: – Saurabh Sinha– Room 2122, Siebel Center– Email: [email protected]

• Class hrs: Tue & Thurs, 2:00pm - 3:15pm, 1131SC

• CRN: 43781• Credits: 4 graduate hrs• Welcome to sit in, if not taking for credit

Page 8: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Books

• Not required1. Biological Sequence Analysis : Probabilistic Models

of Proteins and Nucleic Acids -- Durbin, Eddy, Krogh, Mitchison2. Bioinformatics: The Machine Learning Approach

-- Baldi, Brunak3. Statistical Methods in Bioinformatics

-- Ewens and Grant4. Bioinformatics -- Polanski and Kimmel

Page 9: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Why study bioinformatics?

• Molecular biology is the new frontier of 21st century science

• Computer science is the crown prince of 20th century engineering

• Bioinformatics is the application and development of computer science with the goal of supporting molecular biology

Page 10: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Why study bioinformatics?

• Flood of data: several Giga (Tera?) bytes of sequence, and gene expression data.

• Noise in the data– Biological– Experimental

• Algorithms needed to make discoveries– Probabilistic methods– Need for efficiency

Page 11: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Why study bioinformatics?

• The big picture:– Human health and quality of life– Fundamental science

• Billions of dollars being spent– Health research gets the major chunk of the US

Govt’s funds– Fundamental health research is at the molecular

level– Molecular biology research increasingly a

quantitative science

Page 12: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Why study bioinformatics?• Recent issue of Science: top 25 questions>What Is the Universe Made Of?>What is the Biological Basis of

Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?

Page 13: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Basic Molecular Biology

Page 14: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Life, Cells, Proteins

• The study of life the study of cells• Cells are born, do their job, duplicate,

die– What is “their job”?– Break down nutrients, produce energy,

produce required molecules

• All these processes controlled by proteins

Page 15: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Protein functions

• “Enzymes” (catalysts)– Control chemical reactions in cell

• Transfer of signals/molecules between and inside cells– E.g., sensing of environment

• Regulate production of other proteins

Page 16: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Protein molecule

• Protein is a sequence of amino-acids

• 20 possible amino acids

• The amino-acid sequence “folds” into a 3-D structure called protein

Page 17: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Protein Structure

Protein

DNA

The DNA repair protein MutY (blue) bound to DNA (purple).

PN

AS

cover, courtesy Am

ie B

oal

Page 18: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

DNA

• Deoxyribonucleic acid: a molecule that is involved in production of proteins

• Double helical structure (discovered by Watson, Crick, Wilkins & Franklin)

• Chromosomes are densely coiled and packed DNA

Page 19: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

SOURCE: http://www.microbe.org/espanol/news/human_genome.asp

Chromosome

DNA

Page 20: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

The DNA Molecule

G -- C A -- T T -- A G -- C C -- G G -- C T -- A G -- C T -- A T -- A A -- T A -- T C -- G T -- A

Base = Nucleotide

5’

3’

Page 21: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

SRC:http://www.biologycorner.com/resources/DNA-RNA.gif

Cell

From DNA to Amino-acid sequence

Page 22: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

From DNA to Protein: In words

1. DNA = nucleotide sequence • Alphabet size = 4 (A,C,G,T)

2. DNA mRNA (single stranded)• Alphabet size = 4 (A,C,G,U)

3. mRNA amino acid sequence• Alphabet size = 20

4. Amino acid sequence “folds” into 3-dimensional molecule called protein

Page 23: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Central Dogma

• “Information” flows from DNA to RNA to Protein

• Why “information” ?

• The DNA in a cell has complete information of which proteins will be present in the cell

Page 24: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

DNA and genes

• DNA is a very “long” molecule

• DNA in human has 3 billion base-pairs– String of 3 billion characters !

• DNA harbors “genes” – A gene is a substring of the DNA string

Page 25: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Genes code for proteins

• DNA mRNA protein can actually be written as Gene mRNA protein

• A gene is typically few hundred base-pairs (bp) long

Page 26: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Transcription

• Process of making a single stranded mRNA using double stranded DNA as template

Page 29: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Translation

• Process of making an amino acid sequence from (single stranded) mRNA

• Each triplet of bases translates into one amino acid: each such triplet is called “codon”

• The translation is basically a table lookup

Page 30: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha
Page 31: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

The

Gen

etic

Cod

e

SO

UR

CE

: ht

tp:/

/ww

w.b

iosc

ienc

e.or

g/at

lase

s/ge

neco

de/g

enec

ode.

htm

Page 32: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Step 2: mRNA to Amino acid sequence

Translation

Page 33: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Review so far

• Proteins: important molecules, amino acid sequences

• DNA: structure, base-pairing.

• Genes: substrings of DNA

• Gene --> mRNA (transcription)

• mRNA --> amino acid sequence (translation), genetic code.

Page 34: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Gene expression

• Process of making a protein from a gene as template

• Transcription, then translation

• Can be regulated

Page 35: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

GENE

ACAGTGA

TRANSCRIPTIONFACTOR

PROTEIN

Transcriptional regulation

Page 36: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

GENE

ACAGTGA

TRANSCRIPTIONFACTOR

PROTEIN

Transcriptional regulation

Page 37: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

The importance of gene regulation

Page 38: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Genetic regulatory network controlling the development of the body plan of the sea urchin embryoDavidson et al., Science, 295(5560):1669-1678.

Page 39: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

• That was the “circuit” responsible for development of the sea urchin embryo

• Nodes = genes

• Switches = gene regulation

• Change the switches and the circuit changes

• Gene regulation significance:– Development of an organism– Functioning of the organism– Evolution of organisms

Page 40: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Genome

• The entire sequence of DNA in a cell• All cells have the same genome

– All cells came from repeated duplications starting from initial cell (zygote)

• Human genome is 99.9% identical among individuals

• Human genome is 3 billion base-pairs (bp) long

Page 41: CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

Genome features

• Genes• Regulatory sequences• The above two make up 5% of human

genome• What’s the rest doing?

– We don’t know for sure

• “Annotating” the genome– Task of bioinformatics