41
CS 466 Introduction to Bioinformatics Saurabh Sinha

CS 466 Introduction to Bioinformatics Saurabh Sinha

  • View
    225

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CS 466 Introduction to Bioinformatics Saurabh Sinha

CS 466Introduction to Bioinformatics

Saurabh Sinha

Page 2: CS 466 Introduction to Bioinformatics Saurabh Sinha

What is the course about?

• Algorithmic concepts, applied to sample (toy) problems in “bioinformatics”– Follows the text book

• “Real” bioinformatics research– Follows the best journals

• Not about practical training in the use of popular bioinformatics software

Page 3: CS 466 Introduction to Bioinformatics Saurabh Sinha

Grading

• Assignments: 40%– About one every two weeks

• Mid Term: 30%

• Final: 30%

Page 4: CS 466 Introduction to Bioinformatics Saurabh Sinha

Expectations

• Some programming skills– Any programming language is fine

Page 5: CS 466 Introduction to Bioinformatics Saurabh Sinha

Administrative Details

• Instructor: – Saurabh Sinha– Room 2122, Siebel Center– Email: [email protected]

• Class hrs: Tue & Thu, 3:30pm-4:45pm, 1131SC• Office hrs: Tue, before class (2:30 - 3:30 pm) 2122SC• Web site: http://veda.cs.uiuc.edu/courses/fa08/cs466/

• Welcome to sit in, if not taking for credit

Page 6: CS 466 Introduction to Bioinformatics Saurabh Sinha

Text books

• Jones and Pevzner: required

• Durbin et al.: recommended

Page 7: CS 466 Introduction to Bioinformatics Saurabh Sinha

Other course

• CS 591 BIO: weekly seminar on biomedical informatics

• Fridays at 10:30 am.

Page 8: CS 466 Introduction to Bioinformatics Saurabh Sinha

Motivating bioinformatics

Page 9: CS 466 Introduction to Bioinformatics Saurabh Sinha

Special issue of journal Science, July 1, 2005.

Page 10: CS 466 Introduction to Bioinformatics Saurabh Sinha

>What Is the Universe Made Of?>What is the Biological Basis of Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?

Page 11: CS 466 Introduction to Bioinformatics Saurabh Sinha

Many of the most profound scientific questions of today are within the realm of bioinformatics research

Page 12: CS 466 Introduction to Bioinformatics Saurabh Sinha

“Why do humans have so few genes ?”

Page 13: CS 466 Introduction to Bioinformatics Saurabh Sinha

A simple organism

GENE

Raw

mat

eria

lsEnvironmental signal

Response (protein)

Page 14: CS 466 Introduction to Bioinformatics Saurabh Sinha

A simple organism

GENE1

GENE2

GENE3

Page 15: CS 466 Introduction to Bioinformatics Saurabh Sinha

A simple organism

GENE1

GENE2

GENE3

GENE4

GENE5

GENE6

GENE7

GENE8

GENE9

GENE10

Page 16: CS 466 Introduction to Bioinformatics Saurabh Sinha

A complex organism

GENE1

GENE2

GENE3

GENE4

GENE5

GENE6

GENE7

GENE8

GENE9

GENE10

Complex circuit of interactions

Page 17: CS 466 Introduction to Bioinformatics Saurabh Sinha

Regulatory networks

• This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity)

• Bioinformatics can unravel such networks, given the genome (DNA sequence) and gene activity information

Page 18: CS 466 Introduction to Bioinformatics Saurabh Sinha

Decoding the regulatory network

• Find patterns (“motifs”) in DNA sequence that occur more often than expected by chance

• Statistics on DNA sequences and words

• Knowing these can tell us about edges in the regulatory network

Page 19: CS 466 Introduction to Bioinformatics Saurabh Sinha

Decoding the regulatory network

• An example computational problem:• Given a string of length 10,000 over the

alphabet {A,C,G,T}• Count the number of occurrences Nw

of every 6 letter word w

• Are there specific words that occur more frequently than expected by chance?

Page 20: CS 466 Introduction to Bioinformatics Saurabh Sinha

Decoding the regulatory network

• What is expected by chance?• What is “more frequently”?• Interesting mathematical questions

• The Moby Dick example

Page 21: CS 466 Introduction to Bioinformatics Saurabh Sinha
Page 22: CS 466 Introduction to Bioinformatics Saurabh Sinha

Decoding the regulatory network

• What is expected by chance?• What is “more frequently”?• Interesting mathematical questions

• The Moby Dick example• This is called “motif finding”, and helps

decode the regulatory network!

Page 23: CS 466 Introduction to Bioinformatics Saurabh Sinha

Comparing DNA

• Humans are about 99.9% identical to each other, DNA-wise.

• How do we know that ?

• Compare the genome of two individuals.

• The computational problem: Are two sequences similar ?

Page 24: CS 466 Introduction to Bioinformatics Saurabh Sinha

Sequence alignment

• Why is this a problem?

• The two sequences will differ by “substitutions”, “insertions” and “deletions” accumulated during evolution

• The comparison algorithm has to be robust to such possibilities.– A special technique called “dynamic programming”

does all this, and is “efficient”

Page 25: CS 466 Introduction to Bioinformatics Saurabh Sinha

Sequence alignment

• Why should we care?

• Compare human genome with fish. You’ll see some portions that are highly similar.

• These “conserved” portions are often genes …

• … or regulatory sequences! The regulatory network again.

Page 26: CS 466 Introduction to Bioinformatics Saurabh Sinha

On counting genes

• The original question was “Why do humans have so few genes?”

• How do we know how many genes there are in the human genome ? (And where they are in the genome)

• Experiments can be designed, but bioinformatics plays a major role

Page 27: CS 466 Introduction to Bioinformatics Saurabh Sinha

Gene prediction

• The task of predicting the locations of genes in a new genome (“annotation”)

• Gene prediction software

• The more sophisticated ones use “Hidden Markov models” (HMM) and multiple species comparison

Page 28: CS 466 Introduction to Bioinformatics Saurabh Sinha

HMM for Gene Prediction

http://researchweb.watson.ibm.com/journal/rd/453/birney.html

What is this graph?

It captures the “architecture” of a gene.

It translates into a “probabilisticmodel”.

It leads directly to a gene finding algorithm

Page 29: CS 466 Introduction to Bioinformatics Saurabh Sinha

“What controls organ regeneration ?”

“How does a single somatic cell become a whole plant ?”

Page 30: CS 466 Introduction to Bioinformatics Saurabh Sinha

Development and Regeneration

• Developmental biology

• The timeline from a single cell (with genetic material from mother and father) to a multicellular embryo, and to an adult

• A paradox : All cells in the adult body have the same DNA, then how come different cells are different ?

Page 31: CS 466 Introduction to Bioinformatics Saurabh Sinha

Regulatory networks again

• Bioinformatics used to scan entire genome for regions that participate in “segmenting” the embryo

• Hidden Markov models used to detect such regions

• Multiple species comparison aids discovery

Page 32: CS 466 Introduction to Bioinformatics Saurabh Sinha

“How did cooperative behavior evolve?”

Page 33: CS 466 Introduction to Bioinformatics Saurabh Sinha

Social behavior and bioinformatics?

• Social behavior in honey bees

• Young worker bees are nurses in the hive; older ones go out to forage

• This behavioral maturation is determined by needs of colony. What is the genetic basis of this ?

Page 34: CS 466 Introduction to Bioinformatics Saurabh Sinha

Social behavior and bioinformatics

• Illinois team scanned the genome to understand this (2006)

• Regulatory network of social behavior

• Statistical tools, such as Hypergeometric test

• Machine learning tools such as “support vector machine classification”

Page 35: CS 466 Introduction to Bioinformatics Saurabh Sinha

Other challenges

Page 36: CS 466 Introduction to Bioinformatics Saurabh Sinha

Protein structure prediction

http://www.denizyuret.com/students/vkurt/thesis-main_dosyalar/image006.gif

Page 37: CS 466 Introduction to Bioinformatics Saurabh Sinha

Protein structure prediction

• Can we predict the 3-D structure of a protein from its amino acid sequence ?

• Why ? – One good reason: structure gives clues about function. If we

can tell the structure, we can perhaps tell the function

– We can design amino acid sequences that will fold into proteins that do what we want them to do. Drug design !!

• Neural networks, a popular technique in computer science, applied to this problem

Page 38: CS 466 Introduction to Bioinformatics Saurabh Sinha

Metagenomics• Most studies to date are on genomes of one

species

• A sample from the soil contains hundreds of bacteria, thousands of viruses. Can we study all of these ?

• The Sorcerer II expedition• http://www.sorcerer2expedition.org/version1/HTML/main.htm

Page 39: CS 466 Introduction to Bioinformatics Saurabh Sinha

Many more challenges

• New types of data come due to technological breakthroughs in biology

• High throughput data carries unprecedented amount of information

• Too much noise

• Bioinformatics removes the noise and reveals the truth

Page 40: CS 466 Introduction to Bioinformatics Saurabh Sinha

Bioinformatics

• Is not about one problem (e.g., designing better computer chips, better compilers, better graphics, better networks, better operating systems, etc.)

• Is about a family of very different problems, all related to biology, all related to each other

• How can computers help solve any of this family of problems ?

Page 41: CS 466 Introduction to Bioinformatics Saurabh Sinha

Bioinformatics and You

• You can learn the tools of bioinformatics• These tools owe their origin to computer

science, information theory, probability theory, statistics, etc.

• You can learn the language of biology, enough to understand what the problems are

• You can apply the tools to these problems and contribute to science