39
Introduction to bioinformatics (I617) Haixu Tang School of Informatics Email: [email protected] Office: EIG 1008 Tel: 812-856-1859

Introduction to bioinformatics (I617)

Embed Size (px)

DESCRIPTION

Introduction to bioinformatics (I617). Haixu Tang School of Informatics Email: [email protected] Office: EIG 1008 Tel: 812-856-1859. Textbook. A Primer of Genome Science (2nd Edition) by Greg Gibson, Spencer V. Muse, Sinauer Associates, 2004 - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to bioinformatics (I617)

Introduction to bioinformatics(I617)Haixu Tang

School of InformaticsEmail: [email protected]

Office: EIG 1008Tel: 812-856-1859

Page 2: Introduction to bioinformatics (I617)

Textbook

• A Primer of Genome Science (2nd Edition) by Greg Gibson, Spencer V. Muse, Sinauer Associates, 2004

• Suggested reading materials will be posted on the class wiki page: http://cheminfo.informatics.indiana.edu/djwild/I617_2006_wiki/index.php/Main_Page

• Office Hour: MW 11:00-12:00, EIG 1008 or appointment

Page 3: Introduction to bioinformatics (I617)

Grading

• Class project: selected from one of four covered areas (bioinformatics, Chemical informatics, Laboratory informatics and Health informatics) 25%– Suggested Bioinformatics topics will be

posted on the class wiki page

• Homework: 25% in Bioinformatics– 4, each 6.25%

Page 4: Introduction to bioinformatics (I617)

Bioinformatics = BIOlogy + informatics?

• Not really: it is a term (somehow arbitrarily chosen) to define a multi-disciplinary area that combines life sciences, physical sciences and computer science / informatics;

• It addresses biological problems using theoretical informatics approaches, not vice versa;

• It is transforming classical Biology into a Information Science.

Page 5: Introduction to bioinformatics (I617)

The birth of bioinformatics

• A revolution in biology research: the emergence of Genome Science

• Technology advancement in both biology and information science

Page 6: Introduction to bioinformatics (I617)

Genome science: a revolution of biology

• Classical Biology • Genome Science

Hypothesis

Data

Knowledge

Hypothesis driven approach

Hypothesis

Knowledge

Data

Data driven approach

Page 7: Introduction to bioinformatics (I617)

Bioinformatics: from data analysis to data mining

Hypothesis

Data

• Classical Biology

Low throughput data

• Genome Science

Hypothesis

Data

High throughput data

Hypothesis confirmation / rejection

Hypothesis generation

1 2 3 …

Page 8: Introduction to bioinformatics (I617)

Bioinformatics: in the driver’s seat

• Classical Biology

Hypothesis

Data

Knowledge

• Genome Science

Hypothesis

Knowledge

Data

Data analysis

Data mining

Page 9: Introduction to bioinformatics (I617)

Key technology advancements• High throughput biotechnologies

– Genome sequencing techniques– DNA microarray– Mass spectrometry

• Large-scale experiments– HGP, HapMap– Omics / Systems Biology

• Massive data generation, storage, exchange and analysis– CPU, storage, etc.– High speed network (Internet)– Bioinformatics

Page 10: Introduction to bioinformatics (I617)

Bioinformatics: mutually beneficial

• For biologists– Fragment assembly in

genome sequencing– Genome comparison– Gene clustering in

DNA microarray analysis

– Protein identification in proteomics

• For computer scientists– String algorithms / Tree

algorithms– Alternative Eulerian path

(BEST theorem)– Reversal distances– Probabilistic graphic

models (HMMs, BNs, etc.)

Page 11: Introduction to bioinformatics (I617)

Two origins of bioinformatics

• Combinatorial pattern matching in theoretical computer science– DNA and protein sequence analysis

• Physical and analytical chemistry of Biomolecules– Protein structure analysis Structural

bioinformatics– Bio-analytical chemistry Proteomics

Page 12: Introduction to bioinformatics (I617)

Bioinformatics addresses computational challenges in life and medical sciences

• New computational problems for automatic data analysis

• Reformulation of old problems using new high throughput data

• Formulating new problems using high throughput data

Page 13: Introduction to bioinformatics (I617)

Bioinformatics addresses computational challenges in life and medical sciences

• New computational problems for automatic data analysis• Genome sequencing• Proteomics• Transcriptomics

• Data representation and visualization• Genome Browser

• Solving biological problems by in silico approaches– Reformulation of old problems using new high throughput data

• Gene finding• Protein structure and function

– Formulating new problems using high throughput data• Comparative genomics• Polymorphisms / Population genetics• Systems Biology

Page 14: Introduction to bioinformatics (I617)

Bioinformatics resources

• Databases– Nucleic Acid Research (NAR) annual database issue

• Organization– ISCB (International Society in Computational Biology)

• Conferences– ISMB– RECOMB– Many other smaller or regional conferences, e.g.

ECCB, CSB, PSB, etc, including local Indiana Bioinformatics conference

Page 15: Introduction to bioinformatics (I617)

A case study

• How bioinformatics help and transform classical biological topics?

• Molecular evolutionary studies: from anatomical features to molecular evidences

• Genome evolution: comparison of gene orders

Page 16: Introduction to bioinformatics (I617)

Early Evolutionary Studies

• Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early 1960s

Page 17: Introduction to bioinformatics (I617)

Early Evolutionary Studies

• Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early 1960s

• The evolutionary relationships derived from these relatively subjective observations were often inconclusive. Some of them were later proved incorrect

Page 18: Introduction to bioinformatics (I617)

Evolution and DNA Analysis: the Giant Panda Riddle

• For roughly 100 years scientists were unable to figure out which family the giant panda belongs to

• Giant pandas look like bears but have features that are unusual for bears and typical for raccoons, e.g., they do not hibernate

Page 19: Introduction to bioinformatics (I617)

Evolution and DNA Analysis: the Giant Panda Riddle

• In 1985, Steven O’Brien and colleagues solved the giant panda classification problem using DNA sequences and bioinformatics algorithms

Page 20: Introduction to bioinformatics (I617)

Evolutionary Tree of Bears and Raccoons

Page 21: Introduction to bioinformatics (I617)

Evolutionary Trees: DNA-based Approach

• 40 years ago: Emile Zuckerkandl and Linus Pauling brought reconstructing evolutionary relationships with DNA into the spotlight

• In the first few years after Zuckerkandl and Pauling proposed using DNA for evolutionary studies, the possibility of reconstructing evolutionary trees by DNA analysis was hotly debated

• Now it is a dominant approach to study evolution.

Page 22: Introduction to bioinformatics (I617)

Evolutionary Trees

How are these trees built from DNA sequences?

Page 23: Introduction to bioinformatics (I617)

Evolutionary Trees

How are these trees built from DNA sequences?

– leaves represent existing species

– internal vertices represent ancestors

– root represents the common evolutionary ancestor

Page 24: Introduction to bioinformatics (I617)

Rooted and Unrooted Trees

In the unrooted tree the position of the root (“common ancestor”) is unknown. Otherwise, they are like rooted trees

Page 25: Introduction to bioinformatics (I617)

Distances in Trees

• Edges may have weights reflecting:– Number of mutations on evolutionary path from

one species to another– Time estimate for evolution of one species into

another• In a tree T, we often compute

dij(T) - the length of a path between leaves i and j

dij(T) – tree distance between i and j

Page 26: Introduction to bioinformatics (I617)

Distance in Trees: an Exampe

d1,4 = 12 + 13 + 14 + 17 + 12 = 68

i

j

Page 27: Introduction to bioinformatics (I617)

Distance Matrix

• Given n species, we can compute the n x n distance matrix Dij

• Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species.

Dij – edit distance between i and j

Page 28: Introduction to bioinformatics (I617)

Fitting Distance Matrix

• Given n species, we can compute the n x n distance matrix Dij

• Evolution of these genes is described by a tree that we don’t know.

• We need an algorithm to construct a tree that best fits the distance matrix Dij

Page 29: Introduction to bioinformatics (I617)

Reconstructing a 3 Leaved Tree

• Tree reconstruction for any 3x3 matrix is straightforward

• We have 3 leaves i, j, k and a center vertex c

Observe:

dic + djc = Dij

dic + dkc = Dik

djc + dkc = Djk

Page 30: Introduction to bioinformatics (I617)

Turnip vs Cabbage: Look and Taste Different

• Although cabbages and turnips share a recent common ancestor, they look and taste different

Page 31: Introduction to bioinformatics (I617)

Turnip vs Cabbage: Comparing Gene Sequences Yields No Evolutionary Information

Page 32: Introduction to bioinformatics (I617)

Turnip vs Cabbage: Almost Identical mtDNA gene sequences

• In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip

• 99% similarity between genes• These surprisingly identical gene

sequences differed in gene order• This study helped pave the way to

analyzing genome rearrangements in molecular evolution

Page 33: Introduction to bioinformatics (I617)

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Before

After

Evolution is manifested as the divergence in gene order

Page 34: Introduction to bioinformatics (I617)

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Page 35: Introduction to bioinformatics (I617)

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Page 36: Introduction to bioinformatics (I617)

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Page 37: Introduction to bioinformatics (I617)

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Page 38: Introduction to bioinformatics (I617)

Transforming Cabbage into Turnip

Reversal distance

Page 39: Introduction to bioinformatics (I617)

History of Chromosome X

Rat Consortium, Nature, 2004