19
BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building Email: [email protected] Office hours: Tuesday and Thursday: 2:00~3:00pm 08-23-2010

BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building Office hours: Tuesday and Thursday:

Embed Size (px)

DESCRIPTION

 Weekly or bi-weekly homework assignments, Ph.D students may have additional assignments (30%).  Two midterm exams (60%): 10/5(Tuesday) and 12/14 (Tuesday)  Classroom participation will count for 10% of the grade. Students Evaluation

Citation preview

Page 1: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

BINF6201/8201: Molecular Sequence Analysis

Dr. Zhengchang Su

Office: 351 Bioinformatics Building

Email: [email protected]

Office hours: Tuesday and Thursday: 2:00~3:00pm

08-23-2010

Page 2: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Textbook and reading materials Textbook: Bioinformatics and Molecular Evolution by Paul G. Higgins and Teresa K. Attwood, Blackwell Publishing, 2005.

Additional readings from the current literature may be assigned as appropriate

All lecture slices will be available on line at http://bioinfo.uncc.edu/zhx/binf8201/binf8201.html

Page 3: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Weekly or bi-weekly homework assignments, Ph.D students may have additional assignments (30%).

Two midterm exams (60%): 10/5(Tuesday) and 12/14 (Tuesday)

Classroom participation will count for 10% of the grade.

Students Evaluation

Page 4: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Sequence data explosions Three almost equivalent biological sequence databasesInternational Sequence Database Collaboration

1. GenBank at NCBI

2. European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database at European Bioinformatics Institute (EBI)

3. DNA database of Japan (DDBI)

Features1. All published biological sequences are requested to be deposited in

the one of these three databases;

2. Data are exchanged among these three databases on a daily basis.

Page 5: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Data explosions Both the number/length of sequences and number of transistors in a CPU increase exponentially with the time. However the number/length of sequences increases even faster than the number of transistors in a CPU.

.ln)(ln ;)(

0

0

NrttNeNtN rt

(t)

lnN

(t)

Page 6: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Sequence data explosions are the result of the continuous development of new sequencing

technologies: Chain termination (Sanger) method (1977)

Automation of sequence determination (late 1980s)

Shotgun sequencing strategy (1995)

NexGen sequencing technologies (2004)1. 454 pyrosequencing: 454 Life Sciences/Roche Diagnostics 2. Solexa sequencing: Illumina 3. SOLiD sequencing: Applied Biosystems 4. Helico BioSciences:5. Pacific Biosciences:6. Polonator: open source

Page 7: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Data explosions Since 1995, the number of sequenced genomes also increases exponentially.

Archaea Bacteria Eukaryota TotalComplete 92 1135 133 1360In pipline 186 4804 1548 6538

As of 8-19-2010 http://www.genomesonline.org

Page 8: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Data explosions : Since 2006, the number of meta-genome sequences increases exponentially thanks to the advent of next-generation sequencing technologies. In September, 2009, about 200 meta-genomes are sequenced or are in the process of sequencing.

http://www.genomesonline.org

Page 9: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Data explosions The speed of computers also increase exponentially with the time. However, how can we use the ever powerful computers to solve biological problems is a very challenging task for computer science and biology research communities.

Page 10: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Data explosions More and more biological researches use computational analyses.

Page 11: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

• Microarray/RNA-Seq: transcriptomics

• Mass spectrometry: Proteomics

• Nucleus magnetic resonance (MR) and mass spectrometry: Metabolomics

What is genomics? The availability of whole genome sequences of organisms has led to the birth of Genomics that studies the organisms based on the genetic information encoded in the genomes.

According to the subjects of the study, genomics can be divided into:

1. Functional genomics, which is coupled with the development of relevant high-throughput technologies, such as,

2. Comparative/evolutionary genomics

Page 12: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

What is Bioinformatics? For a short answer: “Bioinformatics is the use of computational methods to study biological data and problems”.

For a more detailed answer: Bioinformatics is 1. “The development and use of computational methods for

studying the structure, function, and evolution of genes, proteins and whole genomes;”

2. “The development and use of methods for the management and analysis of biological information arising from genomics and high-throughput experiments.”

Page 13: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Population genetics, molecular evolution and sequence analysis

According to the evolutionary theory, biological sequences are related to one another through heredity and variation;

Sequence analysis methods are thus based on the principles of the evolution of sequences.

Therefore, to analyze sequences, we must understand

1. the dynamics changes of genes (loci) in a population of the same species— population genetics; and

2. how the gene sequences change during the course of evolution among different species — molecular evolution.

Page 14: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Sequence Similarity The similarity of two sequences can be identified by aligning the two sequences using an alignment method/algorithm, such as the BLAST or Smith-Waterman method/algorithm.

Two parameters to describe the similarity of two sequences 1. Identity 2. Similarity

Identities = 38/139 (27%), Similarity = 66/139 (47%), Gaps = 9/139 (6.5%)LELTYIVNFGSELAVVSMLPTFFETTFDLPKATAGILASCFAFVNLVARPAGGLISDSVG+ Y + FG +A + LPT+ T + AG + FA ++ARP GG +SD +MSFLYAIVFGGFVAFSNYLPTYITTIYGFSTVDAGARTAGFALAAVLARPVGGWLSDRIA SRKNTMGFLTAGLGVGYLVMSMIKPGTFTGTTGIAVAVVITMLASFFVQSGEGATFALVP R + L + + P ++ T I +AV + + G G FA VPRHVVLASLAGTALLAFAAALQPPPEVWSAATFITLAVCLGV--------GTGGVFAWVA -LVKRRVTGQVAGLVGAYGNVG G V G+V A G +GRRAPAASVGSVTGIVAAAGGLG

Page 15: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Homologous Sequence Homology: If the similarity of the two sequences are high enough, it is highly likely that they have evolved from a common ancestor, and we say that they are homologous to each other.

For example, if two sequences of 100 amino acids have 80% of identical residuals, the probability by chance that the two sequences share this level of similarity is (1/20)80.

Homology of two sequences can only be inferred computationally, but is difficult to be tested experimentally.

Page 16: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Orthologs and ParalogsThere are two distinct types of homologous relationships, which differ in their evolutionary history and functional implications.

• Orthologs: Evolutional counterparts derived from a single ancestral gene in the last common ancestor of the given two species. Therefore, orthologous genes are related due to vertical evolution. Orthologous genes typically have the same function.

• Paralogs: homologous genes evolved through duplication within the same or ancestral genome. Therefore, paralogous genes are related due to duplication events. Paralogous genes do not necessary have the same function.

duplicationspeciation

speciation

Page 17: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

When the similarity between two sequences are very low, say, 8% identity, then they could be still homologous due to divergent evolution;

Divergently evolved genes usually have similar biochemical functions.

Speciation or duplication

homologues

Divergence evolution

Page 18: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

When the similarity between two sequences are very low, say, 8%, they could be of difference origin, and the observed sequence similarity is due to convergent evolution under functional selection during the course of evolution. These two sequences are called analogues.

analogues

Analogues may have similar biochemical functions, and they usually only share several amino acids in the active site of enzymes, called motifs.

Convergence evolution

Page 19: BINF6201/8201: Molecular Sequence Analysis Dr. Zhengchang Su Office: 351 Bioinformatics Building   Office hours: Tuesday and Thursday:

Horizontal gene transfer (HGT) During evolution, a progeny obtains its genes from its ancestor (vertical gene transfer), however, it also can obtain genes from other species, genera, or even taxa. This phenomenon is called horizontal gene transfer or lateral gene transfer.•

HGT is very pervasive, in particular, in prokaryote, and is believed to be a major driving force for evolution.

ArchaeaBacteria Eukaryota

Verticalgene

transfer

Horizontal gene transfer

LCA (Last common ancestor)