Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation

Bioinformatics Basics

Cyrus Chan, Peter Lo, David LamCourtesy from LO Leung Yau’s original presentation

Biological Background

Outline

Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression

Bioinformatics Sequence Analysis Phylogentic Trees Data Mining

Biological Background – Cell

Basic unit of organisms Prokaryotic (lacks a cell nucleus)

Eukaryotic A bag of chemicals Metabolism controlled

by various enzymes Correct working needs

Suitable amounts of various proteins

Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

Biological Background – Protein Polymer of 20 types of

Amino Acids Folds into 3D structure Shape determines the

function Many types

Transcription Factors Enzymes Structural Proteins …

Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Amino_acid

Biological Background – DNA & RNA DNA

Double stranded Adenine, Cytosine, Guani

ne, Thymine A-T, G-C Those parts coding for pr

oteins are called genes RNA

Single stranded Adenine, Cytosine, Guani

ne, Uracil

Picture taken from http://en.wikipedia.org/wiki/Gene

Chromosome

Chromatin Structure

Super compact packaging

euchromatin heterochromatin

Biological Background – Genes Genes – protein coding regions

3 nucleotides code for one amino acid

There are also start and stop codons

Biological Background—in a nutshell Abstractions—the Central Dogma

Functional Units: Proteins

Templates: RNAs

Blueprints: DNAs

Templates: RNAs

Blueprints: DNAs

Not only the information (data), but also the control signals about what and how much data is to be sentProteins (TFs) so help


…acatggccgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata….

RNARNA

Protein Protein

Intergenic region“Non-coding region”

GeneGene


…acatgggcgatca…tcaccctgaacatgtcgctttaacctactggtgatgcacct…atgatcaggg…atactggatacagggcata….

RNARNA

Protein (malfunctioning) Protein

Intergenic region“Non-coding region”

GeneGene

Genetic Disease caused by a single mutation


There can be multiple mutations that cause diseases (increase risks of diseases)

…

DNA from different people

Normal

Disease!

AA

A

C

CC

TTT

G

GG

A T

C G

…

…

…

…

SNP (single nucleotide polymorphism)

Biological Background – Sequences Abstractions

Sequences

…acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAacctactggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaatactggatacagggcatataaaacaggggcaaggcacagactc…

FT intron <1..28FT /gene="CREB"FT /number=3FT /experiment="experimental evidence…FT recorded"FT exon 29..174FT /gene="CREB"FT /number=4FT /experiment="experimental evidence…FT recorded"FT intron 175..>189FT /gene="CREB"FT /number=4

Annotations

Visualizations

Biological Background – DNA RNA Protein

Picture taken from http://en.wikipedia.org/wiki/Gene

gene


Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Complex Interactions between Genes, TFs and TFBSs


Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C

pairing Can monitor expression

of many genes

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

Gene Expression Microarray Data

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray

Genes

Time points/Condiditions

Colors: Expression (RNA) Levels

Bioinformatics

Bioinformatics—Sequence Analysis Alignments

a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences

http://en.wikipedia.org/wiki/Sequence_alignment

Bioinformatics—Sequence Analysis Pair-wise alignments

Method: dynamic programming!

No penalty for the consecutive ‘-’s before and after the sequence to be aligned

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures

Bioinformatics—Sequence Analysis Multiple (global) sequence alignment

Also dynamic programming (but can’t scale up!)

Bioinformatics—Sequence Analysis Multiple local sequence alignment

i.e. Motif (pattern) discovery

>seq1acatggccgatcagctggtttttgtgtgcctgtttctgaatc>seq2ttctattttacgtaaatcagcttgaacatgtacctactggtg>seq3atgcacctttgatcaataccagctagacaaacgtgtgttg>seq4agtccaaagatcagggctggctgaatactggatcagct>seq5cagctacagggcatataaaggggcaaggcacagactc

Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes).

TFBSs are the controlling key holes in gene regulation!

DNA motifs

Similar DNA fragments across individuals and/or species TFBS Motifs: DNA fragments similar to “TATAA” are common in order to

recruit the polymerase to initiate transcription in eukaryotes Expensive and time-consuming to try a large set of candidates in biological

experiments

Transcription

RNA

Translation

Protein

TATAA

TFBS (controlling)

Gene(functioning)

TF

Transcription Factor

DNA

Motif discovery

CGATTGAf

Similar controlled functionse.g. cancer gene activities

Maximized

TFBS Motif Discovery

Motif discovery usually refers to TFBS motifs

But motif is a general term meaning “pattern”:Sequence motifs, structural motifs, network motifs…

ChIP-Seq motif discovery

Same to traditional TFBS motif discovery in principle

Data input precision and scale are different Genome-wide: tens of thousands of sequences Short: 50-100bp Each sequence measured by some enrichment

score (a peak)

Introduction

ChIP-Seq technology Peak-calling

…

High-resolution sequences from more direct binding evidence; The enriched regions are likely to contain motifs coupled with peak signals; genome-wide sequences; in vivo

Too many sequences for old-day methods

Enrichment

Introduction

ChIP-Seq technology Motifs?

…Old-day methods reapplied

Phylogentic Trees (Phylogenies) Preliminaries Distance-based methods Parsimony Methods

Adopted from: Fundamental Concepts of BioinformaticsMichael L. RaymerComputer Science, Biomedical SciencesWright State Universitybirg.cs.wright.edu/text/Tutorial.ppt

Phylogenetic Trees

Hypothesis about the relationship between organisms

Can be rooted or unrootedA B C D E

A B

C

D

E

Time

Root

birg.cs.wright.edu/text/Tutorial.ppt

Tree proliferation

!22

!322

n

nN

nR

!32

!523

n

nN

nU

Species Number of Rooted Trees Number of Unrooted Trees

2 1 1

3 3 1

4 15 3

5 105 15

6 34,459,425 2,027,025

7 213,458,046,767,875 7,905,853,580,625

8 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875


An ongoing didactic

Pheneticists tend to prefer distance based metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states.

Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony.


Parsimony methods

Belong to the broader class of character based methods of phylogenetics

Emphasize simpler, and thus more likely evolutionary pathways

Enumerate all possible trees Note the number of substitutions events invoked by

each possible tree Can be weighted by transition/transversion probabilities, et

c. Select the most parsimonious


Branch and Bound methods

Key problem – number of possible trees grows enormous as the number of species gets large

Branch and bound – a technique that allows large numbers of candidate trees to be rapidly disregarded

Requires a “good guess” at the cost of the best tree


Parsimony – Branch and Bound Use the UPGMA tree for an initial best

estimate of the minimum cost (most parsimonious) tree

Use branch and bound to explore all feasible trees

Replace the best estimate as better trees are found

Choose the most parsimonious


Bioinformatics—Data mining

Clustering (Unsupervised learning) Similar things go together Similarity measure is critical Types:

Hierarchical clustering (UPGMA) Partitional clustering (K-means)


Classification (Supervised Learning) To predict! Pre-processing—tidy up your materials! Feature selection—the key points to go over Classifier—the thinking style/manner of how to combine the

key points and get some answer Training—your practice of your thinking manner with

answers known Validation—mock quiz to evaluate what you’ve learnt from

the training Testing—your examination!

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf

Underfitting & Overfitting


Evaluation (scores!) Confusion Matrix Binary Classification

Performance Evaluation Metrics Accuracy Sensitivity/Recall/TP Rate Specificity/TN Rate Precision/PPV …

\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf

FNFPTNTP

TNTP

FNTP

TP

FPTP

TP

FPTN

TN


Evaluation ROC (Receiver Operating Characteristics) Trade-off between positive hits (TP) and false alar

ms (FP)

Statistical Tests

Many different kinds of tests You should choose the appropriate ones

Where to get data

Databases Transfac—TF and TFBS sequence data Protein Data Bank—protein and protein-DNA, prot

ein-ligand complexes 3D structures (sequences and atoms included as well)

There are thousands more… find the ones that fit your topic

Where to get data

Typical format: tags + descriptions in plain text

Where to get data

We have to parse and pre-process data before using Tedious and time-consuming process Some packages can help accelerate this: BioPerl,

BioJava, BioPython… Besides data, sometimes evaluation has to be do

ne with literature evidence (manual!)

Where to get papers (published) A difficult question…

Your research quality, your writing and organization, plus some luck… 知己知彼 : learn from the published papers and compare your research topic

and level to them

Where to find papers to read Play on the CS side:

IEEE Transactions, ACM Transactions IEEE and ACM top conferences

Play on the Bioinformatics side: Bioinformatics, BMC Bioinformatics, Nucleic Acids Research PLoS Computational Biology…

Aim high: Nature (series), Science PNAS, Cell, …

Roadmap

Not The End

Your corresponding tutor will have more project-specific stuff to tell you

Thanks Q & A

Documents

Bioinformatics Basics Cyrus Chan, Peter Lo, David Lam Courtesy from LO Leung Yau’s original presentation