1
Symbol Meaning Description
R A or G puRineY C or T pYrimidineW A or T Weak hydrogen bondsS G or C Strong hydrogen bondsM A or C aMino groupsK G or T Keto groupsH A, C, or T (U) not G, (H follows G)B G, C, or T (U) not A, (B follows A)V G, A, or C not T (U), (V follows U)D G, A, or T (U) not C, (D follows C)N G, A, C or T (U) aNy nucleotide
Nomenclature of nucleic acids
Base Symbol Occurrence
Adenin A DNA, RNAGuanin G DNA, RNACytosin C DNA, RNAThymin T DNAUracil U RNA
+ strand 5´-ACGGTCGCTGTCGGTAGC-3´- strand 3´-TGCCAGCGACAGCCATCG-5´
e.g. in fasta format : >gene sequence|gi12345|chr17|-GCTACCGACAGCGACCGT
DNA sequences are always from 5‘ to 3‘
Positions in the genome (genome assembly) are chromosome wise
e.g. human GRCh37/hg19
chr11:1‐100 chr11:49,686,777‐49,689,777
Positions in the chromosome start for both!! strands from position 1
+ strand 5´-ACGGTCGCTG…………TCGGTAGC-3´- strand 3´-TGCCAGCGAC…………AGCCATCG-5´
chr11:1 2523 2529
chr11:1 2523 2529
Nomenclature
2
A Ala alanine
B Asx aspartic acid or asparagineC Cys cysteineD Asp aspartic acidE Glu glutamic acid
F Phe phenylalanineG Gly glycineH His histidineI Ile isoleucine
K Lys lysineL Leu leucineM Met methionineN Asn asparagine
P Pro prolineQ GIn glutamineR Arg arginineS Ser serine
T Thr threonineV Val valineW Trp tryptophanX Xaa unknown or 'other' amino acid
Y Tyr tyrosineZ Glx glutamic acid or glutamine
Amino Acids
Translation, genetic code and reading frames
3
Peptid chain, amino acid sequence, proteins
Protein sequences are always form N‐terminal end to C‐terminal end
backbone
sidechains
E.g.. SCD sequence in fasta format
Regulation of eukaryotic transcription
4
Different levels of regulation
Transcriptional regulation has largest effect on phenotype!
Chromatin states
Ernst et al. Nature 2011.
5
DNA methylation
Cytosine 5-Methylcytosine
microRNA and siRNA
6
Organization of the human genome
E.g. > 1 million copies of Alu-repeats
Sequence alignmentExact string matching
NaiveZ‐box algorithm (Boyer‐Moore, Knuth‐Morris‐Pratt,…)Suffix arrays, Suffix triesBurrows‐wheeler transformation (BWT), FM‐indexHash tables (spaced seeds)
Aligning 2 sequences
Dot matrixGaps (gap open and extension, linear and convex penalty function)Distances between sequences (Hamming distance, Levenshtein distance, Edit operations)Substitution matrices
Odds ratio (random model [independent= qxiqxj] vs match model [joint=pxi yi])Log odds ratio (scores are additive)PAM, BLOSSUM
Dynamic programmingIdea: new best alignment = previous best alignment + local best alignmentGlobal alignment (Needleman‐Wunsch)
Construct matrix F, F(0,0)=0Backtracing from bottom right (=best score) to up left (=start of sequences)
Local alignment (Smith‐Waterman)Construct matrix F, F(0,j) = F(i,0)=0, include 0 as optionBacktracing from max score to 0
Used for read mapping in NGS applications
7
Sequence alignment
Search similar sequences in database (db)
W‐mer indexing (hash tables)FASTA (evaluate position differences of small words (2‐6 characters) in query and db; hash tables)BLAST (Basic Local Alignment Search Tool)
Query words (W=3), find neighbourhood words with score> threshold T (e.g. T=13) = seedsExtend seeds until score drops off under X = high‐scoring segment pairs (HSP)Evaluate significance (E(S)=Kmne‐λS)BLAST variants (blastn, blastp,blastx, tblastx, tblastn, MegaBlast)
BLAT (Blast‐like alignment tool; mostly for highly similar DNA sequences)
Multiple sequence alignment
Dynamic programming in n‐dimensions (very computer intensive)Compute pairwise alignments to found upperboundCreate heuristic multiple alignment to found lowerboundSearch in n‐dimensional scoring matrix
Progressive tree alignment (ClustalW)Perform hierarchical clustering using distances between sequences (e.g. edit distance): Merge sequences to find ancestor sequences by finding sequences with minimum edit distance to the two children sequences
Assign weights to each branch of tree, based on distance between sequences Align sequences (starting from the closest, using a version of dynamic programming) using weights in the score function
Profile Hidden Markov Model
ACA---ATGTCAACTATCACAC--AGCAGA---ATCACCG--ATC
P(A)=0.8P(C)=0.0P(G)=0.0P(T)=0.2
P(A)=0.8P(C)=0.2P(G)=0.0P(T)=0.0
P(A)=1.0P(C)=0.0P(G)=0.0P(T)=0.0
P(A)=0.0P(C)=0.0P(G)=0.2P(T)=0.8
P(A)=0.0P(C)=0.8P(G)=0.2P(T)=0.0
P(A)=0.0P(C)=0.8P(G)=0.2P(T)=0.0
P(A)=0.2P(C)=0.4P(G)=0.2P(T)=0.4
1.0
0.4
1.0 0.4
0.60.6
1.0 1.01.0
[AT][CG][AC][ACGT]*A[TG][GC]
Regular Expressions
p(ACACATC)=0.8*1*0.8*1*0.8*0.6 *0.4*0.6*1*1*0.8*1*0.8=0.047log‐odds=log(p(S)/0.25L)=log(0.047/0.257)
insertion state
- For multiple alignments (e.g. DNA sequences)
8
Protein Sequence Analysis
Sequence alignmentBLASTFASTA
Uses collective characteristics of a family of proteinsPosition specific score matrix (PSSM)Profile HMMProfileScan, Pfam, CDD, Prosite, BLOCKSPSI‐Blast
Amino Acid CompositionHydrophobicityChargeTheorteical pI,Molecular weight
Secondary structure(Alpha helix, betastrand, beta sheet)Specialized structuresTertiary structure
Neuronal network for secondary structure prediction
9
PredictProtein
• Multi‐step predictive algorithm (Rost et al., 1994)
– Protein sequence queried against SWISS‐PROT– MaxHom used to generate iterative, profile‐basedmultiple sequence alignment (Sander and Schneider,1991)
– Multiple alignment fed into neural network (PHDsec)
• Accuracy: Average > 70%, Best‐case > 90%
• http://www.predictprotein.org/
SignalP
• Neural network trained based on phylogeny– Gram‐negative prokaryotic– Gram‐positive prokaryotic– Eukaryotic
• Predicts secretory signal peptides• http://www.cbs.dtu.dk/services/SignalP/
Signal peptide score (S)
Cleavage site score (C)
Combined Score (Y)
10
Two‐color microarrays
– Oligonucleotides of 60‐80 mers length– cDNA fragments from a library
(varying lengths)
Two color microarray analysis
Experimental design (Biological replicates, Technical replicates, Dye swap, Reference design)
Image analysis (align grid, identify spots and background)
Preprocessing (subtract background, filter saturated or bad spots)
Normalization (idea: expression of majority of genes is not changing across conditions)
Normalization factor N=sum[Ri]/sum[Gi] => Gi‘=N*Gi, Ri‘=R
MA‐plot [M=log2(R/G); A=log2(R*G)/2]
=> M (=log ratios) are dependent on A (average intensities)
=> LOWESS normalization
Identification of differentially expressed genes
Moderated t‐test (t=mean(M)/[(a+s)/sqrt(n)]; a is estimated from all genes)
R‐package limma
As a result for each gene log2‐fold change(=M), p‐value, adjusted p‐value (Benjamini‐Hochbergcorrected p‐value based on the false discovery rate FDR) is calculated
All genes with adjusted p‐value<0.1 are considered statistically significant differentially expressed
11
Affymetrix microarrays
Affymetrix microarrays
12
Affymetrix chips
Analysis of one‐color arrays (Affymetrix)
Preprocessing (apply model of perfect match (PM) and mismatch (MM)=background,
PM‐MM is not correct)
Normalization have to be done between arrays (not within array)
Quantile normalization
Variance stabilizing normalization (VSN)
Probe summarization
Median Polish summarization
The R‐package RMA (Robust Multiarray Average) is the
method of choice which includes all 3 steps, the
resulting intensity values are log2‐transformed
Identification of differentially expressed genes
(as for two color arrays) using the R/Biconductor
Package limma
13
Methods to correct p‐values for multiple testing
In case of 1000 tests 50 false positives are expected at an significance level of 0.05 which are declared significant.
To account for multiple testing following parameter were used:
Family wise error (FWER): p(V>0)
False discovery rate (FDR): E(V/R)
p(i) *n/i > p(i+i) *n/(i+1) => p(i) *n/i = p(i+i) *n/(i+1)
• Potential for surveying the entire transcriptome, including novel, un‐annotated regions.
• Helps to identify expression and function of regulatory none‐coding RNAs (e.g. lincRNA)
• Potential for determining gene structure and isoform level expression using reads mapping to splice junctions.
• Potential for making better presence/absence calls on regions.
• More expensive than microarrays
• Don‘t need to design probes
Transcriptome sequencing (RNAseq)
14
Transcriptome sequencing (RNAseq)
Wang et al., Nature Rev Gen, 2009
Base calling (Phred score)
Phred quality score Q and base‐calling error probabilities P
QPhred = ‐10 log10 P QSolexa = ‐10 log10P
1 ‐ P
For P=0.05 the quality score Q=13
15
Base calling (FastQ format)
@EAS54_6_R1_2_1_413_324CCCTTCTTGTCTTCAGCGTTTCTCC+;;3;;;;;;;;;;;;7;;;;;;;88
Quality scores are encoded in ASCII
1. Read mapping
2. Transcriptome reconstruction
3. Expression quantification
4. Differential expression analysis
Analysis steps
16
Read mapping
Unspliced aligners Spliced aligners (Exon first vs Seed extend)
Bowtie Tophat (using Bowtie which map Exons first)SpliceMap
Tools: Bowtie, BWA, Eland, MAQ, SOAP2, GSNAP, STAR
Result is a Sequence Alignment/Map (SAM/BAM) format file
GFF/GTF files (General Feature Format, General Transfer Format) keeps information about exon, gene and transcript positions in genome assembly
FastQ files => Trim adaptors and bad quality reads (FASTQC)
GTF files
FPKM NormalizationEstimate uncertaintyof mapped read toisoform
Reference genome
Advanced transformationand applying t-test
SAM/BAM file
Differentially expressed genes and isoformsbetween conditions
17
Expression quantification
Garber et al., Nature Methods, 2011
RNAseq normalization
Reads per kilobase per million (RPKM) (divide by library size and transcirpt length)
Quantile normalization
TMM (trimmed mean of M values).
Fragments per kilobase per million reads (FPKM) (for paired‐end sequencing)
18
Differential expression analysis for sequencing count data
Expect Poisson distribution (Mean=Variance) as it is typical for count data
A B C D
Gene1 1 23 2 6
Gene2 0 74 8 7
Gene3 33 4 14 8
But counts for the same gene from different biological replicates have a varianceexceeding the mean (overdispersion) can be estimated by negative‐binominalmodel.
The dispersion is estimated from the raw count distribution of all genes
Differential expressed genes are tested by negative binomial test, Wald test, orlikelihood ratio test
Using R packages DESeq, DESeq2, edgeR
HTSeq can be usedto generate genematrixraw count data
How many reads are needed (depth)?
two mouse libraries (ES,EB) yeast
E.g. 20-40 mio reads should be sufficient for human
19
Clustering
• Unsupervised or supervised (classification)
• AgglomerativeBottom up approach, whereby single expression
profiles are successively joined to form nodes.
• DivisiveTop down approach, each cluster is successively split in the same fashion, until each cluster consists of one single profile.
• Pearson correlation
• Euclidian distance
• Manhattan distance
Similarity distance measures
1
( )n
M i ii
d x y
-1 r 1
20
Hierarchical clustering
• Agglomerative (bottom up), unsupervized• Cluster genes or samples (or both= biclustering)• Distances are encoded in dendogram (tree)• Cut tree to get clusters• Pearson correlation (usually used)• Computational intensive (correlation matrix)
1. Identify clusters (items) with closest distance2. Join to new clusters3. Compute distance between clusters (items) (see linkage)4. Return to step 1
6 cluster
15 cluster
Linkage
Single‐linkage clusteringMinimal distance
Complete‐linkage clusteringMaximal distance
Average‐linkage clusteringCalculated using average distance (UPGMA)Average from distances not! expression values
Weighted pair‐group averageLike UPGMA but weighted according cluster size
Within‐groups clusteringAverage of merged cluster is used instead of cluster elements
Ward’s methodSmallest possible increase in the sum of squared errors
21
• partition n genes into k clusters, where k has to be predetermined
• k‐means clustering minimizes the variability within and maximize between clusters
• Moderate memory and time consumption
K‐means
1. Generate random points (“cluster centers”) in n dimensions (results are depending on these seeds).
2.Compute distance of each data point to each of the cluster centers.
3.Assign each data point to the closest cluster center.
4.Compute new cluster center position as average of points assigned.
5.Loop to (2), stop when cluster centers do not move very much.
Principal component analysis (PCA)
PCA is a data reduction technique that allows to simplify multidimensional data setsinto smaller number of dimensions (r<n).
Variables are summarized by a linear combination to the principal components. The origin of coordinate system is centered to the center of the data (mean centering) . The coordinate system is then rotated to a maximum of the variance in the first axis.
Subsequent principal components are orthogonal to the 1st PC. With the first 2 PCs usually 80‐90% of the variance can already be explained.This analysis can be done by a special matrix decomposition (singular valuedecomposition SVD).
22
Classification (e.g. support vector machines)
Cross validation
K‐fold cross validation (LKOCV)
If k=1 it is called leave‐one‐out cross‐validation (LOOCV)Variance bias trade‐off
Receiver operating characteristics
Sensitivity=TP/(TP+FN)Specificity=TN/(TN+FP)
Area under curve (AUC) measure forclassifier performanceA ideal classifier AUC=1B good classifier AUC~0.8C random AUC=0.5
Sensitivity
1‐Specificity
23
Biological meaning of the gene sets
?
• Gene ontology terms
• Pathway mapping
• Linking to Pubmed abstracts or associated MESH terms
• Regulation by the same transcription factor (module)
• Protein families and domains
• Gene set enrichment analysis
• Over representation analysis
Gene Ontology
The three organizing principles of GO are
• cellular component (e.g. mitochondrium)• biological process (e.g. lipid metabolism)• molecular function (e.g. hydrolase activity)
Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and a term name(e.g. fibroblast growth factor receptor binding).
URL: http://www.geneontology.org/
Different evidence code (e.g. IDA inferred from direct assay)Directed acyclic graph (2 relation part of and is a)Different levels (specific terms sphingolipid metabolism vs general terms e.g. metabolism)GO terms can be occur
The Gene Ontology project provides a controlled vocabularyto describe gene and gene product attributes in any organism.
24
Overrepresentation analysis
m
g
gene universe (whole microarray)
GO term
ci
genes in cluster(gene list)
all genes with GO term
genes in clusterwith GO term
Fisher exact test for contingency table
m-g c-i
g i
Regulatory sequencesExperimental methods
Electro mobility shift assays (EMSA)DNase I and Exonulease FootprintingChromatin immuno precipitation (ChIP)
‐ ChIP‐chip‐ ChIP‐seq
Systematic evolution of ligands by exponential enrichment (SELEX)Reporter gene assays (luciferase)
Computational methods
Matrix based (know in advance which transcription factor)Alignment of experimental verified transcription factor binding sites (TFBS)Position frequency matrix (PFM)Position weight matrix (PWM), position specific scoring matrix (PSSM)
W(b,i)=log2(p(b,i)/p(b)); P(b,i)=f(b,i)/N b..base,i..position, f..frequencyEvaluation of sequence
S=∑W(n,i)
Information content Di = 2+ ∑p(b,i)log2p(b,i)Sequence LogoTF/PWM databases: Transfac, JASPAR, GenomatixMatInspector (based on information content)
SIM=∑Ci(j)*score(b,j)/ ∑ Ci(j)*max_score(j); Ci=K*(∑ p(b,i)*ln p(b,i)+ln 4)Threshold for similarity (e.g. allow max 1 match in 10000 bp of coding sequencesBackground sequnces (Markov chains)Phylogenetic footprinting (predicted binding site is conserved, helps to reduce false positives)Profile Hiden Markov Model (HMM)
25
Motif discovery
Word based counting Expectation maximum (MEME, ChIP‐MEME)Gibbs sampling
Associate regulatory sequences with expressionLinear regression
MicroRNA target predictionSequence complementarity (seed matches)ConservationThermodynamicsSite accessibilityUTR ContextCorrelation of expression profiles (GenMir++)
Databases at the NCBI
• Pubmed• Protein• Nucleotides• Structure• Genome• Books• CancerChromosomes• Conserved Domains• 3D Domains• Gene• Genome Project• dbGAP• GEO Profiles• GEO Datasets• GeneSat
• HomoloGene• Journals• MeSH• NLM Catalogs• OMIA• OMIM• PMC• PopSet• Probe• Protein Cluster• SNP• Taxonomy• UniGene• UniSTS
26
GenBank (see also NCBI Nucleotide, Protein)GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain (ASN.1 format, GenBank Flat file)
Other databases
Gene (One record represents a single gene from an organism)
Gene ID 5091Official Symbol PCOfficial Full Name pyruvate carboxylase
For human provided fromHUGO Gene NomenclatureCommitee (HGNC)
Refseq (Curated database, one per transcript per organism)
NT_ Genomic contigNM_ mRNANP_ proteinNR_ None‐coding RNAXM_ mRNAXP_ protien automatic annotation
SwissProt/UniProt (protein sequences)PDB (protein structures)HPRD (protein‐protein interaction ppi)ENSEMBL/BiomartUCSC genome browser/table browser
27
Orthologs
Homologs: A – B – COrthologs: B1 – C1 Paralogs: C1 – C2 –C3 Inparalogs: C2 – C3 Outparalogs: B2 – C1Xenologs: A1 – AB1
Ortholog predictionBest reciprocal hits (blastp)
Databases:
HomoloGene (NCBI) Inparanois (Stockholm)YOGY (eukarYotic OrtholoGY) (Sanger)
Gene set enrichment analysis
1. Given an a priori defined set of genes S.
2. Rank genes (e.g. by t‐value between 2 groups of microarray samples) ranked gene list L.
3. Calculation of an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.
4. Estimation the statistical significance (nominal P value) of the ES by using an empirical phenotype‐based permutation test procedure.
5. Adjustment for multiple hypothesis testing by controlling the false discovery rate (FDR).
28
Gene set enrichment analysis
Subramanian A et al. Proc Natl Acad Sci (2005)
Biochemical, Metabolic, Signaling Pathways
Boehringer Mannheim mapNCI curated pathway maps (http://pid.nci.nih.gov)Signal Transduction Knowledge Environment (http://stke.sciencemag.org/cm)Kyoto Encylopedia of Genes and Genomes, KEGG (http://www.genome.jp/kegg/)BioCyc (EcoCyc)Biocarta (http://www.biocarta.com/genes/index.asp)TranspathReactome (http://www.reactome.org/)PathwayCommons (http://www.pathwaycommons.org/)
29
Pathway Commons
• Aim: convenient access to pathway information• Facilitate creation and communication of pathway data• Aggregate pathway data in the public domain• Provide easy access for pathway analysis
Cytoscape
• Access pathway commons from cytoscape
• http://www.cytoscape.org• Open source software for network visualization• Active community• >40 plugins extend functionality
e.g. Bingo, ClueGO (for gene ontology)• Easy to use and good documentation
VizMapper Various layout
Cline MS et al Nat Protoc. 2007
30
Map gene expression to pathways
• GenMAPP, Cytoscape, Pathway Explorer
• Pairwise similarity measures (Pearson correlation, Spearman rank correlation,Partial correlation, Mutual information)
• Connection strength (adjacency functions, weighted vs unweighted)• Network modules (measures of node dissimilarity)
(hierarchical clustering with 1‐TOM)• Reverse engineering (Boolean, Differential equation, Bayesian network)• Different network representation (metabolic, transcriptional, signaling, ppi)• Network measures (connectivity, clusteruing coefficients • Connectivity and scale‐free network topolgy
• Network motifs
Concepts for network analysis
31
Gene association network
MICO
• Discretizing expression profiles→ groups of genes with iden cal profile
• REVEAL algorithm based onMutual information
M(x,y)=I(x)+I(y)‐I(x,y)M(x,y)=I(x)→ directed
• Correlation
Pparg
Apmap
Bogner‐Strauss et al. Cell Mol Life Sci. 2010
Adjacency function
weighted
unweighted
32
Weighted gene coexpression network analysis (WGCNA)
modules (subnetworks)
Hierarchical clustering of
Topological overlap measure TOM (common neighbors)
Connectivity of gene i Connectivity of gene j Adjacency function between gene i and gene j
Number of common neighbors
Reverse Engineering
Temporal series of dataInput ReverseEngineeringTemporal
series of data
InputSystem Modeling
Predictive power vs. Inferential powerInstantanous model‐Synchronous model, constrains: system have to be stableBoolean networks (advantage a lot of knowledge from information theory, not
quantitative but topology is correct allows sensitivity/robustness analyses)Differential equations (problem number of samples<<number of genes=> underdetermined,
re‐sampling, simmulated annealing, genetic algorithm)Bayesian network (acyclic graph, conditional probability, conditional independence, Bayesian sore to select best
model, causal relations, can introduce some a priori knowledge)
Perturbation (add in cellculture hormon coktail to
start differentiation)
33
Different network representation
Clustering coefficient
Connectivity (degree)
Topological overlap (TOM)
Network measures
34
• Every node can be reached from every other by a small number of hops or steps
• High clustering coefficient and low mean‐shortest path length (random graphs don’t necessarily have high clustering coefficients
• Social networks, the Internet, and biological networks all exhibit small‐world network characteristics
• Six degrees of separation (Kevin Bacon Game)
Small‐world network
Complex network models
Scale‐free network
Modular networks
Hierarchical networks(metabolic networks)
power lawmany genes with few neighborsmew genes with many neighbors (hubs)
Clustering coeffients
Connectivity(degree)
constant
35
Scale‐free networks are robust
• Complex systems (cell, internet, social networks), are resilient to component failure
• Network topology plays an important role in this robustness (even if ~80% of nodes fail, the remaining ~20% still maintain network connectivity
• Attack vulnerability if hubs are selectively targeted
• In yeast, only ~20% of proteins are lethal when deleted, and are 5 times more likely to have degree k>15 than k<5.
• Cellular networks are assortative, hubs tend not to interact directly with other hubs.
• Hubs tend to be “older” proteins (so far claimed for protein‐protein interaction networks only)
• Hubs also seem to have more evolutionary pressure—their protein sequences are more conserved than average between species (shown in yeast vs. worm)
• Experimentally determined protein complexes tend to contain solely essential or non‐essential proteins—further evidence for modularity.
Network motifs
NAR speeds up the response time of gene circuitsNAR can reduce cell‐cell variation in protein levelsPAR works in the opposite way
ab
c
Feedforward loopNegative and positive autoregulatory loop
Suppress short signals