Download pdf - Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

1

Symbol Meaning Description

R A or G puRineY C or T pYrimidineW A or T Weak hydrogen bondsS G or C Strong hydrogen bondsM A or C aMino groupsK G or T Keto groupsH A, C, or T (U) not G, (H follows G)B G, C, or T (U) not A, (B follows A)V G, A, or C not T (U), (V follows U)D G, A, or T (U) not C, (D follows C)N G, A, C or T (U) aNy nucleotide

Nomenclature of nucleic acids

Base Symbol Occurrence

Adenin A DNA, RNAGuanin G DNA, RNACytosin C DNA, RNAThymin T DNAUracil U RNA

+ strand 5´-ACGGTCGCTGTCGGTAGC-3´- strand 3´-TGCCAGCGACAGCCATCG-5´

e.g. in fasta format : >gene sequence|gi12345|chr17|-GCTACCGACAGCGACCGT

DNA sequences are always from 5‘ to 3‘

Positions in the genome (genome assembly) are chromosome wise

e.g. human GRCh37/hg19

chr11:1‐100 chr11:49,686,777‐49,689,777

Positions in the chromosome start for both!! strands from position 1

+ strand 5´-ACGGTCGCTG…………TCGGTAGC-3´- strand 3´-TGCCAGCGAC…………AGCCATCG-5´

chr11:1 2523 2529

chr11:1 2523 2529

Nomenclature

2

A Ala alanine

B Asx aspartic acid or asparagineC Cys cysteineD Asp aspartic acidE Glu glutamic acid

F Phe phenylalanineG Gly glycineH His histidineI Ile isoleucine

K Lys lysineL Leu leucineM Met methionineN Asn asparagine

P Pro prolineQ GIn glutamineR Arg arginineS Ser serine

T Thr threonineV Val valineW Trp tryptophanX Xaa unknown or 'other' amino acid

Y Tyr tyrosineZ Glx glutamic acid or glutamine

Amino Acids

Translation, genetic code and reading frames

3

Peptid chain, amino acid sequence, proteins

Protein sequences are always form N‐terminal end to C‐terminal end

backbone

sidechains

E.g.. SCD sequence in fasta format

Regulation of eukaryotic transcription

4

Different levels of regulation

Transcriptional regulation has largest effect on phenotype!

Chromatin states

Ernst et al. Nature 2011.

5

DNA methylation

Cytosine 5-Methylcytosine

microRNA and siRNA

6

Organization of the human genome

E.g. > 1 million copies of Alu-repeats

Sequence alignmentExact string matching

NaiveZ‐box algorithm (Boyer‐Moore, Knuth‐Morris‐Pratt,…)Suffix arrays, Suffix triesBurrows‐wheeler transformation (BWT), FM‐indexHash tables (spaced seeds)

Aligning 2 sequences

Dot matrixGaps (gap open and extension, linear and convex penalty function)Distances between sequences (Hamming distance, Levenshtein distance, Edit operations)Substitution matrices

Odds ratio (random model [independent= qxiqxj] vs match model [joint=pxi yi])Log odds ratio (scores are additive)PAM, BLOSSUM

Dynamic programmingIdea: new best alignment = previous best alignment + local best alignmentGlobal alignment (Needleman‐Wunsch)

Construct matrix F, F(0,0)=0Backtracing from bottom right (=best score) to up left (=start of sequences)

Local alignment (Smith‐Waterman)Construct matrix F, F(0,j) = F(i,0)=0, include 0 as optionBacktracing from max score to 0

Used for read mapping in NGS applications

7

Sequence alignment

Search similar sequences in database (db)

W‐mer indexing (hash tables)FASTA (evaluate position differences of small words (2‐6 characters) in query and db; hash tables)BLAST (Basic Local Alignment Search Tool)

Query words (W=3), find neighbourhood words with score> threshold T (e.g. T=13) = seedsExtend seeds until score drops off under X = high‐scoring segment pairs (HSP)Evaluate significance (E(S)=Kmne‐λS)BLAST variants (blastn, blastp,blastx, tblastx, tblastn, MegaBlast)

BLAT (Blast‐like alignment tool; mostly for highly similar DNA sequences)

Multiple sequence alignment

Dynamic programming in n‐dimensions (very computer intensive)Compute pairwise alignments to found upperboundCreate heuristic multiple alignment to found lowerboundSearch in n‐dimensional scoring matrix

Progressive tree alignment (ClustalW)Perform hierarchical clustering using distances between sequences (e.g. edit distance): Merge sequences to find ancestor sequences by finding sequences with minimum edit distance to the two children sequences

Assign weights to each branch of tree, based on distance between sequences Align sequences (starting from the closest, using a version of dynamic programming) using weights in the score function

Profile Hidden Markov Model

ACA---ATGTCAACTATCACAC--AGCAGA---ATCACCG--ATC

P(A)=0.8P(C)=0.0P(G)=0.0P(T)=0.2

P(A)=0.8P(C)=0.2P(G)=0.0P(T)=0.0

P(A)=1.0P(C)=0.0P(G)=0.0P(T)=0.0

P(A)=0.0P(C)=0.0P(G)=0.2P(T)=0.8

P(A)=0.0P(C)=0.8P(G)=0.2P(T)=0.0

P(A)=0.0P(C)=0.8P(G)=0.2P(T)=0.0

P(A)=0.2P(C)=0.4P(G)=0.2P(T)=0.4

1.0

0.4

1.0 0.4

0.60.6

1.0 1.01.0

[AT][CG][AC][ACGT]*A[TG][GC]

Regular Expressions

p(ACACATC)=0.8*1*0.8*1*0.8*0.6 *0.4*0.6*1*1*0.8*1*0.8=0.047log‐odds=log(p(S)/0.25L)=log(0.047/0.257)

insertion state

- For multiple alignments (e.g. DNA sequences)

8

Protein Sequence Analysis

Sequence alignmentBLASTFASTA

Uses collective characteristics of a family of proteinsPosition specific score matrix (PSSM)Profile HMMProfileScan, Pfam, CDD, Prosite, BLOCKSPSI‐Blast

Amino Acid CompositionHydrophobicityChargeTheorteical pI,Molecular weight

Secondary structure(Alpha helix, betastrand, beta sheet)Specialized structuresTertiary structure

Neuronal network for secondary structure prediction

9

PredictProtein

• Multi‐step predictive algorithm (Rost et al., 1994)

– Protein sequence queried against SWISS‐PROT– MaxHom used to generate iterative, profile‐basedmultiple sequence alignment (Sander and Schneider,1991)

– Multiple alignment fed into neural network (PHDsec)

• Accuracy: Average > 70%, Best‐case > 90%

• http://www.predictprotein.org/

SignalP

• Neural network trained based on phylogeny– Gram‐negative prokaryotic– Gram‐positive prokaryotic– Eukaryotic

• Predicts secretory signal peptides• http://www.cbs.dtu.dk/services/SignalP/

Signal peptide score (S)

Cleavage site score (C)

Combined Score (Y)

10

Two‐color microarrays

– Oligonucleotides of 60‐80 mers length– cDNA fragments from a library

(varying lengths)

Two color microarray analysis

Experimental design (Biological replicates, Technical replicates, Dye swap, Reference design)

Image analysis (align grid, identify spots and background)

Preprocessing (subtract background, filter saturated or bad spots)

Normalization (idea: expression of majority of genes is not changing across conditions)

Normalization factor N=sum[Ri]/sum[Gi] => Gi‘=N*Gi, Ri‘=R

MA‐plot [M=log2(R/G); A=log2(R*G)/2]

=> M (=log ratios) are dependent on A (average intensities)

=> LOWESS normalization

Identification of differentially expressed genes

Moderated t‐test (t=mean(M)/[(a+s)/sqrt(n)]; a is estimated from all genes)

R‐package limma

As a result for each gene log2‐fold change(=M), p‐value, adjusted p‐value (Benjamini‐Hochbergcorrected p‐value based on the false discovery rate FDR) is calculated

All genes with adjusted p‐value<0.1 are considered statistically significant differentially expressed

11

Affymetrix microarrays

Affymetrix microarrays

12

Affymetrix chips

Analysis of one‐color arrays (Affymetrix)

Preprocessing (apply model of perfect match (PM) and mismatch (MM)=background,

PM‐MM is not correct)

Normalization have to be done between arrays (not within array)

Quantile normalization

Variance stabilizing normalization (VSN)

Probe summarization

Median Polish summarization

The R‐package RMA (Robust Multiarray Average) is the

method of choice which includes all 3 steps, the

resulting intensity values are log2‐transformed

Identification of differentially expressed genes

(as for two color arrays) using the R/Biconductor

Package limma

13

Methods to correct p‐values for multiple testing

In case of 1000 tests 50 false positives are expected at an significance level of 0.05 which are declared significant.

To account for multiple testing following parameter were used:

Family wise error (FWER): p(V>0)

False discovery rate (FDR): E(V/R)

p(i) *n/i > p(i+i) *n/(i+1) => p(i) *n/i = p(i+i) *n/(i+1)

• Potential for surveying the entire transcriptome, including novel, un‐annotated regions.

• Helps to identify expression and function of regulatory none‐coding RNAs (e.g. lincRNA)

• Potential for determining gene structure and isoform level expression using reads mapping to splice junctions.

• Potential for making better presence/absence calls on regions.

• More expensive than microarrays

• Don‘t need to design probes

Transcriptome sequencing (RNAseq)

14

Transcriptome sequencing (RNAseq)

Wang et al., Nature Rev Gen, 2009

Base calling (Phred score)

Phred quality score Q and base‐calling error probabilities P

QPhred = ‐10 log10 P QSolexa = ‐10 log10P

1 ‐ P

For P=0.05 the quality score Q=13

15

Base calling (FastQ format)

@EAS54_6_R1_2_1_413_324CCCTTCTTGTCTTCAGCGTTTCTCC+;;3;;;;;;;;;;;;7;;;;;;;88

Quality scores are encoded in ASCII

1. Read mapping

2. Transcriptome reconstruction

3. Expression quantification

4. Differential expression analysis

Analysis steps

16

Read mapping

Unspliced aligners Spliced aligners (Exon first vs Seed extend)

Bowtie Tophat (using Bowtie which map Exons first)SpliceMap

Tools: Bowtie, BWA, Eland, MAQ, SOAP2, GSNAP, STAR

Result is a Sequence Alignment/Map (SAM/BAM) format file

GFF/GTF files (General Feature Format, General Transfer Format) keeps information about exon, gene and transcript positions in genome assembly

FastQ files => Trim adaptors and bad quality reads (FASTQC)

GTF files

FPKM NormalizationEstimate uncertaintyof mapped read toisoform

Reference genome

Advanced transformationand applying t-test

SAM/BAM file

Differentially expressed genes and isoformsbetween conditions

17

Expression quantification

Garber et al., Nature Methods, 2011

RNAseq normalization

Reads per kilobase per million (RPKM) (divide by library size and transcirpt length)

Quantile normalization

TMM (trimmed mean of M values).

Fragments per kilobase per million reads (FPKM) (for paired‐end sequencing)

18

Differential expression analysis for sequencing count data

Expect Poisson distribution (Mean=Variance) as it is typical for count data

A B C D

Gene1 1 23 2 6

Gene2 0 74 8 7

Gene3 33 4 14 8

But counts for the same gene from different biological replicates have a varianceexceeding the mean (overdispersion) can be estimated by negative‐binominalmodel.

The dispersion is estimated from the raw count distribution of all genes

Differential expressed genes are tested by negative binomial test, Wald test, orlikelihood ratio test

Using R packages DESeq, DESeq2, edgeR

HTSeq can be usedto generate genematrixraw count data

How many reads are needed (depth)?

two mouse libraries (ES,EB) yeast

E.g. 20-40 mio reads should be sufficient for human

19

Clustering

• Unsupervised or supervised (classification)

• AgglomerativeBottom up approach, whereby single expression

profiles are successively joined to form nodes.

• DivisiveTop down approach, each cluster is successively split in the same fashion, until each cluster consists of one single profile.

• Pearson correlation

• Euclidian distance

• Manhattan distance

Similarity distance measures

1

( )n

M i ii

d x y

-1 r 1

20

Hierarchical clustering

• Agglomerative (bottom up), unsupervized• Cluster genes or samples (or both= biclustering)• Distances are encoded in dendogram (tree)• Cut tree to get clusters• Pearson correlation (usually used)• Computational intensive (correlation matrix)

1. Identify clusters (items) with closest distance2. Join to new clusters3. Compute distance between clusters (items) (see linkage)4. Return to step 1

6 cluster

15 cluster

Linkage

Single‐linkage clusteringMinimal distance

Complete‐linkage clusteringMaximal distance

Average‐linkage clusteringCalculated using average distance (UPGMA)Average from distances not! expression values

Weighted pair‐group averageLike UPGMA but weighted according cluster size

Within‐groups clusteringAverage of merged cluster is used instead of cluster elements

Ward’s methodSmallest possible increase in the sum of squared errors

21

• partition n genes into k clusters, where k has to be predetermined

• k‐means clustering minimizes the variability within and maximize between clusters

• Moderate memory and time consumption

K‐means

1. Generate random points (“cluster centers”) in n dimensions (results are depending on these seeds).

2.Compute distance of each data point to each of the cluster centers.

3.Assign each data point to the closest cluster center.

4.Compute new cluster center position as average of points assigned.

5.Loop to (2), stop when cluster centers do not move very much.

Principal component analysis (PCA)

PCA is a data reduction technique that allows to simplify multidimensional data setsinto smaller number of dimensions (r<n).

Variables are summarized by a linear combination to the principal components. The origin of coordinate system is centered to the center of the data (mean centering) . The coordinate system is then rotated to a maximum of the variance in the first axis.

Subsequent principal components are orthogonal to the 1st PC. With the first 2 PCs usually 80‐90% of the variance can already be explained.This analysis can be done by a special matrix decomposition (singular valuedecomposition SVD).

22

Classification (e.g. support vector machines)

Cross validation

K‐fold cross validation (LKOCV)

If k=1 it is called leave‐one‐out cross‐validation (LOOCV)Variance bias trade‐off

Receiver operating characteristics

Sensitivity=TP/(TP+FN)Specificity=TN/(TN+FP)

Area under curve (AUC) measure forclassifier performanceA ideal classifier AUC=1B good classifier AUC~0.8C random AUC=0.5

Sensitivity

1‐Specificity

23

Biological meaning of the gene sets

?

• Gene ontology terms

• Pathway mapping

• Linking to Pubmed abstracts or associated MESH terms

• Regulation by the same transcription factor (module)

• Protein families and domains

• Gene set enrichment analysis

• Over representation analysis

Gene Ontology

The three organizing principles of GO are

• cellular component (e.g. mitochondrium)• biological process (e.g. lipid metabolism)• molecular function (e.g. hydrolase activity)

Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and a term name(e.g. fibroblast growth factor receptor binding).

URL: http://www.geneontology.org/

Different evidence code (e.g. IDA inferred from direct assay)Directed acyclic graph (2 relation part of and is a)Different levels (specific terms sphingolipid metabolism vs general terms e.g. metabolism)GO terms can be occur

The Gene Ontology project provides a controlled vocabularyto describe gene and gene product attributes in any organism.

24

Overrepresentation analysis

m

g

gene universe (whole microarray)

GO term

ci

genes in cluster(gene list)

all genes with GO term

genes in clusterwith GO term

Fisher exact test for contingency table

m-g c-i

g i

Regulatory sequencesExperimental methods

Electro mobility shift assays (EMSA)DNase I and Exonulease FootprintingChromatin immuno precipitation (ChIP)

‐ ChIP‐chip‐ ChIP‐seq

Systematic evolution of ligands by exponential enrichment (SELEX)Reporter gene assays (luciferase)

Computational methods

Matrix based (know in advance which transcription factor)Alignment of experimental verified transcription factor binding sites (TFBS)Position frequency matrix (PFM)Position weight matrix (PWM), position specific scoring matrix (PSSM)

W(b,i)=log2(p(b,i)/p(b)); P(b,i)=f(b,i)/N b..base,i..position, f..frequencyEvaluation of sequence

S=∑W(n,i)

Information content Di = 2+ ∑p(b,i)log2p(b,i)Sequence LogoTF/PWM databases: Transfac, JASPAR, GenomatixMatInspector (based on information content)

SIM=∑Ci(j)*score(b,j)/ ∑ Ci(j)*max_score(j); Ci=K*(∑ p(b,i)*ln p(b,i)+ln 4)Threshold for similarity (e.g. allow max 1 match in 10000 bp of coding sequencesBackground sequnces (Markov chains)Phylogenetic footprinting (predicted binding site is conserved, helps to reduce false positives)Profile Hiden Markov Model (HMM)

25

Motif discovery

Word based counting Expectation maximum (MEME, ChIP‐MEME)Gibbs sampling

Associate regulatory sequences with expressionLinear regression

MicroRNA target predictionSequence complementarity (seed matches)ConservationThermodynamicsSite accessibilityUTR ContextCorrelation of expression profiles (GenMir++)

Databases at the NCBI

• Pubmed• Protein• Nucleotides• Structure• Genome• Books• CancerChromosomes• Conserved Domains• 3D Domains• Gene• Genome Project• dbGAP• GEO Profiles• GEO Datasets• GeneSat

• HomoloGene• Journals• MeSH• NLM Catalogs• OMIA• OMIM• PMC• PopSet• Probe• Protein Cluster• SNP• Taxonomy• UniGene• UniSTS

26

GenBank (see also NCBI Nucleotide, Protein)GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain (ASN.1 format, GenBank Flat file)

Other databases

Gene (One record represents a single gene from an organism)

Gene ID 5091Official Symbol PCOfficial Full Name pyruvate carboxylase

For human provided fromHUGO Gene NomenclatureCommitee (HGNC)

Refseq (Curated database, one per transcript per organism)

NT_ Genomic contigNM_ mRNANP_ proteinNR_ None‐coding RNAXM_ mRNAXP_ protien automatic annotation

SwissProt/UniProt (protein sequences)PDB (protein structures)HPRD (protein‐protein interaction ppi)ENSEMBL/BiomartUCSC genome browser/table browser

27

Orthologs

Homologs: A – B – COrthologs: B1 – C1 Paralogs: C1 – C2 –C3 Inparalogs: C2 – C3 Outparalogs: B2 – C1Xenologs: A1 – AB1

Ortholog predictionBest reciprocal hits (blastp)

Databases:

HomoloGene (NCBI) Inparanois (Stockholm)YOGY (eukarYotic OrtholoGY) (Sanger)

Gene set enrichment analysis

1. Given an a priori defined set of genes S.

2. Rank genes (e.g. by t‐value between 2 groups of microarray samples) ranked gene list L.

3. Calculation of an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.

4. Estimation the statistical significance (nominal P value) of the ES by using an empirical phenotype‐based permutation test procedure.

5. Adjustment for multiple hypothesis testing by controlling the false discovery rate (FDR).

28

Gene set enrichment analysis

Subramanian A et al. Proc Natl Acad Sci (2005)

Biochemical, Metabolic, Signaling Pathways

Boehringer Mannheim mapNCI curated pathway maps (http://pid.nci.nih.gov)Signal Transduction Knowledge Environment (http://stke.sciencemag.org/cm)Kyoto Encylopedia of Genes and Genomes, KEGG (http://www.genome.jp/kegg/)BioCyc (EcoCyc)Biocarta (http://www.biocarta.com/genes/index.asp)TranspathReactome (http://www.reactome.org/)PathwayCommons (http://www.pathwaycommons.org/)

29

Pathway Commons

• Aim: convenient access to pathway information• Facilitate creation and communication of pathway data• Aggregate pathway data in the public domain• Provide easy access for pathway analysis

Cytoscape

• Access pathway commons from cytoscape

• http://www.cytoscape.org• Open source software for network visualization• Active community• >40 plugins extend functionality

e.g. Bingo, ClueGO (for gene ontology)• Easy to use and good documentation

VizMapper Various layout

Cline MS et al Nat Protoc. 2007

30

Map gene expression to pathways

• GenMAPP, Cytoscape, Pathway Explorer

• Pairwise similarity measures (Pearson correlation, Spearman rank correlation,Partial correlation, Mutual information)

• Connection strength (adjacency functions, weighted vs unweighted)• Network modules (measures of node dissimilarity)

(hierarchical clustering with 1‐TOM)• Reverse engineering (Boolean, Differential equation, Bayesian network)• Different network representation (metabolic, transcriptional, signaling, ppi)• Network measures (connectivity, clusteruing coefficients • Connectivity and scale‐free network topolgy

• Network motifs

Concepts for network analysis

31

Gene association network

MICO

• Discretizing expression profiles→ groups of genes with iden cal profile

• REVEAL algorithm based onMutual information

M(x,y)=I(x)+I(y)‐I(x,y)M(x,y)=I(x)→ directed

• Correlation

Pparg

Apmap

Bogner‐Strauss et al. Cell Mol Life Sci. 2010

Adjacency function

weighted

unweighted

32

Weighted gene coexpression network analysis (WGCNA)

modules (subnetworks)

Hierarchical clustering of

Topological overlap measure TOM (common neighbors)

Connectivity of gene i Connectivity of gene j Adjacency function between gene i and gene j

Number of common neighbors

Reverse Engineering

Temporal series of dataInput ReverseEngineeringTemporal

series of data

InputSystem Modeling

Predictive power vs. Inferential powerInstantanous model‐Synchronous model, constrains: system have to be stableBoolean networks (advantage a lot of knowledge from information theory, not

quantitative but topology is correct allows sensitivity/robustness analyses)Differential equations (problem number of samples<<number of genes=> underdetermined,

re‐sampling, simmulated annealing, genetic algorithm)Bayesian network (acyclic graph, conditional probability, conditional independence, Bayesian sore to select best

model, causal relations, can introduce some a priori knowledge)

Perturbation (add in cellculture hormon coktail to

start differentiation)

33

Different network representation

Clustering coefficient

Connectivity (degree)

Topological overlap (TOM)

Network measures

34

• Every node can be reached from every other by a small number of hops or steps

• High clustering coefficient and low mean‐shortest path length (random graphs don’t necessarily have high clustering coefficients

• Social networks, the Internet, and biological networks all exhibit small‐world network characteristics

• Six degrees of separation (Kevin Bacon Game)

Small‐world network

Complex network models

Scale‐free network

Modular networks

Hierarchical networks(metabolic networks)

power lawmany genes with few neighborsmew genes with many neighbors (hubs)

Clustering coeffients

Connectivity(degree)

constant

35

Scale‐free networks are robust

• Complex systems (cell, internet, social networks), are resilient to component failure

• Network topology plays an important role in this robustness (even if ~80% of nodes fail, the remaining ~20% still maintain network connectivity

• Attack vulnerability if hubs are selectively targeted

• In yeast, only ~20% of proteins are lethal when deleted, and are 5 times more likely to have degree k>15 than k<5.

• Cellular networks are assortative, hubs tend not to interact directly with other hubs.

• Hubs tend to be “older” proteins (so far claimed for protein‐protein interaction networks only)

• Hubs also seem to have more evolutionary pressure—their protein sequences are more conserved than average between species (shown in yeast vs. worm)

• Experimentally determined protein complexes tend to contain solely essential or non‐essential proteins—further evidence for modularity.

Network motifs

NAR speeds up the response time of gene circuitsNAR can reduce cell‐cell variation in protein levelsPAR works in the opposite way

ab

c

Feedforward loopNegative and positive autoregulatory loop

Suppress short signals