22
Bioinformatica Bioinformatica Corso di Laurea Specialistica in Informatica Corso di Laurea Specialistica in Informatica Microarray e Microarray e Biomarcatori Biomarcatori 06/05/2011 06/05/2011

Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

  • Upload
    ferris

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011. Classification of microarray samples. We are given a set (called Learning set) of Microarrays expressions data coming from several classes of samples (patients) - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

BioinformaticaBioinformaticaCorso di Laurea Specialistica in InformaticaCorso di Laurea Specialistica in Informatica

Microarray e BiomarcatoriMicroarray e Biomarcatori06/05/201106/05/2011

Page 2: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Classification of microarray Classification of microarray samplessamples• We are given a set (called Learning set) of

Microarrays expressions data coming from several classes of samples (patients)

• To simplify the problem we consider only two classes: Case/Control. So we have a set of pairs case/control .

• For example cancer/normal metastatic/non metastatic etc..,

• Build a classifier able to decide to which class a new unclassified sample belongs .

Page 3: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Expression profiling data analysisExpression profiling data analysis

• A supervised approach to classification:

•Identify genes (or microRNAs) that are differentially expressed in the two classes of samples.

•Discretize the set of discriminant genes

•Use these genes to build a classifier able to classify new (unknown) samples

Page 4: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Two classes/1Two classes/1

• Rank Product– Rank Product is a non-parametric statistical

method based on ranks of fold changes. Given n genes and k replicates, let eg,i be the fold change(ratio case/control) and rg,i the rank of gene g in the i-th replicate.

– The rank product is computed through the geometric mean:

– Simple permutation-based estimation is used to determine how likely a given RP has been obtained by chance.R. Breitling, P. Armengaud, A. Amtmann, P. Herzyk. Rank products: a simple, yet powerful, new

method to detect differentially regulated genes in replicated microarray experiments FEBS Letters, Volume 573, Issue 1, Pages 83-92.

Page 5: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Two classes/2Two classes/2

• Identification of differentially expressed genes between two classes. The identification consists of two parts the identification of up-regulated and down-regulated genes in the class a compared to class b, respectively.

• These results have been obtained using the Rank Product package (v. 2.16.0) of the BioConductor Library under the R System.

Page 6: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

More than two classesMore than two classes

• Many statistical tests are available

– Kruskal-Wallis– ANOVA (for Gaussian only)– SAM (?)– Linear model (R limma package)

Page 7: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

DiscretizationDiscretization

• Discretization algorithms play an important role in data mining and knowledge discovery.

• They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily, but also make learning more accurate and faster.

• Discretization algorithms can be classified into five diffrent groups: – supervised versus unsupervised;– static versus dynamic;– global versus local;– top-down (splitting) versus bottom-up (merging);– direct versus incremental;

Page 8: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Class-Attribute Contingency CoefficientClass-Attribute Contingency Coefficient

• Given the quanta matrix, usually contingency coefficient is used to measure the strength of dependence between the variables

– qir (i = 1,2,...,S,r = 1,2,...,n) denotes the total number of examples belonging to the i-th class that are within interval (dr-1,dr];

– Mi+ is the total number of examples belonging to the i-th class;

– M+r is the total number of examples that are within the interval (dr-1,dr];

– n is the number of intervals;

C.J. Tsai, C.-I. Lee, W.-P. Yang. A discretization algorithm based on Class-Attribute Contingency Coefficient. Information Sciences 178:3 (2008) 714-731.

Page 9: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

CACC Pseudo-codeCACC Pseudo-code

Page 10: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Associative classificationAssociative classification

• Associative classification mining is a successful approach that uses association rule discovery techniques to build classification systems.

Page 11: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Maximal Frequent Itemset (i.e. MAFIA Maximal Frequent Itemset (i.e. MAFIA algorithm)algorithm)

– Given the set of discretized discriminant genes. Consider all the pairs [gene,interval] as the Items of our data mining analysis . We compute , for each class k, a set of maximal frequent itemsets (MFI). Where a frequent itemset for a class k is a set of items which appear together in a number of elements of the class greater than a given percentage threshold t. It is maximal if no proper superset of it is frequent.

– For each class k=0,…,K−1, the set of all MFI, MFI(k)={mfi1(k),...,mfihk(k)} is computed.

Then assign to k the set of rules

&mfi1(k)- class k

.

.&mfihk(k) class k

Burdick D, Calimlim M, Flannick J, Gehrke J, Yiu T: MAFIA: A Maximal Frequent Itemset Algorithm. IEEE Transactions on Knowledge and Data Engineering 2005, 17:1490–1504.

Page 12: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

EvaluationEvaluation

– Unknown phenotypes are properly discretized and then assigned to a class k with a score, by using association rules. The assignment which yields the highest score establishes the class.

– Let x = {I1,...,Im} be an unknown discretized phenotype, we evaluate how many rules are satisfied, even partially, in each Rk. The sample is assigned to the class whose satisfied rules are maximal. Fixed a class k, we evaluate x under a generic rule rv

k = {Ii , ..., Ij } assigning a score in the following way:

Page 13: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Profilingdata

Profilingdata

Genes patterns (data mining: max

freq itemsets)

Genes patterns (data mining: max

freq itemsets)

Model validation(KFCV)

Model validation(KFCV)

DiscretizationDiscretizationFiltering (i.e. discriminant genes)

Filtering (i.e. discriminant genes)

Filtering based on permutation test

Filtering based on permutation test

Superset of robust biomarkers

Superset of robust biomarkers

Bayesian NetworksConstruction (reverse engineering)

Pathway PerturbationmicroRNAs analysis

Bayesian NetworksConstruction (reverse engineering)

Pathway PerturbationmicroRNAs analysis

General schemaGeneral schema

Binary strategy

Page 14: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Bayesian networksBayesian networks

• Two components:

– G directed aciclic graph in which nodes are random variables X1,…..,Xn

– For each variable the conditional probability distribution is given by its precursor.

• These two components represent a unic distribution on X1,…..,Xn.

Page 15: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

• Markov assumption

• Each joint distribution satisfies the assumption that each variable Xi is influenced by the values of the state that preceds it .

Where:parents(Xi) = set of precursors of Xi in G

Page 16: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Tools for Bayesian networks constructionTools for Bayesian networks construction

• Banjo

• Biolearn– Dana Pe’er Lab

• http://www.c2b2.columbia.edu/danapeerlab/html/biolearn.html

Hartemink, A., Gifford, D., Jaakkola, T., & Young, R. (2001) “Using Graphical Models and Genomic Expression Data to Statistically Validate Models of Genetic Regulatory Networks.” In Pacific Symposium on Biocomputing 2001 (PSB01), Altman, R., Dunker, A.K., Hunter, L., Lauderdale, K., & Klein, T., eds. World Scientific: New Jersey. pp. 422–433.

Page 17: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Build a Bayesian network Build a Bayesian network

MFI(K) set

MFI(K) set

PKC

Raf

Erk

Mek

PKA

Akt

Jnk P38

Page 18: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

Pathway PerturbationPathway Perturbation

• Our goal is to apply an analysis model using both – statistically significant number of differentially

expressed genes (or miRNAs)– biologically meaningful changes on a given

pathway. A set of pathways describing sub‐systems of the given organism involving the given variables (genes).

S. Draghici, P. Khatri, A.L. Tarca, K. Amin, A. Done, C. Voichita, C. Georgescu, and R. Romero. A systems biology approach for pathway level analysis. Genome Research, 17:1537-1545, 2007.

Page 19: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

• Output– Rank the sub‐systems in the decreasing order of

the amount of disruption suffered– If possible, identify those sub‐systems for which

the disruption is “significant”

Page 20: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

• Gene perturbation factor

– PF(g) = perturbation factor of g:– α = a priori type of impact expected from that gene– ΔΕ(g) = change in expression level for g(fold

change)

– USg = Set of genes directly upstream of g in the pathway

– Nds(u) = number of genes directly downstream of u in pathways

– βug = efficiency of the connection between u and g

Page 21: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

• Pathway perturbation factor

– Nde (Pi) = number of Differentially Expressed genes on the given pathway Pi

– PF(g) =perturbation of the gene g

 

– mean fold change of differentially expressed genes.

Page 22: Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

• In this model, the impact factor IF of a set of genes (for example those of a MFI belonging to Pi) on a pathway Pi can be estimated (p-value) by replacing that set by a random set of genes in Pi of the same cardinality .

• The perturbation factor of Pi and this p-value give the measure of the relevance of the MFI on that Pathway.