Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011

BioinformaticaBioinformaticaCorso di Laurea Specialistica in InformaticaCorso di Laurea Specialistica in Informatica

Microarray e BiomarcatoriMicroarray e Biomarcatori06/05/201106/05/2011

Classification of microarray Classification of microarray samplessamples• We are given a set (called Learning set) of

Microarrays expressions data coming from several classes of samples (patients)

• To simplify the problem we consider only two classes: Case/Control. So we have a set of pairs case/control .

• For example cancer/normal metastatic/non metastatic etc..,

• Build a classifier able to decide to which class a new unclassified sample belongs .

Expression profiling data analysisExpression profiling data analysis

• A supervised approach to classification:

•Identify genes (or microRNAs) that are differentially expressed in the two classes of samples.

•Discretize the set of discriminant genes

•Use these genes to build a classifier able to classify new (unknown) samples

Two classes/1Two classes/1

• Rank Product– Rank Product is a non-parametric statistical

method based on ranks of fold changes. Given n genes and k replicates, let eg,i be the fold change(ratio case/control) and rg,i the rank of gene g in the i-th replicate.

– The rank product is computed through the geometric mean:

– Simple permutation-based estimation is used to determine how likely a given RP has been obtained by chance.R. Breitling, P. Armengaud, A. Amtmann, P. Herzyk. Rank products: a simple, yet powerful, new

method to detect differentially regulated genes in replicated microarray experiments FEBS Letters, Volume 573, Issue 1, Pages 83-92.

Two classes/2Two classes/2

• Identification of differentially expressed genes between two classes. The identification consists of two parts the identification of up-regulated and down-regulated genes in the class a compared to class b, respectively.

• These results have been obtained using the Rank Product package (v. 2.16.0) of the BioConductor Library under the R System.

More than two classesMore than two classes

• Many statistical tests are available

– Kruskal-Wallis– ANOVA (for Gaussian only)– SAM (?)– Linear model (R limma package)

DiscretizationDiscretization

• Discretization algorithms play an important role in data mining and knowledge discovery.

• They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily, but also make learning more accurate and faster.

• Discretization algorithms can be classified into five diffrent groups: – supervised versus unsupervised;– static versus dynamic;– global versus local;– top-down (splitting) versus bottom-up (merging);– direct versus incremental;

Class-Attribute Contingency CoefficientClass-Attribute Contingency Coefficient

• Given the quanta matrix, usually contingency coefficient is used to measure the strength of dependence between the variables

– qir (i = 1,2,...,S,r = 1,2,...,n) denotes the total number of examples belonging to the i-th class that are within interval (dr-1,dr];

– Mi+ is the total number of examples belonging to the i-th class;

– M+r is the total number of examples that are within the interval (dr-1,dr];

– n is the number of intervals;

C.J. Tsai, C.-I. Lee, W.-P. Yang. A discretization algorithm based on Class-Attribute Contingency Coefficient. Information Sciences 178:3 (2008) 714-731.

CACC Pseudo-codeCACC Pseudo-code

Associative classificationAssociative classification

• Associative classification mining is a successful approach that uses association rule discovery techniques to build classification systems.

Maximal Frequent Itemset (i.e. MAFIA Maximal Frequent Itemset (i.e. MAFIA algorithm)algorithm)

– Given the set of discretized discriminant genes. Consider all the pairs [gene,interval] as the Items of our data mining analysis . We compute , for each class k, a set of maximal frequent itemsets (MFI). Where a frequent itemset for a class k is a set of items which appear together in a number of elements of the class greater than a given percentage threshold t. It is maximal if no proper superset of it is frequent.

– For each class k=0,…,K−1, the set of all MFI, MFI(k)={mfi1(k),...,mfihk(k)} is computed.

Then assign to k the set of rules

&mfi1(k)- class k

.

.&mfihk(k) class k

Burdick D, Calimlim M, Flannick J, Gehrke J, Yiu T: MAFIA: A Maximal Frequent Itemset Algorithm. IEEE Transactions on Knowledge and Data Engineering 2005, 17:1490–1504.

EvaluationEvaluation

– Unknown phenotypes are properly discretized and then assigned to a class k with a score, by using association rules. The assignment which yields the highest score establishes the class.

– Let x = {I1,...,Im} be an unknown discretized phenotype, we evaluate how many rules are satisfied, even partially, in each Rk. The sample is assigned to the class whose satisfied rules are maximal. Fixed a class k, we evaluate x under a generic rule rv

k = {Ii , ..., Ij } assigning a score in the following way:

Profilingdata

Profilingdata

Genes patterns (data mining: max

freq itemsets)

Genes patterns (data mining: max

freq itemsets)

Model validation(KFCV)

Model validation(KFCV)

DiscretizationDiscretizationFiltering (i.e. discriminant genes)

Filtering (i.e. discriminant genes)

Filtering based on permutation test

Filtering based on permutation test

Superset of robust biomarkers

Superset of robust biomarkers

Bayesian NetworksConstruction (reverse engineering)

Pathway PerturbationmicroRNAs analysis

Bayesian NetworksConstruction (reverse engineering)

Pathway PerturbationmicroRNAs analysis

General schemaGeneral schema

Binary strategy

Bayesian networksBayesian networks

• Two components:

– G directed aciclic graph in which nodes are random variables X1,…..,Xn

– For each variable the conditional probability distribution is given by its precursor.

• These two components represent a unic distribution on X1,…..,Xn.

• Markov assumption

• Each joint distribution satisfies the assumption that each variable Xi is influenced by the values of the state that preceds it .

Where:parents(Xi) = set of precursors of Xi in G

Tools for Bayesian networks constructionTools for Bayesian networks construction

• Banjo

• Biolearn– Dana Pe’er Lab

• http://www.c2b2.columbia.edu/danapeerlab/html/biolearn.html

Hartemink, A., Gifford, D., Jaakkola, T., & Young, R. (2001) “Using Graphical Models and Genomic Expression Data to Statistically Validate Models of Genetic Regulatory Networks.” In Pacific Symposium on Biocomputing 2001 (PSB01), Altman, R., Dunker, A.K., Hunter, L., Lauderdale, K., & Klein, T., eds. World Scientific: New Jersey. pp. 422–433.

Build a Bayesian network Build a Bayesian network

MFI(K) set

MFI(K) set

PKC

Raf

Erk

Mek

PKA

Akt

Jnk P38

Pathway PerturbationPathway Perturbation

• Our goal is to apply an analysis model using both – statistically significant number of differentially

expressed genes (or miRNAs)– biologically meaningful changes on a given

pathway. A set of pathways describing sub‐systems of the given organism involving the given variables (genes).

S. Draghici, P. Khatri, A.L. Tarca, K. Amin, A. Done, C. Voichita, C. Georgescu, and R. Romero. A systems biology approach for pathway level analysis. Genome Research, 17:1537-1545, 2007.

• Output– Rank the sub‐systems in the decreasing order of

the amount of disruption suffered– If possible, identify those sub‐systems for which

the disruption is “significant”

• Gene perturbation factor

– PF(g) = perturbation factor of g:– α = a priori type of impact expected from that gene– ΔΕ(g) = change in expression level for g(fold

change)

– USg = Set of genes directly upstream of g in the pathway

– Nds(u) = number of genes directly downstream of u in pathways

– βug = efficiency of the connection between u and g

• Pathway perturbation factor

– Nde (Pi) = number of Differentially Expressed genes on the given pathway Pi

– PF(g) =perturbation of the gene g

– mean fold change of differentially expressed genes.

• In this model, the impact factor IF of a set of genes (for example those of a MFI belonging to Pi) on a pathway Pi can be estimated (p-value) by replacing that set by a random set of genes in Pi of the same cardinality .

• The perturbation factor of Pi and this p-value give the measure of the relevance of the MFI on that Pathway.

Documents

Bioinformatica Corso di Laurea Specialistica in Informatica Microarray e Biomarcatori 06/05/2011