1
Topological analysis of coexpression networks in neoplastic tissues R Anglani 1 , TM Creanza 2 , VC Liuzzi 1 , PF Stifanelli 1 , R Maglietta 1 , A Piepoli 3 , S Mukherjee 4 , PF Schena 2 , N Ancona 1 1 Bioinformatics & Systems Biology Lab, CNR-ISSIA, Bari, Italy; 2 Dipartimento di Emergenza e Trapianti Organi, Università di Bari, Italy; 3 Unità Operativa di Gastroenterologia, IRCCS, Casa Sollievo della Sofferenza, San Giovanni Rotondo, Italy; 4 Institute for Genome Sciences and Policy, Duke University, Durham, USA BITS 2012 Ninth Annual Meeting of the Bioinformatics Italian Society May 2-4, 2012, Catania, Italy Gene co-expression networks are useful models to enlighten the coordinated expression of groups of genes that are functionally co-regulated in order to provide the adaptive response to the system modification. In this framework, topology-based approaches to network analysis can yield unexpected insights of the global properties of biological systems that could not be unveiled with one-gene approaches. We show that topological differences can critically emerge from the comparison between normal and cancer networks and can identify those non-differentially expressed genes that can have a role in the evolution of the specific disease. To this aim, we introduce a novel method for the characterization of disease genes, based on the study of a new observable, called “differential connection”, i.e. a statistically significative degree difference of a gene between two phenotype conditions. Moreover, we see that preferential removal of “differentially connected” genes is responsible for the alteration of the average path length of the normal network with respect to random failure and hub removal. Finally, we suggest a possible association between the differential connectivity and the presence of mutations within protein interaction domains. Fisher’s exact test LUNG (19520 genes) DC(FDR<13%) = 7520 DE(FDR<13%) = 12258 DC&DE = 3293 p(DC&DE>3292) ~ 1 Fisher’s exact test DC & Cancer census DC(FDR<13%) = 1072 CENSUS(CRC) = 18 DC & CENSUS = 4 p(DC&CC>4) ~ 0.02 Fisher’s exact test DE & Cancer census DE(FDR<13%) = 6606 CENSUS(CRC) = 18 DC & CENSUS = 9 p(DC&CC>9) ~ 0.21 Fisher’s exact test COLON (17400 genes) DC(FDR<13%) = 1072 DE(FDR<13%) = 606 DC&DE = 260 p(DC&DE>260) ~ 1 Dataset (EMTAB829) colorectal cancer Affymetrix GeneChip Human Exon 1.0 ST 28 samples (14 cancer and 14 paired normal) (GSE15852) breast cancer Affymetrix GeneChip Human Genome U133A 86 samples (43 cancer and 43 paired normal) (GSE18842) non-small cell lung cancer Affymetrix GeneChip Human Genome U133Plus2 91 samples (46 cancer and 45 normal, all paired except 3) Coexpression network generation Coexpression networks are build by evaluating the Pearson correlation coefficients (PCC) of each dataset X and testing the hypothesis of no correlation. In the null case, for approximately Gaussian data the sampling distribution of PCC follows Student’s t-distribution with n-2 degrees of freedom, with n is the sample size. t = p n - 2 r p 1 - r 2 Degree distribution The degree of a node in a graph is the number of edges connected to the node. The degree distribution P(k) is the probability that a selected node has k links. A random network is characterized by a Poisson degree distribution, while a ‘scale-free’ network [1] has a power law distribution P(k) ~ k - γ . We find that degree distribution of the examined coexpression networks are scale-free in agreement with Ref. [2]. Fisher’s exact test BREAST (12157 genes) DC(FDR<13%) = 582 DE(FDR<13%) = 3127 DC&DE = 337 p(DC&DE>337) ~ 1E-64 Differential connectivity Given Δ i = degree difference of the i-th gene between normal and cancer condition, a gene is said to be “differentially connected” when Δ i is statistically significative. To assess the significance, we randomly assign patients to one of two groups and we evaluate Δ i * for each permutation. We repeat the shuffle 1000 times to obtain the random distribution. The differential connection p-value is evauated comparing the real Δ i with the random distribution. In order to control the expected proportion of incorrectly rejected null hypotheses, we evaluate Benjamini Hochberg False Discovery Rate and we put the significance threshold to 13%. Fisher’s exact tests for colon and lung case suggest that “differentially connected” genes can represent a population distinct from differentially expressed genes. 0 0.2 0.4 0.6 0.8 1 Fraction of removed nodes 0 2 4 6 8 10 12 14 16 Average path length Breast random hubs diff conn diff expr 0 0.2 0.4 0.6 0.8 1 Fraction of removed nodes 1 2 3 4 5 6 Average path length Lung random hubs diff conn diff expr 0 0.1 0.2 0.3 0.4 1.75 1.8 1.85 1.9 1.95 0.0001 0.001 0.01 0.1 1 Differential connection p-value 0 10 20 30 40 50 60 70 BH False Discovery Rate [%] Lung Breast Colon System attack tolerance Average path length (APL) is the mean of geodesic lengths over all pair of vertices and it is a measure of efficiency of information transport of the network. Differentially connected genes result to be responsible for an alteration of APL which is evidently different from random failure, hub removal [3] and differentially expressed genes removal. Literature validation: Cancer-related mutations During cancer progression, mutations can occur indifferently in regulatory or coding sites of genes. It is reasonable to guess that alteration of regulatory sites can lead to modification of gene expression. Instead, missense and nonsense mutations in the binding regions of a protein could disrupt some interactions with other proteins. In our study, we find a significative intersection (Fisher’s exact test pvalue ~ 0.02) between differentially connected genes and colon-cancer-related Census genes. Differentially expressed genes do not provide the same significance. Our results suggest that differentially connected genes can correspond to those genes frequently mutated in colorectal cancer according to Cancer Census Database (Wellcome Trust Institute). [1] Barabasi & Oltvai, Nature Reviews 5 101 (2004); [2] Carter et al., Bioinformatics 20 2242 (2004); [3] Albert & Barabasi, Rev. Mod. Phys 74 47 (2002) 0 0.2 0.4 0.6 0.8 1 Fraction of removed nodes 2 3 4 5 6 7 Average path length Colon random hubs diff conn diff expr 0 0.1 0.2 0.3 0.4 2 2.1 2.2 2.3 2.4 2.5 2.6 FBXW7 KRAS MAP2K4 VTI1A 0 500 1000 1500 2000 2500 Degree [k] 0 200 400 600 800 1000 1200 1400 1600 Number of genes with degree k Breast normal cancer 10 100 1000 1 10 100 1000 0 1000 2000 3000 4000 5000 Degree [k] 0 200 400 600 800 1000 1200 1400 Number of genes with degree k Colon normal cancer 100 1000 10000 1 10 100 1000 0 500 1000 1500 2000 2500 3000 3500 Degree [k] 0 500 1000 1500 2000 2500 Number of genes with degree k Lung normal cancer 10 100 1000 1 10 100 1000

ISSIA-BITS2012

Embed Size (px)

Citation preview

Page 1: ISSIA-BITS2012

Topological analysis of coexpression networks in neoplastic tissues

R Anglani1, TM Creanza2, VC Liuzzi1, PF Stifanelli1, R Maglietta1, A Piepoli3, S Mukherjee4, PF Schena2, N Ancona1

1Bioinformatics & Systems Biology Lab, CNR-ISSIA, Bari, Italy; 2Dipartimento di Emergenza e Trapianti Organi, Università di Bari, Italy; 3Unità Operativa di Gastroenterologia, IRCCS, Casa Sollievo della Sofferenza, San Giovanni Rotondo, Italy; 4Institute for Genome Sciences and Policy, Duke University, Durham, USA

BITS 2012 Ninth Annual Meeting of the Bioinformatics Italian Society May 2-4, 2012, Catania, Italy

Gene co-expression networks are useful models to enlighten the coordinated expression of groups of genes that are functionally co-regulated in order to provide the adaptive response to the system modification. In this framework, topology-based approaches to network analysis can yield unexpected insights of the global properties of biological systems that could not be unveiled with one-gene approaches. We show that topological differences can critically emerge from the comparison between normal and cancer networks and can identify those non-differentially expressed genes that can have a role in the evolution of the specific disease. To this aim, we introduce a novel method for the characterization of disease genes, based on the study of a new observable, called “differential connection”, i.e. a statistically significative degree difference of a gene between two phenotype conditions. Moreover, we see that preferential removal of “differentially connected” genes is responsible for the alteration of the average path length of the normal network with respect to random failure and hub removal. Finally, we suggest a possible association between the differential connectivity and the presence of mutations within protein interaction domains.

Fisher’s exact test LUNG (19520 genes)DC(FDR<13%) = 7520 DE(FDR<13%) = 12258DC&DE = 3293 p(DC&DE>3292) ~ 1

Fisher’s exact test DC & Cancer censusDC(FDR<13%) = 1072 CENSUS(CRC) = 18DC & CENSUS = 4 p(DC&CC>4) ~ 0.02

Fisher’s exact test DE & Cancer censusDE(FDR<13%) = 6606 CENSUS(CRC) = 18 DC & CENSUS = 9 p(DC&CC>9) ~ 0.21

Fisher’s exact test COLON (17400 genes)DC(FDR<13%) = 1072 DE(FDR<13%) = 606DC&DE = 260 p(DC&DE>260) ~ 1

Dataset(EMTAB829) colorectal cancer Affymetrix GeneChip Human Exon 1.0 ST 28 samples (14 cancer and 14 paired normal)(GSE15852) breast cancerAffymetrix GeneChip Human Genome U133A 86 samples (43 cancer and 43 paired normal)(GSE18842) non-small cell lung cancer Affymetrix GeneChip Human Genome U133Plus2 91 samples (46 cancer and 45 normal, all paired except 3)

Coexpression network generation Coexpression networks are build by evaluating the Pearson correlation coefficients (PCC) of each dataset X and testing the hypothesis of no correlation. In the null case, for approximately Gaussian data the sampling distribution of PCC follows Student’s t-distribution with n-2 degrees of freedom, with n is the sample size.

t =pn� 2

rp1� r2

Degree distributionThe degree of a node in a graph is the number of edges connected to the node. The degree distribution P(k) is the probability that a selected node has k links. A random network is characterized by a Poisson degree distribution, while a ‘scale-free’ network [1] has a power law distribution P(k) ~ k-γ. We find that degree distribution of the examined coexpression networks are scale-free in agreement with Ref. [2].

Fisher’s exact test BREAST (12157 genes)DC(FDR<13%) = 582 DE(FDR<13%) = 3127DC&DE = 337 p(DC&DE>337) ~ 1E-64

Differential connectivityGiven Δi = degree difference of the i-th gene between normal and cancer condition, a gene is said to be “differentially connected” when Δi is statistically significative. To assess the significance, we randomly assign patients to one of two groups and we evaluate Δi* for each permutation. We repeat the shuffle 1000 times to obtain the random distribution. The differential connection p-value is evauated comparing the real Δi with the random distribution. In order to control the expected proportion of incorrectly rejected null hypotheses, we evaluate Benjamini Hochberg False Discovery Rate and we put the significance threshold to 13%.Fisher’s exact tests for colon and lung case suggest that “differentially connected” genes can represent a population distinct from differentially expressed genes.

0 0.2 0.4 0.6 0.8 1

Fraction of removed nodes

0

2

4

6

8

10

12

14

16

Ave

rage

pat

h le

ngth

Breast

randomhubsdiff conndiff expr

0 0.2 0.4 0.6 0.8 1

Fraction of removed nodes

1

2

3

4

5

6

Ave

rage

pat

h le

ngth

Lung

randomhubsdiff conndiff expr

0 0.1 0.2 0.3 0.41.75

1.8

1.85

1.9

1.95

0.0001 0.001 0.01 0.1 1Differential connection p-value

0

10

20

30

40

50

60

70

BH F

alse

Dis

cove

ry R

ate

[%]

LungBreastColon

System attack toleranceAverage path length (APL) is the mean of geodesic lengths over all pair of vertices and it is a measure of efficiency of information transport of the network. Differentially connected genes result to be responsible for an alteration of APL which is evidently different from random failure, hub removal [3] and differentially expressed genes removal.

Literature validation: Cancer-related mutationsDuring cancer progression, mutations can occur indifferently in regulatory or coding sites of genes. It is reasonable to guess that alteration of regulatory sites can lead to modification of gene expression. Instead, missense and nonsense mutations in the binding regions of a protein could disrupt some interactions with other proteins. In our study, we find a significative intersection (Fisher’s exact test pvalue ~ 0.02) between differentially connected genes and colon-cancer-related Census genes. Differentially expressed genes do not provide the same significance.Our results suggest that differentially connected genes can correspond to those genes frequently mutated in colorectal cancer according to Cancer Census Database (Wellcome Trust Institute).

[1] Barabasi & Oltvai, Nature Reviews 5 101 (2004); [2] Carter et al., Bioinformatics 20 2242 (2004); [3] Albert & Barabasi, Rev. Mod. Phys 74 47 (2002)

0 0.2 0.4 0.6 0.8 1

Fraction of removed nodes

2

3

4

5

6

7

Ave

rage

pat

h le

ngth

Colon

randomhubsdiff conndiff expr

0 0.1 0.2 0.3 0.42

2.12.22.32.42.52.6

FBXW7

KRAS

MAP2K4

VTI1A

0 500 1000 1500 2000 2500

Degree [k]

0

200

400

600

800

1000

1200

1400

1600

Num

ber

of g

enes

wit

h de

gree

k

Breast

normalcancer

10 100 10001

10

100

1000

0 1000 2000 3000 4000 5000

Degree [k]

0

200

400

600

800

1000

1200

1400

Num

ber

of g

enes

wit

h de

gree

k

Colon

normalcancer

100 1000 10000

1

10

100

1000

0 500 1000 1500 2000 2500 3000 3500

Degree [k]

0

500

1000

1500

2000

2500

Num

ber

of g

enes

wit

h de

gree

k

Lung

normalcancer

10 100 1000

1

10

100

1000