Improving PPI Networks with Correlated Gene Expression Data Jesse Walsh

Improving PPI Networks with Correlated Gene Expression Data

Jesse Walsh

Background

• PPI networks are currently derived either computationally or experimentally

• It is well known that there are a great number of false positives and false negatives in computationally derived networks

• High-throughput– Two major published yeast PPI experiments

showed only ≈150 similar interactions out of thousands [1]

Goal: Improve the Data Quality

• The goal of this study is to improve the quality of computationally predicted protein-protein interactions

• Hypothesis:– Proteins that interact may also have similar

expression patterns– Gene coexpression is correlated to PPIs

Previous Work

• Deane et al. (2001) [2]– Proposed EPR metric: use

gene expression profiles to assess the quality of computationally predicted protein-protein interactions

Figure adapted from Deane et al. [2]

Glimpse at the Data

Genome Size Interactions Proteins

E. Coli (2008) 4.6 million bp 7447 1863

Yeast (2001) -- 8063 4150

Yeast (2008) 12.5 million bp 18440 4943

DIP (Database of Interacting Proteins) http://dip.doe-mbi.ucla.edu/dip/Main.cgi [3]

Interaction Data:DIP dataset statistics

M3D (Many Microbe Microarrays Database) http://m3d.bu.edu/cgi-bin/web/array/index.pl?section=home [4]

Affymetrix Expression Data:M3D (Many Microbe Microarrays Database)Number of Experiments: 466Number of Chips: 907Genes: 4298

Expression Data Selection

• Concerned about complications from adding to many expression conditions– Knockouts, over-expression, foreign genes

• Selected a group of 20 conditions that were published as a part of the same experiment– Hope for more homogenous data– Allen et al. [5]

MethodGene Expression Distance

• Expression Distance given by:

• Summed over 20 conditions– Treated the first condition (wild-type anaerobic) as

the reference condition for lack of a true control

Expression Distance equation from Deane et al. [2]

MethodDIP Data

• DIP labels as ‘core’ or ‘non-core’– Corresponds roughly to small scale experiments and

high-throughput experiments• 3 Interaction Datasets– Core

• Core interactions

– Non• Non-core interactions

– Rand• 100,000 random interactions were created

MethodMapping

• DIP interaction set used uniprot protein identifiers, while M3D used gene ids

• Ran a blast of protein sequences against translated E. coli genome to map the datasets together

• Lost most of my data on this step

Number of Interactions

Mapping Available

CORE 991 220

NON 5999 903

RAND 100,000 100,000

Results

Bin size = .05

Density distribution of squared distances

Results from Deane et al. [2]


Bin size = 1.25

Results from Deane et al. [2]


Least Squares Factorization

Discussion

• Mapping/Demerging problem– Kept 1044 of my 6991 interactions (15%)

• Case study P22885– Obsolete since 2005– Demerged to P0A8P6 and P0A8P7

• Tyrosine recombinase xerC• All three have a perfect ClustalW match

Discussion

• Shape of curve– Multimers and proteins that link to themselves in

the PPI network (the zeros problem)– 66.6% of Core, 43.6% of Non, <0.1% of Rand

Conclusion

• Cannot predict novel interactions• Cannot assign confidence values to individual

interactions• Can provide some measure of the overall

quality of a PPI dataset

Thank You

• References:• [1] Deeds EJ, Ashenberg O, Shakhnovich EI. “A simple physical model for scaling in protein-protein

interaction networks.” Proc. Natl Acad. Sci. USA (2006) 103:311–316• [2] Charlotte M. Deane et al. “Protein Interactions: Two Methods for Assessment of the Reliability of

High Throughput Observations.” Molecular & Cellular Proteomics 1.5 349-356• [3] Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting

Proteins: 2004 update. NAR 32 Database issue:D449-51• [4] Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, and Gardner TS. Many

Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Research

• [5] Timothy e. Allen et al. “Genome-Scale Analysis of the Uses of the Escherichia coli Genome: Model-Driven Analysis of Heterogeneous Data Sets.” J Bacteriol. 2003 November, 185(21): 6392-6399

Documents

Improving PPI Networks with Correlated Gene Expression Data Jesse Walsh