Upload
todd-oconnor
View
213
Download
0
Embed Size (px)
Citation preview
Improving PPI Networks with Correlated Gene Expression Data
Jesse Walsh
Background
• PPI networks are currently derived either computationally or experimentally
• It is well known that there are a great number of false positives and false negatives in computationally derived networks
• High-throughput– Two major published yeast PPI experiments
showed only ≈150 similar interactions out of thousands [1]
Goal: Improve the Data Quality
• The goal of this study is to improve the quality of computationally predicted protein-protein interactions
• Hypothesis:– Proteins that interact may also have similar
expression patterns– Gene coexpression is correlated to PPIs
Previous Work
• Deane et al. (2001) [2]– Proposed EPR metric: use
gene expression profiles to assess the quality of computationally predicted protein-protein interactions
Figure adapted from Deane et al. [2]
Glimpse at the Data
Genome Size Interactions Proteins
E. Coli (2008) 4.6 million bp 7447 1863
Yeast (2001) -- 8063 4150
Yeast (2008) 12.5 million bp 18440 4943
DIP (Database of Interacting Proteins) http://dip.doe-mbi.ucla.edu/dip/Main.cgi [3]
Interaction Data:DIP dataset statistics
M3D (Many Microbe Microarrays Database) http://m3d.bu.edu/cgi-bin/web/array/index.pl?section=home [4]
Affymetrix Expression Data:M3D (Many Microbe Microarrays Database)Number of Experiments: 466Number of Chips: 907Genes: 4298
Expression Data Selection
• Concerned about complications from adding to many expression conditions– Knockouts, over-expression, foreign genes
• Selected a group of 20 conditions that were published as a part of the same experiment– Hope for more homogenous data– Allen et al. [5]
MethodGene Expression Distance
• Expression Distance given by:
• Summed over 20 conditions– Treated the first condition (wild-type anaerobic) as
the reference condition for lack of a true control
Expression Distance equation from Deane et al. [2]
MethodDIP Data
• DIP labels as ‘core’ or ‘non-core’– Corresponds roughly to small scale experiments and
high-throughput experiments• 3 Interaction Datasets– Core
• Core interactions
– Non• Non-core interactions
– Rand• 100,000 random interactions were created
MethodMapping
• DIP interaction set used uniprot protein identifiers, while M3D used gene ids
• Ran a blast of protein sequences against translated E. coli genome to map the datasets together
• Lost most of my data on this step
Number of Interactions
Mapping Available
CORE 991 220
NON 5999 903
RAND 100,000 100,000
Results
Bin size = .05
Density distribution of squared distances
Results from Deane et al. [2]
Figure adapted from Deane et al. [2]
Bin size = 1.25
Results from Deane et al. [2]
Figure adapted from Deane et al. [2]
Least Squares Factorization
Discussion
• Mapping/Demerging problem– Kept 1044 of my 6991 interactions (15%)
• Case study P22885– Obsolete since 2005– Demerged to P0A8P6 and P0A8P7
• Tyrosine recombinase xerC• All three have a perfect ClustalW match
Discussion
• Shape of curve– Multimers and proteins that link to themselves in
the PPI network (the zeros problem)– 66.6% of Core, 43.6% of Non, <0.1% of Rand
Conclusion
• Cannot predict novel interactions• Cannot assign confidence values to individual
interactions• Can provide some measure of the overall
quality of a PPI dataset
Thank You
• References:• [1] Deeds EJ, Ashenberg O, Shakhnovich EI. “A simple physical model for scaling in protein-protein
interaction networks.” Proc. Natl Acad. Sci. USA (2006) 103:311–316• [2] Charlotte M. Deane et al. “Protein Interactions: Two Methods for Assessment of the Reliability of
High Throughput Observations.” Molecular & Cellular Proteomics 1.5 349-356• [3] Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting
Proteins: 2004 update. NAR 32 Database issue:D449-51• [4] Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, and Gardner TS. Many
Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Research
• [5] Timothy e. Allen et al. “Genome-Scale Analysis of the Uses of the Escherichia coli Genome: Model-Driven Analysis of Heterogeneous Data Sets.” J Bacteriol. 2003 November, 185(21): 6392-6399