Comparative and Functional GenomicsComp Funct Genom 2004; 5: 196–200.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.378
Conference Review
Looking for cancer clues in publicly accessibledatabases††
Djamel Medjahed1*, Peter F. Lemkin1, Gary W. Smythers1,2 and David J. Munroe1,2
1National Cancer Institute at Frederick, PO Box B, Frederick, MD 21702-1201, USA2Science Applications International Corporation, Frederick, MD, USA
*Correspondence to:Djamel Medjahed, Laboratory ofMolecular Technology,SAIC–Frederick, National CancerInstitute at Frederick, 915Tollhouse Avenue, Suite 211,Frederick, MD 21701, USA.E-mail: [email protected]
†This article is a US Governmentwork and is in the public domainin the USA.
Received: 10 November 2003Revised: 12 December 2003Accepted: 18 December 2003
AbstractWhat started out as a mere attempt to tentatively identify proteins in experimen-tal cancer-related 2D-PAGE maps developed into VIRTUAL2D, a web-accessiblerepository for theoretical pI/MW charts for 92 organisms. Using publicly availableexpression data, we developed a collection of tissue-specific plots based on differentialgene expression between normal and diseased states. We use this comparative cancerproteomics knowledge base, known as the tissue molecular anatomy project (TMAP),to uncover threads of cancer markers common to several types of cancer and to relatethis information to established biological pathways. Published in 2004 by John Wiley& Sons, Ltd.
Keywords: cancer pathways; TMAP; VIRTUAL2D
Introduction
The recent sequencing and subsequent analysis ofthe human genome have enabled a paradigm shift:increasingly, efforts are being directed away froma microscopy-based histopathology approach andtowards molecular profiling for diagnosis and man-agement of diseases such as cancer. This approachpromises to improve the likelihood of positive out-come by early detection and selection of the appro-priate therapeutic intervention. One way to achievethis is to use computer-aided pattern recognitionalgorithms to look for signatures of markers on thebasis of significant differential expression betweennormal and altered states. This formidable chal-lenge relies on high-resolution separation and anal-ysis methods.
Despite being gradually complemented andsometimes replaced by liquid chromatography(LC) techniques, two-dimensional polyacrylamidegel electrophoresis (2D-PAGE) [9] has enjoyedremarkable staying power. In the span of
three decades, it has evolved from a multi-process, labour-intensive separation technique to anautomated, highly reproducible and sensitive toolthat has often been at the core of efforts aimedat the detection of disease-specific biomarkers.However, large-scale identification of proteinsremains a pricey proposition, as it is typicallycarried out in conjunction with tandem massspectrometry (MS), driving the cost per spot toexceed several hundred Euros. SWISS2D is thelargest publicly accessible repository of such data,yet the number of identified spots represents lessthan a few per cent of the full complementof proteins predicted by the genome sequencingprojects. We constructed two databases that addresssome of these issues, more specifically intended to:
• Facilitate the protein-to-spot assignment inexperimental 2D-PAGE maps.
• Optimize experimental conditions by estimatingthe range of pI and MW attributes in advance.
Published in 2004 by John Wiley & Sons, Ltd.
Comparative cancer proteomics 197
Materials and methods
VIRTUAL2D
VIRTUAL2D [7] is the collection of web-accessible, fully interactive, isoelectric focusingpoint/molecular mass (pI/MW) charts. These mapshave been assembled from primary sequencecontained within the combined SWISS-PROT/TrEMBL curated proteome databases [11]. Startingwith the datasets in FASTA format released by theEuropean Bioinformatics Institute, electrophoreticand mass attributes are computed for unmodifiedproteins (save for signal peptides if they arepresent) using the following approach:
• Scan the primary sequence of the peptide.• Assign the pK of each contributing amino acid
and average over the entire peptide.• Sum up all the mass contributions.
The resulting pI/Mw is then given by the ratio of:
pItot =(
pKCterm +∑
int pKint + pKNterm
)
(n + 2)(1)
andMWtot =
∑i
MWi
where the pK and mass values used are the sameas in (1).
These attributes; a database Accession No.(GenBank, SWISS-PROT), protein name and cgirequests, are assembled into tab-delimited (ASCII-format) files, which are then processed by a JAVA-based graphical user interface adapted from PtPlot[10].
In the course of building these plots, a bi-modal distribution, centred on either side of arelatively ‘depleted’ region around pH 7.4, wasseen to be conserved for all organisms. Randomlygenerated sequences varying in length from 50to 600 amino acids yielded a similar distribution,consistent with a limited pK-segregated proteomicalphabet: roughly half the internal contributingamino acids are acidic, while the other half is basic.Just as important is the fact that none of them havea pK value near the depleted region around pH 7.2.
When launching or accessing VIRTUAL2D, aleft panel is presented that contains a list ofavailable organisms, which, when selected, will
produce an initial pI/MW map containing all theentries found in the data file. One can zoom inon an area of interest and click on any spot to betransported by hyperlink to a database of choice(default is SWISS-PROT).
To date, the pI/MW charts for 92 organismshave been assembled from data extracted from thepublished datasets. The central repository can beaccessed at http://ncisgi.ncifcrf.gov/medjahed/ orcan be requested from the author and run on aJAVA-enabled web browser.
Comparisons of predicted and experimentalcharts have yielded mixed results. For very high-resolution gels of relatively simple organismssuch as Escherichia coli, a subset of proteins forwhich the theoretical values are close to theirmeasured counterparts can be identified and, inprinciple, be used as landmarks to align bothdatasets. The large number of pre- and post-translational modifications characteristic of morecomplex, multicellular organisms makes it nearlyimpossible to assign reliably the protein identity ofmost spots.
TMAP
As an extension to the two-dimensional information,we have explored using the frequency of detectionor abundance of each transcript in cDNA librariespublished in the Cancer Anatomy Genome Project(CGAP) [2] database to develop a set of tissue/his-tology-specific protein expression maps: the TissueMolecular Anatomy Project (TMAP) [6].
CGAP was launched in 1996 to standardizesample handling and procurement and to track themolecular changes occurring in cells throughouttheir progression from the normal to the cancerousstate. This effort was further enhanced by thedevelopment of laser microdissection technology,leading in principle, to purer cell populations.
The correlation between mRNA abundance andprotein expression level is known to be complexand non-linear. The aim here was not to addressthis issue but to simply provide a representationthat could be used to carry out a comparativeanalysis between the different histological states.
The starting point of our data-mining was the listof entries in the CGAP library from which we carrya cross-referencing of the Expressed Sequence Tags(EST) in UNIGENE [12] to extract a tab delimitedlist of gene products, including their frequency
Published in 2004 by John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 196–200.
198 D. Medjahed et al.
Figure 1. A screenshot of TMAP rendering an example pI/MW plot of the differential expression of gene products. Theisoelectric focusing point (pI) is along the x axis, the molecular mass (MW) is along the y axis. 2148 entries are commonlydetected in pooled CGAP libraries from prostate cancer and normal (which can also be displayed separately). By draggingthe appropriate sliding controls, one can filter the output by selecting a range of differential expression. Here, optingto look only at those entries that are at least two-fold altered reduces the number of displayed proteins to 1188. Theexpression scale is colour-coded from light green (most downregulated) to bright red (most upregulated). Controls alongthe axes allow one to zoom in on any part of the chart and upon enabling access to a pre-selected web-based databaseof choice, one can click on the hyperlinked spots to open a new window containing all the information relevant to theassociated protein. Additional filters are accessible under the filter menu and from the graphical user interface. The plotand a tab-delimited report can be output for further examination and/or collaborations
of detection within that library. Those transcripts,clustered with genes of known function, have theirsymbol cross-referenced against the SWISS-PROTdatabase. For small sets of proteins, the pI/MWserver [3] can be interrogated to produce isoelectricfocusing point and mass values. For large proteomedatasets containing tens of thousands of entries, aperl script was developed, which is run locally,to overcome any restrictions and inherent Internetspeed bottlenecks. It will extract from the databasea flat file with Accession No., primary aminoacid sequence and other attributes such as putativefunction, pathways, etc. It then computes the massand isoelectric focusing point, as outlined earlier.
Once again, the data file is a simple tab-delimitedformat with the associated expression information.The frequency of detection of each gene productis used to derive normalized expression levels foreach library, so that the most abundant always hasa relative expression level of 1.0. The user canselect a grey-scale or colour-coded display of thisinformation.
Protplot is the software used to display theseexpression maps. It has been adapted from MAEx-plorer, an open-source JAVA-based microarraydata analysis suite [8]. It checks and loads allthe corresponding files having a .prp extensionpresent within the start directory. Any one of
Published in 2004 by John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 196–200.
Comparative cancer proteomics 199
these or a combination thereof can be selected fordisplay.
In some cases, a query of the CGAP databaseyields more than one library satisfying the searchcriteria. This is useful in checking the variabilityof a gene product across similar libraries. Onecan then pool the results of some or all of thelibraries corresponding to the same tissue type andincrease the signal : noise ratio. As in any countingexperiment, noise increases as n while true signalincreases as n2.
Several filters can be applied to these inferredprotein expression maps to restrict the number ofdisplayed gene products and monitor their expres-sion profile across several tissues.
A word of caution to potential users: the signif-icance of these comparisons hinges on the qual-ity of the information in the underlying databases,a theme of growing importance in light of theproliferation of biological databases. Although theentire dataset can be displayed, a transcript hasto have been detected five or more times (p <0.05) in order to be deemed reliable enough tobe included in any comparative analysis. Not alllibraries are equally rich in their content and notall the expression data contained within them is ofthe same quality.
Analysis of differential expression can be car-ried out between similar libraries, different dis-ease/histological states and indeed different tissuesby dividing the normalized expression levels com-monly found in both (Figure 1).
cDNA libraries obtained from microdissectedprostate samples contain some of the better qual-ity CGAP datasets. Several of them originatefrom the same patient and span the normal, pre-cancerous and cancerous state. In order to establisha transcript-based model of the progression of thedisease, it helps to classify the plausible scenarios.Given three possible histology states, the expres-sion level of any gene product has to adhere to oneof the nine following cases in the progression ofthe disease:
1. Remain constant from normal to pre-cancerto cancer.
2. Remain constant from normal to pre-cancer andincrease from pre-cancer to cancer.
3. Remain constant from normal to pre-cancer anddecrease from pre-cancer to cancer.
4. Increase from normal to pre-cancer and remainconstant from pre-cancer to cancer.
5. Decrease from normal to pre-cancer and remainconstant from pre-cancer to cancer.
6. Increase from normal to pre-cancer and decreasefrom pre-cancer to cancer.
7. Decrease from normal to pre-cancer andincrease from pre-cancer to cancer.
8. Increase from normal to pre-cancer and increasefrom pre-cancer to cancer.
9. Decrease from normal to pre-cancer anddecrease from pre-cancer to cancer.
This bookkeeping allows one to go beyond thesimple grouping of co-regulated proteins and inves-tigate inverse correlations as well. Relations suchas these are numerous and well documented in theliterature. In the context of cancer, one such exam-ple is p27, a putative tumour-suppressor, which isdownregulated in most human prostate cancers. Inparallel, Skp2, a component of the Skp1–Cul1–F-box protein (SCF) ubiquitin ligase complex, wasobserved to be overexpressed in the same samplesleading to the hypothesis, and subsequent experi-mental confirmation, that degradation of the formeris at least partially due to the latter [5].
To date, this comparative cancer proteomicsapproach has been applied to more than 14 tissuesrepresenting the normal, precancer and cancerhistological states. This database contains morethan 18 000 gene products.
Discussion
We have presented two proteomic databases: VIR-TUAL2D and its extension, TMAP, which attemptto go beyond mere transcript counting by addingfunctional enhancements to the analysis tools, suchas p value filters, library-pooling, etc.
It is critical when interpreting differential geneexpression datasets to use statistically sound anal-ysis tools that take into account their reproducibil-ity and validity. In addition to exploring ways tomodel de novo functional biological relationships,we are in the early stages of exploring ways to mapthe expression data onto established pathways, anexample of which is displayed in Figure 2. It is thehope of the present authors that as the amount andquality of information in databases improves, toolssuch as VIRTUAL2D and TMAP can facilitate theformulation of biological hypotheses.
Published in 2004 by John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 196–200.
200 D. Medjahed et al.
HDAC1
Growth Factor Growth Factorwithdrawal
MAPKsignalingpathway
GSK3B
TGF beta
SCF
SKP2SMAD3SMAD4
SCF
RB1 p107SKP2 ABL1
HDAC
ARF
EP300
Mdm2
RB1
DN-PK
p53P06400
e e e
+p +p
+p+p
+p
-p
+u
-p
+u
+p
DNA damage checkpoint
p16,15,18,19 p27, 57 p21+u +u
P38936 P24522Kip1Ink4a-d
e e
R-point(Start)
+p
+p
+p
+p+p
+p+p
+p+p
014+p
+p
+uCDK4CDK6
CycD2CycD3 CycE2 CycA1
CycA2
CDK2
CycE1
CDK2
E2F
DP1
P0640
P005
CCNHCDK2
Cdc61
ORC
Cdc45
MCMRB1
Cdc7Dbt4DNA
DNADNA biosynthesis
O00311
P51946
P24941
P0640 WEE1 Myt1
P21941P21941
CDC25A
CycA1CycB1CycB2CycB3
CDK1
CycA2
CDK1
Apoptosis
ATRATM
Chk1Chk2 096017
060566
O90195
043684
BUB1
Mps1
MAD1L1
MAD2L1MAD2L2
BUB1B
Smc1L2
Esp1
PTTG3
PTTG2PTTG1
APC/C
Cdc20
14-3-3
CDC25CCDC25B P30305
PLK
APC/C
Cdh1
Ubiquitinmediated
proteolysis
Cdc14A
Bub2 MEN
043171
043183
GADD45
PCNA P12004
YWHAG 070457
Condensin
Separin
Securin
S-phase proteins
S G2 MExpression DatasetName: colonGene Color Set: colon_expressionGene Value: SP_IDCOLON EXPRESSION MAPLegendMORE THAN 2-FOLD DOWN-REGULATEDMORE THAN TWO-FOLD UP_REGULATEDNo criteria metNot found
ORC2LORC4LORC6L
ORC1LORC3LORC5L
ORC OriginRecognition Complex
04392
MCM Mini-ChromosomeMaintenance complex
P25205
P33993
P4973
O145
P3399
G1
HDAC2HDAC3HDAC4
Histone Deacetylases Transcription Factor E2F
E2F1E2F2E2F3E2F4E2F5
RBL1E2F6
UBE2FLOC51270
O135
O927
O153
P565
O9UOL
O9UBH
O9HP7
O0071
O1299
O15329
060544
BUB3
Cdc14B
MCM2
MCM6MCM4
HDAC6HDAC7AHDAC8
HDAC5
p46527
MCM7
MCM3MCM5
C1p1
Cell cycle
Figure 2. For the commonly shared gene products, differential expression can be examined by displaying the colour-codedratios between the diseased state and the normal state. Using the GenMAPP package [4], Histone deacetylases found inprostate expression data are colour-coded in the pathway adapted from KEGG and describing the cell cycle according totheir level of overexpression (HDAC6, yellow) or downregulation (HDAC1, 3, purple) in cancer vs. normal
Acknowledgements
The content of this publication does not necessarily reflectthe views or policies of the Department of Health andHuman Services, neither does mention of trade names,commercial products or organization imply endorsementby the US Government. This project has been funded withFederal funds from the National Cancer Institute, NationalInstitutes of Health, under Contract No. NO1-CO-12400.
References
1. Bjellqvist B, Sanchez JC, Pasquali C, et al. 1993. Micro-preparative two-dimensional electrophoresis allowing the sep-aration of samples containing milligram amounts of proteins.Electrophoresis 14: 1375–1378.
2. Cancer Genome Anatomy Project: http://CGAP.nci.nih.gov.
3. Compute pI/Mw tool: http://us.expasy.org/tools/pi tool.html.4. GenMAPP: http://www.genmapp.org.5. Lim MS, Adamson A, Lin Z, et al. 2002. Expression of
Skp2, a p27(Kip1) ubiquitin ligase, in malignant lymphoma:correlation with p27(Kip1) and proliferation index. Blood 100:2950–2956.
6. Medjahed D, Luke BT, Tontesh TS, et al. 2003. TissueMolecular Anatomy Project (TMAP): an expression databasefor comparative cancer proteomics. Proteomics 3: 1445–1453.
7. Medjahed D, Smythers GW, Powell AD, et al. 2003. VIR-TUAL2D: a web-accessible predictive database for proteomicsanalysis. Proteomics 3: 129–138.
8. MicroArray Explorer: https://sourceforge.net/projects/maexplorer.
9. O’Farrell PH. 1975. High resolution two-dimensional elec-trophoresis of proteins. J Biol Chem 250: 4007–4021.
10. PtPlot: http://ptolemy.eecs.berkeley.edu/java/ptplot/.11. SWISSPROT can be accessed at http://www.expasy.ch.12. UniGene database: http://www.ncbi.nlm.nih.gov.
Published in 2004 by John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 196–200.