Comparative and Functional GenomicsComp Funct Genom 2004; 5: 196200.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.378
Looking for cancer clues in publicly accessibledatabases
Djamel Medjahed1*, Peter F. Lemkin1, Gary W. Smythers1,2 and David J. Munroe1,21National Cancer Institute at Frederick, PO Box B, Frederick, MD 21702-1201, USA2Science Applications International Corporation, Frederick, MD, USA
*Correspondence to:Djamel Medjahed, Laboratory ofMolecular Technology,SAICFrederick, National CancerInstitute at Frederick, 915Tollhouse Avenue, Suite 211,Frederick, MD 21701, USA.E-mail: firstname.lastname@example.org
This article is a US Governmentwork and is in the public domainin the USA.
Received: 10 November 2003Revised: 12 December 2003Accepted: 18 December 2003
AbstractWhat started out as a mere attempt to tentatively identify proteins in experimen-tal cancer-related 2D-PAGE maps developed into VIRTUAL2D, a web-accessiblerepository for theoretical pI/MW charts for 92 organisms. Using publicly availableexpression data, we developed a collection of tissue-specific plots based on differentialgene expression between normal and diseased states. We use this comparative cancerproteomics knowledge base, known as the tissue molecular anatomy project (TMAP),to uncover threads of cancer markers common to several types of cancer and to relatethis information to established biological pathways. Published in 2004 by John Wiley& Sons, Ltd.
Keywords: cancer pathways; TMAP; VIRTUAL2D
The recent sequencing and subsequent analysis ofthe human genome have enabled a paradigm shift:increasingly, efforts are being directed away froma microscopy-based histopathology approach andtowards molecular profiling for diagnosis and man-agement of diseases such as cancer. This approachpromises to improve the likelihood of positive out-come by early detection and selection of the appro-priate therapeutic intervention. One way to achievethis is to use computer-aided pattern recognitionalgorithms to look for signatures of markers on thebasis of significant differential expression betweennormal and altered states. This formidable chal-lenge relies on high-resolution separation and anal-ysis methods.
Despite being gradually complemented andsometimes replaced by liquid chromatography(LC) techniques, two-dimensional polyacrylamidegel electrophoresis (2D-PAGE)  has enjoyedremarkable staying power. In the span of
three decades, it has evolved from a multi-process, labour-intensive separation technique to anautomated, highly reproducible and sensitive toolthat has often been at the core of efforts aimedat the detection of disease-specific biomarkers.However, large-scale identification of proteinsremains a pricey proposition, as it is typicallycarried out in conjunction with tandem massspectrometry (MS), driving the cost per spot toexceed several hundred Euros. SWISS2D is thelargest publicly accessible repository of such data,yet the number of identified spots represents lessthan a few per cent of the full complementof proteins predicted by the genome sequencingprojects. We constructed two databases that addresssome of these issues, more specifically intended to:
Facilitate the protein-to-spot assignment inexperimental 2D-PAGE maps.
Optimize experimental conditions by estimatingthe range of pI and MW attributes in advance.
Published in 2004 by John Wiley & Sons, Ltd.
Comparative cancer proteomics 197
Materials and methods
VIRTUAL2D  is the collection of web-accessible, fully interactive, isoelectric focusingpoint/molecular mass (pI/MW) charts. These mapshave been assembled from primary sequencecontained within the combined SWISS-PROT/TrEMBL curated proteome databases . Startingwith the datasets in FASTA format released by theEuropean Bioinformatics Institute, electrophoreticand mass attributes are computed for unmodifiedproteins (save for signal peptides if they arepresent) using the following approach: Scan the primary sequence of the peptide. Assign the pK of each contributing amino acid
and average over the entire peptide. Sum up all the mass contributions.The resulting pI/Mw is then given by the ratio of:
int pKint + pKNterm)
(n + 2) (1)
where the pK and mass values used are the sameas in (1).
These attributes; a database Accession No.(GenBank, SWISS-PROT), protein name and cgirequests, are assembled into tab-delimited (ASCII-format) files, which are then processed by a JAVA-based graphical user interface adapted from PtPlot.
In the course of building these plots, a bi-modal distribution, centred on either side of arelatively depleted region around pH 7.4, wasseen to be conserved for all organisms. Randomlygenerated sequences varying in length from 50to 600 amino acids yielded a similar distribution,consistent with a limited pK-segregated proteomicalphabet: roughly half the internal contributingamino acids are acidic, while the other half is basic.Just as important is the fact that none of them havea pK value near the depleted region around pH 7.2.
When launching or accessing VIRTUAL2D, aleft panel is presented that contains a list ofavailable organisms, which, when selected, will
produce an initial pI/MW map containing all theentries found in the data file. One can zoom inon an area of interest and click on any spot to betransported by hyperlink to a database of choice(default is SWISS-PROT).
To date, the pI/MW charts for 92 organismshave been assembled from data extracted from thepublished datasets. The central repository can beaccessed at http://ncisgi.ncifcrf.gov/medjahed/ orcan be requested from the author and run on aJAVA-enabled web browser.
Comparisons of predicted and experimentalcharts have yielded mixed results. For very high-resolution gels of relatively simple organismssuch as Escherichia coli, a subset of proteins forwhich the theoretical values are close to theirmeasured counterparts can be identified and, inprinciple, be used as landmarks to align bothdatasets. The large number of pre- and post-translational modifications characteristic of morecomplex, multicellular organisms makes it nearlyimpossible to assign reliably the protein identity ofmost spots.
As an extension to the two-dimensional information,we have explored using the frequency of detectionor abundance of each transcript in cDNA librariespublished in the Cancer Anatomy Genome Project(CGAP)  database to develop a set of tissue/his-tology-specific protein expression maps: the TissueMolecular Anatomy Project (TMAP) .
CGAP was launched in 1996 to standardizesample handling and procurement and to track themolecular changes occurring in cells throughouttheir progression from the normal to the cancerousstate. This effort was further enhanced by thedevelopment of laser microdissection technology,leading in principle, to purer cell populations.
The correlation between mRNA abundance andprotein expression level is known to be complexand non-linear. The aim here was not to addressthis issue but to simply provide a representationthat could be used to carry out a comparativeanalysis between the different histological states.
The starting point of our data-mining was the listof entries in the CGAP library from which we carrya cross-referencing of the Expressed Sequence Tags(EST) in UNIGENE  to extract a tab delimitedlist of gene products, including their frequency
Published in 2004 by John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 196200.
198 D. Medjahed et al.
Figure 1. A screenshot of TMAP rendering an example pI/MW plot of the differential expression of gene products. Theisoelectric focusing point (pI) is along the x axis, the molecular mass (MW) is along the y axis. 2148 entries are commonlydetected in pooled CGAP libraries from prostate cancer and normal (which can also be displayed separately). By draggingthe appropriate sliding controls, one can filter the output by selecting a range of differential expression. Here, optingto look only at those entries that are at least two-fold altered reduces the number of displayed proteins to 1188. Theexpression scale is colour-coded from light green (most downregulated) to bright red (most upregulated). Controls alongthe axes allow one to zoom in on any part of the chart and upon enabling access to a pre-selected web-based databaseof choice, one can click on the hyperlinked spots to open a new window containing all the information relevant to theassociated protein. Additional filters are accessible under the filter menu and from the graphical user interface. The plotand a tab-delimited report can be output for further examination and/or collaborations
of detection within that library. Those transcripts,clustered with genes of known function, have theirsymbol cross-referenced against the SWISS-PROTdatabase. For small sets of proteins, the pI/MWserver  can be interrogated to produce isoelectricfocusing point and mass values. For large proteomedatasets containing tens of thousands of entries, aperl script was developed, which is run locally,to overcome any restrictions and inherent Internetspeed bottlenecks. It will extract from the databasea flat file with Accession No., primary aminoacid sequence and other attributes such as putativefunction, pathways, etc. It then computes the massand isoelectric focusing point, as outlined earlier.
Once again, the data file is a simple tab-delimitedformat with the associated expression information.The frequency of detection of each gene productis used to derive normalized expression levels foreach library, so that the most abundant always hasa relative expression level of 1.0. The user canselect a grey-scale or colour-coded display of thisinformation.
Protplot is the software used to display theseexpression maps. It has been adapted from MAEx-plorer, an open-source JAVA-based microarraydata analysis suite . It checks and loads allthe corresponding files having a .prp extensionpresent within the start directory. Any one of
Published in 2004 by John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 196200.
Comparative cancer proteomics 199
these or a combination thereof can be selected fordisplay.
In some cases, a query of the CGAP databaseyields more than one library satisfying the searchcriteria. This is useful in checking the variabilityof a gene product across similar libraries. Onecan then pool the results of some or all of thelibraries corresponding to the same tissue type andincrease the signal : noise ratio. As in any countingexperiment, noise increases as n while true signalincreases as n2.
Several filters can be applied to these inferredprotein expression maps to restrict the number ofdisplayed gene products and monitor their expres-sion profile across several tissues.
A word of caution to potential users: the signif-icance of these comparisons hinges on the qual-ity of the information in the underlying databases,a theme of growing importance in light of theproliferation of biological databases. Although theentire dataset can be displayed, a transcript hasto have been detected five or more times (p