Looking for cancer clues in publicly accessible databases

Comparative and Functional GenomicsComp Funct Genom 2004; 5: 196–200.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.378

Conference Review

Looking for cancer clues in publicly accessibledatabases††

Djamel Medjahed1*, Peter F. Lemkin1, Gary W. Smythers1,2 and David J. Munroe1,2

1National Cancer Institute at Frederick, PO Box B, Frederick, MD 21702-1201, USA2Science Applications International Corporation, Frederick, MD, USA

*Correspondence to:Djamel Medjahed, Laboratory ofMolecular Technology,SAIC–Frederick, National CancerInstitute at Frederick, 915Tollhouse Avenue, Suite 211,Frederick, MD 21701, USA.E-mail: [email protected]

†This article is a US Governmentwork and is in the public domainin the USA.

Received: 10 November 2003Revised: 12 December 2003Accepted: 18 December 2003

AbstractWhat started out as a mere attempt to tentatively identify proteins in experimen-tal cancer-related 2D-PAGE maps developed into VIRTUAL2D, a web-accessiblerepository for theoretical pI/MW charts for 92 organisms. Using publicly availableexpression data, we developed a collection of tissue-specific plots based on differentialgene expression between normal and diseased states. We use this comparative cancerproteomics knowledge base, known as the tissue molecular anatomy project (TMAP),to uncover threads of cancer markers common to several types of cancer and to relatethis information to established biological pathways. Published in 2004 by John Wiley& Sons, Ltd.

Keywords: cancer pathways; TMAP; VIRTUAL2D

Introduction

The recent sequencing and subsequent analysis ofthe human genome have enabled a paradigm shift:increasingly, efforts are being directed away froma microscopy-based histopathology approach andtowards molecular profiling for diagnosis and man-agement of diseases such as cancer. This approachpromises to improve the likelihood of positive out-come by early detection and selection of the appro-priate therapeutic intervention. One way to achievethis is to use computer-aided pattern recognitionalgorithms to look for signatures of markers on thebasis of significant differential expression betweennormal and altered states. This formidable chal-lenge relies on high-resolution separation and anal-ysis methods.

Despite being gradually complemented andsometimes replaced by liquid chromatography(LC) techniques, two-dimensional polyacrylamidegel electrophoresis (2D-PAGE) [9] has enjoyedremarkable staying power. In the span of

three decades, it has evolved from a multi-process, labour-intensive separation technique to anautomated, highly reproducible and sensitive toolthat has often been at the core of efforts aimedat the detection of disease-specific biomarkers.However, large-scale identification of proteinsremains a pricey proposition, as it is typicallycarried out in conjunction with tandem massspectrometry (MS), driving the cost per spot toexceed several hundred Euros. SWISS2D is thelargest publicly accessible repository of such data,yet the number of identified spots represents lessthan a few per cent of the full complementof proteins predicted by the genome sequencingprojects. We constructed two databases that addresssome of these issues, more specifically intended to:

• Facilitate the protein-to-spot assignment inexperimental 2D-PAGE maps.

• Optimize experimental conditions by estimatingthe range of pI and MW attributes in advance.

Published in 2004 by John Wiley & Sons, Ltd.

Comparative cancer proteomics 197

Materials and methods

VIRTUAL2D

VIRTUAL2D [7] is the collection of web-accessible, fully interactive, isoelectric focusingpoint/molecular mass (pI/MW) charts. These mapshave been assembled from primary sequencecontained within the combined SWISS-PROT/TrEMBL curated proteome databases [11]. Startingwith the datasets in FASTA format released by theEuropean Bioinformatics Institute, electrophoreticand mass attributes are computed for unmodifiedproteins (save for signal peptides if they arepresent) using the following approach:

• Scan the primary sequence of the peptide.• Assign the pK of each contributing amino acid

and average over the entire peptide.• Sum up all the mass contributions.

The resulting pI/Mw is then given by the ratio of:

pItot =(

pKCterm +∑

int pKint + pKNterm

)

(n + 2)(1)

andMWtot =

∑i

MWi

where the pK and mass values used are the sameas in (1).

These attributes; a database Accession No.(GenBank, SWISS-PROT), protein name and cgirequests, are assembled into tab-delimited (ASCII-format) files, which are then processed by a JAVA-based graphical user interface adapted from PtPlot[10].

In the course of building these plots, a bi-modal distribution, centred on either side of arelatively ‘depleted’ region around pH 7.4, wasseen to be conserved for all organisms. Randomlygenerated sequences varying in length from 50to 600 amino acids yielded a similar distribution,consistent with a limited pK-segregated proteomicalphabet: roughly half the internal contributingamino acids are acidic, while the other half is basic.Just as important is the fact that none of them havea pK value near the depleted region around pH 7.2.

When launching or accessing VIRTUAL2D, aleft panel is presented that contains a list ofavailable organisms, which, when selected, will

produce an initial pI/MW map containing all theentries found in the data file. One can zoom inon an area of interest and click on any spot to betransported by hyperlink to a database of choice(default is SWISS-PROT).

To date, the pI/MW charts for 92 organismshave been assembled from data extracted from thepublished datasets. The central repository can beaccessed at http://ncisgi.ncifcrf.gov/medjahed/ orcan be requested from the author and run on aJAVA-enabled web browser.

Comparisons of predicted and experimentalcharts have yielded mixed results. For very high-resolution gels of relatively simple organismssuch as Escherichia coli, a subset of proteins forwhich the theoretical values are close to theirmeasured counterparts can be identified and, inprinciple, be used as landmarks to align bothdatasets. The large number of pre- and post-translational modifications characteristic of morecomplex, multicellular organisms makes it nearlyimpossible to assign reliably the protein identity ofmost spots.

TMAP

As an extension to the two-dimensional information,we have explored using the frequency of detectionor abundance of each transcript in cDNA librariespublished in the Cancer Anatomy Genome Project(CGAP) [2] database to develop a set of tissue/his-tology-specific protein expression maps: the TissueMolecular Anatomy Project (TMAP) [6].

CGAP was launched in 1996 to standardizesample handling and procurement and to track themolecular changes occurring in cells throughouttheir progression from the normal to the cancerousstate. This effort was further enhanced by thedevelopment of laser microdissection technology,leading in principle, to purer cell populations.

The correlation between mRNA abundance andprotein expression level is known to be complexand non-linear. The aim here was not to addressthis issue but to simply provide a representationthat could be used to carry out a comparativeanalysis between the different histological states.

The starting point of our data-mining was the listof entries in the CGAP library from which we carrya cross-referencing of the Expressed Sequence Tags(EST) in UNIGENE [12] to extract a tab delimitedlist of gene products, including their frequency

Published in 2004 by John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 196–200.

http://ncisgi.ncifcrf.gov/medjahed/

198 D. Medjahed et al.

Figure 1. A screenshot of TMAP rendering an example pI/MW plot of the differential expression of gene products. Theisoelectric focusing point (pI) is along the x axis, the molecular mass (MW) is along the y axis. 2148 entries are commonlydetected in pooled CGAP libraries from prostate cancer and normal (which can also be displayed separately). By draggingthe appropriate sliding controls, one can filter the output by selecting a range of differential expression. Here, optingto look only at those entries that are at least two-fold altered reduces the number of displayed proteins to 1188. Theexpression scale is colour-coded from light green (most downregulated) to bright red (most upregulated). Controls alongthe axes allow one to zoom in on any part of the chart and upon enabling access to a pre-selected web-based databaseof choice, one can click on the hyperlinked spots to open a new window containing all the information relevant to theassociated protein. Additional filters are accessible under the filter menu and from the graphical user interface. The plotand a tab-delimited report can be output for further examination and/or collaborations

of detection within that library. Those transcripts,clustered with genes of known function, have theirsymbol cross-referenced against the SWISS-PROTdatabase. For small sets of proteins, the pI/MWserver [3] can be interrogated to produce isoelectricfocusing point and mass values. For large proteomedatasets containing tens of thousands of entries, aperl script was developed, which is run locally,to overcome any restrictions and inherent Internetspeed bottlenecks. It will extract from the databasea flat file with Accession No., primary aminoacid sequence and other attributes such as putativefunction, pathways, etc. It then computes the massand isoelectric focusing point, as outlined earlier.

Once again, the data file is a simple tab-delimitedformat with the associated expression information.The frequency of detection of each gene productis used to derive normalized expression levels foreach library, so that the most abundant always hasa relative expression level of 1.0. The user canselect a grey-scale or colour-coded display of thisinformation.

Protplot is the software used to display theseexpression maps. It has been adapted from MAEx-plorer, an open-source JAVA-based microarraydata analysis suite [8]. It checks and loads allthe corresponding files having a .prp extensionpresent within the start directory. Any one of


Comparative cancer proteomics 199

these or a combination thereof can be selected fordisplay.

In some cases, a query of the CGAP databaseyields more than one library satisfying the searchcriteria. This is useful in checking the variabilityof a gene product across similar libraries. Onecan then pool the results of some or all of thelibraries corresponding to the same tissue type andincrease the signal : noise ratio. As in any countingexperiment, noise increases as n while true signalincreases as n2.

Several filters can be applied to these inferredprotein expression maps to restrict the number ofdisplayed gene products and monitor their expres-sion profile across several tissues.

A word of caution to potential users: the signif-icance of these comparisons hinges on the qual-ity of the information in the underlying databases,a theme of growing importance in light of theproliferation of biological databases. Although theentire dataset can be displayed, a transcript hasto have been detected five or more times (p <0.05) in order to be deemed reliable enough tobe included in any comparative analysis. Not alllibraries are equally rich in their content and notall the expression data contained within them is ofthe same quality.

Analysis of differential expression can be car-ried out between similar libraries, different dis-ease/histological states and indeed different tissuesby dividing the normalized expression levels com-monly found in both (Figure 1).

cDNA libraries obtained from microdissectedprostate samples contain some of the better qual-ity CGAP datasets. Several of them originatefrom the same patient and span the normal, pre-cancerous and cancerous state. In order to establisha transcript-based model of the progression of thedisease, it helps to classify the plausible scenarios.Given three possible histology states, the expres-sion level of any gene product has to adhere to oneof the nine following cases in the progression ofthe disease:

1. Remain constant from normal to pre-cancerto cancer.

2. Remain constant from normal to pre-cancer andincrease from pre-cancer to cancer.

3. Remain constant from normal to pre-cancer anddecrease from pre-cancer to cancer.

4. Increase from normal to pre-cancer and remainconstant from pre-cancer to cancer.

5. Decrease from normal to pre-cancer and remainconstant from pre-cancer to cancer.

6. Increase from normal to pre-cancer and decreasefrom pre-cancer to cancer.

7. Decrease from normal to pre-cancer andincrease from pre-cancer to cancer.

8. Increase from normal to pre-cancer and increasefrom pre-cancer to cancer.

9. Decrease from normal to pre-cancer anddecrease from pre-cancer to cancer.

This bookkeeping allows one to go beyond thesimple grouping of co-regulated proteins and inves-tigate inverse correlations as well. Relations suchas these are numerous and well documented in theliterature. In the context of cancer, one such exam-ple is p27, a putative tumour-suppressor, which isdownregulated in most human prostate cancers. Inparallel, Skp2, a component of the Skp1–Cul1–F-box protein (SCF) ubiquitin ligase complex, wasobserved to be overexpressed in the same samplesleading to the hypothesis, and subsequent experi-mental confirmation, that degradation of the formeris at least partially due to the latter [5].

To date, this comparative cancer proteomicsapproach has been applied to more than 14 tissuesrepresenting the normal, precancer and cancerhistological states. This database contains morethan 18 000 gene products.

Discussion

We have presented two proteomic databases: VIR-TUAL2D and its extension, TMAP, which attemptto go beyond mere transcript counting by addingfunctional enhancements to the analysis tools, suchas p value filters, library-pooling, etc.

It is critical when interpreting differential geneexpression datasets to use statistically sound anal-ysis tools that take into account their reproducibil-ity and validity. In addition to exploring ways tomodel de novo functional biological relationships,we are in the early stages of exploring ways to mapthe expression data onto established pathways, anexample of which is displayed in Figure 2. It is thehope of the present authors that as the amount andquality of information in databases improves, toolssuch as VIRTUAL2D and TMAP can facilitate theformulation of biological hypotheses.


200 D. Medjahed et al.

HDAC1

Growth Factor Growth Factorwithdrawal

MAPKsignalingpathway

GSK3B

TGF beta

SCF

SKP2SMAD3SMAD4

SCF

RB1 p107SKP2 ABL1

HDAC

ARF

EP300

Mdm2

RB1

DN-PK

p53P06400

e e e

+p +p

+p+p

+p

-p

+u

-p

+u

+p

DNA damage checkpoint

p16,15,18,19 p27, 57 p21+u +u

P38936 P24522Kip1Ink4a-d

e e

R-point(Start)

+p

+p

+p

+p+p

+p+p

+p+p

014+p

+p

+uCDK4CDK6

CycD2CycD3 CycE2 CycA1

CycA2

CDK2

CycE1

CDK2

E2F

DP1

P0640

P005

CCNHCDK2

Cdc61

ORC

Cdc45

MCMRB1

Cdc7Dbt4DNA

DNADNA biosynthesis

O00311

P51946

P24941

P0640 WEE1 Myt1

P21941P21941

CDC25A

CycA1CycB1CycB2CycB3

CDK1

CycA2

CDK1

Apoptosis

ATRATM

Chk1Chk2 096017

060566

O90195

043684

BUB1

Mps1

MAD1L1

MAD2L1MAD2L2

BUB1B

Smc1L2

Esp1

PTTG3

PTTG2PTTG1

APC/C

Cdc20

14-3-3

CDC25CCDC25B P30305

PLK

APC/C

Cdh1

Ubiquitinmediated

proteolysis

Cdc14A

Bub2 MEN

043171

043183

GADD45

PCNA P12004

YWHAG 070457

Condensin

Separin

Securin

S-phase proteins

S G2 MExpression DatasetName: colonGene Color Set: colon_expressionGene Value: SP_IDCOLON EXPRESSION MAPLegendMORE THAN 2-FOLD DOWN-REGULATEDMORE THAN TWO-FOLD UP_REGULATEDNo criteria metNot found

ORC2LORC4LORC6L

ORC1LORC3LORC5L

ORC OriginRecognition Complex

04392

MCM Mini-ChromosomeMaintenance complex

P25205

P33993

P4973

O145

P3399

G1

HDAC2HDAC3HDAC4

Histone Deacetylases Transcription Factor E2F

E2F1E2F2E2F3E2F4E2F5

RBL1E2F6

UBE2FLOC51270

O135

O927

O153

P565

O9UOL

O9UBH

O9HP7

O0071

O1299

O15329

060544

BUB3

Cdc14B

MCM2

MCM6MCM4

HDAC6HDAC7AHDAC8

HDAC5

p46527

MCM7

MCM3MCM5

C1p1

Cell cycle

Figure 2. For the commonly shared gene products, differential expression can be examined by displaying the colour-codedratios between the diseased state and the normal state. Using the GenMAPP package [4], Histone deacetylases found inprostate expression data are colour-coded in the pathway adapted from KEGG and describing the cell cycle according totheir level of overexpression (HDAC6, yellow) or downregulation (HDAC1, 3, purple) in cancer vs. normal

Acknowledgements

The content of this publication does not necessarily reflectthe views or policies of the Department of Health andHuman Services, neither does mention of trade names,commercial products or organization imply endorsementby the US Government. This project has been funded withFederal funds from the National Cancer Institute, NationalInstitutes of Health, under Contract No. NO1-CO-12400.

References

1. Bjellqvist B, Sanchez JC, Pasquali C, et al. 1993. Micro-preparative two-dimensional electrophoresis allowing the sep-aration of samples containing milligram amounts of proteins.Electrophoresis 14: 1375–1378.

2. Cancer Genome Anatomy Project: http://CGAP.nci.nih.gov.

3. Compute pI/Mw tool: http://us.expasy.org/tools/pi tool.html.4. GenMAPP: http://www.genmapp.org.5. Lim MS, Adamson A, Lin Z, et al. 2002. Expression of

Skp2, a p27(Kip1) ubiquitin ligase, in malignant lymphoma:correlation with p27(Kip1) and proliferation index. Blood 100:2950–2956.

6. Medjahed D, Luke BT, Tontesh TS, et al. 2003. TissueMolecular Anatomy Project (TMAP): an expression databasefor comparative cancer proteomics. Proteomics 3: 1445–1453.

7. Medjahed D, Smythers GW, Powell AD, et al. 2003. VIR-TUAL2D: a web-accessible predictive database for proteomicsanalysis. Proteomics 3: 129–138.

8. MicroArray Explorer: https://sourceforge.net/projects/maexplorer.

9. O’Farrell PH. 1975. High resolution two-dimensional elec-trophoresis of proteins. J Biol Chem 250: 4007–4021.

10. PtPlot: http://ptolemy.eecs.berkeley.edu/java/ptplot/.11. SWISSPROT can be accessed at http://www.expasy.ch.12. UniGene database: http://www.ncbi.nlm.nih.gov.


http://CGAP.nci.nih.gov

http://us.expasy.org/tools/pi_tool.html

http://www.genmapp.org

https://sourceforge.net/projects/

http://ptolemy.eecs.berkeley.edu/java/ptplot/

http://www.expasy.ch

http://www.ncbi.nlm.nih.gov

Documents

Looking for cancer clues in publicly accessible databases