6
Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers Aparna Kumar a , Arvind Rao a,1 , Santosh Bhavani a , Justin Y. Newberg b,c,2 , and Robert F. Murphy a,b,c,d,e,3 a Lane Center for Computational Biology, b Center for Bioimage Informatics, c Department of Biomedical Engineering, d Department of Biological Sciences, and e Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213 Edited by Joe W. Gray, Oregon Health and Science University, Portland, OR, and accepted by the Editorial Board November 17, 2014 (received for review August 8, 2014) Molecular biomarkers are changes measured in biological samples that reflect disease states. Such markers can help clinicians identify types of cancer or stages of progression, and they can guide in tailoring specific therapies. Many efforts to identify biomarkers consider genes that mutate between normal and cancerous tissues or changes in protein or RNA expression levels. Here we define location biomarkers, proteins that undergo changes in subcellular location that are indicative of disease. To discover such bio- markers, we have developed an automated pipeline to compare the subcellular location of proteins between two sets of immuno- histochemistry images. We used the pipeline to compare images of healthy and tumor tissue from the Human Protein Atlas, ranking hundreds of proteins in breast, liver, prostate, and bladder based on how much their location was estimated to have changed. The performance of the system was evaluated by determining whether proteins previously known to change location in tumors were ranked highly. We present a number of candidate location biomarkers for each tissue, and identify biochemical pathways that are enriched in proteins that change location. The analysis technology is anticipated to be useful not only for discovering new location biomarkers but also for enabling automated analysis of biomarker distributions as an aid to determining diagnosis. biomarkers | protein subcellular location | image analysis | digital pathology | cancer O ur understanding of the number and types of changes that occur in various cancers is continuously growing. Previous work to discover proteins that vary significantly between normal and cancer cells has used techniques such as microarray pro- filing, next-generation sequencing, antibody arrays, and proteo- mic profiling (14). These studies have led to the discovery of proteins (termed expression biomarkers) whose expression levels mark different disease states. However, for some proteins, the extent of localization in the nucleus can be used to predict pa- tient prognosis; β-catenin (5) and NF-κB (6) are examples. The discovery of more proteins that undergo oncogenesis-associated changes in subcellular location (which we term location bio- markers) could potentially improve disease diagnosis in con- junction with traditional protein expression markers. Further, discovering proteins that relocate in the disease state may give new insight into changes driving disease, and such changes would go undetected by measuring only expression. Immunohistochemistry (IHC) studies are a major source of data on protein expression and location. Most such studies use visual examination to assess changes, a difficult and time-con- suming task. With the advent of high-throughput acquisition technologies like tissue microarrays and automated slide scan- ners, computerized analysis of tissue images is highly desirable, and studies have shown that quantitative software can detect changes in disease states that are missed by visual inspection (7). Methods for analyzing changes in expression and pattern are well established in cultured cells (8), but histological images are typically more difficult to analyze because cellular heterogeneity and the closely packed organization of cells lead to signifi- cant cell segmentation challenges. Several projects have been initiated to build workflows that process IHC images (9, 10). Most of this work has been focused on quantifying differences in protein abundance between normal and cancer tissue. However, as discussed earlier, differences in subcellular protein locations could also be critical for understanding and diagnosing disease. Thus, there is a strong need for systems that can analyze protein subcellular location in IHC images. We have previously described an automated system for rec- ognizing major subcellular patterns in IHC images (11), and presented preliminary results from the use of that system to identify proteins that change location in various cancers (12). These studies used a subset of the extensive collection of IHC images in the Human Protein Atlas (HPA) (13). However, we have found that the performance on a larger collection of pro- teins with more pattern variation was significantly lower com- pared with the 16 marker proteins used in our previous study. We therefore sought to develop a system that can identify po- tential location biomarkers by using new approaches without explicit classification. By using images from the HPA, we show that our system can identify proteins with altered subcellular location directly from tissue images and anticipate that approaches such as this may significantly contribute to diagnosis, treatment, and monitoring of cancers. Results Our analysis pipeline (Fig. 1) consists of five steps. i ) Selecting a set of proteins for analysis guided by staining levels. For a given tissue, we selected antibodies from the Significance Changes in the expression of proteins are often associated with oncogenesis, and are frequently used as cancer bio- markers. Changes in the subcellular location of proteins have been less frequently investigated. In this paper, we describe a robust pipeline for identifying those proteins whose sub- cellular location undergoes statistically significant changes in cancers of four tissues, and also for identifying biochemical pathways that are enriched for proteins that translocate. Fu- ture investigation of these proteins and pathways may provide new insight into oncogenesis. Further, the analysis pipeline is expected to be useful for assessing disease type and severity in a clinical setting. Author contributions: A.K., A.R., and R.F.M. designed research; A.K., A.R., S.B., J.Y.N., and R.F.M. performed research; A.K., A.R., and R.F.M. analyzed data; and A.K., A.R., and R.F.M. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. J.W.G. is a guest editor invited by the Editorial Board. 1 Present address: Department of Bioinformatics and Computational Biology, The Univer- sity of Texas MD Anderson Cancer Center, Houston, TX 77030. 2 Present address: Cancer Research Program, Houston Methodist Research Institute, Houston, TX 77030. 3 To whom correspondence should be addressed. Email: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1415120112/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1415120112 PNAS | December 23, 2014 | vol. 111 | no. 51 | 1824918254 BIOPHYSICS AND COMPUTATIONAL BIOLOGY Downloaded by guest on March 21, 2020

Automated analysis of immunohistochemistry …Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers Aparna Kumara, Arvind Raoa,1, Santosh

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automated analysis of immunohistochemistry …Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers Aparna Kumara, Arvind Raoa,1, Santosh

Automated analysis of immunohistochemistry imagesidentifies candidate location biomarkers for cancersAparna Kumara, Arvind Raoa,1, Santosh Bhavania, Justin Y. Newbergb,c,2, and Robert F. Murphya,b,c,d,e,3

aLane Center for Computational Biology, bCenter for Bioimage Informatics, cDepartment of Biomedical Engineering, dDepartment of Biological Sciences,and eMachine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213

Edited by Joe W. Gray, Oregon Health and Science University, Portland, OR, and accepted by the Editorial Board November 17, 2014 (received for reviewAugust 8, 2014)

Molecular biomarkers are changes measured in biological samplesthat reflect disease states. Such markers can help clinicians identifytypes of cancer or stages of progression, and they can guide intailoring specific therapies. Many efforts to identify biomarkersconsider genes that mutate between normal and cancerous tissuesor changes in protein or RNA expression levels. Here we definelocation biomarkers, proteins that undergo changes in subcellularlocation that are indicative of disease. To discover such bio-markers, we have developed an automated pipeline to comparethe subcellular location of proteins between two sets of immuno-histochemistry images. We used the pipeline to compare images ofhealthy and tumor tissue from the Human Protein Atlas, rankinghundreds of proteins in breast, liver, prostate, and bladder basedon how much their location was estimated to have changed. Theperformance of the system was evaluated by determiningwhether proteins previously known to change location in tumorswere ranked highly. We present a number of candidate locationbiomarkers for each tissue, and identify biochemical pathwaysthat are enriched in proteins that change location. The analysistechnology is anticipated to be useful not only for discovering newlocation biomarkers but also for enabling automated analysis ofbiomarker distributions as an aid to determining diagnosis.

biomarkers | protein subcellular location | image analysis |digital pathology | cancer

Our understanding of the number and types of changes thatoccur in various cancers is continuously growing. Previous

work to discover proteins that vary significantly between normaland cancer cells has used techniques such as microarray pro-filing, next-generation sequencing, antibody arrays, and proteo-mic profiling (1–4). These studies have led to the discovery ofproteins (termed expression biomarkers) whose expression levelsmark different disease states. However, for some proteins, theextent of localization in the nucleus can be used to predict pa-tient prognosis; β-catenin (5) and NF-κB (6) are examples. Thediscovery of more proteins that undergo oncogenesis-associatedchanges in subcellular location (which we term location bio-markers) could potentially improve disease diagnosis in con-junction with traditional protein expression markers. Further,discovering proteins that relocate in the disease state may givenew insight into changes driving disease, and such changes wouldgo undetected by measuring only expression.Immunohistochemistry (IHC) studies are a major source of

data on protein expression and location. Most such studies usevisual examination to assess changes, a difficult and time-con-suming task. With the advent of high-throughput acquisitiontechnologies like tissue microarrays and automated slide scan-ners, computerized analysis of tissue images is highly desirable,and studies have shown that quantitative software can detectchanges in disease states that are missed by visual inspection (7).Methods for analyzing changes in expression and pattern are wellestablished in cultured cells (8), but histological images aretypically more difficult to analyze because cellular heterogeneityand the closely packed organization of cells lead to signifi-cant cell segmentation challenges. Several projects have been

initiated to build workflows that process IHC images (9, 10).Most of this work has been focused on quantifying differences inprotein abundance between normal and cancer tissue. However,as discussed earlier, differences in subcellular protein locationscould also be critical for understanding and diagnosing disease.Thus, there is a strong need for systems that can analyze proteinsubcellular location in IHC images.We have previously described an automated system for rec-

ognizing major subcellular patterns in IHC images (11), andpresented preliminary results from the use of that system toidentify proteins that change location in various cancers (12).These studies used a subset of the extensive collection of IHCimages in the Human Protein Atlas (HPA) (13). However, wehave found that the performance on a larger collection of pro-teins with more pattern variation was significantly lower com-pared with the 16 marker proteins used in our previous study.We therefore sought to develop a system that can identify po-tential location biomarkers by using new approaches withoutexplicit classification. By using images from the HPA, we showthat our system can identify proteins with altered subcellularlocation directly from tissue images and anticipate thatapproaches such as this may significantly contribute to diagnosis,treatment, and monitoring of cancers.

ResultsOur analysis pipeline (Fig. 1) consists of five steps.

i) Selecting a set of proteins for analysis guided by staininglevels. For a given tissue, we selected antibodies from the

Significance

Changes in the expression of proteins are often associatedwith oncogenesis, and are frequently used as cancer bio-markers. Changes in the subcellular location of proteins havebeen less frequently investigated. In this paper, we describea robust pipeline for identifying those proteins whose sub-cellular location undergoes statistically significant changes incancers of four tissues, and also for identifying biochemicalpathways that are enriched for proteins that translocate. Fu-ture investigation of these proteins and pathways may providenew insight into oncogenesis. Further, the analysis pipeline isexpected to be useful for assessing disease type and severity ina clinical setting.

Author contributions: A.K., A.R., and R.F.M. designed research; A.K., A.R., S.B., J.Y.N., andR.F.M. performed research; A.K., A.R., and R.F.M. analyzed data; and A.K., A.R., andR.F.M. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. J.W.G. is a guest editor invited by theEditorial Board.1Present address: Department of Bioinformatics and Computational Biology, The Univer-sity of Texas MD Anderson Cancer Center, Houston, TX 77030.

2Present address: Cancer Research Program, Houston Methodist Research Institute,Houston, TX 77030.

3To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1415120112/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1415120112 PNAS | December 23, 2014 | vol. 111 | no. 51 | 18249–18254

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

Dow

nloa

ded

by g

uest

on

Mar

ch 2

1, 2

020

Page 2: Automated analysis of immunohistochemistry …Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers Aparna Kumara, Arvind Raoa,1, Santosh

HPA whose staining intensity was annotated as moderate orstrong, and whose staining quantity was annotated as greaterthan 75%. As a result of tissue-specific expression and var-iations in staining, the proteins identified (referred to as theanalysis set) were different for each tissue.

ii) Separating the DNA and protein components of each imageby unmixing the hematoxylin and diaminobenzidine stains.The HPA images were collected as red, green, and blue(RGB) images in which the two stains appear as purpleand brown, respectively. The intensity derived from eachstain is therefore a combination of the intensities from thethree RGB channels. We unmixed the spectra to give sepa-rate images reflecting mainly DNA and protein content (11).

iii) Selecting regions of each image with the highest proteinexpression, under the assumption that the highest stainedregions would be less likely to contain connective tissue,stroma, and other noncellular regions.

iv) Calculating features to describe the location patterns in eachregion (11).

v) First, estimating the probability that a given protein’s locationpattern differs between the two conditions. The nonparamet-ric Friedman–Rafsky (FR) test was used to calculate a P valuefor the null hypothesis that the sets of regions from normalimages and from cancer images show the same pattern. Sec-ond, estimating the probability that a given protein’s level ofexpression differs between the two conditions. Expression Pvalues were calculated by using the Wald–Wolfowitz methodto test the null hypothesis that the level of expression in theregions from the normal and cancer images came from thesame distribution. Calculations for 35 random samplings ofimages were averaged, giving very high repeatability of theresults (Materials and Methods). Finally, calculating a classifi-cation accuracy for separating normal and cancer images byusing protein location information.

We applied this pipeline to images from the HPA for fourtissues: breast, liver, prostate, and bladder (the results are

contained in Dataset S1). After running the pipeline for eachtissue, the proteins were sorted by their location P values toobtain a ranking by extent of subcellular location change. Rep-resentative images of the top three hits for each tissue are shownin Fig. S1.

Testing Using Known Location Biomarkers. We expected that pro-teins known to change location in cancer would be ranked highon this list. To test this, we constructed validation sets by usingpathologists’ annotations of the gross subcellular location pro-vided in HPA: (i) nuclear, (ii) cytoplasm and plasma membrane,(iii) nuclear, cytoplasm, and plasma membrane, and (iv) none.The validation set for a given tissue consisted of those proteinsfrom the analysis set for that tissue that had different locationannotations between the normal and cancer images (Materialsand Methods). Treating these as true positives, we constructedreceiver operating characteristic (ROC) curves in which athreshold on the P value at which a protein was consideredpositive was varied (Fig. S2). In this case, the area under thecurve (AUC) is a measure of how well our test finds the truepositives. If the validation markers were the only proteinsexpected to change location, and if the system performed per-fectly, the AUC values should be 1. However, we expect some ofthe proteins ranked highly by P value may be actual locationbiomarkers even if they are not in the validation set. For ex-ample, proteins may undergo a change in location that was notcaptured by the gross location annotations used to define truepositives. Thus, we do not expect even a very good discoverysystem to give values near 1. The AUC values for breast, liver,prostate, and bladder were 0.67, 0.59, 0.67, and 0.68, re-spectively. These are all significantly greater than 0.5, the AUCexpected for random performance.

Distinguishing Location and Expression Changes. The features weused are designed to minimize the effect of differences in proteinstaining level. Even so, a major change in expression may causea change in image texture that would be detected by our featureseven if subcellular location remains the same. This may causeproteins that do not change their location significantly but dochange their expression dramatically to rank highly on our lists.We therefore used the expression P values and location P valuestogether to analyze each protein’s change.Fig. 2 shows the relationship between the expression change

and location change for proteins in various tissues. The firstconclusion we can draw is that the two values are not correlated,suggesting that proteins that change location do not alwayschange expression, and vice versa. Second, the points in theupper left corner of each scatter plot in Fig. 2 represent proteinsthat have significantly changed location (low P values) but havenot changed expression (high P values). The color of each pointindicates how well that protein can be used to train a classifier todistinguish images from normal and cancerous tissue (Materialsand Methods; the accuracy values are listed in Dataset S1). Thus,we expect proteins whose symbols are dark red and in the upperleft corner to be potential biomarkers useful in a clinical settingfor recognizing cancerous tissue by measuring differences insubcellular location. These proteins would not have been iden-tified as potential markers by measuring expression changesalone. Dataset S1 is ranked for each tissue using the Euclidiandistance from the upper left corner, that is, proteins that changelocation and do not change expression. The five top-rankedproteins for each tissue using this criterion are shown in Table 1,and images of the top three from each tissue are shown in Fig. 3.Of course, we expected that classic biomarkers that are known

to translocate in cancer, such as E-cadherin, β-catenin, andNF-κB, would be ranked highly in this list. These proteins were notpart of our analysis sets because the HPA did not contain a suf-ficient number of images to meet the threshold of our pipeline.We therefore separately calculated location P values for thoseproteins by using the images that were available for breast andprostate cancers. The P values for two E-cadherin antibodies

Fig. 1. Overview of the location biomarker discovery pipeline. Images withstrong or moderate antibody staining were selected. Linear unmixing was usedto separate each image into two composite images representing the DNA andprotein stains as previously described (11). Regions were selected by convolvingthe protein image with a low-pass filter and selecting the highest points asregion centers. Fifty-seven numerical features were calculated to describe thepattern in each region. The nonparametric FR test was used to calculate aP value and determine whether the null hypothesis, that the features fromthe normal and cancer image come from the same distribution, should berejected. The nonparametric Wald–Wolfowitz test was used to calculate aP value to measure how likely the two sets of images are to come from thesame expression distribution. A nearest neighbor classifier was also used to de-termine the ability of each antibody to distinguish normal and cancer images.

18250 | www.pnas.org/cgi/doi/10.1073/pnas.1415120112 Kumar et al.

Dow

nloa

ded

by g

uest

on

Mar

ch 2

1, 2

020

Page 3: Automated analysis of immunohistochemistry …Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers Aparna Kumara, Arvind Raoa,1, Santosh

with high reliability, CAB000087 and HPA004812, were higher than0.20. The P values for three β-catenin antibodies, CAB000108,HPA029159, and HPA029160, were higher than 0.32. The twoantibodies against NF-κB in prostate cancer are CAB004031 andHPA027305, with P values greater than 0.22. Thus, our testsindicate that none of these are strong location biomarkers inthese tissues, contrary to expectation based on previous liter-ature reports. In addition, on visual inspection of the HPAimages, we did not observe a pattern change; the pathologistannotations also did not indicate a location change. All theantibodies for these three proteins had identical locationannotations in the two disease states with the exception ofHPA004812, which moved from nuclear and cytoplasmic andmembranous to mostly cytoplasmic membranous in the cancerstate. The basis for this difference in this dataset from previousresults is unclear.Given that we were evaluating a large number of antibodies,

we were also interested in estimating the generalizability of therelative performance of the antibodies. In other words, howlikely it is that proteins with low P values or high classificationaccuracies would show similar values in future experiments? Todo this, we calculated P values and accuracies for a smallernumber of images and compared them with the values with thelarger number (SI Text; note that this is different from the re-peatability of the rankings using the same number of images).The results shown in Fig. S3 indicate a high correlation in thetwo estimates, indicating that the generalizability of our resultsto future experiments should be high.

Features Reveal Visually Distinguishable Changes Useful for DistinguishingTumor From Healthy Tissue. To determine whether the differencesin location being identified by our pipeline were visually distin-guishable changes, we performed hierarchical clustering andoptimal leaf ordering to order the regions for a given antibody

using our features (Materials and Methods). Fig. 4 shows 10representative regions ordered for two example proteins fromthe top-ranked proteins. In breast tissue stained for SLC36A4,the normal regions clustered near each other on the left. Further,the regions appear ordered by increasing nuclear localization,suggesting our features can detect incremental and possiblycontinuous changes in this location pattern. In liver, GSTZ1showed a decrease in nuclear localization from left to right, andalso an increase in cytoplasmic graininess. The clustering groupedthe normal and cancer regions separately.

Location Biomarkers Can Distinguish Between Cancer Grades. Eachcancer in the HPA has a specified grade or subtype. We parti-tioned the images by grade and ran the pipeline to compare thetwo grades to each other for the prostate and bladder cancer set(Dataset S2). We also asked how well each protein could be usedas a potential biomarker in a classifier trained to distinguishthree disease states: normal tissue and low-grade and high-gradetumors. Fig. S4 shows the location P value and the expression Pvalue for each protein when comparing the two subtypes versuseach other. Points that fall in the upper left corner (Fig. S4) havedifferent subcellular locations between the two grades but similarexpression levels. The color of each point represents that pro-tein’s three-class classification accuracy.Example images for the proteins with the greatest three-class

classification accuracies are shown in Fig. S5. The best classifi-cation accuracy was obtained for S100A6 in bladder: it hasa classification accuracy of 83% (compared with 33% expected atrandom), a location P value of 0.081, and an expression P valueof 0.34. This protein is the best example of a potential locationbiomarker (one that changes location but not expression) inbladder. These results provide further support for the utility ofour system for identifying important location changes betweendisease states.

0 10

1Breast

0 10

1Liver

0 10

1Prostate

0 10

1Bladder

0

0.5

1

Location Similarity (p−value)

Exp

ress

ion

Sim

ilari

ty (

p−v

alu

e)

Fig. 2. Distinguishing between intensity and location changes. Each dotshows the P values for the hypotheses that location or expression are dif-ferent between normal and tumor tissue for a given protein. The correlationbetween location and expression P values is weak, suggesting that proteinsthat change location in the cancer state do not necessarily change expres-sion, as seen in the top left corner. The color indicates the classification ac-curacy for separating normal and cancer images of that protein by usingsubcellular location information alone. Proteins with high classification ac-curacy for distinguishing normal and cancer images are represented by a reddot. Red proteins closest to the top left corner are potential locationbiomarkers, and their discovery would have been missed by traditionalexperiments that measure changes in protein expression.

Table 1. Potential location biomarkers

Gene name HPA Ab

P value

AccuracyLocation Expression

BreastSCRN2 HPA023434 0.24 0.95 0.69BTNL2 HPA039844 0.24 0.95 0.59PSTPIP2 HPA040944 0.28 0.95 0.58USP10 HPA006749 0.27 0.90 0.58NT5DC3 HPA041634 0.27 0.89 0.62

LiverSLC30A9 HPA004014 0.15 0.96 0.81METTL21A HPA034712 0.15 0.87 0.65C4orf22 HPA043383 0.19 0.92 0.65WDR24 HPA039506 0.16 0.85 0.67PARP12 HPA003584 0.22 0.94 0.78

ProstateRASGRF2 HPA018679 0.14 0.97 0.90ECE1 HPA001490 0.17 0.99 0.77FAM120A HPA019734 0.18 0.94 0.73PLA2G4C HPA043083 0.19 0.95 0.69TMEM194A HPA014394 0.13 0.85 0.79

BladderTTC27 HPA031246 0.19 0.89 0.84FGFR1OP2 HPA038696 0.14 0.83 0.89TARS2 HPA028626 0.25 0.96 0.54STAC HPA035143 0.19 0.83 0.71— CAB009119 0.20 0.82 0.72

The five proteins with the greatest location change and the smallestexpression change are shown (the full ranked list is in Dataset S1). Classifi-cation accuracies for distinguishing normal and cancer are also shown.

Kumar et al. PNAS | December 23, 2014 | vol. 111 | no. 51 | 18251

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

Dow

nloa

ded

by g

uest

on

Mar

ch 2

1, 2

020

Page 4: Automated analysis of immunohistochemistry …Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers Aparna Kumara, Arvind Raoa,1, Santosh

Identification of Biochemical Pathways Enriched in TranslocatedProteins. Finally, we were interested to find out whether ouranalysis could suggest entire pathways, or major portions thereof,that might undergo translocation together in cancer (the simplestexample would be proteins that are part of a translocatingcomplex). To answer this, for each Kyoto Encyclopedia of Genesand Genomes (KEGG) pathway, we calculated the probabilitythat all of the proteins in it changed location or expressioncompared with a randomly sampled background distribution (SIText and Fig. S6). (Note that this represents an underestimate ofthe change in a pathway if it contains subcomponents that do nottranslocate.) We calculated pathway changes by using our imageprocessing pipeline or pathologist annotations. Pathways withthe largest change in either location or expression are listed inTable 2. As discussed later, some of these pathways have beenpreviously implicated in cancer and some are novel predictions.

DiscussionWe have described a workflow to identify proteins that changetheir subcellular location between normal and cancerous tissuewithout requiring classification. We confirmed our ability to detectchanges in location classes by using annotations provided by theHPA database.

Upon visual inspection of top hits (Fig. 3), we noted that oursystem was able to find texture changes between the two diseasestates; however, these did not always represent changes be-tween distinct subcellular location classes. In some cases, ourtexture features detected changes in tissue structures andmorphology.In Fig. 3, BTNL2 in breast showed a decrease in nuclear lo-

calization. SLC30A9 in liver decreased in cytoplasmic graininess,and C4orf22 decreased in plasma membrane and vesicle locali-zation in the cancer state. In prostate cancer, ECE1 decreased inplasma membrane accumulation. In bladder cancer, FGFR1OP2changed from cytoplasmic and nuclear localization to mostlya grainy cytoplasmic localization in cancer. TARS2 in bladderincreased in nuclear membrane localization in cancer. In somecases, our texture features picked up changes in tissue structurebut not necessarily subcellular location, as was seen in SCRN2 inbreast and TTC27 in bladder.It was of interest to consider whether our top predictions had

previously been implicated as being altered in cancers, and whichones were new discoveries. In breast cancer, very few studieshave reported on BTNL2; however, there is strong evidence thatvariants of this gene play a role in susceptibility to sporadic andfamilial prostate cancer (14). PSTPIP2 has not been reported inbreast cancer; however, it is implicated in the expansion ofmacrophage progenitors leading to autoinflammatory disease(15). USP10 is translocated to the nucleus upon DNA damageand regulates p53 (16).In liver cancer, PARP12 has been reported to play a role in

genome surveillance and DNA repair pathways, and it is beingrecognized as a new potential therapeutic target (17).In prostate cancer, three of our top findings have been linked

to prostate cancer development. Methylation of the RASGRF2gene was found to be associated with prostate cancer (18). ECE1has been implicated in prostate cancer cell invasion, in whichdifferent isoforms of the protein were found to play differentroles (19). PLA2G4C is regulated by EGR, a gene that is rear-ranged in approximately 50% of prostate cancer (20). In bladdercancer, very few of the top findings have been published in as-sociation with disease.

Pathway Changes. Some of the top-ranking pathways had locationP values that were approximately two orders of magnitudesmaller than the expression P values based on the pipeline results(Table 2 and Fig. S6). In the HTLV-1 infection pathway, theHTLV-I Tax oncoprotein initiates malignancy development inleukemia by creating an environment to facilitate DNA damage(21). To our awareness, the molecular mechanism of this path-way has not been studied in the contexts of breast or urothelial

BT

NL

2

ME

TT

L21

A

EC

E1

FG

FR

1OP

2

PS

TP

IP2

C4o

rf22

FA

M12

0A

TA

RS

2

SC

RN

2

Normal CancerBreast

SL

C30

A9

Normal CancerLiver

RA

SG

RF

2

Normal CancerProstate

TT

C27

Normal CancerBladder

Fig. 3. Example regions from top location biomarker predictions with verysmall mean intensity changes. For every protein, the features from eachdisease state were clustered by using k-means (k = 2), and the region closestto each centroid is displayed.

SL

C36

A4

Bre

ast

GS

TZ

1L

iver

Fig. 4. Ordering regions by location change progression. We selected one top-ranking protein from breast and liver: antibodies HPA017887 and HPA004701,respectively. The Euclidean distances between every pair of regions were calculated by using the features and clustered into a binary hierarchical tree. Theleaves were ordered to maximize the sum of similarities between adjacent leaves across the tree. The tree was cut at 10 clusters, and leaves contained in eachcluster are indicated by color. The region closest to the mean of each cluster is displayed below the tree from left to right. Normal tiles are outlined in blue;cancer tiles are outlined in red.

18252 | www.pnas.org/cgi/doi/10.1073/pnas.1415120112 Kumar et al.

Dow

nloa

ded

by g

uest

on

Mar

ch 2

1, 2

020

Page 5: Automated analysis of immunohistochemistry …Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers Aparna Kumara, Arvind Raoa,1, Santosh

cancer. Our results, together with other literature reports,indicate that subcellular location changes in components ofHTLV-I infection pathway could play a role in driving cancer.Further, the high rank of this pathway across the four tissuesindicates that these changes may be important in identifying andunderstanding other cancers as well.The “one carbon pool by folate” pathway was also found to

change location in breast cancer. It is known to play an importantrole in DNA global hypomethylation, which can lead to DNAstrand breaks (22). As expected, when changes in expression areused to rank pathways, the ErbB pathway ranks near the top forbreast cancer (23).In liver cancer, the axon guidance pathway was found to

change location more than it did expression. One of the genescontributing to the axon guidance pathway, ROBO1, was foundto be overexpressed in hepatocellular carcinoma (24). The axonguidance pathway has not been implicated as a whole in livercancers, but it is known to be altered in pancreatic cancers (25).Our results, with previous reports of ROBO1, suggest thispathway may play a role in liver cancer. The importance of thetop pathways seen to change expression in liver is unclear.In prostate cancer, HIF-1 signaling is known to play an im-

portant role in hypoxia adaption of tumors, and HIF-1α is knownto be overexpressed in early tumors (26). Our findings suggestthat the proteins in this pathway undergo location changes,possibly contributing to the pathway’s dysregulation in cancer.PPAR is a known prostate cancer marker (27), and its pathway isidentified as changing expression.Finally, in urothelial cancer, the hippo signaling pathway was

the top-ranking pathway to change location. Hippo signaling isresponsible for tissue size and is known to lead to uncontrolledcellular proliferation and blocking of apoptosis when mis-regulated (28). When we ranked the pathways in urothelialcancer by expression changes, a number of signaling pathwaysknown to be involved in cancers ranked at the top.We also calculated the product of the P values for each

pathway across all four tissues to find those pathways changing in

all four cancers (Dataset S3). Three of the top-ranking pathwaysfor location changes were already identified in individual tissues.In addition, the p53 signaling pathway (which is known to involvelocation changes) was also identified. For expression changes,five pathways previously associated with cancers were highlyranked (which is encouraging with respect to the accuracy of ourautomated methods).Our analyses suggest that location changes of these pathways

may be important for understanding their role in disease. Inaddition, our results link previously implicated pathways to newcancers for further investigation.

ConclusionBy using staining patterns of proteins in four tissues, we haveidentified proteins that show altered subcellular location incancer and/or whose patterns can be used to distinguish normaland cancerous tissue or different cancer subtypes. Many of theseproteins do not have significant expression level changes andwould not have been found as biomarkers if we had consideredexpression level changes alone. Further, some proteins have highclassification accuracies but visually similar location patterns.The subtle changes that are being detected may nonetheless beuseful for distinguishing disease states. Extended analysis withmore images of the potential markers we have identified will benecessary to assess their utility or significance. We note that theanalysis pipeline we have described is not only useful for iden-tifying cancer biomarkers, but should also be valuable for auto-mating the process of analyzing IHC images to assess diseasestate. We are currently carrying out collaborative translationalstudies to determine whether our technology combined with anyof the potential biomarkers is useful for distinguishing lesionswith various diagnoses or prognoses.

Materials and MethodsData. We used images from the HPA (www.proteinatlas.org) that appearedonline on September 24, 2013 (see Fig. S7 for evaluation of the effects ofimage compression). Proteins were placed in the analysis set for each tissue ifthey met three criteria: (i) the staining annotation was strong or moderate,(ii) if the quantity field was annotated as greater than 75%, and (iii) at leastthree images of that protein were available for the normal tissue. Approx-imately 500 proteins per tissue passed this filtering procedure (Dataset S1provides a list of proteins in the full analysis set for each tissue).

Identification of Validation Sets. A validation set of proteins whose locationwas known to change (whichwe define as true positives) was created for eachtissue. These were found by using HPA annotations. We identified the set oftrue positives for a given tissue by finding those proteins for which the set oflocation annotations for all normal images did not intersect the set of lo-cation annotations for all cancer images. In the data set for breast, liver,prostate, and bladder there were 5, 3, 7, and 13 true positives, respectively.

Selecting Regions. For each image, we selected regions that showed signifi-cant staining. A low-pass filter was applied to each protein image and weselected regions centered on the peaks of the filtered image. This was doneunder the assumption that the cellular regions of the tissue would have thehighest staining levels, as opposed to the connective tissue, stroma, and othernoncellular regions, whichwould havemuch lower levels of staining primarilyas a result of nonspecific antibody binding.

Removing Outlier Images.Next, we removed outlier regions and images basedon DNA and protein intensity. For each tissue, we calculated themean and SDof the protein and DNA stains for all images. This same process was repeatedfor all regions from each tissue. We removed images and regions from thedataset that were more than 4 SDs from the mean.

Pipeline for Testing Changes in Location or Expression. Our pipeline calculatesP values for the hypotheses that the location or expression of each protein arethe same between normal and cancer images. The pipeline requires inputs forthe number of images to use, the number of regions to select per image, theregion size, and the number of estimates to average when reporting P valuesand accuracies (choice of these parameters is discussed later).

Table 2. Pathways with the largest location or expressionchanges

Tissue L:P E:P L:A E:A Pathway

Breast + — — — HTLV I infection+ — — — One carbon pool by folate— + — — Proteasome— + — — ErbB signaling

Liver + — — — Axon guidance— + — — GPI anchor biosynthesis— + — — Hypertrophic cardiomyopathy— + — — Dilated cardiomyopathy

Prostate + — — — Fatty acid elongation+ — — — HIF-1 signaling— + — — Oxidative phosphorylation— + — — PPAR signaling— + — — Viral myocarditis

Bladder + — + — Hippo signaling— + — — NF-κB signaling— + — — p53 signaling— + — — Transcription misregulation

in cancer— + — — Apoptosis— + — — Cell cycle— + — — mRNA surveillance— + — — Ribosome biogenesis

Pathway P values were calculated by using individual protein location (L)or expression (E) P values from the pipeline (L:P, E:P) and using P values frompathologist annotations (L:A, E:A). Plus symbol indicates pathway P < 0.01.Values for all pathways are in Dataset S3.

Kumar et al. PNAS | December 23, 2014 | vol. 111 | no. 51 | 18253

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

Dow

nloa

ded

by g

uest

on

Mar

ch 2

1, 2

020

Page 6: Automated analysis of immunohistochemistry …Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers Aparna Kumara, Arvind Raoa,1, Santosh

For location testing, a set of features that do not require segmentation ofthe image into individual cell regions was extracted from each region asdescribed previously (11), with the modification that horizontal and verticalfeatures were combined to produce a set of 592 rotation invariant features.These include texture features at many different levels of resolution andnuclear overlap features. We used the 57-feature subset that was previouslyselected to be able to distinguish eight subcellular location classes with highaccuracy (ref. 11; see also Fig. S8). The equivalence of the distributions ofthese features for normal and cancer regions was evaluated by the FR test.Because the test is nonparametric and does not make assumptions about thedistributions from which the samples are drawn, it is suitable for smallnumbers of regions and large numbers of features.

Expression P values were calculated by normalizing the mean proteinintensity level across each region used in the location analysis by the re-spective mean nuclear intensity level. This results in a one-dimensional set ofpoints corresponding to the regions for normal protein expression andcancer protein expression. We calculated a P value that the points in the twosets were drawn from the same distribution by using the Wald–Wolfowitztest, the one-dimensional version of the FR test.

The reported P values and accuracies for each protein were calculated bytaking the average of 35 estimates that we found produced consistentranked lists (SI Text and Fig. S9).

Selecting Image Sets, Number of Regions, and Region Size. The database has asmany as 3 images for each normal tissue and as many as 30 images for eachcancer tissue. To have the same null distribution for the nonparametricP values, we needed to use the same number of normal and cancer imagesfor each antibody. To identify the optimal number of each, we randomlyselected 200 antibodies for each tissue, selected 2 regions from each image,and assessed the extent to which the validation markers were ranked highlyby location P values (using the AUC). The number of normal images wasvaried from 1 to 3 and the number of cancer images from 3 to 24. We foundthat the best AUC average over all 200 antibodies resulted from using 2normal images and 17 cancer images.

Next, the optimal number of regions was found by using the same200-antibody training set and 2 normal and 17 cancer images. We variedthe region count from 2 to 5 from each of the images. The best perfor-mance as measured by AUC resulted from using 5, 3, 3, and 4 regions perimage for breast, liver, prostate, and bladder tissues, respectively, and we

therefore used these values for the full analysis sets. Differences in theoptimal number of regions for different tissues presumably reflect tissue-specific variations across the normal and cancer states. Limiting the numberof regions per image prevents the sampling of noncellular regions ineach tissue.

The optimal region size was chosen by assessing the performance of 100randomly chosen proteins in ROC curves for each tissue. Ideally the regionshould be small enough so as to only capture cellular areas from a tissueimage, as capturing noncellular regions introduces new textures to theanalysis that would affect the subcellular location features. The optimalsize was selected to be 75 pixels, with an average AUC of 0.67 for thefour tissues.

Classification. We used nearest neighbor classifiers with the 57 z-scoredfeatures and used cross-validation to estimate of the ability of a givenprotein to distinguish normal from cancer images, or to distinguish low-grade from high-grade cancer images. Images were assigned the majorityclass of their regions.

Distinguishing Cancer Grades. The prostate and bladders cancers are identifiedas high- or low-grade in the HPA, with approximately equal numbers of each.We partitioned the cancer images by grade and ran the pipeline to comparethem. We also randomly selected three images from each set and calculatedthe three-class classification accuracy for a nearest neighbor classifier (usingleave-one-out cross-validation). This was repeated 35 times (producing aconsistent ranking, as explained earlier) for different sets of randomlyselected images, and the average of the 35 accuracies is reported. Proteinsthat did not have at least three images from each disease statewere excluded.The results are contained in Dataset S2.

ACKNOWLEDGMENTS. We thank the HPA project team for making thevaluable collection of IHC images publicly available; E. Lundgren andM. Uhlén for helpful discussions; members of the laboratory of R.F.M. forvaluable suggestions; and the anonymous reviewers of the manuscript forcomments resulting in significant improvements in the analysis. This workwas supported by NIH Grant GM075205, NIH Training Grant EB009403(to A.K.), Commonwealth of Pennsylvania Commonwealth Universal Re-search Enhancement Program Grant 4100059192, and a postdoctoral fellow-ship from the Lane Fellows Program (to A.R.).

1. Kononen J, et al. (1998) Tissue microarrays for high-throughput molecular profiling oftumor specimens. Nat Med 4(7):844–847.

2. Khan J, et al. (1999) Expression profiling in cancer using cDNA microarrays. Electro-phoresis 20(2):223–229.

3. Mardis ER, Wilson RK (2009) Cancer genome sequencing: A review. Hum Mol Genet18(R2):R163–R168.

4. Leung F, Diamandis EP, Kulasingam V (2012) From bench to bedside: Discovery ofovarian cancer biomarkers using high-throughput technologies in the past decade.Biomarkers Med 6(5):613–625.

5. Chung GG, et al. (2001) Tissue microarray analysis of beta-catenin in colorectal cancershows nuclear phospho-beta-catenin is associated with a better prognosis. Clin CancerRes 7(12):4013–4020.

6. Lessard L, et al. (2006) Nuclear localization of nuclear factor-kappaB p65 in primaryprostate tumors is highly predictive of pelvic lymph node metastases. Clin Cancer Res12(19):5741–5745.

7. Guillaud M, et al. (2005) Subvisual chromatin changes in cervical epithelium measuredby texture image analysis and correlated with HPV. Gynecol Oncol 99(3, suppl 1):S16–S23.

8. Shariff A, Kangas J, Coelho LP, Quinn S, Murphy RF (2010) Automated image analysisfor high-content screening and analysis. J Biomol Screen 15(7):726–734.

9. Lejeune M, et al. (2008) Quantification of diverse subcellular immunohistochemicalmarkers with clinicobiological relevancies: Validation of a new computer-assistedimage analysis procedure. J Anat 212(6):868–878.

10. Matos LL, Trufelli DC, deMatos MG, da Silva Pinhal MA (2010) Immunohistochemistry asan important tool in biomarkers detection and clinical practice. Biomark Insights 5:9–20.

11. Newberg J, Murphy RF (2008) A framework for the automated analysis of subcellularpatterns in Human Protein Atlas images. J Proteome Res 7(6):2300–2308.

12. Glory E, Newberg J, Murphy RF (2008) Automated comparison of protein subcellularlocation patterns between images of normal and cancerous tissues. Proc IEEE Int SympBiomed Imaging 2008:304–307.

13. Uhlén M, et al. (2005) A human protein atlas for normal and cancer tissues based onantibody proteomics. Mol Cell Proteomics 4(12):1920–1932.

14. Fitzgerald LM, et al. (2013) Germline missense variants in the BTNL2 gene are asso-ciated with prostate cancer susceptibility. Cancer Epidemiol Biomarkers Prev 22(9):1520–1528.

15. Chitu V, et al. (2009) Primed innate immunity leads to autoinflammatory disease in

PSTPIP2-deficient cmo mice. Blood 114(12):2497–2505.16. Yuan J, Luo K, Zhang L, Cheville JC, Lou Z (2010) USP10 regulates p53 localization and

stability by deubiquitinating p53. Cell 140(3):384–396.17. Yelamos J, Farres J, Llacuna L, Ampurdanes C, Martin-Caballero J (2011) PARP-1 and

PARP-2: New players in tumour development. Am J Cancer Res 1(3):328–346.18. Mahapatra S, et al. (2012) Global methylation profiling for risk prediction of prostate

cancer. Clin Cancer Res 18(10):2882–2895.19. Lambert LA, Whyteside AR, Turner AJ, Usmani BA (2008) Isoforms of endothelin-

converting enzyme-1 (ECE-1) have opposing effects on prostate cancer cell invasion.

Br J Cancer 99(7):1114–1120.20. Massoner P, et al. (2013) Characterization of transcriptional changes in ERG re-

arrangement-positive prostate cancer identifies the regulation of metabolic sensors

such as neuropeptide Y. PLoS ONE 8(2):e55207.21. Matsuoka M, Jeang KT (2007) Human T-cell leukaemia virus type 1 (HTLV-1) infectivity

and cellular transformation. Nat Rev Cancer 7(4):270–280.22. Xu X, Chen J (2009) One-carbon metabolism and breast cancer: An epidemiological

perspective. J Genet Genomics 36(4):203–214.23. Howe LR, Brown PH (2011) Targeting the HER/EGFR/ErbB family to prevent breast

cancer. Cancer Prev Res (Phila) 4(8):1149–1157.24. Ito H, et al. (2006) Identification of ROBO1 as a novel hepatocellular carcinoma an-

tigen and a potential therapeutic and diagnostic target. Clin Cancer Res 12(11 pt 1):

3257–3264.25. Biankin AV, et al.; Australian Pancreatic Cancer Genome Initiative (2012) Pancreatic

cancer genomes reveal aberrations in axon guidance pathway genes. Nature

491(7424):399–405.26. Kimbro KS, Simons JW (2006) Hypoxia-inducible factor-1 in human breast and pros-

tate cancer. Endocr Relat Cancer 13(3):739–749.27. Collett GP, et al. (2000) Peroxisome proliferator-activated receptor alpha is an androgen-

responsive gene in human prostate and is highly expressed in prostatic adenocarcinoma.

Clin Cancer Res 6(8):3241–3248.28. Barron DA, Kagey JD (2014) The role of the Hippo pathway in human disease and

tumorigenesis. Clin Ttransl Med 3:25.

18254 | www.pnas.org/cgi/doi/10.1073/pnas.1415120112 Kumar et al.

Dow

nloa

ded

by g

uest

on

Mar

ch 2

1, 2

020