14
Embracing an integromic approach to tissue biomarker research in cancer: Perspectives and lessons learned Li, G., Bankhead, P., Dunne, P. D., O'Reilly, P. G., James, J. A., Salto-Tellez, M., ... McArt, D. G. (2016). Embracing an integromic approach to tissue biomarker research in cancer: Perspectives and lessons learned. Briefings in Bioinformatics. DOI: 10.1093/bib/bbw044 Published in: Briefings in Bioinformatics Document Version: Publisher's PDF, also known as Version of record Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights ©2016 The Authors This is an open access article published under a Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium, provided the author and source are cited. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:15. Feb. 2017

Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

Embracing an integromic approach to tissue biomarker researchin cancer Perspectives and lessons learned

Li G Bankhead P Dunne P D OReilly P G James J A Salto-Tellez M McArt D G (2016)Embracing an integromic approach to tissue biomarker research in cancer Perspectives and lessons learnedBriefings in Bioinformatics DOI 101093bibbbw044

Published inBriefings in Bioinformatics

Document VersionPublishers PDF also known as Version of record

Queens University Belfast - Research PortalLink to publication record in Queens University Belfast Research Portal

Publisher rightscopy2016 The AuthorsThis is an open access article published under a Creative Commons Attribution License (httpscreativecommonsorglicensesby40)which permits unrestricted use distribution and reproduction in any medium provided the author and source are cited

General rightsCopyright for the publications made accessible via the Queens University Belfast Research Portal is retained by the author(s) and or othercopyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associatedwith these rights

Take down policyThe Research Portal is Queens institutional repository that provides access to Queens research output Every effort has been made toensure that content in the Research Portal does not infringe any persons rights or applicable UK laws If you discover content in theResearch Portal that you believe breaches copyright or violates any law please contact openaccessqubacuk

Download date15 Feb 2017

Embracing an integromic approach to tissue biomarker

research in cancer Perspectives and lessons learnedGerald Li Peter Bankhead Philip D Dunne Paul G OrsquoReillyJacqueline A James Manuel Salto-Tellez Peter W Hamilton andDarragh G McArtCo-senior authorsCorresponding author Darragh G McArt Centre for Cancer Research and Cell Biology (CCRCB) Queenrsquos University Belfast Belfast United KingdomE-mail dmcartqubacuk

Abstract

Modern approaches to biomedical research and diagnostics targeted towards precision medicine are generating lsquobig datarsquoacross a range of high-throughput experimental and analytical platforms Integrative analysis of this rich clinical pathologicalmolecular and imaging data represents one of the greatest bottlenecks in biomarker discovery research in cancer and otherdiseases Following on from the publication of our successful framework for multimodal data amalgamation and integrativeanalysis Pathology Integromics in Cancer (PICan) this article will explore the essential elements of assembling an integromicsframework from a more detailed perspective PICan built around a relational database storing curated multimodal data is theresearch tool sitting at the heart of our interdisciplinary efforts to streamline biomarker discovery and validation While recog-nizing that every institution has a unique set of priorities and challenges we will use our experiences with PICan as a casestudy and starting point rationalizing the design choices we made within the context of our local infrastructure and specificneeds but also highlighting alternative approaches that may better suit other programmes of research and discovery Alongthe way we stress that integromics is not just a set of tools but rather a cohesive paradigm for how modern bioinformaticscan be enhanced Successful implementation of an integromics framework is a collaborative team effort that is built with aneye to the future and greatly accelerates the processes of biomarker discovery validation and translation into clinical practice

Key words integromics big data biomarkers omics digital pathology interdisciplinary teamwork

Gerald Li is a Research Fellow at Queenrsquos University Belfast His research focuses on integrative multimodal data analysis for cancer biomarker discoveryand validationPeter Bankhead is a Research Fellow at Queenrsquos University Belfast His research focuses on developing novel image analysis algorithms and softwarefocussing on digital pathologyPhilip Dunne is a Senior Research Fellow at Queenrsquos University Belfast His research has evolved from basic biology to using translational bioinformaticsto define the molecular pathology epidemiology of early-invasive cancerPaul OrsquoReilly is a Research Fellow at Queenrsquos University Belfast His research focuses on the development implementation and acceleration of bioinfor-matics algorithms and the architecture of bioinformatics software systemsJacqueline James is a Clinical Senior LecturerConsultant Pathologist at Queenrsquos University BelfastBelfast Health and Social Care Trust Her research usesintegrative molecular approaches for biomarker discovery and validationManuel Salto-Tellez is the Chair of Pathology at Queenrsquos University Belfast and a diagnostic morpho-molecular pathologist He has championed the inte-gration of phenotype and genotype in diagnostics and researchPeter Hamilton is Professor of Bioimaging and Informatics at Queenrsquos University Belfast He has been working for the past 25 years in digital pathologyquantitative imaging biomarker discovery and digital diagnosticsDarragh McArt is a Lecturer in Translational Bioinformatics at Queenrsquos University Belfast His research focuses on developing accelerative and integrativesolutions in modern biomarker researchSubmitted 26 February 2016 Received (in revised form) 8 April 2016

VC The Author 2016 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40)which permits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

1

Briefings in Bioinformatics 2016 1ndash13

doi 101093bibbbw044Paper

Briefings in Bioinformatics Advance Access published June 2 2016 at Q

ueens University of B

elfast on June 3 2016httpbiboxfordjournalsorg

Dow

nloaded from

Introduction

Modern basic scientific research approaches and clinical studiesof disease are now demanding the requirement to store man-age and interrogate enormous amounts of information Hiddenamong these large complex multivariate data are the patternsand clues needed to understand disease processes and drivenew methods for diagnosis prognosis therapeutics and preci-sion or personalized medicine Commercial and in-house solu-tions [1ndash14] have been developed in an attempt to make themost of this data deluge Describing this age of big data-drivenbiomarker research Weinstein defined integromics as thelsquomelding of diverse types of data from different experimentalplatformsrsquo [15 16] while the American Society for ClinicalOncology uses the term lsquopanomicsrsquo to describe similar needs[17]

Given the centrality of pathology to solid tumour analysiswe previously published on our own framework for multimodaldata amalgamation and integrative analysis called PathologyIntegromics in Cancer (PICan) [18] Since then we have been ap-proached by researchers who want to establish similar systemsat their own institutions This led us to reflect on the process ofbuilding PICan and the appropriateness of packaging and dis-tributing PICan as source code However the diversity of clin-ical experimental and research setups means that an optimalintegromics strategy for each institution might be vastly differ-ent from the next Hence we present here a more detailed dis-cussion on the technical considerations underpinning theformation of an integromics environment to accelerate tissuebiomarker research with a focus on cancer Key framework de-cisions were shaped by our local environment but this articlewill also discuss some alternative solutions that may better suitother research groups organizations and multi-institutionalcollaborations This perspective is structured around severalthemes with reference back to PICan as a case study integro-mics overview database structure data exchange multimodalanalysis interdisciplinary collaboration and data control Weconclude with a discussion on the central role integromics willplay in the modern era of large-scale multicentre trials and glo-bal research consortia [19ndash21] We hope that the concepts raisedwill provide valuable insights from a more detailed technicalperspective for institutions and research groups wishing to de-velop their own integromics framework

Precision medicine and integromics for biomarkerdiscovery in cancer

The modern precision medicine approach aims to stratify pa-tients based on specific biomarkers that will inform the most ef-fective treatment for each patient cutting costs and reducingharm from ineffective treatments [22] Novel therapeutics targetspecific cancer-causing mechanisms inducing fewer side ef-fects but requiring companion biomarker diagnostics withquick turnaround times to identify the patients who will bene-fit most As multiple patient and tumour characteristics cancontribute to a treatmentrsquos effectiveness this drive towards pre-cision medicine is inevitably spurring an increasing demand forunbiased automated data collection and analysis across amultitude of high-throughput analytical platforms The currentpace of data generation is far outstripping the ability to analysethese data Efforts by journals and grant-funding agencies to in-crease data transparency and encourage data sharing have re-sulted in large repositories of publicly available data such asArrayExpress [23] and the European Nucleotide Archive [24]

The ability to access these databases has increased the interestin integromics or panomics the integrative analysis of multiplebig data types This is especially true in molecular analyses [7ndash911 19 25ndash28] where tools such as Partek Genomics Suite(Partek Inc St Louis MO USA) GeneSpring (AgilentTechnologies Santa Clara CA USA) tools that harness TheCancer Genome Atlas (TCGA) [25 26] data portal (eg cBioPortalfor Cancer Genomics [developed and maintained by the Centerfor Molecular Oncology and the Computational Biology Centerat Memorial Sloan-Kettering Cancer Center] [29 30] COSMIC[31] and the Broad Institute TCGA GDAC Firehose [32]) andothers [33 34] can facilitate statistical analysis and biologicalcontextualization of omics data However in the push towardsdevelopment of precision medicine there remains an urgentneed for an integromics platform that takes into account non-molecular data Currently most of these data types are still heldin separate or loosely bound silos with the onus resting with re-search groups around the world to seek out these disparate silosand attempt to harmonize the data Even then different datatypes derived from different patient cohorts make it challengingto integrate the data at the patient level This limits analysis be-tween data types to comparisons of cohort-level summariespotentially obscuring correlations within the data that mayhave been apparent if studied at the patient level

Integration of tissue phenotype through digitalpathology and image analytics

Most molecular analyses on their own seldom consider thehistological distribution of biomarkers This leads to blind spotsin our ability to interpret and understand molecular analysessuch as the potential implications of tumour heterogeneity andwhether it might correlate with any specific genetic or epigen-etic changes Computer-aided image analysis applied to cancerdiagnostics and investigative biomarker research has been inuse for several decades [35] However the combination of digitalpathology with whole slide imaging has truly catapulted thefield into the era of big data Entire stained tissue and tissuemicroarray (TMA) sections each potentially comprising hun-dreds of samples across hundreds of patients can now be digi-tized archived and disseminated without the need to store anddistribute the original glass slides [36ndash38] Digital image acquisi-tion is now commonplace with multiple established companiessuch as Aperio (Leica Biosystems Nussloch Germany)Hamamatsu Photonics (Hamamatsu City Japan) Carl Zeiss(Oberkochen Germany) and 3D Histech (Budapest Hungary)offering scanning hardware capable of producing high-resolution digital slide images Aside from greater ease of col-laboration whole slide imaging unlocks the analytical potentialof digital image analysis Algorithms can be trained to identifytumour regions delineate margins classify cellular stainingand quantify the results for a wide range of histological pheno-types and cellular markers [39] Many software packagesincluding commercial (eg PathXL TissueMark Belfast UKDefiniens TissueStudio Munich Germany Aperio Genie andVisiopharm Visiomorph Hoersholm Denmark) in-house(QuPath Queenrsquos University Belfast) and community-driven(eg ImageJ [40] Fiji [41] Icy [42] and Ilastik [43]) offerings nowoffer access to such algorithms Computerized image interpret-ation reduces the subjectivity of manual interpretation [44] andonce properly trained can batch process a large number of slideimages at once Applied to TMAs high-throughput tissue inter-pretation can greatly accelerate tissue biomarker discovery

2 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

pipelines especially when coupled with the wealth of moleculardata currently becoming increasingly accessible

In light of this true integromics today ought to embraceboth molecular and digital pathology to uncover and validate amore diverse and robust set of biomarkers that will be key to de-livering on the promises of precision medicine [45] Tissue path-ology and histology are essential to the application of morerecent molecular methods in solid tumours required for deter-mining the presence and proportion of tumour cells to ensuresample sufficiency for downstream analysis of nucleic acidsand for directing microdissection of tissue samples for molecu-lar analysis [46] Moreover tissue context and cellular pheno-type are central to understanding tumour heterogeneity andthe underlying genomic map of solid tumours TheImmunoscore biomarker [47] for example applies image ana-lysis to contextualize the interpretation of immunohistochemi-cal staining The understanding of both how genotype drivesphenotype and how phenotypic information influences inter-pretation of genotype are required for a holistic view of cancerand are therefore essential elements of a modern integromicsapproach

Building large-scale digital resources PICanoverview

Reflecting this broader view of integromics we created PICan tobe a comprehensive framework for the collection integrationand analysis of clinical pathological molecular and digitalimage analysis data sets for the purposes of cancer biomarkerdiscovery and validation [18] PICan sits at the heart of our bio-marker discovery program acting as both a central hub andcatalyst (Figure 1) PICan streamlines the process of bringing to-gether diverse streams of multimodal data from multiple cancertypes to generate clinically and biologically relevant analysesand reports via a bespoke access-controlled web-based inter-face written in ASPNETC with a connection to the R statis-tical environment [48] through statconn [49] At its core is arestricted-access MySQL database to house this collated patientdata Currently there are 3660 patients representing eight can-cer types recorded in the system together with fully curateddata on pathological diagnosis and clinical outcome Fromthese nearly 12 000 TMA cores have been stained immunohis-tochemically and scored for a broad range of well-establishedand novel cellular biomarkers that are of interest to our basicscience translational and clinical research collaborators Withthe ongoing rapid pace of research activity at our institutionthese numbers are set to double over the course of the nextyear Data de-identification using study-specific unique identi-fiers protects patient privacy while authentication and author-ization of users enables a balance to be struck between enablingaccess to users and safeguarding the rights of data collectors(clinicians wet lab scientists and other collaborators) to publi-cation priority and control of their own data Only administra-tors have direct access to the PICan database while end useraccess is provided exclusively through the access-controlledanalysis and report generation web interface Data quality is akey priority to ensure the integrity of downstream analyses asany analysis is only as good as the weakest data supporting itso data are curated with biological and clinical input before up-load Meanwhile bioinformaticians tailor their pipelines ac-cording to the needs of the research or analysis Consequentlyin addition to the central database a complete integromicspipeline must necessarily encompass data collection data

curation and biologically driven analysis and interpretation ofthis data activities which we consider integral parts of any inte-gromics framework

Database design data pre-processing curationand structure

Traditional single-mode data analysis pipelines take raw data(eg sequencing reads microarray spot intensities whole slideimage files) and process it through a series of steps into a finalform that informs a particular biological narrative (eg list ofmutations and frequencies gene expression values H-scoresfor immunohistochemistry (IHC) scoring [50 51]) However rawdata such as raw fluorescent spot intensities from an expres-sion microarray experiment might not be suitable for multi-modal and dynamic analysis without corrections forbackground and batch effects [52 53] while data aggregationmight leave fully processed data such as a list of differentiallyexpressed genes too refined to be of further use in patient-levelresearch analysis A compromise is to capture data in a statethat is suitable and informative for subsequent analysis whiletracking metadata associated with acquisition and pre-processing Metadata or lsquodata about datarsquo captures informationabout the methods and conditions under which the data itselfwere obtained Such information can be used for example toidentify cases where an observed effect was the result of differ-ences in the laboratory or researcher conducting the experi-ments rather than the biological condition Going back toexpression microarray data as an example this could beachieved by recording background-corrected and normalizedprobe signal intensities with metadata detailing the nature andparameters for data pre-processing steps such as normaliza-

tion assay conditions instrumentation used including its set-tings and pre-analytical steps eg how the physical sampleswere obtained handled and processed and by whom For com-pleteness and traceability raw data can be stored alongside thisanalysis-ready data Additionally multiple lsquosnapshotsrsquo fromsingle-mode analysis pipelines can be stored as each could po-tentially serve as a starting point for multimodal analysisHowever care must be taken to ensure internal consistency sothat results are repeatable and reproducible whether workingup raw data or the stored analysis-ready data Consequentlyfor PICan we opted for quality and consistency by recordingonly the analysis-ready data most suitable for further integro-mic analysis A number of quality assurance and quality controlmeasures can be taken to ensure data integrity with the appro-priate level of adoption into standard operating proceduresbeing dependent on the overall level of accreditation of the la-boratory in question Metadata is stored alongside the analysis-ready data but raw data and most downstream single-modeanalysis results are presently excluded from PICan Future plansto increase data completeness and traceability include adding acentral independent repository for raw data files These repre-sent files not already in existence elsewhere (eg public reposi-tories) These raw data files will be read-only and bear uniqueidentifiers that link them to their analysis-ready counterpartson the main database A file path stored in the main databasewill allow the raw data files to be easily retrieved when neededAs these files will not be used for on-demand statistical ana-lysis they can be stored in their native formats

Integromics in cancer tissue biomarker research | 3

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Relational database systems

Integromics software infrastructure can be built around one ofmany database management systems available each with itsown strengths and weaknesses As we envisioned PICan to in-corporate a semi-automated analysis engine [18] we prioritizeddata quality which maximizes the validity and reproducibilityof our results and preservation of data structure which stream-lines the automated retrieval and processing of big data sets ra-ther than attempting to blindly maximize data throughputReflecting these design priorities PICan was built around a rela-tional database (MySQL) consisting of interrelated data tablesthat reflect the process of constructing and harnessing TMAswhich are central to much of the biomarker discovery work con-ducted at our institution the Northern Ireland MolecularPathology Laboratory and through the Northern Ireland Biobank(NIB) (Figure 2) Every row of every table has a system-generatedprimary key and these are used as foreign keys to form relation-ships between tables The database centres on the Patient andBlock tables (table names will be italicized) reflecting our core vi-sion for PICan to drive holistic multimodal integromic analysisat a patient level One-to-many relationships link the Patienttable to tables containing the clinicopathological and treatmentdata capturing cancer-type-specific fields from templates de-veloped in collaboration with our clinical expert partners Thesetemplates also define permissible data inputs for each field (egage is a number sex is lsquoMrsquo or lsquoFrsquo) which are then encoded intothe database schema A patient may have had multiple tissuespecimens collected as tissue blocks each represented by a

separate linked entry in the Block table Linked via the Block tableare tables representing molecular assays (SangerSequenceDNAMicroarrayChip ExpressionLevel) whole face tissue sections(TissueSlide) TMA construction (Cylinder RecipientBlock TMASlideTMACore) IHC scoring (TissueScore BiomarkerScore) and digitalpathology (TissueVirtualSlide TMAVirtualSlide VirtualCoreDIAScore) all of which reflect primary activities within our insti-tutionrsquos biomarker discovery program Metadata is stored in thetable corresponding to the relevant physical entity and soimaging parameters for a whole face tissue slide for examplewould go in the TissueVirtualSlide table While this databasestructure is well suited to our needs laboratories collectingnon-tissue specimens such as blood sputum or urine may findit inappropriate for their needs A Blood table might sit at theheart of a haematological laboratoryrsquos database for examplewhile other institutions may need a cluster of tables to repre-sent different specimen types each linked to tables for variousbiochemical assays

PICan uses a modular internal structure to support variousdata types Each table represents a physical entity and conse-quently a data type This allows the addition of new tables torepresent data types from emerging technologies as they ma-ture As maximizing utility of new tables will require updates tothe end-user web interface and setting the appropriate foreignkeys will require knowledge of the internal structure of thePICan database only the system administrator can add newtables

Another consideration for relational database design is nor-malization This procedure simplifies the data structure by

Figure 1 PICan sits at the heart of an integrated multidisciplinary biomarker discovery and validation pipeline Multiple curated data sources feed into PICan which

accelerates downstream analysis and report generation

4 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

removing duplication of data ensuring that the table is resilientagainst inconsistencies arising from incomplete addition dele-tion or modification of data As described in [54] this isachieved by identifying groups of related or dependent datafields and splitting them off into a separate table using foreignkeys to record relationships between tables For most practicalapplications the third normal form balances a robust architec-ture without overly cluttering the schema with an untenablenumber of tables Normalization approaches can also be used tooptimize database speed and performance

A consequence of our decision to use a relational database isthat free-form data needs to be harmonized with the data struc-ture presented by PICan before data entry The requisite cur-ation steps prioritize data integrity and prevent poor-qualitydata that might impede proper and meaningful analysis fromentering the database (garbage-in-garbage-out) Clinical andpathological data collection and storage are guided by tem-plates collections of well-defined data fields that have been de-veloped through our close working relationships with the NIBNorthern Ireland Cancer Registry Northern Ireland Health andSocial Care Trusts and clinicians These collaborators share thedesire to drive data integration and provide expertise contextand data access to harmonize data templates around commonvocabulary and permissible data values ensuring maximal clin-ical relevance of PICan In addition to the harmonized common

fields like age and sex PICan also supports cancer-type-specificfields for maximum flexibility

Non-relational database structures

While a structured relational database has served PICan wellthere are alternative models that might better suit other re-search integromics environments MongoDB for example is anon-relational (or lsquoNoSQLrsquo) database architecture that uses adocument model where each entity (a tissue block for ex-ample) is represented by a lsquodocumentrsquo that might contain hun-dreds of attributes as key-value pairs This model works with aflexible schema allowing data fields to be added or omitted ona case-by-case basis while harmonization of terminology anddata fields occurs at data retrieval In a research environmentespecially at the beginning of a study when pipelines and meta-data are likely to change it is easier to update the data modelthan to redesign the fixed schema of a relational databaseHowever with less urgency to curate data before upload dataintegrity may be variable with errors remaining unnoticed andconfounding analyses performed on the data Fuzzy matchingalgorithms [55ndash57] can mitigate the impact of typographical andsome clerical errors but these approaches must still be rigor-ously validated In some cases it might be impossible even for ahuman expert to deduce the original intention Data must then

Figure 2 Schematic of the internal table structure of PICanrsquos relational database reflecting the core activities and workflows that comprise the biomarker discovery

and validation program at our institution Most tables represent physical or digital entities within these workflows Where table names may be unclear a brief descrip-

tion has been included along with some example data fields (in parentheses)

Integromics in cancer tissue biomarker research | 5

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

be discarded from analysis potentially reducing the statisticalpower Manual curation at the outset might identify such caseswhen there is still an opportunity to seek clarification from theoriginal data collector Ideally ambiguous annotations and bor-derline cases should be resolved in consultation with a suitablyqualified medical professional before they enter the databaserather than leaving them open to interpretation by downstreaminformaticians Likewise missing or contradictory data pointsare preferably identified and where possible corrected beforeentry into the database However a major advantage of an un-structured database model is the ability to more faithfully cap-ture the complexities of the patientrsquos clinical pathological andother data parameters including free-form text fields such asclinical notes The ability to capture patient data in its nativeform also improves the traceability and scalability of data entryinto a non-structured database solution by eliminating the la-bour and documentation associated with data curation andenabling automated bulk data upload Despite the tools wehave developed to semi-automate data upload this will remaina relative drawback of manual curation in the short term espe-cially for clinical and pathological data fields In the longerterm scalability may be improved with increased adoption ofelectronic medical records wherein many of these data fieldscould be completed by the attending physician when encoun-tering the patient

Balancing the data curation afforded by the structured rela-tional database with the high-throughput flexibility of unstruc-tured models are various hybrid options For example PICanoffers flexibility to the structured database model by includingfreeform text fields to capture unstructured notes (in our experi-ence these are typically for clinical notes) in several of the datatables As these fields are not referenced directly by the auto-mated online analysis component which requires well-definedparameters they are hidden from analysis but can be retrievedby users wishing to conduct further data refinement or offlineanalysis In practice however it is more likely that any fieldsnot used in analysis by the system will simply be ignored andneglected Another hybrid solution we are actively exploring toincrease data throughput is to allow the importation of non-curated data sets for end-user analysis In this model curatedlsquopublicrsquo data sets would be available to all users (with necessaryapprovals and access restrictions in place) while each end userwill be able to upload lsquoprivatersquo data sets to be analysed along-side or integrated with the curated data The user then is re-sponsible for ensuring the quality of hisher own dataExtending the hybrid model concept further are heterogeneousdatabases in which only the most important unchanging andsensitive data are typically stored in a relational databaseMeanwhile the bulk of the data in terms of size will be storedin non-relational databases or as raw data with links stored inthe main database to ensure ease of access and retrieval

Image analysis data storage and management

Current work is focussing on improving the support of digitalimage analysis data an emerging area of research in which ourgroup has significant expertise Imaging data presents someunique challenges for database storage design The imagesthemselves can be large One of our typical tissue slidesscanned at 40objective magnification (025 lm per pixel) canhave 2ndash10 billion pixels Imaging in RGB at 8 bits per channelgives uncompressed file sizes of up to 30 GB Additional colourchannels (such as for fluorescence imaging) z-stacks and dataredundancy in the form of image pyramids for improved image

loading performance would all greatly increase file sizesCompression typically JPEG or JPEG2000 for whole slide imagescan significantly reduce file size without drastically impairingvisual assessment although suitability for image analysis ismore variable [39] Hence it is vital to store the original slideimage to allow a pathologist to go back and verify that imageanalysis results are consistent with hisher expectations This iscurrently achieved by hosting the slide images on remote ser-vers maintained by PathXL A link is stored in the PICan data-base that allows the web-based interface to retrieve the imagedata on user request via an application programming interface(API) supplied by PathXL Pan zoom and overlay functionalitiesin the PICan web interface are powered by OpenSeadragon [58]As the samples are drawn from NIB collections NIB maintainscontrol of access and security using the PathXL software

In addition to the raw image pixels image analysis datamight include markups annotations and calculated metrics(such as integrated optical densities nuclear areas or immuno-histochemical H-scores [50 51]) Metrics can be at a cellularlevel or summarized across a region of interest (ROI) a slide oreven a set of slides Much of these data may be hierarchicallystructured with each slide potentially containing multiple ROIseach ROI containing many cells and each cell being associatedwith values for numerous metrics However the efficiency ofstoring image analysis data in a structured database is uncer-tain when a single image can contain thousands of annotationsFor now PICan stores only summary scores and metrics (eg H-score [50 51] Allred Score [59] percentage positive cells) whileusers wishing to interrogate the image data on a more granularlevel will need to rely on external specialist tools The challengeof storing and transferring image analysis data between sys-tems is an ongoing one with several potential approaches hav-ing been proposed including DICOM [60] OME-TIFF [61] variousHDF5-based formats [62 63] and proprietary vendor formatsbut without a clear front-runner DICOM and OME-TIFF are pri-marily image formats with flexible metadata support optimizedfor image viewing rather than storage of image analysis resultswith large numbers of image objects In contrast HDF5 capturesobject hierarchy relationships well but on its own it is not a na-tive image format Meanwhile proprietary formats are poten-tially encumbered by patenting and intellectual propertyconcerns

Integromics data management is patient-centric

Regardless of the database management system being used acritical characteristic of integromics analysis is patient-centricity The ultimate objective in bringing together data frommultiple experimental platforms is to build a more holistic pic-ture of the interplay between genotype and phenotype in thepatients under study As interest in amalgamating multipledata types has grown many platforms originally designed tohost one data type have been expanded to accommodate add-itional types of data Many biobank management software sys-tems allow addition of clinicopathological notes andexperimental assay results on top of sample tracking dataMeanwhile digital slide hosting platforms like OMERO [64]PathXL Xplore Cytomine [65] and DigitalScope (Aptia SystemsHouston TX used by the College of American Pathologists) [66]are beginning to support the addition of associated clinicopa-thological data as metadata Others like TCGA [25 26] origi-nated as molecular data repositories but have begun addingwhole slide imaging There is a clear trend towards convergenceof database models encompassing more and more data types

6 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

but their impact will be limited if these extra data are simplystored and forgotten Integromics is an active process not just ofbringing data together but more importantly of creative com-parisons and contrasts to expose underlying interactions under-pinning the cancer phenotype Many public repositories and bigdatabases such as TCGA draw from an assortment of patientcohorts so associations between different data types measuredon different cohorts are limited to cohort summaries poten-tially obscuring trends between data types that are only appar-ent at the individual patient level

Interacting with external resourcesmdashdataexchange between systems

With the current rapid expansion of big data analytics in bio-medicine integromics frameworks must remain flexibleenough to incorporate emerging and future technologiesStructured architecture and design of the overall integromicsdatabase framework can facilitate this However large data setslike next-generation sequencing (and upcoming third-gener-ation sequencing [67ndash70]) can be unwieldy in a MySQL databasewith typical whole exome or RNA sequencing runs comprising10ndash100 billion base pairs for a single sample so future integro-mics systems may house these types of data in databases thatare structured specifically for this purpose Instead of a singledatabase housing multiple data types a future integromicsdatabase will undoubtedly be structured as a fully distributedfederated database [71ndash74] with a central database storing iden-tifiers that interface with a cluster of highly specialized data-bases each housing a specific type of data in its native formatArguably PICan already uses a version of this model by out-sourcing digital slide image storage to remote servers Futuredevelopments may include enabling PICan to tap into publiclyavailable data sources and allowing users to analyse these pub-lic data sets in tandem with PICan data such as accessing GeneExpression Omnibus data through GEO2R [75]

One of the challenges when working with a federated data-base model is how to exchange data faithfully and efficientlybetween the various components Each constituent databasewill need an interface to interact seamlessly with the centraldatabase and control system Some of these will be availableoff-the-shelf or rely on published APIs while others such asthose interacting with in-house software may need significantdeveloper time and effort to implement Ultimately there areno shortcuts to establishing these interfaces and so institutionsshould aim to reduce the number of distinct interfaces bystandardizing data around a small set of formats For examplewhen working with whole slide images a single institutioncould strive to use only a single file format The various avail-able formats each have their own strengths and weaknessesbut the natural option will likely be the native format of thescanner used at the institution or a vendor-neutral format likeDICOM [60] or OME-TIFF [61] Fortunately many tools for work-ing with whole slide images support multiple file formats suchas DigitalScope [66] for remote web-based viewing of variousimaging formats Meanwhile projects like Bio-Formats (OpenMicroscopy Environment consortium) [61] and OpenSlide [76]further facilitate the viewing and exploitation of various imag-ing formats in other programming and analysis languages likeC Cthornthorn Python and Java Additionally it is important to stand-ardize the data transfer protocol whether this is throughInternet standards like hypertext transfer protocol physicalmeans like hard disks or some other agreed protocol Perhaps

establishing an institutional integromics strategy will encour-age adoption of standardized data formats by creating an ex-pectation among data generators that data should be portableand compatible with other data types for analysis

Data flow within an integromics environment can often bemultidirectional Even with only a single central data repositorythere will be instances when it is beneficial to export some ofthese data to a separate third-party platform whether for speci-alized analysis report generation or some other purpose AsPICanrsquos web-based user interface is designed to be user friendlyit supports only a limited number of types of analyses and re-port generation options Analyses not hard-coded into the inter-face for example would need to be performed lsquoofflinersquo In factthis is how new analyses are tested before being incorporatedinto PICan Similar to data import and storage data export froman integromics system is complicated by the wide range of datatypes that demand different formats to accommodate their in-herently different natures Efforts have been made in somefields to standardize and integrate data models such as thoseof the Global Alliance for Genomics and Health for genomicdata [20] Hence where widely recognized data exchange for-mats exist such as FASTAFASTQ files for nucleotide or aminoacid sequences [77] or the TMA data exchange standard [78]they should be prioritized for integromics support For otherdata types like digital image analysis however there remains aneed for a well-defined and widely accepted standard for dataexchange Whether it is possible or feasible to define a singleunifying integromic data exchange format that would encapsu-late multiple data types attached to individuals within a cohortof patient subjects remains an open question

Standardization of data vocabulary

Any data exchange standard needs to address both how dataare encoded (structure) as well as a common vocabulary forcommonly encountered fields CSV JSON XML and derivativesof these are popular formats that lend themselves well to struc-tured data Certainly CSV export is our default as although itdoes not fully represent the relational structure underpinningPICanrsquos database it mimics the basic tabular structure and canbe easily viewed by other researchers in common spreadsheetor statistical packages Non-tabular data like digital slide images(which are stored as links in the database) are exported as sep-arate files The greater challenge might be agreeing on a com-mon vocabulary Data field names need to be mapped fromPICanrsquos internal vocabulary to that of the target external plat-form This is not a trivial task as even professionals within thesame institution may use different names for the same metricSome clinicians may be more precise in how certain metrics aremeasured and recorded such as when some distinguish be-tween lymphatic and vascular invasion while others groupthem together as lymphovascular invasion Harmonizing be-tween cancer types can also be challenging hence the use ofcancer-type-specific tables in PICan for certain clinicopathologi-cal metrics Additionally even where a data exchange specifica-tion exists like the one for TMA reporting [78] there may beallowance for custom fields posing an extra challenge for align-ing under a common vocabulary Attempts have been made tostandardize terminology in some disciplines The US NationalLibrary of Medicine maintains a collection of controlled vocabu-laries known as the Unified Medical Language System [79]which incorporates SNOMED CT MeSH and OMIM amongmany other ontologies Another compendium is caCOREincluding the Enterprise Vocabulary Services and Cancer Data

Integromics in cancer tissue biomarker research | 7

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Standards Repository of common data elements [80 81]Additionally some groups have devised cancer-type-specificlists such as for breast [82] mesothelioma [83] and prostate[84] In the absence of clear community-supported data stand-ards institutions are advised to agree on and use a minimal setof data export formats

Unique challenges of multimodal integrativeanalysis

Collecting and storing diverse data types is only half the battlein integromics unlocking the value of these data will comefrom analysis and interpretation Through a connection withthe R statistical environment via statconn [49] PICan is capableof on-demand statistical analysis graphical output and reportgeneration Besides R and statconn other similar frameworksand interfaces include Shiny [85] RNET [86] Python (eg mat-plotlib [87]) and D3js [88] While PICan is strictly a research toolan integromic approach will undoubtedly be essential to futureclinical decision support systems placing integromics at thefront line of patient care delivery

Multimodal analysis is particularly important when con-sidering the future of digital pathology and tissue imagingConventional light and fluorescence imaging provides the basisfor most studies in this field However the future may providenew imaging technologies using light outside of the normal vis-ible spectrum that may be more informative towards illuminat-ing the underlying cancer biology and delivering sensitive andspecific biomarkers of clinical outcome Numerous studies areemerging that exploit technologies such as ultraviolet [89] in-frared [90] and Raman spectroscopy [91 92] together with abroad range of associated multispectral technologies to gain abetter insight into tissue and cell characterization and to visual-ize compounds and chemistry not accessible to conventionalmicroscopy Integrating these imaging modalities into a singleframework to compare compound and deduce clinical utilitynot in isolation but together with conventional imaging meth-ods is essential Similarly there is capacity to integrate otherforms of medical imaging relevant to cancer discovery includ-ing positron emission tomography magnetic resonance imag-ing computed tomography and X-ray This can be used tounderstand the relationship between in vivo clinical imagingand ex vivo tissue imaging thereby translating markers that arecurrently only visible on microscopy to early in vivo detectionwith clinical imaging modalities In both settings the real valueis likely to come from the overlay and integration of imagingmodalities extending the capacity of tissue imaging on its ownand providing new visualization tools for both pathologists andradiologists that are more informative and consequently im-prove clinical decision making The integromics initiative dis-cussed in this article provides such a model for imaging as wellas molecular and clinical data blurring the edges of medicalspecialties and providing a unified framework for discovery

The basic tools of integromic data analysis are generallysimilar to other disciplines of bioinformatics Mean compari-sons clustering survival analyses machine learning and linearmodels for example can all be used as can the various pipe-lines that have been developed for next-generation sequencinggene expression microarrays and other molecular big dataomics techniques Integration of digital image analysis data inintegromic analysis can present its own distinct risks and re-wards Images and their associated markups and annotationsdo not lend themselves on their own to t-tests or clustering for

example Nevertheless these may be useful to contextualizeother pieces of data like IHC scores Calculated scores (egH-score) and metrics (eg integrated optical density nucleararea) can often be expressed as nameattribute pairs so theylend themselves more immediately to integrative analysis al-though data structure and hierarchy can still complicate ana-lysis Some metrics can be measured at a cellular level andsummarized at an ROI or whole slide level Others howevermight not be meaningful at a cellular level (eg H-score) so onemust be mindful of the meaning of each measure when incor-porating it into a multivariate model of the data Tumour het-erogeneity is also an important complication Summary scorescomputed across different regions of the same tumour may dif-fer markedly Heterogeneity at least in solid tumours can bemore easily mapped in tissue sections where tissue structureand cellular context is maintained and intact Spatial changesin tumour cell arrangement can be visualized microscopicallyand used to sample regions for molecular analysis and themeasurement of intratumoural heterogeneity Whole slideimaging and digital image analysis represent powerful tools toquantitatively map tissue phenotype and provide a sound basisfor biomarker assay development while opening up new av-enues of inquiry For example one can begin to explore whethercertain morphometric phenotypes correlate with specific gen-etic or epigenetic profiles and whether these have any signifi-cant prognostic or therapeutic impact

A specific challenge for integromics is the application ofstatistical techniques to data from disparate sources Care mustbe taken to ensure differences observed between experimentalgroups reflect the underlying biology and not simply batch ef-fects Differences in experimental assay platforms such as dif-ferent probes targeting the same gene or different antibodies forIHC may produce discrepancies in the data based on differentlevels of sensitivity and specificity In digital pathology differ-ences in imaging hardware analysis algorithms and parameterdefinitions may lead to various imaging artefacts and interpret-ation difficulties Even within the same software cell countsdetermined from different threshold settings are not compar-able Less visible confounders include study and patient vari-ables like recruitment methodology study admission criteriapatient stratificationdiagnostic criteria pre-collection treat-ments and standard of care (for prognostic biomarker studies)that can affect the makeup of the study patient populationMeanwhile specimen handling variables like cold ischemictime fixation method staining protocols and formulations canhave a profound effect on the data often dominating other con-founders regardless of the subsequent pipeline Ideally thendata harmonization begins well before data are generated withstandardization and thorough documentation of all pre-analytical steps including patient selection tissue collectionand specimen handling Where potential batch effects or con-founding variables have been identified a number of strategiescan be used such as subset analysis (ie stratifying the subjectpopulation on the confounding variable) or regression model-ling [93 94]

Mandated public release of large data sets has been helpful inadvancing transparency in science but tracking the associatedmetadata and assessing its impact on results derived from thesebig data sets remains a major unresolved challenge Even largerepositories like TCGA are composed of smaller collections froma wide range of geographies and institutional environments Inmany cases pooling such diverse data sets becomes an apples-to-oranges comparison Batch effects in TCGA and other largehigh-throughput data sets are well-documented [52 53] Without

8 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 2: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

Embracing an integromic approach to tissue biomarker

research in cancer Perspectives and lessons learnedGerald Li Peter Bankhead Philip D Dunne Paul G OrsquoReillyJacqueline A James Manuel Salto-Tellez Peter W Hamilton andDarragh G McArtCo-senior authorsCorresponding author Darragh G McArt Centre for Cancer Research and Cell Biology (CCRCB) Queenrsquos University Belfast Belfast United KingdomE-mail dmcartqubacuk

Abstract

Modern approaches to biomedical research and diagnostics targeted towards precision medicine are generating lsquobig datarsquoacross a range of high-throughput experimental and analytical platforms Integrative analysis of this rich clinical pathologicalmolecular and imaging data represents one of the greatest bottlenecks in biomarker discovery research in cancer and otherdiseases Following on from the publication of our successful framework for multimodal data amalgamation and integrativeanalysis Pathology Integromics in Cancer (PICan) this article will explore the essential elements of assembling an integromicsframework from a more detailed perspective PICan built around a relational database storing curated multimodal data is theresearch tool sitting at the heart of our interdisciplinary efforts to streamline biomarker discovery and validation While recog-nizing that every institution has a unique set of priorities and challenges we will use our experiences with PICan as a casestudy and starting point rationalizing the design choices we made within the context of our local infrastructure and specificneeds but also highlighting alternative approaches that may better suit other programmes of research and discovery Alongthe way we stress that integromics is not just a set of tools but rather a cohesive paradigm for how modern bioinformaticscan be enhanced Successful implementation of an integromics framework is a collaborative team effort that is built with aneye to the future and greatly accelerates the processes of biomarker discovery validation and translation into clinical practice

Key words integromics big data biomarkers omics digital pathology interdisciplinary teamwork

Gerald Li is a Research Fellow at Queenrsquos University Belfast His research focuses on integrative multimodal data analysis for cancer biomarker discoveryand validationPeter Bankhead is a Research Fellow at Queenrsquos University Belfast His research focuses on developing novel image analysis algorithms and softwarefocussing on digital pathologyPhilip Dunne is a Senior Research Fellow at Queenrsquos University Belfast His research has evolved from basic biology to using translational bioinformaticsto define the molecular pathology epidemiology of early-invasive cancerPaul OrsquoReilly is a Research Fellow at Queenrsquos University Belfast His research focuses on the development implementation and acceleration of bioinfor-matics algorithms and the architecture of bioinformatics software systemsJacqueline James is a Clinical Senior LecturerConsultant Pathologist at Queenrsquos University BelfastBelfast Health and Social Care Trust Her research usesintegrative molecular approaches for biomarker discovery and validationManuel Salto-Tellez is the Chair of Pathology at Queenrsquos University Belfast and a diagnostic morpho-molecular pathologist He has championed the inte-gration of phenotype and genotype in diagnostics and researchPeter Hamilton is Professor of Bioimaging and Informatics at Queenrsquos University Belfast He has been working for the past 25 years in digital pathologyquantitative imaging biomarker discovery and digital diagnosticsDarragh McArt is a Lecturer in Translational Bioinformatics at Queenrsquos University Belfast His research focuses on developing accelerative and integrativesolutions in modern biomarker researchSubmitted 26 February 2016 Received (in revised form) 8 April 2016

VC The Author 2016 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40)which permits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

1

Briefings in Bioinformatics 2016 1ndash13

doi 101093bibbbw044Paper

Briefings in Bioinformatics Advance Access published June 2 2016 at Q

ueens University of B

elfast on June 3 2016httpbiboxfordjournalsorg

Dow

nloaded from

Introduction

Modern basic scientific research approaches and clinical studiesof disease are now demanding the requirement to store man-age and interrogate enormous amounts of information Hiddenamong these large complex multivariate data are the patternsand clues needed to understand disease processes and drivenew methods for diagnosis prognosis therapeutics and preci-sion or personalized medicine Commercial and in-house solu-tions [1ndash14] have been developed in an attempt to make themost of this data deluge Describing this age of big data-drivenbiomarker research Weinstein defined integromics as thelsquomelding of diverse types of data from different experimentalplatformsrsquo [15 16] while the American Society for ClinicalOncology uses the term lsquopanomicsrsquo to describe similar needs[17]

Given the centrality of pathology to solid tumour analysiswe previously published on our own framework for multimodaldata amalgamation and integrative analysis called PathologyIntegromics in Cancer (PICan) [18] Since then we have been ap-proached by researchers who want to establish similar systemsat their own institutions This led us to reflect on the process ofbuilding PICan and the appropriateness of packaging and dis-tributing PICan as source code However the diversity of clin-ical experimental and research setups means that an optimalintegromics strategy for each institution might be vastly differ-ent from the next Hence we present here a more detailed dis-cussion on the technical considerations underpinning theformation of an integromics environment to accelerate tissuebiomarker research with a focus on cancer Key framework de-cisions were shaped by our local environment but this articlewill also discuss some alternative solutions that may better suitother research groups organizations and multi-institutionalcollaborations This perspective is structured around severalthemes with reference back to PICan as a case study integro-mics overview database structure data exchange multimodalanalysis interdisciplinary collaboration and data control Weconclude with a discussion on the central role integromics willplay in the modern era of large-scale multicentre trials and glo-bal research consortia [19ndash21] We hope that the concepts raisedwill provide valuable insights from a more detailed technicalperspective for institutions and research groups wishing to de-velop their own integromics framework

Precision medicine and integromics for biomarkerdiscovery in cancer

The modern precision medicine approach aims to stratify pa-tients based on specific biomarkers that will inform the most ef-fective treatment for each patient cutting costs and reducingharm from ineffective treatments [22] Novel therapeutics targetspecific cancer-causing mechanisms inducing fewer side ef-fects but requiring companion biomarker diagnostics withquick turnaround times to identify the patients who will bene-fit most As multiple patient and tumour characteristics cancontribute to a treatmentrsquos effectiveness this drive towards pre-cision medicine is inevitably spurring an increasing demand forunbiased automated data collection and analysis across amultitude of high-throughput analytical platforms The currentpace of data generation is far outstripping the ability to analysethese data Efforts by journals and grant-funding agencies to in-crease data transparency and encourage data sharing have re-sulted in large repositories of publicly available data such asArrayExpress [23] and the European Nucleotide Archive [24]

The ability to access these databases has increased the interestin integromics or panomics the integrative analysis of multiplebig data types This is especially true in molecular analyses [7ndash911 19 25ndash28] where tools such as Partek Genomics Suite(Partek Inc St Louis MO USA) GeneSpring (AgilentTechnologies Santa Clara CA USA) tools that harness TheCancer Genome Atlas (TCGA) [25 26] data portal (eg cBioPortalfor Cancer Genomics [developed and maintained by the Centerfor Molecular Oncology and the Computational Biology Centerat Memorial Sloan-Kettering Cancer Center] [29 30] COSMIC[31] and the Broad Institute TCGA GDAC Firehose [32]) andothers [33 34] can facilitate statistical analysis and biologicalcontextualization of omics data However in the push towardsdevelopment of precision medicine there remains an urgentneed for an integromics platform that takes into account non-molecular data Currently most of these data types are still heldin separate or loosely bound silos with the onus resting with re-search groups around the world to seek out these disparate silosand attempt to harmonize the data Even then different datatypes derived from different patient cohorts make it challengingto integrate the data at the patient level This limits analysis be-tween data types to comparisons of cohort-level summariespotentially obscuring correlations within the data that mayhave been apparent if studied at the patient level

Integration of tissue phenotype through digitalpathology and image analytics

Most molecular analyses on their own seldom consider thehistological distribution of biomarkers This leads to blind spotsin our ability to interpret and understand molecular analysessuch as the potential implications of tumour heterogeneity andwhether it might correlate with any specific genetic or epigen-etic changes Computer-aided image analysis applied to cancerdiagnostics and investigative biomarker research has been inuse for several decades [35] However the combination of digitalpathology with whole slide imaging has truly catapulted thefield into the era of big data Entire stained tissue and tissuemicroarray (TMA) sections each potentially comprising hun-dreds of samples across hundreds of patients can now be digi-tized archived and disseminated without the need to store anddistribute the original glass slides [36ndash38] Digital image acquisi-tion is now commonplace with multiple established companiessuch as Aperio (Leica Biosystems Nussloch Germany)Hamamatsu Photonics (Hamamatsu City Japan) Carl Zeiss(Oberkochen Germany) and 3D Histech (Budapest Hungary)offering scanning hardware capable of producing high-resolution digital slide images Aside from greater ease of col-laboration whole slide imaging unlocks the analytical potentialof digital image analysis Algorithms can be trained to identifytumour regions delineate margins classify cellular stainingand quantify the results for a wide range of histological pheno-types and cellular markers [39] Many software packagesincluding commercial (eg PathXL TissueMark Belfast UKDefiniens TissueStudio Munich Germany Aperio Genie andVisiopharm Visiomorph Hoersholm Denmark) in-house(QuPath Queenrsquos University Belfast) and community-driven(eg ImageJ [40] Fiji [41] Icy [42] and Ilastik [43]) offerings nowoffer access to such algorithms Computerized image interpret-ation reduces the subjectivity of manual interpretation [44] andonce properly trained can batch process a large number of slideimages at once Applied to TMAs high-throughput tissue inter-pretation can greatly accelerate tissue biomarker discovery

2 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

pipelines especially when coupled with the wealth of moleculardata currently becoming increasingly accessible

In light of this true integromics today ought to embraceboth molecular and digital pathology to uncover and validate amore diverse and robust set of biomarkers that will be key to de-livering on the promises of precision medicine [45] Tissue path-ology and histology are essential to the application of morerecent molecular methods in solid tumours required for deter-mining the presence and proportion of tumour cells to ensuresample sufficiency for downstream analysis of nucleic acidsand for directing microdissection of tissue samples for molecu-lar analysis [46] Moreover tissue context and cellular pheno-type are central to understanding tumour heterogeneity andthe underlying genomic map of solid tumours TheImmunoscore biomarker [47] for example applies image ana-lysis to contextualize the interpretation of immunohistochemi-cal staining The understanding of both how genotype drivesphenotype and how phenotypic information influences inter-pretation of genotype are required for a holistic view of cancerand are therefore essential elements of a modern integromicsapproach

Building large-scale digital resources PICanoverview

Reflecting this broader view of integromics we created PICan tobe a comprehensive framework for the collection integrationand analysis of clinical pathological molecular and digitalimage analysis data sets for the purposes of cancer biomarkerdiscovery and validation [18] PICan sits at the heart of our bio-marker discovery program acting as both a central hub andcatalyst (Figure 1) PICan streamlines the process of bringing to-gether diverse streams of multimodal data from multiple cancertypes to generate clinically and biologically relevant analysesand reports via a bespoke access-controlled web-based inter-face written in ASPNETC with a connection to the R statis-tical environment [48] through statconn [49] At its core is arestricted-access MySQL database to house this collated patientdata Currently there are 3660 patients representing eight can-cer types recorded in the system together with fully curateddata on pathological diagnosis and clinical outcome Fromthese nearly 12 000 TMA cores have been stained immunohis-tochemically and scored for a broad range of well-establishedand novel cellular biomarkers that are of interest to our basicscience translational and clinical research collaborators Withthe ongoing rapid pace of research activity at our institutionthese numbers are set to double over the course of the nextyear Data de-identification using study-specific unique identi-fiers protects patient privacy while authentication and author-ization of users enables a balance to be struck between enablingaccess to users and safeguarding the rights of data collectors(clinicians wet lab scientists and other collaborators) to publi-cation priority and control of their own data Only administra-tors have direct access to the PICan database while end useraccess is provided exclusively through the access-controlledanalysis and report generation web interface Data quality is akey priority to ensure the integrity of downstream analyses asany analysis is only as good as the weakest data supporting itso data are curated with biological and clinical input before up-load Meanwhile bioinformaticians tailor their pipelines ac-cording to the needs of the research or analysis Consequentlyin addition to the central database a complete integromicspipeline must necessarily encompass data collection data

curation and biologically driven analysis and interpretation ofthis data activities which we consider integral parts of any inte-gromics framework

Database design data pre-processing curationand structure

Traditional single-mode data analysis pipelines take raw data(eg sequencing reads microarray spot intensities whole slideimage files) and process it through a series of steps into a finalform that informs a particular biological narrative (eg list ofmutations and frequencies gene expression values H-scoresfor immunohistochemistry (IHC) scoring [50 51]) However rawdata such as raw fluorescent spot intensities from an expres-sion microarray experiment might not be suitable for multi-modal and dynamic analysis without corrections forbackground and batch effects [52 53] while data aggregationmight leave fully processed data such as a list of differentiallyexpressed genes too refined to be of further use in patient-levelresearch analysis A compromise is to capture data in a statethat is suitable and informative for subsequent analysis whiletracking metadata associated with acquisition and pre-processing Metadata or lsquodata about datarsquo captures informationabout the methods and conditions under which the data itselfwere obtained Such information can be used for example toidentify cases where an observed effect was the result of differ-ences in the laboratory or researcher conducting the experi-ments rather than the biological condition Going back toexpression microarray data as an example this could beachieved by recording background-corrected and normalizedprobe signal intensities with metadata detailing the nature andparameters for data pre-processing steps such as normaliza-

tion assay conditions instrumentation used including its set-tings and pre-analytical steps eg how the physical sampleswere obtained handled and processed and by whom For com-pleteness and traceability raw data can be stored alongside thisanalysis-ready data Additionally multiple lsquosnapshotsrsquo fromsingle-mode analysis pipelines can be stored as each could po-tentially serve as a starting point for multimodal analysisHowever care must be taken to ensure internal consistency sothat results are repeatable and reproducible whether workingup raw data or the stored analysis-ready data Consequentlyfor PICan we opted for quality and consistency by recordingonly the analysis-ready data most suitable for further integro-mic analysis A number of quality assurance and quality controlmeasures can be taken to ensure data integrity with the appro-priate level of adoption into standard operating proceduresbeing dependent on the overall level of accreditation of the la-boratory in question Metadata is stored alongside the analysis-ready data but raw data and most downstream single-modeanalysis results are presently excluded from PICan Future plansto increase data completeness and traceability include adding acentral independent repository for raw data files These repre-sent files not already in existence elsewhere (eg public reposi-tories) These raw data files will be read-only and bear uniqueidentifiers that link them to their analysis-ready counterpartson the main database A file path stored in the main databasewill allow the raw data files to be easily retrieved when neededAs these files will not be used for on-demand statistical ana-lysis they can be stored in their native formats

Integromics in cancer tissue biomarker research | 3

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Relational database systems

Integromics software infrastructure can be built around one ofmany database management systems available each with itsown strengths and weaknesses As we envisioned PICan to in-corporate a semi-automated analysis engine [18] we prioritizeddata quality which maximizes the validity and reproducibilityof our results and preservation of data structure which stream-lines the automated retrieval and processing of big data sets ra-ther than attempting to blindly maximize data throughputReflecting these design priorities PICan was built around a rela-tional database (MySQL) consisting of interrelated data tablesthat reflect the process of constructing and harnessing TMAswhich are central to much of the biomarker discovery work con-ducted at our institution the Northern Ireland MolecularPathology Laboratory and through the Northern Ireland Biobank(NIB) (Figure 2) Every row of every table has a system-generatedprimary key and these are used as foreign keys to form relation-ships between tables The database centres on the Patient andBlock tables (table names will be italicized) reflecting our core vi-sion for PICan to drive holistic multimodal integromic analysisat a patient level One-to-many relationships link the Patienttable to tables containing the clinicopathological and treatmentdata capturing cancer-type-specific fields from templates de-veloped in collaboration with our clinical expert partners Thesetemplates also define permissible data inputs for each field (egage is a number sex is lsquoMrsquo or lsquoFrsquo) which are then encoded intothe database schema A patient may have had multiple tissuespecimens collected as tissue blocks each represented by a

separate linked entry in the Block table Linked via the Block tableare tables representing molecular assays (SangerSequenceDNAMicroarrayChip ExpressionLevel) whole face tissue sections(TissueSlide) TMA construction (Cylinder RecipientBlock TMASlideTMACore) IHC scoring (TissueScore BiomarkerScore) and digitalpathology (TissueVirtualSlide TMAVirtualSlide VirtualCoreDIAScore) all of which reflect primary activities within our insti-tutionrsquos biomarker discovery program Metadata is stored in thetable corresponding to the relevant physical entity and soimaging parameters for a whole face tissue slide for examplewould go in the TissueVirtualSlide table While this databasestructure is well suited to our needs laboratories collectingnon-tissue specimens such as blood sputum or urine may findit inappropriate for their needs A Blood table might sit at theheart of a haematological laboratoryrsquos database for examplewhile other institutions may need a cluster of tables to repre-sent different specimen types each linked to tables for variousbiochemical assays

PICan uses a modular internal structure to support variousdata types Each table represents a physical entity and conse-quently a data type This allows the addition of new tables torepresent data types from emerging technologies as they ma-ture As maximizing utility of new tables will require updates tothe end-user web interface and setting the appropriate foreignkeys will require knowledge of the internal structure of thePICan database only the system administrator can add newtables

Another consideration for relational database design is nor-malization This procedure simplifies the data structure by

Figure 1 PICan sits at the heart of an integrated multidisciplinary biomarker discovery and validation pipeline Multiple curated data sources feed into PICan which

accelerates downstream analysis and report generation

4 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

removing duplication of data ensuring that the table is resilientagainst inconsistencies arising from incomplete addition dele-tion or modification of data As described in [54] this isachieved by identifying groups of related or dependent datafields and splitting them off into a separate table using foreignkeys to record relationships between tables For most practicalapplications the third normal form balances a robust architec-ture without overly cluttering the schema with an untenablenumber of tables Normalization approaches can also be used tooptimize database speed and performance

A consequence of our decision to use a relational database isthat free-form data needs to be harmonized with the data struc-ture presented by PICan before data entry The requisite cur-ation steps prioritize data integrity and prevent poor-qualitydata that might impede proper and meaningful analysis fromentering the database (garbage-in-garbage-out) Clinical andpathological data collection and storage are guided by tem-plates collections of well-defined data fields that have been de-veloped through our close working relationships with the NIBNorthern Ireland Cancer Registry Northern Ireland Health andSocial Care Trusts and clinicians These collaborators share thedesire to drive data integration and provide expertise contextand data access to harmonize data templates around commonvocabulary and permissible data values ensuring maximal clin-ical relevance of PICan In addition to the harmonized common

fields like age and sex PICan also supports cancer-type-specificfields for maximum flexibility

Non-relational database structures

While a structured relational database has served PICan wellthere are alternative models that might better suit other re-search integromics environments MongoDB for example is anon-relational (or lsquoNoSQLrsquo) database architecture that uses adocument model where each entity (a tissue block for ex-ample) is represented by a lsquodocumentrsquo that might contain hun-dreds of attributes as key-value pairs This model works with aflexible schema allowing data fields to be added or omitted ona case-by-case basis while harmonization of terminology anddata fields occurs at data retrieval In a research environmentespecially at the beginning of a study when pipelines and meta-data are likely to change it is easier to update the data modelthan to redesign the fixed schema of a relational databaseHowever with less urgency to curate data before upload dataintegrity may be variable with errors remaining unnoticed andconfounding analyses performed on the data Fuzzy matchingalgorithms [55ndash57] can mitigate the impact of typographical andsome clerical errors but these approaches must still be rigor-ously validated In some cases it might be impossible even for ahuman expert to deduce the original intention Data must then

Figure 2 Schematic of the internal table structure of PICanrsquos relational database reflecting the core activities and workflows that comprise the biomarker discovery

and validation program at our institution Most tables represent physical or digital entities within these workflows Where table names may be unclear a brief descrip-

tion has been included along with some example data fields (in parentheses)

Integromics in cancer tissue biomarker research | 5

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

be discarded from analysis potentially reducing the statisticalpower Manual curation at the outset might identify such caseswhen there is still an opportunity to seek clarification from theoriginal data collector Ideally ambiguous annotations and bor-derline cases should be resolved in consultation with a suitablyqualified medical professional before they enter the databaserather than leaving them open to interpretation by downstreaminformaticians Likewise missing or contradictory data pointsare preferably identified and where possible corrected beforeentry into the database However a major advantage of an un-structured database model is the ability to more faithfully cap-ture the complexities of the patientrsquos clinical pathological andother data parameters including free-form text fields such asclinical notes The ability to capture patient data in its nativeform also improves the traceability and scalability of data entryinto a non-structured database solution by eliminating the la-bour and documentation associated with data curation andenabling automated bulk data upload Despite the tools wehave developed to semi-automate data upload this will remaina relative drawback of manual curation in the short term espe-cially for clinical and pathological data fields In the longerterm scalability may be improved with increased adoption ofelectronic medical records wherein many of these data fieldscould be completed by the attending physician when encoun-tering the patient

Balancing the data curation afforded by the structured rela-tional database with the high-throughput flexibility of unstruc-tured models are various hybrid options For example PICanoffers flexibility to the structured database model by includingfreeform text fields to capture unstructured notes (in our experi-ence these are typically for clinical notes) in several of the datatables As these fields are not referenced directly by the auto-mated online analysis component which requires well-definedparameters they are hidden from analysis but can be retrievedby users wishing to conduct further data refinement or offlineanalysis In practice however it is more likely that any fieldsnot used in analysis by the system will simply be ignored andneglected Another hybrid solution we are actively exploring toincrease data throughput is to allow the importation of non-curated data sets for end-user analysis In this model curatedlsquopublicrsquo data sets would be available to all users (with necessaryapprovals and access restrictions in place) while each end userwill be able to upload lsquoprivatersquo data sets to be analysed along-side or integrated with the curated data The user then is re-sponsible for ensuring the quality of hisher own dataExtending the hybrid model concept further are heterogeneousdatabases in which only the most important unchanging andsensitive data are typically stored in a relational databaseMeanwhile the bulk of the data in terms of size will be storedin non-relational databases or as raw data with links stored inthe main database to ensure ease of access and retrieval

Image analysis data storage and management

Current work is focussing on improving the support of digitalimage analysis data an emerging area of research in which ourgroup has significant expertise Imaging data presents someunique challenges for database storage design The imagesthemselves can be large One of our typical tissue slidesscanned at 40objective magnification (025 lm per pixel) canhave 2ndash10 billion pixels Imaging in RGB at 8 bits per channelgives uncompressed file sizes of up to 30 GB Additional colourchannels (such as for fluorescence imaging) z-stacks and dataredundancy in the form of image pyramids for improved image

loading performance would all greatly increase file sizesCompression typically JPEG or JPEG2000 for whole slide imagescan significantly reduce file size without drastically impairingvisual assessment although suitability for image analysis ismore variable [39] Hence it is vital to store the original slideimage to allow a pathologist to go back and verify that imageanalysis results are consistent with hisher expectations This iscurrently achieved by hosting the slide images on remote ser-vers maintained by PathXL A link is stored in the PICan data-base that allows the web-based interface to retrieve the imagedata on user request via an application programming interface(API) supplied by PathXL Pan zoom and overlay functionalitiesin the PICan web interface are powered by OpenSeadragon [58]As the samples are drawn from NIB collections NIB maintainscontrol of access and security using the PathXL software

In addition to the raw image pixels image analysis datamight include markups annotations and calculated metrics(such as integrated optical densities nuclear areas or immuno-histochemical H-scores [50 51]) Metrics can be at a cellularlevel or summarized across a region of interest (ROI) a slide oreven a set of slides Much of these data may be hierarchicallystructured with each slide potentially containing multiple ROIseach ROI containing many cells and each cell being associatedwith values for numerous metrics However the efficiency ofstoring image analysis data in a structured database is uncer-tain when a single image can contain thousands of annotationsFor now PICan stores only summary scores and metrics (eg H-score [50 51] Allred Score [59] percentage positive cells) whileusers wishing to interrogate the image data on a more granularlevel will need to rely on external specialist tools The challengeof storing and transferring image analysis data between sys-tems is an ongoing one with several potential approaches hav-ing been proposed including DICOM [60] OME-TIFF [61] variousHDF5-based formats [62 63] and proprietary vendor formatsbut without a clear front-runner DICOM and OME-TIFF are pri-marily image formats with flexible metadata support optimizedfor image viewing rather than storage of image analysis resultswith large numbers of image objects In contrast HDF5 capturesobject hierarchy relationships well but on its own it is not a na-tive image format Meanwhile proprietary formats are poten-tially encumbered by patenting and intellectual propertyconcerns

Integromics data management is patient-centric

Regardless of the database management system being used acritical characteristic of integromics analysis is patient-centricity The ultimate objective in bringing together data frommultiple experimental platforms is to build a more holistic pic-ture of the interplay between genotype and phenotype in thepatients under study As interest in amalgamating multipledata types has grown many platforms originally designed tohost one data type have been expanded to accommodate add-itional types of data Many biobank management software sys-tems allow addition of clinicopathological notes andexperimental assay results on top of sample tracking dataMeanwhile digital slide hosting platforms like OMERO [64]PathXL Xplore Cytomine [65] and DigitalScope (Aptia SystemsHouston TX used by the College of American Pathologists) [66]are beginning to support the addition of associated clinicopa-thological data as metadata Others like TCGA [25 26] origi-nated as molecular data repositories but have begun addingwhole slide imaging There is a clear trend towards convergenceof database models encompassing more and more data types

6 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

but their impact will be limited if these extra data are simplystored and forgotten Integromics is an active process not just ofbringing data together but more importantly of creative com-parisons and contrasts to expose underlying interactions under-pinning the cancer phenotype Many public repositories and bigdatabases such as TCGA draw from an assortment of patientcohorts so associations between different data types measuredon different cohorts are limited to cohort summaries poten-tially obscuring trends between data types that are only appar-ent at the individual patient level

Interacting with external resourcesmdashdataexchange between systems

With the current rapid expansion of big data analytics in bio-medicine integromics frameworks must remain flexibleenough to incorporate emerging and future technologiesStructured architecture and design of the overall integromicsdatabase framework can facilitate this However large data setslike next-generation sequencing (and upcoming third-gener-ation sequencing [67ndash70]) can be unwieldy in a MySQL databasewith typical whole exome or RNA sequencing runs comprising10ndash100 billion base pairs for a single sample so future integro-mics systems may house these types of data in databases thatare structured specifically for this purpose Instead of a singledatabase housing multiple data types a future integromicsdatabase will undoubtedly be structured as a fully distributedfederated database [71ndash74] with a central database storing iden-tifiers that interface with a cluster of highly specialized data-bases each housing a specific type of data in its native formatArguably PICan already uses a version of this model by out-sourcing digital slide image storage to remote servers Futuredevelopments may include enabling PICan to tap into publiclyavailable data sources and allowing users to analyse these pub-lic data sets in tandem with PICan data such as accessing GeneExpression Omnibus data through GEO2R [75]

One of the challenges when working with a federated data-base model is how to exchange data faithfully and efficientlybetween the various components Each constituent databasewill need an interface to interact seamlessly with the centraldatabase and control system Some of these will be availableoff-the-shelf or rely on published APIs while others such asthose interacting with in-house software may need significantdeveloper time and effort to implement Ultimately there areno shortcuts to establishing these interfaces and so institutionsshould aim to reduce the number of distinct interfaces bystandardizing data around a small set of formats For examplewhen working with whole slide images a single institutioncould strive to use only a single file format The various avail-able formats each have their own strengths and weaknessesbut the natural option will likely be the native format of thescanner used at the institution or a vendor-neutral format likeDICOM [60] or OME-TIFF [61] Fortunately many tools for work-ing with whole slide images support multiple file formats suchas DigitalScope [66] for remote web-based viewing of variousimaging formats Meanwhile projects like Bio-Formats (OpenMicroscopy Environment consortium) [61] and OpenSlide [76]further facilitate the viewing and exploitation of various imag-ing formats in other programming and analysis languages likeC Cthornthorn Python and Java Additionally it is important to stand-ardize the data transfer protocol whether this is throughInternet standards like hypertext transfer protocol physicalmeans like hard disks or some other agreed protocol Perhaps

establishing an institutional integromics strategy will encour-age adoption of standardized data formats by creating an ex-pectation among data generators that data should be portableand compatible with other data types for analysis

Data flow within an integromics environment can often bemultidirectional Even with only a single central data repositorythere will be instances when it is beneficial to export some ofthese data to a separate third-party platform whether for speci-alized analysis report generation or some other purpose AsPICanrsquos web-based user interface is designed to be user friendlyit supports only a limited number of types of analyses and re-port generation options Analyses not hard-coded into the inter-face for example would need to be performed lsquoofflinersquo In factthis is how new analyses are tested before being incorporatedinto PICan Similar to data import and storage data export froman integromics system is complicated by the wide range of datatypes that demand different formats to accommodate their in-herently different natures Efforts have been made in somefields to standardize and integrate data models such as thoseof the Global Alliance for Genomics and Health for genomicdata [20] Hence where widely recognized data exchange for-mats exist such as FASTAFASTQ files for nucleotide or aminoacid sequences [77] or the TMA data exchange standard [78]they should be prioritized for integromics support For otherdata types like digital image analysis however there remains aneed for a well-defined and widely accepted standard for dataexchange Whether it is possible or feasible to define a singleunifying integromic data exchange format that would encapsu-late multiple data types attached to individuals within a cohortof patient subjects remains an open question

Standardization of data vocabulary

Any data exchange standard needs to address both how dataare encoded (structure) as well as a common vocabulary forcommonly encountered fields CSV JSON XML and derivativesof these are popular formats that lend themselves well to struc-tured data Certainly CSV export is our default as although itdoes not fully represent the relational structure underpinningPICanrsquos database it mimics the basic tabular structure and canbe easily viewed by other researchers in common spreadsheetor statistical packages Non-tabular data like digital slide images(which are stored as links in the database) are exported as sep-arate files The greater challenge might be agreeing on a com-mon vocabulary Data field names need to be mapped fromPICanrsquos internal vocabulary to that of the target external plat-form This is not a trivial task as even professionals within thesame institution may use different names for the same metricSome clinicians may be more precise in how certain metrics aremeasured and recorded such as when some distinguish be-tween lymphatic and vascular invasion while others groupthem together as lymphovascular invasion Harmonizing be-tween cancer types can also be challenging hence the use ofcancer-type-specific tables in PICan for certain clinicopathologi-cal metrics Additionally even where a data exchange specifica-tion exists like the one for TMA reporting [78] there may beallowance for custom fields posing an extra challenge for align-ing under a common vocabulary Attempts have been made tostandardize terminology in some disciplines The US NationalLibrary of Medicine maintains a collection of controlled vocabu-laries known as the Unified Medical Language System [79]which incorporates SNOMED CT MeSH and OMIM amongmany other ontologies Another compendium is caCOREincluding the Enterprise Vocabulary Services and Cancer Data

Integromics in cancer tissue biomarker research | 7

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Standards Repository of common data elements [80 81]Additionally some groups have devised cancer-type-specificlists such as for breast [82] mesothelioma [83] and prostate[84] In the absence of clear community-supported data stand-ards institutions are advised to agree on and use a minimal setof data export formats

Unique challenges of multimodal integrativeanalysis

Collecting and storing diverse data types is only half the battlein integromics unlocking the value of these data will comefrom analysis and interpretation Through a connection withthe R statistical environment via statconn [49] PICan is capableof on-demand statistical analysis graphical output and reportgeneration Besides R and statconn other similar frameworksand interfaces include Shiny [85] RNET [86] Python (eg mat-plotlib [87]) and D3js [88] While PICan is strictly a research toolan integromic approach will undoubtedly be essential to futureclinical decision support systems placing integromics at thefront line of patient care delivery

Multimodal analysis is particularly important when con-sidering the future of digital pathology and tissue imagingConventional light and fluorescence imaging provides the basisfor most studies in this field However the future may providenew imaging technologies using light outside of the normal vis-ible spectrum that may be more informative towards illuminat-ing the underlying cancer biology and delivering sensitive andspecific biomarkers of clinical outcome Numerous studies areemerging that exploit technologies such as ultraviolet [89] in-frared [90] and Raman spectroscopy [91 92] together with abroad range of associated multispectral technologies to gain abetter insight into tissue and cell characterization and to visual-ize compounds and chemistry not accessible to conventionalmicroscopy Integrating these imaging modalities into a singleframework to compare compound and deduce clinical utilitynot in isolation but together with conventional imaging meth-ods is essential Similarly there is capacity to integrate otherforms of medical imaging relevant to cancer discovery includ-ing positron emission tomography magnetic resonance imag-ing computed tomography and X-ray This can be used tounderstand the relationship between in vivo clinical imagingand ex vivo tissue imaging thereby translating markers that arecurrently only visible on microscopy to early in vivo detectionwith clinical imaging modalities In both settings the real valueis likely to come from the overlay and integration of imagingmodalities extending the capacity of tissue imaging on its ownand providing new visualization tools for both pathologists andradiologists that are more informative and consequently im-prove clinical decision making The integromics initiative dis-cussed in this article provides such a model for imaging as wellas molecular and clinical data blurring the edges of medicalspecialties and providing a unified framework for discovery

The basic tools of integromic data analysis are generallysimilar to other disciplines of bioinformatics Mean compari-sons clustering survival analyses machine learning and linearmodels for example can all be used as can the various pipe-lines that have been developed for next-generation sequencinggene expression microarrays and other molecular big dataomics techniques Integration of digital image analysis data inintegromic analysis can present its own distinct risks and re-wards Images and their associated markups and annotationsdo not lend themselves on their own to t-tests or clustering for

example Nevertheless these may be useful to contextualizeother pieces of data like IHC scores Calculated scores (egH-score) and metrics (eg integrated optical density nucleararea) can often be expressed as nameattribute pairs so theylend themselves more immediately to integrative analysis al-though data structure and hierarchy can still complicate ana-lysis Some metrics can be measured at a cellular level andsummarized at an ROI or whole slide level Others howevermight not be meaningful at a cellular level (eg H-score) so onemust be mindful of the meaning of each measure when incor-porating it into a multivariate model of the data Tumour het-erogeneity is also an important complication Summary scorescomputed across different regions of the same tumour may dif-fer markedly Heterogeneity at least in solid tumours can bemore easily mapped in tissue sections where tissue structureand cellular context is maintained and intact Spatial changesin tumour cell arrangement can be visualized microscopicallyand used to sample regions for molecular analysis and themeasurement of intratumoural heterogeneity Whole slideimaging and digital image analysis represent powerful tools toquantitatively map tissue phenotype and provide a sound basisfor biomarker assay development while opening up new av-enues of inquiry For example one can begin to explore whethercertain morphometric phenotypes correlate with specific gen-etic or epigenetic profiles and whether these have any signifi-cant prognostic or therapeutic impact

A specific challenge for integromics is the application ofstatistical techniques to data from disparate sources Care mustbe taken to ensure differences observed between experimentalgroups reflect the underlying biology and not simply batch ef-fects Differences in experimental assay platforms such as dif-ferent probes targeting the same gene or different antibodies forIHC may produce discrepancies in the data based on differentlevels of sensitivity and specificity In digital pathology differ-ences in imaging hardware analysis algorithms and parameterdefinitions may lead to various imaging artefacts and interpret-ation difficulties Even within the same software cell countsdetermined from different threshold settings are not compar-able Less visible confounders include study and patient vari-ables like recruitment methodology study admission criteriapatient stratificationdiagnostic criteria pre-collection treat-ments and standard of care (for prognostic biomarker studies)that can affect the makeup of the study patient populationMeanwhile specimen handling variables like cold ischemictime fixation method staining protocols and formulations canhave a profound effect on the data often dominating other con-founders regardless of the subsequent pipeline Ideally thendata harmonization begins well before data are generated withstandardization and thorough documentation of all pre-analytical steps including patient selection tissue collectionand specimen handling Where potential batch effects or con-founding variables have been identified a number of strategiescan be used such as subset analysis (ie stratifying the subjectpopulation on the confounding variable) or regression model-ling [93 94]

Mandated public release of large data sets has been helpful inadvancing transparency in science but tracking the associatedmetadata and assessing its impact on results derived from thesebig data sets remains a major unresolved challenge Even largerepositories like TCGA are composed of smaller collections froma wide range of geographies and institutional environments Inmany cases pooling such diverse data sets becomes an apples-to-oranges comparison Batch effects in TCGA and other largehigh-throughput data sets are well-documented [52 53] Without

8 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 3: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

Introduction

Modern basic scientific research approaches and clinical studiesof disease are now demanding the requirement to store man-age and interrogate enormous amounts of information Hiddenamong these large complex multivariate data are the patternsand clues needed to understand disease processes and drivenew methods for diagnosis prognosis therapeutics and preci-sion or personalized medicine Commercial and in-house solu-tions [1ndash14] have been developed in an attempt to make themost of this data deluge Describing this age of big data-drivenbiomarker research Weinstein defined integromics as thelsquomelding of diverse types of data from different experimentalplatformsrsquo [15 16] while the American Society for ClinicalOncology uses the term lsquopanomicsrsquo to describe similar needs[17]

Given the centrality of pathology to solid tumour analysiswe previously published on our own framework for multimodaldata amalgamation and integrative analysis called PathologyIntegromics in Cancer (PICan) [18] Since then we have been ap-proached by researchers who want to establish similar systemsat their own institutions This led us to reflect on the process ofbuilding PICan and the appropriateness of packaging and dis-tributing PICan as source code However the diversity of clin-ical experimental and research setups means that an optimalintegromics strategy for each institution might be vastly differ-ent from the next Hence we present here a more detailed dis-cussion on the technical considerations underpinning theformation of an integromics environment to accelerate tissuebiomarker research with a focus on cancer Key framework de-cisions were shaped by our local environment but this articlewill also discuss some alternative solutions that may better suitother research groups organizations and multi-institutionalcollaborations This perspective is structured around severalthemes with reference back to PICan as a case study integro-mics overview database structure data exchange multimodalanalysis interdisciplinary collaboration and data control Weconclude with a discussion on the central role integromics willplay in the modern era of large-scale multicentre trials and glo-bal research consortia [19ndash21] We hope that the concepts raisedwill provide valuable insights from a more detailed technicalperspective for institutions and research groups wishing to de-velop their own integromics framework

Precision medicine and integromics for biomarkerdiscovery in cancer

The modern precision medicine approach aims to stratify pa-tients based on specific biomarkers that will inform the most ef-fective treatment for each patient cutting costs and reducingharm from ineffective treatments [22] Novel therapeutics targetspecific cancer-causing mechanisms inducing fewer side ef-fects but requiring companion biomarker diagnostics withquick turnaround times to identify the patients who will bene-fit most As multiple patient and tumour characteristics cancontribute to a treatmentrsquos effectiveness this drive towards pre-cision medicine is inevitably spurring an increasing demand forunbiased automated data collection and analysis across amultitude of high-throughput analytical platforms The currentpace of data generation is far outstripping the ability to analysethese data Efforts by journals and grant-funding agencies to in-crease data transparency and encourage data sharing have re-sulted in large repositories of publicly available data such asArrayExpress [23] and the European Nucleotide Archive [24]

The ability to access these databases has increased the interestin integromics or panomics the integrative analysis of multiplebig data types This is especially true in molecular analyses [7ndash911 19 25ndash28] where tools such as Partek Genomics Suite(Partek Inc St Louis MO USA) GeneSpring (AgilentTechnologies Santa Clara CA USA) tools that harness TheCancer Genome Atlas (TCGA) [25 26] data portal (eg cBioPortalfor Cancer Genomics [developed and maintained by the Centerfor Molecular Oncology and the Computational Biology Centerat Memorial Sloan-Kettering Cancer Center] [29 30] COSMIC[31] and the Broad Institute TCGA GDAC Firehose [32]) andothers [33 34] can facilitate statistical analysis and biologicalcontextualization of omics data However in the push towardsdevelopment of precision medicine there remains an urgentneed for an integromics platform that takes into account non-molecular data Currently most of these data types are still heldin separate or loosely bound silos with the onus resting with re-search groups around the world to seek out these disparate silosand attempt to harmonize the data Even then different datatypes derived from different patient cohorts make it challengingto integrate the data at the patient level This limits analysis be-tween data types to comparisons of cohort-level summariespotentially obscuring correlations within the data that mayhave been apparent if studied at the patient level

Integration of tissue phenotype through digitalpathology and image analytics

Most molecular analyses on their own seldom consider thehistological distribution of biomarkers This leads to blind spotsin our ability to interpret and understand molecular analysessuch as the potential implications of tumour heterogeneity andwhether it might correlate with any specific genetic or epigen-etic changes Computer-aided image analysis applied to cancerdiagnostics and investigative biomarker research has been inuse for several decades [35] However the combination of digitalpathology with whole slide imaging has truly catapulted thefield into the era of big data Entire stained tissue and tissuemicroarray (TMA) sections each potentially comprising hun-dreds of samples across hundreds of patients can now be digi-tized archived and disseminated without the need to store anddistribute the original glass slides [36ndash38] Digital image acquisi-tion is now commonplace with multiple established companiessuch as Aperio (Leica Biosystems Nussloch Germany)Hamamatsu Photonics (Hamamatsu City Japan) Carl Zeiss(Oberkochen Germany) and 3D Histech (Budapest Hungary)offering scanning hardware capable of producing high-resolution digital slide images Aside from greater ease of col-laboration whole slide imaging unlocks the analytical potentialof digital image analysis Algorithms can be trained to identifytumour regions delineate margins classify cellular stainingand quantify the results for a wide range of histological pheno-types and cellular markers [39] Many software packagesincluding commercial (eg PathXL TissueMark Belfast UKDefiniens TissueStudio Munich Germany Aperio Genie andVisiopharm Visiomorph Hoersholm Denmark) in-house(QuPath Queenrsquos University Belfast) and community-driven(eg ImageJ [40] Fiji [41] Icy [42] and Ilastik [43]) offerings nowoffer access to such algorithms Computerized image interpret-ation reduces the subjectivity of manual interpretation [44] andonce properly trained can batch process a large number of slideimages at once Applied to TMAs high-throughput tissue inter-pretation can greatly accelerate tissue biomarker discovery

2 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

pipelines especially when coupled with the wealth of moleculardata currently becoming increasingly accessible

In light of this true integromics today ought to embraceboth molecular and digital pathology to uncover and validate amore diverse and robust set of biomarkers that will be key to de-livering on the promises of precision medicine [45] Tissue path-ology and histology are essential to the application of morerecent molecular methods in solid tumours required for deter-mining the presence and proportion of tumour cells to ensuresample sufficiency for downstream analysis of nucleic acidsand for directing microdissection of tissue samples for molecu-lar analysis [46] Moreover tissue context and cellular pheno-type are central to understanding tumour heterogeneity andthe underlying genomic map of solid tumours TheImmunoscore biomarker [47] for example applies image ana-lysis to contextualize the interpretation of immunohistochemi-cal staining The understanding of both how genotype drivesphenotype and how phenotypic information influences inter-pretation of genotype are required for a holistic view of cancerand are therefore essential elements of a modern integromicsapproach

Building large-scale digital resources PICanoverview

Reflecting this broader view of integromics we created PICan tobe a comprehensive framework for the collection integrationand analysis of clinical pathological molecular and digitalimage analysis data sets for the purposes of cancer biomarkerdiscovery and validation [18] PICan sits at the heart of our bio-marker discovery program acting as both a central hub andcatalyst (Figure 1) PICan streamlines the process of bringing to-gether diverse streams of multimodal data from multiple cancertypes to generate clinically and biologically relevant analysesand reports via a bespoke access-controlled web-based inter-face written in ASPNETC with a connection to the R statis-tical environment [48] through statconn [49] At its core is arestricted-access MySQL database to house this collated patientdata Currently there are 3660 patients representing eight can-cer types recorded in the system together with fully curateddata on pathological diagnosis and clinical outcome Fromthese nearly 12 000 TMA cores have been stained immunohis-tochemically and scored for a broad range of well-establishedand novel cellular biomarkers that are of interest to our basicscience translational and clinical research collaborators Withthe ongoing rapid pace of research activity at our institutionthese numbers are set to double over the course of the nextyear Data de-identification using study-specific unique identi-fiers protects patient privacy while authentication and author-ization of users enables a balance to be struck between enablingaccess to users and safeguarding the rights of data collectors(clinicians wet lab scientists and other collaborators) to publi-cation priority and control of their own data Only administra-tors have direct access to the PICan database while end useraccess is provided exclusively through the access-controlledanalysis and report generation web interface Data quality is akey priority to ensure the integrity of downstream analyses asany analysis is only as good as the weakest data supporting itso data are curated with biological and clinical input before up-load Meanwhile bioinformaticians tailor their pipelines ac-cording to the needs of the research or analysis Consequentlyin addition to the central database a complete integromicspipeline must necessarily encompass data collection data

curation and biologically driven analysis and interpretation ofthis data activities which we consider integral parts of any inte-gromics framework

Database design data pre-processing curationand structure

Traditional single-mode data analysis pipelines take raw data(eg sequencing reads microarray spot intensities whole slideimage files) and process it through a series of steps into a finalform that informs a particular biological narrative (eg list ofmutations and frequencies gene expression values H-scoresfor immunohistochemistry (IHC) scoring [50 51]) However rawdata such as raw fluorescent spot intensities from an expres-sion microarray experiment might not be suitable for multi-modal and dynamic analysis without corrections forbackground and batch effects [52 53] while data aggregationmight leave fully processed data such as a list of differentiallyexpressed genes too refined to be of further use in patient-levelresearch analysis A compromise is to capture data in a statethat is suitable and informative for subsequent analysis whiletracking metadata associated with acquisition and pre-processing Metadata or lsquodata about datarsquo captures informationabout the methods and conditions under which the data itselfwere obtained Such information can be used for example toidentify cases where an observed effect was the result of differ-ences in the laboratory or researcher conducting the experi-ments rather than the biological condition Going back toexpression microarray data as an example this could beachieved by recording background-corrected and normalizedprobe signal intensities with metadata detailing the nature andparameters for data pre-processing steps such as normaliza-

tion assay conditions instrumentation used including its set-tings and pre-analytical steps eg how the physical sampleswere obtained handled and processed and by whom For com-pleteness and traceability raw data can be stored alongside thisanalysis-ready data Additionally multiple lsquosnapshotsrsquo fromsingle-mode analysis pipelines can be stored as each could po-tentially serve as a starting point for multimodal analysisHowever care must be taken to ensure internal consistency sothat results are repeatable and reproducible whether workingup raw data or the stored analysis-ready data Consequentlyfor PICan we opted for quality and consistency by recordingonly the analysis-ready data most suitable for further integro-mic analysis A number of quality assurance and quality controlmeasures can be taken to ensure data integrity with the appro-priate level of adoption into standard operating proceduresbeing dependent on the overall level of accreditation of the la-boratory in question Metadata is stored alongside the analysis-ready data but raw data and most downstream single-modeanalysis results are presently excluded from PICan Future plansto increase data completeness and traceability include adding acentral independent repository for raw data files These repre-sent files not already in existence elsewhere (eg public reposi-tories) These raw data files will be read-only and bear uniqueidentifiers that link them to their analysis-ready counterpartson the main database A file path stored in the main databasewill allow the raw data files to be easily retrieved when neededAs these files will not be used for on-demand statistical ana-lysis they can be stored in their native formats

Integromics in cancer tissue biomarker research | 3

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Relational database systems

Integromics software infrastructure can be built around one ofmany database management systems available each with itsown strengths and weaknesses As we envisioned PICan to in-corporate a semi-automated analysis engine [18] we prioritizeddata quality which maximizes the validity and reproducibilityof our results and preservation of data structure which stream-lines the automated retrieval and processing of big data sets ra-ther than attempting to blindly maximize data throughputReflecting these design priorities PICan was built around a rela-tional database (MySQL) consisting of interrelated data tablesthat reflect the process of constructing and harnessing TMAswhich are central to much of the biomarker discovery work con-ducted at our institution the Northern Ireland MolecularPathology Laboratory and through the Northern Ireland Biobank(NIB) (Figure 2) Every row of every table has a system-generatedprimary key and these are used as foreign keys to form relation-ships between tables The database centres on the Patient andBlock tables (table names will be italicized) reflecting our core vi-sion for PICan to drive holistic multimodal integromic analysisat a patient level One-to-many relationships link the Patienttable to tables containing the clinicopathological and treatmentdata capturing cancer-type-specific fields from templates de-veloped in collaboration with our clinical expert partners Thesetemplates also define permissible data inputs for each field (egage is a number sex is lsquoMrsquo or lsquoFrsquo) which are then encoded intothe database schema A patient may have had multiple tissuespecimens collected as tissue blocks each represented by a

separate linked entry in the Block table Linked via the Block tableare tables representing molecular assays (SangerSequenceDNAMicroarrayChip ExpressionLevel) whole face tissue sections(TissueSlide) TMA construction (Cylinder RecipientBlock TMASlideTMACore) IHC scoring (TissueScore BiomarkerScore) and digitalpathology (TissueVirtualSlide TMAVirtualSlide VirtualCoreDIAScore) all of which reflect primary activities within our insti-tutionrsquos biomarker discovery program Metadata is stored in thetable corresponding to the relevant physical entity and soimaging parameters for a whole face tissue slide for examplewould go in the TissueVirtualSlide table While this databasestructure is well suited to our needs laboratories collectingnon-tissue specimens such as blood sputum or urine may findit inappropriate for their needs A Blood table might sit at theheart of a haematological laboratoryrsquos database for examplewhile other institutions may need a cluster of tables to repre-sent different specimen types each linked to tables for variousbiochemical assays

PICan uses a modular internal structure to support variousdata types Each table represents a physical entity and conse-quently a data type This allows the addition of new tables torepresent data types from emerging technologies as they ma-ture As maximizing utility of new tables will require updates tothe end-user web interface and setting the appropriate foreignkeys will require knowledge of the internal structure of thePICan database only the system administrator can add newtables

Another consideration for relational database design is nor-malization This procedure simplifies the data structure by

Figure 1 PICan sits at the heart of an integrated multidisciplinary biomarker discovery and validation pipeline Multiple curated data sources feed into PICan which

accelerates downstream analysis and report generation

4 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

removing duplication of data ensuring that the table is resilientagainst inconsistencies arising from incomplete addition dele-tion or modification of data As described in [54] this isachieved by identifying groups of related or dependent datafields and splitting them off into a separate table using foreignkeys to record relationships between tables For most practicalapplications the third normal form balances a robust architec-ture without overly cluttering the schema with an untenablenumber of tables Normalization approaches can also be used tooptimize database speed and performance

A consequence of our decision to use a relational database isthat free-form data needs to be harmonized with the data struc-ture presented by PICan before data entry The requisite cur-ation steps prioritize data integrity and prevent poor-qualitydata that might impede proper and meaningful analysis fromentering the database (garbage-in-garbage-out) Clinical andpathological data collection and storage are guided by tem-plates collections of well-defined data fields that have been de-veloped through our close working relationships with the NIBNorthern Ireland Cancer Registry Northern Ireland Health andSocial Care Trusts and clinicians These collaborators share thedesire to drive data integration and provide expertise contextand data access to harmonize data templates around commonvocabulary and permissible data values ensuring maximal clin-ical relevance of PICan In addition to the harmonized common

fields like age and sex PICan also supports cancer-type-specificfields for maximum flexibility

Non-relational database structures

While a structured relational database has served PICan wellthere are alternative models that might better suit other re-search integromics environments MongoDB for example is anon-relational (or lsquoNoSQLrsquo) database architecture that uses adocument model where each entity (a tissue block for ex-ample) is represented by a lsquodocumentrsquo that might contain hun-dreds of attributes as key-value pairs This model works with aflexible schema allowing data fields to be added or omitted ona case-by-case basis while harmonization of terminology anddata fields occurs at data retrieval In a research environmentespecially at the beginning of a study when pipelines and meta-data are likely to change it is easier to update the data modelthan to redesign the fixed schema of a relational databaseHowever with less urgency to curate data before upload dataintegrity may be variable with errors remaining unnoticed andconfounding analyses performed on the data Fuzzy matchingalgorithms [55ndash57] can mitigate the impact of typographical andsome clerical errors but these approaches must still be rigor-ously validated In some cases it might be impossible even for ahuman expert to deduce the original intention Data must then

Figure 2 Schematic of the internal table structure of PICanrsquos relational database reflecting the core activities and workflows that comprise the biomarker discovery

and validation program at our institution Most tables represent physical or digital entities within these workflows Where table names may be unclear a brief descrip-

tion has been included along with some example data fields (in parentheses)

Integromics in cancer tissue biomarker research | 5

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

be discarded from analysis potentially reducing the statisticalpower Manual curation at the outset might identify such caseswhen there is still an opportunity to seek clarification from theoriginal data collector Ideally ambiguous annotations and bor-derline cases should be resolved in consultation with a suitablyqualified medical professional before they enter the databaserather than leaving them open to interpretation by downstreaminformaticians Likewise missing or contradictory data pointsare preferably identified and where possible corrected beforeentry into the database However a major advantage of an un-structured database model is the ability to more faithfully cap-ture the complexities of the patientrsquos clinical pathological andother data parameters including free-form text fields such asclinical notes The ability to capture patient data in its nativeform also improves the traceability and scalability of data entryinto a non-structured database solution by eliminating the la-bour and documentation associated with data curation andenabling automated bulk data upload Despite the tools wehave developed to semi-automate data upload this will remaina relative drawback of manual curation in the short term espe-cially for clinical and pathological data fields In the longerterm scalability may be improved with increased adoption ofelectronic medical records wherein many of these data fieldscould be completed by the attending physician when encoun-tering the patient

Balancing the data curation afforded by the structured rela-tional database with the high-throughput flexibility of unstruc-tured models are various hybrid options For example PICanoffers flexibility to the structured database model by includingfreeform text fields to capture unstructured notes (in our experi-ence these are typically for clinical notes) in several of the datatables As these fields are not referenced directly by the auto-mated online analysis component which requires well-definedparameters they are hidden from analysis but can be retrievedby users wishing to conduct further data refinement or offlineanalysis In practice however it is more likely that any fieldsnot used in analysis by the system will simply be ignored andneglected Another hybrid solution we are actively exploring toincrease data throughput is to allow the importation of non-curated data sets for end-user analysis In this model curatedlsquopublicrsquo data sets would be available to all users (with necessaryapprovals and access restrictions in place) while each end userwill be able to upload lsquoprivatersquo data sets to be analysed along-side or integrated with the curated data The user then is re-sponsible for ensuring the quality of hisher own dataExtending the hybrid model concept further are heterogeneousdatabases in which only the most important unchanging andsensitive data are typically stored in a relational databaseMeanwhile the bulk of the data in terms of size will be storedin non-relational databases or as raw data with links stored inthe main database to ensure ease of access and retrieval

Image analysis data storage and management

Current work is focussing on improving the support of digitalimage analysis data an emerging area of research in which ourgroup has significant expertise Imaging data presents someunique challenges for database storage design The imagesthemselves can be large One of our typical tissue slidesscanned at 40objective magnification (025 lm per pixel) canhave 2ndash10 billion pixels Imaging in RGB at 8 bits per channelgives uncompressed file sizes of up to 30 GB Additional colourchannels (such as for fluorescence imaging) z-stacks and dataredundancy in the form of image pyramids for improved image

loading performance would all greatly increase file sizesCompression typically JPEG or JPEG2000 for whole slide imagescan significantly reduce file size without drastically impairingvisual assessment although suitability for image analysis ismore variable [39] Hence it is vital to store the original slideimage to allow a pathologist to go back and verify that imageanalysis results are consistent with hisher expectations This iscurrently achieved by hosting the slide images on remote ser-vers maintained by PathXL A link is stored in the PICan data-base that allows the web-based interface to retrieve the imagedata on user request via an application programming interface(API) supplied by PathXL Pan zoom and overlay functionalitiesin the PICan web interface are powered by OpenSeadragon [58]As the samples are drawn from NIB collections NIB maintainscontrol of access and security using the PathXL software

In addition to the raw image pixels image analysis datamight include markups annotations and calculated metrics(such as integrated optical densities nuclear areas or immuno-histochemical H-scores [50 51]) Metrics can be at a cellularlevel or summarized across a region of interest (ROI) a slide oreven a set of slides Much of these data may be hierarchicallystructured with each slide potentially containing multiple ROIseach ROI containing many cells and each cell being associatedwith values for numerous metrics However the efficiency ofstoring image analysis data in a structured database is uncer-tain when a single image can contain thousands of annotationsFor now PICan stores only summary scores and metrics (eg H-score [50 51] Allred Score [59] percentage positive cells) whileusers wishing to interrogate the image data on a more granularlevel will need to rely on external specialist tools The challengeof storing and transferring image analysis data between sys-tems is an ongoing one with several potential approaches hav-ing been proposed including DICOM [60] OME-TIFF [61] variousHDF5-based formats [62 63] and proprietary vendor formatsbut without a clear front-runner DICOM and OME-TIFF are pri-marily image formats with flexible metadata support optimizedfor image viewing rather than storage of image analysis resultswith large numbers of image objects In contrast HDF5 capturesobject hierarchy relationships well but on its own it is not a na-tive image format Meanwhile proprietary formats are poten-tially encumbered by patenting and intellectual propertyconcerns

Integromics data management is patient-centric

Regardless of the database management system being used acritical characteristic of integromics analysis is patient-centricity The ultimate objective in bringing together data frommultiple experimental platforms is to build a more holistic pic-ture of the interplay between genotype and phenotype in thepatients under study As interest in amalgamating multipledata types has grown many platforms originally designed tohost one data type have been expanded to accommodate add-itional types of data Many biobank management software sys-tems allow addition of clinicopathological notes andexperimental assay results on top of sample tracking dataMeanwhile digital slide hosting platforms like OMERO [64]PathXL Xplore Cytomine [65] and DigitalScope (Aptia SystemsHouston TX used by the College of American Pathologists) [66]are beginning to support the addition of associated clinicopa-thological data as metadata Others like TCGA [25 26] origi-nated as molecular data repositories but have begun addingwhole slide imaging There is a clear trend towards convergenceof database models encompassing more and more data types

6 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

but their impact will be limited if these extra data are simplystored and forgotten Integromics is an active process not just ofbringing data together but more importantly of creative com-parisons and contrasts to expose underlying interactions under-pinning the cancer phenotype Many public repositories and bigdatabases such as TCGA draw from an assortment of patientcohorts so associations between different data types measuredon different cohorts are limited to cohort summaries poten-tially obscuring trends between data types that are only appar-ent at the individual patient level

Interacting with external resourcesmdashdataexchange between systems

With the current rapid expansion of big data analytics in bio-medicine integromics frameworks must remain flexibleenough to incorporate emerging and future technologiesStructured architecture and design of the overall integromicsdatabase framework can facilitate this However large data setslike next-generation sequencing (and upcoming third-gener-ation sequencing [67ndash70]) can be unwieldy in a MySQL databasewith typical whole exome or RNA sequencing runs comprising10ndash100 billion base pairs for a single sample so future integro-mics systems may house these types of data in databases thatare structured specifically for this purpose Instead of a singledatabase housing multiple data types a future integromicsdatabase will undoubtedly be structured as a fully distributedfederated database [71ndash74] with a central database storing iden-tifiers that interface with a cluster of highly specialized data-bases each housing a specific type of data in its native formatArguably PICan already uses a version of this model by out-sourcing digital slide image storage to remote servers Futuredevelopments may include enabling PICan to tap into publiclyavailable data sources and allowing users to analyse these pub-lic data sets in tandem with PICan data such as accessing GeneExpression Omnibus data through GEO2R [75]

One of the challenges when working with a federated data-base model is how to exchange data faithfully and efficientlybetween the various components Each constituent databasewill need an interface to interact seamlessly with the centraldatabase and control system Some of these will be availableoff-the-shelf or rely on published APIs while others such asthose interacting with in-house software may need significantdeveloper time and effort to implement Ultimately there areno shortcuts to establishing these interfaces and so institutionsshould aim to reduce the number of distinct interfaces bystandardizing data around a small set of formats For examplewhen working with whole slide images a single institutioncould strive to use only a single file format The various avail-able formats each have their own strengths and weaknessesbut the natural option will likely be the native format of thescanner used at the institution or a vendor-neutral format likeDICOM [60] or OME-TIFF [61] Fortunately many tools for work-ing with whole slide images support multiple file formats suchas DigitalScope [66] for remote web-based viewing of variousimaging formats Meanwhile projects like Bio-Formats (OpenMicroscopy Environment consortium) [61] and OpenSlide [76]further facilitate the viewing and exploitation of various imag-ing formats in other programming and analysis languages likeC Cthornthorn Python and Java Additionally it is important to stand-ardize the data transfer protocol whether this is throughInternet standards like hypertext transfer protocol physicalmeans like hard disks or some other agreed protocol Perhaps

establishing an institutional integromics strategy will encour-age adoption of standardized data formats by creating an ex-pectation among data generators that data should be portableand compatible with other data types for analysis

Data flow within an integromics environment can often bemultidirectional Even with only a single central data repositorythere will be instances when it is beneficial to export some ofthese data to a separate third-party platform whether for speci-alized analysis report generation or some other purpose AsPICanrsquos web-based user interface is designed to be user friendlyit supports only a limited number of types of analyses and re-port generation options Analyses not hard-coded into the inter-face for example would need to be performed lsquoofflinersquo In factthis is how new analyses are tested before being incorporatedinto PICan Similar to data import and storage data export froman integromics system is complicated by the wide range of datatypes that demand different formats to accommodate their in-herently different natures Efforts have been made in somefields to standardize and integrate data models such as thoseof the Global Alliance for Genomics and Health for genomicdata [20] Hence where widely recognized data exchange for-mats exist such as FASTAFASTQ files for nucleotide or aminoacid sequences [77] or the TMA data exchange standard [78]they should be prioritized for integromics support For otherdata types like digital image analysis however there remains aneed for a well-defined and widely accepted standard for dataexchange Whether it is possible or feasible to define a singleunifying integromic data exchange format that would encapsu-late multiple data types attached to individuals within a cohortof patient subjects remains an open question

Standardization of data vocabulary

Any data exchange standard needs to address both how dataare encoded (structure) as well as a common vocabulary forcommonly encountered fields CSV JSON XML and derivativesof these are popular formats that lend themselves well to struc-tured data Certainly CSV export is our default as although itdoes not fully represent the relational structure underpinningPICanrsquos database it mimics the basic tabular structure and canbe easily viewed by other researchers in common spreadsheetor statistical packages Non-tabular data like digital slide images(which are stored as links in the database) are exported as sep-arate files The greater challenge might be agreeing on a com-mon vocabulary Data field names need to be mapped fromPICanrsquos internal vocabulary to that of the target external plat-form This is not a trivial task as even professionals within thesame institution may use different names for the same metricSome clinicians may be more precise in how certain metrics aremeasured and recorded such as when some distinguish be-tween lymphatic and vascular invasion while others groupthem together as lymphovascular invasion Harmonizing be-tween cancer types can also be challenging hence the use ofcancer-type-specific tables in PICan for certain clinicopathologi-cal metrics Additionally even where a data exchange specifica-tion exists like the one for TMA reporting [78] there may beallowance for custom fields posing an extra challenge for align-ing under a common vocabulary Attempts have been made tostandardize terminology in some disciplines The US NationalLibrary of Medicine maintains a collection of controlled vocabu-laries known as the Unified Medical Language System [79]which incorporates SNOMED CT MeSH and OMIM amongmany other ontologies Another compendium is caCOREincluding the Enterprise Vocabulary Services and Cancer Data

Integromics in cancer tissue biomarker research | 7

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Standards Repository of common data elements [80 81]Additionally some groups have devised cancer-type-specificlists such as for breast [82] mesothelioma [83] and prostate[84] In the absence of clear community-supported data stand-ards institutions are advised to agree on and use a minimal setof data export formats

Unique challenges of multimodal integrativeanalysis

Collecting and storing diverse data types is only half the battlein integromics unlocking the value of these data will comefrom analysis and interpretation Through a connection withthe R statistical environment via statconn [49] PICan is capableof on-demand statistical analysis graphical output and reportgeneration Besides R and statconn other similar frameworksand interfaces include Shiny [85] RNET [86] Python (eg mat-plotlib [87]) and D3js [88] While PICan is strictly a research toolan integromic approach will undoubtedly be essential to futureclinical decision support systems placing integromics at thefront line of patient care delivery

Multimodal analysis is particularly important when con-sidering the future of digital pathology and tissue imagingConventional light and fluorescence imaging provides the basisfor most studies in this field However the future may providenew imaging technologies using light outside of the normal vis-ible spectrum that may be more informative towards illuminat-ing the underlying cancer biology and delivering sensitive andspecific biomarkers of clinical outcome Numerous studies areemerging that exploit technologies such as ultraviolet [89] in-frared [90] and Raman spectroscopy [91 92] together with abroad range of associated multispectral technologies to gain abetter insight into tissue and cell characterization and to visual-ize compounds and chemistry not accessible to conventionalmicroscopy Integrating these imaging modalities into a singleframework to compare compound and deduce clinical utilitynot in isolation but together with conventional imaging meth-ods is essential Similarly there is capacity to integrate otherforms of medical imaging relevant to cancer discovery includ-ing positron emission tomography magnetic resonance imag-ing computed tomography and X-ray This can be used tounderstand the relationship between in vivo clinical imagingand ex vivo tissue imaging thereby translating markers that arecurrently only visible on microscopy to early in vivo detectionwith clinical imaging modalities In both settings the real valueis likely to come from the overlay and integration of imagingmodalities extending the capacity of tissue imaging on its ownand providing new visualization tools for both pathologists andradiologists that are more informative and consequently im-prove clinical decision making The integromics initiative dis-cussed in this article provides such a model for imaging as wellas molecular and clinical data blurring the edges of medicalspecialties and providing a unified framework for discovery

The basic tools of integromic data analysis are generallysimilar to other disciplines of bioinformatics Mean compari-sons clustering survival analyses machine learning and linearmodels for example can all be used as can the various pipe-lines that have been developed for next-generation sequencinggene expression microarrays and other molecular big dataomics techniques Integration of digital image analysis data inintegromic analysis can present its own distinct risks and re-wards Images and their associated markups and annotationsdo not lend themselves on their own to t-tests or clustering for

example Nevertheless these may be useful to contextualizeother pieces of data like IHC scores Calculated scores (egH-score) and metrics (eg integrated optical density nucleararea) can often be expressed as nameattribute pairs so theylend themselves more immediately to integrative analysis al-though data structure and hierarchy can still complicate ana-lysis Some metrics can be measured at a cellular level andsummarized at an ROI or whole slide level Others howevermight not be meaningful at a cellular level (eg H-score) so onemust be mindful of the meaning of each measure when incor-porating it into a multivariate model of the data Tumour het-erogeneity is also an important complication Summary scorescomputed across different regions of the same tumour may dif-fer markedly Heterogeneity at least in solid tumours can bemore easily mapped in tissue sections where tissue structureand cellular context is maintained and intact Spatial changesin tumour cell arrangement can be visualized microscopicallyand used to sample regions for molecular analysis and themeasurement of intratumoural heterogeneity Whole slideimaging and digital image analysis represent powerful tools toquantitatively map tissue phenotype and provide a sound basisfor biomarker assay development while opening up new av-enues of inquiry For example one can begin to explore whethercertain morphometric phenotypes correlate with specific gen-etic or epigenetic profiles and whether these have any signifi-cant prognostic or therapeutic impact

A specific challenge for integromics is the application ofstatistical techniques to data from disparate sources Care mustbe taken to ensure differences observed between experimentalgroups reflect the underlying biology and not simply batch ef-fects Differences in experimental assay platforms such as dif-ferent probes targeting the same gene or different antibodies forIHC may produce discrepancies in the data based on differentlevels of sensitivity and specificity In digital pathology differ-ences in imaging hardware analysis algorithms and parameterdefinitions may lead to various imaging artefacts and interpret-ation difficulties Even within the same software cell countsdetermined from different threshold settings are not compar-able Less visible confounders include study and patient vari-ables like recruitment methodology study admission criteriapatient stratificationdiagnostic criteria pre-collection treat-ments and standard of care (for prognostic biomarker studies)that can affect the makeup of the study patient populationMeanwhile specimen handling variables like cold ischemictime fixation method staining protocols and formulations canhave a profound effect on the data often dominating other con-founders regardless of the subsequent pipeline Ideally thendata harmonization begins well before data are generated withstandardization and thorough documentation of all pre-analytical steps including patient selection tissue collectionand specimen handling Where potential batch effects or con-founding variables have been identified a number of strategiescan be used such as subset analysis (ie stratifying the subjectpopulation on the confounding variable) or regression model-ling [93 94]

Mandated public release of large data sets has been helpful inadvancing transparency in science but tracking the associatedmetadata and assessing its impact on results derived from thesebig data sets remains a major unresolved challenge Even largerepositories like TCGA are composed of smaller collections froma wide range of geographies and institutional environments Inmany cases pooling such diverse data sets becomes an apples-to-oranges comparison Batch effects in TCGA and other largehigh-throughput data sets are well-documented [52 53] Without

8 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 4: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

pipelines especially when coupled with the wealth of moleculardata currently becoming increasingly accessible

In light of this true integromics today ought to embraceboth molecular and digital pathology to uncover and validate amore diverse and robust set of biomarkers that will be key to de-livering on the promises of precision medicine [45] Tissue path-ology and histology are essential to the application of morerecent molecular methods in solid tumours required for deter-mining the presence and proportion of tumour cells to ensuresample sufficiency for downstream analysis of nucleic acidsand for directing microdissection of tissue samples for molecu-lar analysis [46] Moreover tissue context and cellular pheno-type are central to understanding tumour heterogeneity andthe underlying genomic map of solid tumours TheImmunoscore biomarker [47] for example applies image ana-lysis to contextualize the interpretation of immunohistochemi-cal staining The understanding of both how genotype drivesphenotype and how phenotypic information influences inter-pretation of genotype are required for a holistic view of cancerand are therefore essential elements of a modern integromicsapproach

Building large-scale digital resources PICanoverview

Reflecting this broader view of integromics we created PICan tobe a comprehensive framework for the collection integrationand analysis of clinical pathological molecular and digitalimage analysis data sets for the purposes of cancer biomarkerdiscovery and validation [18] PICan sits at the heart of our bio-marker discovery program acting as both a central hub andcatalyst (Figure 1) PICan streamlines the process of bringing to-gether diverse streams of multimodal data from multiple cancertypes to generate clinically and biologically relevant analysesand reports via a bespoke access-controlled web-based inter-face written in ASPNETC with a connection to the R statis-tical environment [48] through statconn [49] At its core is arestricted-access MySQL database to house this collated patientdata Currently there are 3660 patients representing eight can-cer types recorded in the system together with fully curateddata on pathological diagnosis and clinical outcome Fromthese nearly 12 000 TMA cores have been stained immunohis-tochemically and scored for a broad range of well-establishedand novel cellular biomarkers that are of interest to our basicscience translational and clinical research collaborators Withthe ongoing rapid pace of research activity at our institutionthese numbers are set to double over the course of the nextyear Data de-identification using study-specific unique identi-fiers protects patient privacy while authentication and author-ization of users enables a balance to be struck between enablingaccess to users and safeguarding the rights of data collectors(clinicians wet lab scientists and other collaborators) to publi-cation priority and control of their own data Only administra-tors have direct access to the PICan database while end useraccess is provided exclusively through the access-controlledanalysis and report generation web interface Data quality is akey priority to ensure the integrity of downstream analyses asany analysis is only as good as the weakest data supporting itso data are curated with biological and clinical input before up-load Meanwhile bioinformaticians tailor their pipelines ac-cording to the needs of the research or analysis Consequentlyin addition to the central database a complete integromicspipeline must necessarily encompass data collection data

curation and biologically driven analysis and interpretation ofthis data activities which we consider integral parts of any inte-gromics framework

Database design data pre-processing curationand structure

Traditional single-mode data analysis pipelines take raw data(eg sequencing reads microarray spot intensities whole slideimage files) and process it through a series of steps into a finalform that informs a particular biological narrative (eg list ofmutations and frequencies gene expression values H-scoresfor immunohistochemistry (IHC) scoring [50 51]) However rawdata such as raw fluorescent spot intensities from an expres-sion microarray experiment might not be suitable for multi-modal and dynamic analysis without corrections forbackground and batch effects [52 53] while data aggregationmight leave fully processed data such as a list of differentiallyexpressed genes too refined to be of further use in patient-levelresearch analysis A compromise is to capture data in a statethat is suitable and informative for subsequent analysis whiletracking metadata associated with acquisition and pre-processing Metadata or lsquodata about datarsquo captures informationabout the methods and conditions under which the data itselfwere obtained Such information can be used for example toidentify cases where an observed effect was the result of differ-ences in the laboratory or researcher conducting the experi-ments rather than the biological condition Going back toexpression microarray data as an example this could beachieved by recording background-corrected and normalizedprobe signal intensities with metadata detailing the nature andparameters for data pre-processing steps such as normaliza-

tion assay conditions instrumentation used including its set-tings and pre-analytical steps eg how the physical sampleswere obtained handled and processed and by whom For com-pleteness and traceability raw data can be stored alongside thisanalysis-ready data Additionally multiple lsquosnapshotsrsquo fromsingle-mode analysis pipelines can be stored as each could po-tentially serve as a starting point for multimodal analysisHowever care must be taken to ensure internal consistency sothat results are repeatable and reproducible whether workingup raw data or the stored analysis-ready data Consequentlyfor PICan we opted for quality and consistency by recordingonly the analysis-ready data most suitable for further integro-mic analysis A number of quality assurance and quality controlmeasures can be taken to ensure data integrity with the appro-priate level of adoption into standard operating proceduresbeing dependent on the overall level of accreditation of the la-boratory in question Metadata is stored alongside the analysis-ready data but raw data and most downstream single-modeanalysis results are presently excluded from PICan Future plansto increase data completeness and traceability include adding acentral independent repository for raw data files These repre-sent files not already in existence elsewhere (eg public reposi-tories) These raw data files will be read-only and bear uniqueidentifiers that link them to their analysis-ready counterpartson the main database A file path stored in the main databasewill allow the raw data files to be easily retrieved when neededAs these files will not be used for on-demand statistical ana-lysis they can be stored in their native formats

Integromics in cancer tissue biomarker research | 3

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Relational database systems

Integromics software infrastructure can be built around one ofmany database management systems available each with itsown strengths and weaknesses As we envisioned PICan to in-corporate a semi-automated analysis engine [18] we prioritizeddata quality which maximizes the validity and reproducibilityof our results and preservation of data structure which stream-lines the automated retrieval and processing of big data sets ra-ther than attempting to blindly maximize data throughputReflecting these design priorities PICan was built around a rela-tional database (MySQL) consisting of interrelated data tablesthat reflect the process of constructing and harnessing TMAswhich are central to much of the biomarker discovery work con-ducted at our institution the Northern Ireland MolecularPathology Laboratory and through the Northern Ireland Biobank(NIB) (Figure 2) Every row of every table has a system-generatedprimary key and these are used as foreign keys to form relation-ships between tables The database centres on the Patient andBlock tables (table names will be italicized) reflecting our core vi-sion for PICan to drive holistic multimodal integromic analysisat a patient level One-to-many relationships link the Patienttable to tables containing the clinicopathological and treatmentdata capturing cancer-type-specific fields from templates de-veloped in collaboration with our clinical expert partners Thesetemplates also define permissible data inputs for each field (egage is a number sex is lsquoMrsquo or lsquoFrsquo) which are then encoded intothe database schema A patient may have had multiple tissuespecimens collected as tissue blocks each represented by a

separate linked entry in the Block table Linked via the Block tableare tables representing molecular assays (SangerSequenceDNAMicroarrayChip ExpressionLevel) whole face tissue sections(TissueSlide) TMA construction (Cylinder RecipientBlock TMASlideTMACore) IHC scoring (TissueScore BiomarkerScore) and digitalpathology (TissueVirtualSlide TMAVirtualSlide VirtualCoreDIAScore) all of which reflect primary activities within our insti-tutionrsquos biomarker discovery program Metadata is stored in thetable corresponding to the relevant physical entity and soimaging parameters for a whole face tissue slide for examplewould go in the TissueVirtualSlide table While this databasestructure is well suited to our needs laboratories collectingnon-tissue specimens such as blood sputum or urine may findit inappropriate for their needs A Blood table might sit at theheart of a haematological laboratoryrsquos database for examplewhile other institutions may need a cluster of tables to repre-sent different specimen types each linked to tables for variousbiochemical assays

PICan uses a modular internal structure to support variousdata types Each table represents a physical entity and conse-quently a data type This allows the addition of new tables torepresent data types from emerging technologies as they ma-ture As maximizing utility of new tables will require updates tothe end-user web interface and setting the appropriate foreignkeys will require knowledge of the internal structure of thePICan database only the system administrator can add newtables

Another consideration for relational database design is nor-malization This procedure simplifies the data structure by

Figure 1 PICan sits at the heart of an integrated multidisciplinary biomarker discovery and validation pipeline Multiple curated data sources feed into PICan which

accelerates downstream analysis and report generation

4 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

removing duplication of data ensuring that the table is resilientagainst inconsistencies arising from incomplete addition dele-tion or modification of data As described in [54] this isachieved by identifying groups of related or dependent datafields and splitting them off into a separate table using foreignkeys to record relationships between tables For most practicalapplications the third normal form balances a robust architec-ture without overly cluttering the schema with an untenablenumber of tables Normalization approaches can also be used tooptimize database speed and performance

A consequence of our decision to use a relational database isthat free-form data needs to be harmonized with the data struc-ture presented by PICan before data entry The requisite cur-ation steps prioritize data integrity and prevent poor-qualitydata that might impede proper and meaningful analysis fromentering the database (garbage-in-garbage-out) Clinical andpathological data collection and storage are guided by tem-plates collections of well-defined data fields that have been de-veloped through our close working relationships with the NIBNorthern Ireland Cancer Registry Northern Ireland Health andSocial Care Trusts and clinicians These collaborators share thedesire to drive data integration and provide expertise contextand data access to harmonize data templates around commonvocabulary and permissible data values ensuring maximal clin-ical relevance of PICan In addition to the harmonized common

fields like age and sex PICan also supports cancer-type-specificfields for maximum flexibility

Non-relational database structures

While a structured relational database has served PICan wellthere are alternative models that might better suit other re-search integromics environments MongoDB for example is anon-relational (or lsquoNoSQLrsquo) database architecture that uses adocument model where each entity (a tissue block for ex-ample) is represented by a lsquodocumentrsquo that might contain hun-dreds of attributes as key-value pairs This model works with aflexible schema allowing data fields to be added or omitted ona case-by-case basis while harmonization of terminology anddata fields occurs at data retrieval In a research environmentespecially at the beginning of a study when pipelines and meta-data are likely to change it is easier to update the data modelthan to redesign the fixed schema of a relational databaseHowever with less urgency to curate data before upload dataintegrity may be variable with errors remaining unnoticed andconfounding analyses performed on the data Fuzzy matchingalgorithms [55ndash57] can mitigate the impact of typographical andsome clerical errors but these approaches must still be rigor-ously validated In some cases it might be impossible even for ahuman expert to deduce the original intention Data must then

Figure 2 Schematic of the internal table structure of PICanrsquos relational database reflecting the core activities and workflows that comprise the biomarker discovery

and validation program at our institution Most tables represent physical or digital entities within these workflows Where table names may be unclear a brief descrip-

tion has been included along with some example data fields (in parentheses)

Integromics in cancer tissue biomarker research | 5

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

be discarded from analysis potentially reducing the statisticalpower Manual curation at the outset might identify such caseswhen there is still an opportunity to seek clarification from theoriginal data collector Ideally ambiguous annotations and bor-derline cases should be resolved in consultation with a suitablyqualified medical professional before they enter the databaserather than leaving them open to interpretation by downstreaminformaticians Likewise missing or contradictory data pointsare preferably identified and where possible corrected beforeentry into the database However a major advantage of an un-structured database model is the ability to more faithfully cap-ture the complexities of the patientrsquos clinical pathological andother data parameters including free-form text fields such asclinical notes The ability to capture patient data in its nativeform also improves the traceability and scalability of data entryinto a non-structured database solution by eliminating the la-bour and documentation associated with data curation andenabling automated bulk data upload Despite the tools wehave developed to semi-automate data upload this will remaina relative drawback of manual curation in the short term espe-cially for clinical and pathological data fields In the longerterm scalability may be improved with increased adoption ofelectronic medical records wherein many of these data fieldscould be completed by the attending physician when encoun-tering the patient

Balancing the data curation afforded by the structured rela-tional database with the high-throughput flexibility of unstruc-tured models are various hybrid options For example PICanoffers flexibility to the structured database model by includingfreeform text fields to capture unstructured notes (in our experi-ence these are typically for clinical notes) in several of the datatables As these fields are not referenced directly by the auto-mated online analysis component which requires well-definedparameters they are hidden from analysis but can be retrievedby users wishing to conduct further data refinement or offlineanalysis In practice however it is more likely that any fieldsnot used in analysis by the system will simply be ignored andneglected Another hybrid solution we are actively exploring toincrease data throughput is to allow the importation of non-curated data sets for end-user analysis In this model curatedlsquopublicrsquo data sets would be available to all users (with necessaryapprovals and access restrictions in place) while each end userwill be able to upload lsquoprivatersquo data sets to be analysed along-side or integrated with the curated data The user then is re-sponsible for ensuring the quality of hisher own dataExtending the hybrid model concept further are heterogeneousdatabases in which only the most important unchanging andsensitive data are typically stored in a relational databaseMeanwhile the bulk of the data in terms of size will be storedin non-relational databases or as raw data with links stored inthe main database to ensure ease of access and retrieval

Image analysis data storage and management

Current work is focussing on improving the support of digitalimage analysis data an emerging area of research in which ourgroup has significant expertise Imaging data presents someunique challenges for database storage design The imagesthemselves can be large One of our typical tissue slidesscanned at 40objective magnification (025 lm per pixel) canhave 2ndash10 billion pixels Imaging in RGB at 8 bits per channelgives uncompressed file sizes of up to 30 GB Additional colourchannels (such as for fluorescence imaging) z-stacks and dataredundancy in the form of image pyramids for improved image

loading performance would all greatly increase file sizesCompression typically JPEG or JPEG2000 for whole slide imagescan significantly reduce file size without drastically impairingvisual assessment although suitability for image analysis ismore variable [39] Hence it is vital to store the original slideimage to allow a pathologist to go back and verify that imageanalysis results are consistent with hisher expectations This iscurrently achieved by hosting the slide images on remote ser-vers maintained by PathXL A link is stored in the PICan data-base that allows the web-based interface to retrieve the imagedata on user request via an application programming interface(API) supplied by PathXL Pan zoom and overlay functionalitiesin the PICan web interface are powered by OpenSeadragon [58]As the samples are drawn from NIB collections NIB maintainscontrol of access and security using the PathXL software

In addition to the raw image pixels image analysis datamight include markups annotations and calculated metrics(such as integrated optical densities nuclear areas or immuno-histochemical H-scores [50 51]) Metrics can be at a cellularlevel or summarized across a region of interest (ROI) a slide oreven a set of slides Much of these data may be hierarchicallystructured with each slide potentially containing multiple ROIseach ROI containing many cells and each cell being associatedwith values for numerous metrics However the efficiency ofstoring image analysis data in a structured database is uncer-tain when a single image can contain thousands of annotationsFor now PICan stores only summary scores and metrics (eg H-score [50 51] Allred Score [59] percentage positive cells) whileusers wishing to interrogate the image data on a more granularlevel will need to rely on external specialist tools The challengeof storing and transferring image analysis data between sys-tems is an ongoing one with several potential approaches hav-ing been proposed including DICOM [60] OME-TIFF [61] variousHDF5-based formats [62 63] and proprietary vendor formatsbut without a clear front-runner DICOM and OME-TIFF are pri-marily image formats with flexible metadata support optimizedfor image viewing rather than storage of image analysis resultswith large numbers of image objects In contrast HDF5 capturesobject hierarchy relationships well but on its own it is not a na-tive image format Meanwhile proprietary formats are poten-tially encumbered by patenting and intellectual propertyconcerns

Integromics data management is patient-centric

Regardless of the database management system being used acritical characteristic of integromics analysis is patient-centricity The ultimate objective in bringing together data frommultiple experimental platforms is to build a more holistic pic-ture of the interplay between genotype and phenotype in thepatients under study As interest in amalgamating multipledata types has grown many platforms originally designed tohost one data type have been expanded to accommodate add-itional types of data Many biobank management software sys-tems allow addition of clinicopathological notes andexperimental assay results on top of sample tracking dataMeanwhile digital slide hosting platforms like OMERO [64]PathXL Xplore Cytomine [65] and DigitalScope (Aptia SystemsHouston TX used by the College of American Pathologists) [66]are beginning to support the addition of associated clinicopa-thological data as metadata Others like TCGA [25 26] origi-nated as molecular data repositories but have begun addingwhole slide imaging There is a clear trend towards convergenceof database models encompassing more and more data types

6 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

but their impact will be limited if these extra data are simplystored and forgotten Integromics is an active process not just ofbringing data together but more importantly of creative com-parisons and contrasts to expose underlying interactions under-pinning the cancer phenotype Many public repositories and bigdatabases such as TCGA draw from an assortment of patientcohorts so associations between different data types measuredon different cohorts are limited to cohort summaries poten-tially obscuring trends between data types that are only appar-ent at the individual patient level

Interacting with external resourcesmdashdataexchange between systems

With the current rapid expansion of big data analytics in bio-medicine integromics frameworks must remain flexibleenough to incorporate emerging and future technologiesStructured architecture and design of the overall integromicsdatabase framework can facilitate this However large data setslike next-generation sequencing (and upcoming third-gener-ation sequencing [67ndash70]) can be unwieldy in a MySQL databasewith typical whole exome or RNA sequencing runs comprising10ndash100 billion base pairs for a single sample so future integro-mics systems may house these types of data in databases thatare structured specifically for this purpose Instead of a singledatabase housing multiple data types a future integromicsdatabase will undoubtedly be structured as a fully distributedfederated database [71ndash74] with a central database storing iden-tifiers that interface with a cluster of highly specialized data-bases each housing a specific type of data in its native formatArguably PICan already uses a version of this model by out-sourcing digital slide image storage to remote servers Futuredevelopments may include enabling PICan to tap into publiclyavailable data sources and allowing users to analyse these pub-lic data sets in tandem with PICan data such as accessing GeneExpression Omnibus data through GEO2R [75]

One of the challenges when working with a federated data-base model is how to exchange data faithfully and efficientlybetween the various components Each constituent databasewill need an interface to interact seamlessly with the centraldatabase and control system Some of these will be availableoff-the-shelf or rely on published APIs while others such asthose interacting with in-house software may need significantdeveloper time and effort to implement Ultimately there areno shortcuts to establishing these interfaces and so institutionsshould aim to reduce the number of distinct interfaces bystandardizing data around a small set of formats For examplewhen working with whole slide images a single institutioncould strive to use only a single file format The various avail-able formats each have their own strengths and weaknessesbut the natural option will likely be the native format of thescanner used at the institution or a vendor-neutral format likeDICOM [60] or OME-TIFF [61] Fortunately many tools for work-ing with whole slide images support multiple file formats suchas DigitalScope [66] for remote web-based viewing of variousimaging formats Meanwhile projects like Bio-Formats (OpenMicroscopy Environment consortium) [61] and OpenSlide [76]further facilitate the viewing and exploitation of various imag-ing formats in other programming and analysis languages likeC Cthornthorn Python and Java Additionally it is important to stand-ardize the data transfer protocol whether this is throughInternet standards like hypertext transfer protocol physicalmeans like hard disks or some other agreed protocol Perhaps

establishing an institutional integromics strategy will encour-age adoption of standardized data formats by creating an ex-pectation among data generators that data should be portableand compatible with other data types for analysis

Data flow within an integromics environment can often bemultidirectional Even with only a single central data repositorythere will be instances when it is beneficial to export some ofthese data to a separate third-party platform whether for speci-alized analysis report generation or some other purpose AsPICanrsquos web-based user interface is designed to be user friendlyit supports only a limited number of types of analyses and re-port generation options Analyses not hard-coded into the inter-face for example would need to be performed lsquoofflinersquo In factthis is how new analyses are tested before being incorporatedinto PICan Similar to data import and storage data export froman integromics system is complicated by the wide range of datatypes that demand different formats to accommodate their in-herently different natures Efforts have been made in somefields to standardize and integrate data models such as thoseof the Global Alliance for Genomics and Health for genomicdata [20] Hence where widely recognized data exchange for-mats exist such as FASTAFASTQ files for nucleotide or aminoacid sequences [77] or the TMA data exchange standard [78]they should be prioritized for integromics support For otherdata types like digital image analysis however there remains aneed for a well-defined and widely accepted standard for dataexchange Whether it is possible or feasible to define a singleunifying integromic data exchange format that would encapsu-late multiple data types attached to individuals within a cohortof patient subjects remains an open question

Standardization of data vocabulary

Any data exchange standard needs to address both how dataare encoded (structure) as well as a common vocabulary forcommonly encountered fields CSV JSON XML and derivativesof these are popular formats that lend themselves well to struc-tured data Certainly CSV export is our default as although itdoes not fully represent the relational structure underpinningPICanrsquos database it mimics the basic tabular structure and canbe easily viewed by other researchers in common spreadsheetor statistical packages Non-tabular data like digital slide images(which are stored as links in the database) are exported as sep-arate files The greater challenge might be agreeing on a com-mon vocabulary Data field names need to be mapped fromPICanrsquos internal vocabulary to that of the target external plat-form This is not a trivial task as even professionals within thesame institution may use different names for the same metricSome clinicians may be more precise in how certain metrics aremeasured and recorded such as when some distinguish be-tween lymphatic and vascular invasion while others groupthem together as lymphovascular invasion Harmonizing be-tween cancer types can also be challenging hence the use ofcancer-type-specific tables in PICan for certain clinicopathologi-cal metrics Additionally even where a data exchange specifica-tion exists like the one for TMA reporting [78] there may beallowance for custom fields posing an extra challenge for align-ing under a common vocabulary Attempts have been made tostandardize terminology in some disciplines The US NationalLibrary of Medicine maintains a collection of controlled vocabu-laries known as the Unified Medical Language System [79]which incorporates SNOMED CT MeSH and OMIM amongmany other ontologies Another compendium is caCOREincluding the Enterprise Vocabulary Services and Cancer Data

Integromics in cancer tissue biomarker research | 7

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Standards Repository of common data elements [80 81]Additionally some groups have devised cancer-type-specificlists such as for breast [82] mesothelioma [83] and prostate[84] In the absence of clear community-supported data stand-ards institutions are advised to agree on and use a minimal setof data export formats

Unique challenges of multimodal integrativeanalysis

Collecting and storing diverse data types is only half the battlein integromics unlocking the value of these data will comefrom analysis and interpretation Through a connection withthe R statistical environment via statconn [49] PICan is capableof on-demand statistical analysis graphical output and reportgeneration Besides R and statconn other similar frameworksand interfaces include Shiny [85] RNET [86] Python (eg mat-plotlib [87]) and D3js [88] While PICan is strictly a research toolan integromic approach will undoubtedly be essential to futureclinical decision support systems placing integromics at thefront line of patient care delivery

Multimodal analysis is particularly important when con-sidering the future of digital pathology and tissue imagingConventional light and fluorescence imaging provides the basisfor most studies in this field However the future may providenew imaging technologies using light outside of the normal vis-ible spectrum that may be more informative towards illuminat-ing the underlying cancer biology and delivering sensitive andspecific biomarkers of clinical outcome Numerous studies areemerging that exploit technologies such as ultraviolet [89] in-frared [90] and Raman spectroscopy [91 92] together with abroad range of associated multispectral technologies to gain abetter insight into tissue and cell characterization and to visual-ize compounds and chemistry not accessible to conventionalmicroscopy Integrating these imaging modalities into a singleframework to compare compound and deduce clinical utilitynot in isolation but together with conventional imaging meth-ods is essential Similarly there is capacity to integrate otherforms of medical imaging relevant to cancer discovery includ-ing positron emission tomography magnetic resonance imag-ing computed tomography and X-ray This can be used tounderstand the relationship between in vivo clinical imagingand ex vivo tissue imaging thereby translating markers that arecurrently only visible on microscopy to early in vivo detectionwith clinical imaging modalities In both settings the real valueis likely to come from the overlay and integration of imagingmodalities extending the capacity of tissue imaging on its ownand providing new visualization tools for both pathologists andradiologists that are more informative and consequently im-prove clinical decision making The integromics initiative dis-cussed in this article provides such a model for imaging as wellas molecular and clinical data blurring the edges of medicalspecialties and providing a unified framework for discovery

The basic tools of integromic data analysis are generallysimilar to other disciplines of bioinformatics Mean compari-sons clustering survival analyses machine learning and linearmodels for example can all be used as can the various pipe-lines that have been developed for next-generation sequencinggene expression microarrays and other molecular big dataomics techniques Integration of digital image analysis data inintegromic analysis can present its own distinct risks and re-wards Images and their associated markups and annotationsdo not lend themselves on their own to t-tests or clustering for

example Nevertheless these may be useful to contextualizeother pieces of data like IHC scores Calculated scores (egH-score) and metrics (eg integrated optical density nucleararea) can often be expressed as nameattribute pairs so theylend themselves more immediately to integrative analysis al-though data structure and hierarchy can still complicate ana-lysis Some metrics can be measured at a cellular level andsummarized at an ROI or whole slide level Others howevermight not be meaningful at a cellular level (eg H-score) so onemust be mindful of the meaning of each measure when incor-porating it into a multivariate model of the data Tumour het-erogeneity is also an important complication Summary scorescomputed across different regions of the same tumour may dif-fer markedly Heterogeneity at least in solid tumours can bemore easily mapped in tissue sections where tissue structureand cellular context is maintained and intact Spatial changesin tumour cell arrangement can be visualized microscopicallyand used to sample regions for molecular analysis and themeasurement of intratumoural heterogeneity Whole slideimaging and digital image analysis represent powerful tools toquantitatively map tissue phenotype and provide a sound basisfor biomarker assay development while opening up new av-enues of inquiry For example one can begin to explore whethercertain morphometric phenotypes correlate with specific gen-etic or epigenetic profiles and whether these have any signifi-cant prognostic or therapeutic impact

A specific challenge for integromics is the application ofstatistical techniques to data from disparate sources Care mustbe taken to ensure differences observed between experimentalgroups reflect the underlying biology and not simply batch ef-fects Differences in experimental assay platforms such as dif-ferent probes targeting the same gene or different antibodies forIHC may produce discrepancies in the data based on differentlevels of sensitivity and specificity In digital pathology differ-ences in imaging hardware analysis algorithms and parameterdefinitions may lead to various imaging artefacts and interpret-ation difficulties Even within the same software cell countsdetermined from different threshold settings are not compar-able Less visible confounders include study and patient vari-ables like recruitment methodology study admission criteriapatient stratificationdiagnostic criteria pre-collection treat-ments and standard of care (for prognostic biomarker studies)that can affect the makeup of the study patient populationMeanwhile specimen handling variables like cold ischemictime fixation method staining protocols and formulations canhave a profound effect on the data often dominating other con-founders regardless of the subsequent pipeline Ideally thendata harmonization begins well before data are generated withstandardization and thorough documentation of all pre-analytical steps including patient selection tissue collectionand specimen handling Where potential batch effects or con-founding variables have been identified a number of strategiescan be used such as subset analysis (ie stratifying the subjectpopulation on the confounding variable) or regression model-ling [93 94]

Mandated public release of large data sets has been helpful inadvancing transparency in science but tracking the associatedmetadata and assessing its impact on results derived from thesebig data sets remains a major unresolved challenge Even largerepositories like TCGA are composed of smaller collections froma wide range of geographies and institutional environments Inmany cases pooling such diverse data sets becomes an apples-to-oranges comparison Batch effects in TCGA and other largehigh-throughput data sets are well-documented [52 53] Without

8 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 5: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

Relational database systems

Integromics software infrastructure can be built around one ofmany database management systems available each with itsown strengths and weaknesses As we envisioned PICan to in-corporate a semi-automated analysis engine [18] we prioritizeddata quality which maximizes the validity and reproducibilityof our results and preservation of data structure which stream-lines the automated retrieval and processing of big data sets ra-ther than attempting to blindly maximize data throughputReflecting these design priorities PICan was built around a rela-tional database (MySQL) consisting of interrelated data tablesthat reflect the process of constructing and harnessing TMAswhich are central to much of the biomarker discovery work con-ducted at our institution the Northern Ireland MolecularPathology Laboratory and through the Northern Ireland Biobank(NIB) (Figure 2) Every row of every table has a system-generatedprimary key and these are used as foreign keys to form relation-ships between tables The database centres on the Patient andBlock tables (table names will be italicized) reflecting our core vi-sion for PICan to drive holistic multimodal integromic analysisat a patient level One-to-many relationships link the Patienttable to tables containing the clinicopathological and treatmentdata capturing cancer-type-specific fields from templates de-veloped in collaboration with our clinical expert partners Thesetemplates also define permissible data inputs for each field (egage is a number sex is lsquoMrsquo or lsquoFrsquo) which are then encoded intothe database schema A patient may have had multiple tissuespecimens collected as tissue blocks each represented by a

separate linked entry in the Block table Linked via the Block tableare tables representing molecular assays (SangerSequenceDNAMicroarrayChip ExpressionLevel) whole face tissue sections(TissueSlide) TMA construction (Cylinder RecipientBlock TMASlideTMACore) IHC scoring (TissueScore BiomarkerScore) and digitalpathology (TissueVirtualSlide TMAVirtualSlide VirtualCoreDIAScore) all of which reflect primary activities within our insti-tutionrsquos biomarker discovery program Metadata is stored in thetable corresponding to the relevant physical entity and soimaging parameters for a whole face tissue slide for examplewould go in the TissueVirtualSlide table While this databasestructure is well suited to our needs laboratories collectingnon-tissue specimens such as blood sputum or urine may findit inappropriate for their needs A Blood table might sit at theheart of a haematological laboratoryrsquos database for examplewhile other institutions may need a cluster of tables to repre-sent different specimen types each linked to tables for variousbiochemical assays

PICan uses a modular internal structure to support variousdata types Each table represents a physical entity and conse-quently a data type This allows the addition of new tables torepresent data types from emerging technologies as they ma-ture As maximizing utility of new tables will require updates tothe end-user web interface and setting the appropriate foreignkeys will require knowledge of the internal structure of thePICan database only the system administrator can add newtables

Another consideration for relational database design is nor-malization This procedure simplifies the data structure by

Figure 1 PICan sits at the heart of an integrated multidisciplinary biomarker discovery and validation pipeline Multiple curated data sources feed into PICan which

accelerates downstream analysis and report generation

4 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

removing duplication of data ensuring that the table is resilientagainst inconsistencies arising from incomplete addition dele-tion or modification of data As described in [54] this isachieved by identifying groups of related or dependent datafields and splitting them off into a separate table using foreignkeys to record relationships between tables For most practicalapplications the third normal form balances a robust architec-ture without overly cluttering the schema with an untenablenumber of tables Normalization approaches can also be used tooptimize database speed and performance

A consequence of our decision to use a relational database isthat free-form data needs to be harmonized with the data struc-ture presented by PICan before data entry The requisite cur-ation steps prioritize data integrity and prevent poor-qualitydata that might impede proper and meaningful analysis fromentering the database (garbage-in-garbage-out) Clinical andpathological data collection and storage are guided by tem-plates collections of well-defined data fields that have been de-veloped through our close working relationships with the NIBNorthern Ireland Cancer Registry Northern Ireland Health andSocial Care Trusts and clinicians These collaborators share thedesire to drive data integration and provide expertise contextand data access to harmonize data templates around commonvocabulary and permissible data values ensuring maximal clin-ical relevance of PICan In addition to the harmonized common

fields like age and sex PICan also supports cancer-type-specificfields for maximum flexibility

Non-relational database structures

While a structured relational database has served PICan wellthere are alternative models that might better suit other re-search integromics environments MongoDB for example is anon-relational (or lsquoNoSQLrsquo) database architecture that uses adocument model where each entity (a tissue block for ex-ample) is represented by a lsquodocumentrsquo that might contain hun-dreds of attributes as key-value pairs This model works with aflexible schema allowing data fields to be added or omitted ona case-by-case basis while harmonization of terminology anddata fields occurs at data retrieval In a research environmentespecially at the beginning of a study when pipelines and meta-data are likely to change it is easier to update the data modelthan to redesign the fixed schema of a relational databaseHowever with less urgency to curate data before upload dataintegrity may be variable with errors remaining unnoticed andconfounding analyses performed on the data Fuzzy matchingalgorithms [55ndash57] can mitigate the impact of typographical andsome clerical errors but these approaches must still be rigor-ously validated In some cases it might be impossible even for ahuman expert to deduce the original intention Data must then

Figure 2 Schematic of the internal table structure of PICanrsquos relational database reflecting the core activities and workflows that comprise the biomarker discovery

and validation program at our institution Most tables represent physical or digital entities within these workflows Where table names may be unclear a brief descrip-

tion has been included along with some example data fields (in parentheses)

Integromics in cancer tissue biomarker research | 5

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

be discarded from analysis potentially reducing the statisticalpower Manual curation at the outset might identify such caseswhen there is still an opportunity to seek clarification from theoriginal data collector Ideally ambiguous annotations and bor-derline cases should be resolved in consultation with a suitablyqualified medical professional before they enter the databaserather than leaving them open to interpretation by downstreaminformaticians Likewise missing or contradictory data pointsare preferably identified and where possible corrected beforeentry into the database However a major advantage of an un-structured database model is the ability to more faithfully cap-ture the complexities of the patientrsquos clinical pathological andother data parameters including free-form text fields such asclinical notes The ability to capture patient data in its nativeform also improves the traceability and scalability of data entryinto a non-structured database solution by eliminating the la-bour and documentation associated with data curation andenabling automated bulk data upload Despite the tools wehave developed to semi-automate data upload this will remaina relative drawback of manual curation in the short term espe-cially for clinical and pathological data fields In the longerterm scalability may be improved with increased adoption ofelectronic medical records wherein many of these data fieldscould be completed by the attending physician when encoun-tering the patient

Balancing the data curation afforded by the structured rela-tional database with the high-throughput flexibility of unstruc-tured models are various hybrid options For example PICanoffers flexibility to the structured database model by includingfreeform text fields to capture unstructured notes (in our experi-ence these are typically for clinical notes) in several of the datatables As these fields are not referenced directly by the auto-mated online analysis component which requires well-definedparameters they are hidden from analysis but can be retrievedby users wishing to conduct further data refinement or offlineanalysis In practice however it is more likely that any fieldsnot used in analysis by the system will simply be ignored andneglected Another hybrid solution we are actively exploring toincrease data throughput is to allow the importation of non-curated data sets for end-user analysis In this model curatedlsquopublicrsquo data sets would be available to all users (with necessaryapprovals and access restrictions in place) while each end userwill be able to upload lsquoprivatersquo data sets to be analysed along-side or integrated with the curated data The user then is re-sponsible for ensuring the quality of hisher own dataExtending the hybrid model concept further are heterogeneousdatabases in which only the most important unchanging andsensitive data are typically stored in a relational databaseMeanwhile the bulk of the data in terms of size will be storedin non-relational databases or as raw data with links stored inthe main database to ensure ease of access and retrieval

Image analysis data storage and management

Current work is focussing on improving the support of digitalimage analysis data an emerging area of research in which ourgroup has significant expertise Imaging data presents someunique challenges for database storage design The imagesthemselves can be large One of our typical tissue slidesscanned at 40objective magnification (025 lm per pixel) canhave 2ndash10 billion pixels Imaging in RGB at 8 bits per channelgives uncompressed file sizes of up to 30 GB Additional colourchannels (such as for fluorescence imaging) z-stacks and dataredundancy in the form of image pyramids for improved image

loading performance would all greatly increase file sizesCompression typically JPEG or JPEG2000 for whole slide imagescan significantly reduce file size without drastically impairingvisual assessment although suitability for image analysis ismore variable [39] Hence it is vital to store the original slideimage to allow a pathologist to go back and verify that imageanalysis results are consistent with hisher expectations This iscurrently achieved by hosting the slide images on remote ser-vers maintained by PathXL A link is stored in the PICan data-base that allows the web-based interface to retrieve the imagedata on user request via an application programming interface(API) supplied by PathXL Pan zoom and overlay functionalitiesin the PICan web interface are powered by OpenSeadragon [58]As the samples are drawn from NIB collections NIB maintainscontrol of access and security using the PathXL software

In addition to the raw image pixels image analysis datamight include markups annotations and calculated metrics(such as integrated optical densities nuclear areas or immuno-histochemical H-scores [50 51]) Metrics can be at a cellularlevel or summarized across a region of interest (ROI) a slide oreven a set of slides Much of these data may be hierarchicallystructured with each slide potentially containing multiple ROIseach ROI containing many cells and each cell being associatedwith values for numerous metrics However the efficiency ofstoring image analysis data in a structured database is uncer-tain when a single image can contain thousands of annotationsFor now PICan stores only summary scores and metrics (eg H-score [50 51] Allred Score [59] percentage positive cells) whileusers wishing to interrogate the image data on a more granularlevel will need to rely on external specialist tools The challengeof storing and transferring image analysis data between sys-tems is an ongoing one with several potential approaches hav-ing been proposed including DICOM [60] OME-TIFF [61] variousHDF5-based formats [62 63] and proprietary vendor formatsbut without a clear front-runner DICOM and OME-TIFF are pri-marily image formats with flexible metadata support optimizedfor image viewing rather than storage of image analysis resultswith large numbers of image objects In contrast HDF5 capturesobject hierarchy relationships well but on its own it is not a na-tive image format Meanwhile proprietary formats are poten-tially encumbered by patenting and intellectual propertyconcerns

Integromics data management is patient-centric

Regardless of the database management system being used acritical characteristic of integromics analysis is patient-centricity The ultimate objective in bringing together data frommultiple experimental platforms is to build a more holistic pic-ture of the interplay between genotype and phenotype in thepatients under study As interest in amalgamating multipledata types has grown many platforms originally designed tohost one data type have been expanded to accommodate add-itional types of data Many biobank management software sys-tems allow addition of clinicopathological notes andexperimental assay results on top of sample tracking dataMeanwhile digital slide hosting platforms like OMERO [64]PathXL Xplore Cytomine [65] and DigitalScope (Aptia SystemsHouston TX used by the College of American Pathologists) [66]are beginning to support the addition of associated clinicopa-thological data as metadata Others like TCGA [25 26] origi-nated as molecular data repositories but have begun addingwhole slide imaging There is a clear trend towards convergenceof database models encompassing more and more data types

6 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

but their impact will be limited if these extra data are simplystored and forgotten Integromics is an active process not just ofbringing data together but more importantly of creative com-parisons and contrasts to expose underlying interactions under-pinning the cancer phenotype Many public repositories and bigdatabases such as TCGA draw from an assortment of patientcohorts so associations between different data types measuredon different cohorts are limited to cohort summaries poten-tially obscuring trends between data types that are only appar-ent at the individual patient level

Interacting with external resourcesmdashdataexchange between systems

With the current rapid expansion of big data analytics in bio-medicine integromics frameworks must remain flexibleenough to incorporate emerging and future technologiesStructured architecture and design of the overall integromicsdatabase framework can facilitate this However large data setslike next-generation sequencing (and upcoming third-gener-ation sequencing [67ndash70]) can be unwieldy in a MySQL databasewith typical whole exome or RNA sequencing runs comprising10ndash100 billion base pairs for a single sample so future integro-mics systems may house these types of data in databases thatare structured specifically for this purpose Instead of a singledatabase housing multiple data types a future integromicsdatabase will undoubtedly be structured as a fully distributedfederated database [71ndash74] with a central database storing iden-tifiers that interface with a cluster of highly specialized data-bases each housing a specific type of data in its native formatArguably PICan already uses a version of this model by out-sourcing digital slide image storage to remote servers Futuredevelopments may include enabling PICan to tap into publiclyavailable data sources and allowing users to analyse these pub-lic data sets in tandem with PICan data such as accessing GeneExpression Omnibus data through GEO2R [75]

One of the challenges when working with a federated data-base model is how to exchange data faithfully and efficientlybetween the various components Each constituent databasewill need an interface to interact seamlessly with the centraldatabase and control system Some of these will be availableoff-the-shelf or rely on published APIs while others such asthose interacting with in-house software may need significantdeveloper time and effort to implement Ultimately there areno shortcuts to establishing these interfaces and so institutionsshould aim to reduce the number of distinct interfaces bystandardizing data around a small set of formats For examplewhen working with whole slide images a single institutioncould strive to use only a single file format The various avail-able formats each have their own strengths and weaknessesbut the natural option will likely be the native format of thescanner used at the institution or a vendor-neutral format likeDICOM [60] or OME-TIFF [61] Fortunately many tools for work-ing with whole slide images support multiple file formats suchas DigitalScope [66] for remote web-based viewing of variousimaging formats Meanwhile projects like Bio-Formats (OpenMicroscopy Environment consortium) [61] and OpenSlide [76]further facilitate the viewing and exploitation of various imag-ing formats in other programming and analysis languages likeC Cthornthorn Python and Java Additionally it is important to stand-ardize the data transfer protocol whether this is throughInternet standards like hypertext transfer protocol physicalmeans like hard disks or some other agreed protocol Perhaps

establishing an institutional integromics strategy will encour-age adoption of standardized data formats by creating an ex-pectation among data generators that data should be portableand compatible with other data types for analysis

Data flow within an integromics environment can often bemultidirectional Even with only a single central data repositorythere will be instances when it is beneficial to export some ofthese data to a separate third-party platform whether for speci-alized analysis report generation or some other purpose AsPICanrsquos web-based user interface is designed to be user friendlyit supports only a limited number of types of analyses and re-port generation options Analyses not hard-coded into the inter-face for example would need to be performed lsquoofflinersquo In factthis is how new analyses are tested before being incorporatedinto PICan Similar to data import and storage data export froman integromics system is complicated by the wide range of datatypes that demand different formats to accommodate their in-herently different natures Efforts have been made in somefields to standardize and integrate data models such as thoseof the Global Alliance for Genomics and Health for genomicdata [20] Hence where widely recognized data exchange for-mats exist such as FASTAFASTQ files for nucleotide or aminoacid sequences [77] or the TMA data exchange standard [78]they should be prioritized for integromics support For otherdata types like digital image analysis however there remains aneed for a well-defined and widely accepted standard for dataexchange Whether it is possible or feasible to define a singleunifying integromic data exchange format that would encapsu-late multiple data types attached to individuals within a cohortof patient subjects remains an open question

Standardization of data vocabulary

Any data exchange standard needs to address both how dataare encoded (structure) as well as a common vocabulary forcommonly encountered fields CSV JSON XML and derivativesof these are popular formats that lend themselves well to struc-tured data Certainly CSV export is our default as although itdoes not fully represent the relational structure underpinningPICanrsquos database it mimics the basic tabular structure and canbe easily viewed by other researchers in common spreadsheetor statistical packages Non-tabular data like digital slide images(which are stored as links in the database) are exported as sep-arate files The greater challenge might be agreeing on a com-mon vocabulary Data field names need to be mapped fromPICanrsquos internal vocabulary to that of the target external plat-form This is not a trivial task as even professionals within thesame institution may use different names for the same metricSome clinicians may be more precise in how certain metrics aremeasured and recorded such as when some distinguish be-tween lymphatic and vascular invasion while others groupthem together as lymphovascular invasion Harmonizing be-tween cancer types can also be challenging hence the use ofcancer-type-specific tables in PICan for certain clinicopathologi-cal metrics Additionally even where a data exchange specifica-tion exists like the one for TMA reporting [78] there may beallowance for custom fields posing an extra challenge for align-ing under a common vocabulary Attempts have been made tostandardize terminology in some disciplines The US NationalLibrary of Medicine maintains a collection of controlled vocabu-laries known as the Unified Medical Language System [79]which incorporates SNOMED CT MeSH and OMIM amongmany other ontologies Another compendium is caCOREincluding the Enterprise Vocabulary Services and Cancer Data

Integromics in cancer tissue biomarker research | 7

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Standards Repository of common data elements [80 81]Additionally some groups have devised cancer-type-specificlists such as for breast [82] mesothelioma [83] and prostate[84] In the absence of clear community-supported data stand-ards institutions are advised to agree on and use a minimal setof data export formats

Unique challenges of multimodal integrativeanalysis

Collecting and storing diverse data types is only half the battlein integromics unlocking the value of these data will comefrom analysis and interpretation Through a connection withthe R statistical environment via statconn [49] PICan is capableof on-demand statistical analysis graphical output and reportgeneration Besides R and statconn other similar frameworksand interfaces include Shiny [85] RNET [86] Python (eg mat-plotlib [87]) and D3js [88] While PICan is strictly a research toolan integromic approach will undoubtedly be essential to futureclinical decision support systems placing integromics at thefront line of patient care delivery

Multimodal analysis is particularly important when con-sidering the future of digital pathology and tissue imagingConventional light and fluorescence imaging provides the basisfor most studies in this field However the future may providenew imaging technologies using light outside of the normal vis-ible spectrum that may be more informative towards illuminat-ing the underlying cancer biology and delivering sensitive andspecific biomarkers of clinical outcome Numerous studies areemerging that exploit technologies such as ultraviolet [89] in-frared [90] and Raman spectroscopy [91 92] together with abroad range of associated multispectral technologies to gain abetter insight into tissue and cell characterization and to visual-ize compounds and chemistry not accessible to conventionalmicroscopy Integrating these imaging modalities into a singleframework to compare compound and deduce clinical utilitynot in isolation but together with conventional imaging meth-ods is essential Similarly there is capacity to integrate otherforms of medical imaging relevant to cancer discovery includ-ing positron emission tomography magnetic resonance imag-ing computed tomography and X-ray This can be used tounderstand the relationship between in vivo clinical imagingand ex vivo tissue imaging thereby translating markers that arecurrently only visible on microscopy to early in vivo detectionwith clinical imaging modalities In both settings the real valueis likely to come from the overlay and integration of imagingmodalities extending the capacity of tissue imaging on its ownand providing new visualization tools for both pathologists andradiologists that are more informative and consequently im-prove clinical decision making The integromics initiative dis-cussed in this article provides such a model for imaging as wellas molecular and clinical data blurring the edges of medicalspecialties and providing a unified framework for discovery

The basic tools of integromic data analysis are generallysimilar to other disciplines of bioinformatics Mean compari-sons clustering survival analyses machine learning and linearmodels for example can all be used as can the various pipe-lines that have been developed for next-generation sequencinggene expression microarrays and other molecular big dataomics techniques Integration of digital image analysis data inintegromic analysis can present its own distinct risks and re-wards Images and their associated markups and annotationsdo not lend themselves on their own to t-tests or clustering for

example Nevertheless these may be useful to contextualizeother pieces of data like IHC scores Calculated scores (egH-score) and metrics (eg integrated optical density nucleararea) can often be expressed as nameattribute pairs so theylend themselves more immediately to integrative analysis al-though data structure and hierarchy can still complicate ana-lysis Some metrics can be measured at a cellular level andsummarized at an ROI or whole slide level Others howevermight not be meaningful at a cellular level (eg H-score) so onemust be mindful of the meaning of each measure when incor-porating it into a multivariate model of the data Tumour het-erogeneity is also an important complication Summary scorescomputed across different regions of the same tumour may dif-fer markedly Heterogeneity at least in solid tumours can bemore easily mapped in tissue sections where tissue structureand cellular context is maintained and intact Spatial changesin tumour cell arrangement can be visualized microscopicallyand used to sample regions for molecular analysis and themeasurement of intratumoural heterogeneity Whole slideimaging and digital image analysis represent powerful tools toquantitatively map tissue phenotype and provide a sound basisfor biomarker assay development while opening up new av-enues of inquiry For example one can begin to explore whethercertain morphometric phenotypes correlate with specific gen-etic or epigenetic profiles and whether these have any signifi-cant prognostic or therapeutic impact

A specific challenge for integromics is the application ofstatistical techniques to data from disparate sources Care mustbe taken to ensure differences observed between experimentalgroups reflect the underlying biology and not simply batch ef-fects Differences in experimental assay platforms such as dif-ferent probes targeting the same gene or different antibodies forIHC may produce discrepancies in the data based on differentlevels of sensitivity and specificity In digital pathology differ-ences in imaging hardware analysis algorithms and parameterdefinitions may lead to various imaging artefacts and interpret-ation difficulties Even within the same software cell countsdetermined from different threshold settings are not compar-able Less visible confounders include study and patient vari-ables like recruitment methodology study admission criteriapatient stratificationdiagnostic criteria pre-collection treat-ments and standard of care (for prognostic biomarker studies)that can affect the makeup of the study patient populationMeanwhile specimen handling variables like cold ischemictime fixation method staining protocols and formulations canhave a profound effect on the data often dominating other con-founders regardless of the subsequent pipeline Ideally thendata harmonization begins well before data are generated withstandardization and thorough documentation of all pre-analytical steps including patient selection tissue collectionand specimen handling Where potential batch effects or con-founding variables have been identified a number of strategiescan be used such as subset analysis (ie stratifying the subjectpopulation on the confounding variable) or regression model-ling [93 94]

Mandated public release of large data sets has been helpful inadvancing transparency in science but tracking the associatedmetadata and assessing its impact on results derived from thesebig data sets remains a major unresolved challenge Even largerepositories like TCGA are composed of smaller collections froma wide range of geographies and institutional environments Inmany cases pooling such diverse data sets becomes an apples-to-oranges comparison Batch effects in TCGA and other largehigh-throughput data sets are well-documented [52 53] Without

8 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 6: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

removing duplication of data ensuring that the table is resilientagainst inconsistencies arising from incomplete addition dele-tion or modification of data As described in [54] this isachieved by identifying groups of related or dependent datafields and splitting them off into a separate table using foreignkeys to record relationships between tables For most practicalapplications the third normal form balances a robust architec-ture without overly cluttering the schema with an untenablenumber of tables Normalization approaches can also be used tooptimize database speed and performance

A consequence of our decision to use a relational database isthat free-form data needs to be harmonized with the data struc-ture presented by PICan before data entry The requisite cur-ation steps prioritize data integrity and prevent poor-qualitydata that might impede proper and meaningful analysis fromentering the database (garbage-in-garbage-out) Clinical andpathological data collection and storage are guided by tem-plates collections of well-defined data fields that have been de-veloped through our close working relationships with the NIBNorthern Ireland Cancer Registry Northern Ireland Health andSocial Care Trusts and clinicians These collaborators share thedesire to drive data integration and provide expertise contextand data access to harmonize data templates around commonvocabulary and permissible data values ensuring maximal clin-ical relevance of PICan In addition to the harmonized common

fields like age and sex PICan also supports cancer-type-specificfields for maximum flexibility

Non-relational database structures

While a structured relational database has served PICan wellthere are alternative models that might better suit other re-search integromics environments MongoDB for example is anon-relational (or lsquoNoSQLrsquo) database architecture that uses adocument model where each entity (a tissue block for ex-ample) is represented by a lsquodocumentrsquo that might contain hun-dreds of attributes as key-value pairs This model works with aflexible schema allowing data fields to be added or omitted ona case-by-case basis while harmonization of terminology anddata fields occurs at data retrieval In a research environmentespecially at the beginning of a study when pipelines and meta-data are likely to change it is easier to update the data modelthan to redesign the fixed schema of a relational databaseHowever with less urgency to curate data before upload dataintegrity may be variable with errors remaining unnoticed andconfounding analyses performed on the data Fuzzy matchingalgorithms [55ndash57] can mitigate the impact of typographical andsome clerical errors but these approaches must still be rigor-ously validated In some cases it might be impossible even for ahuman expert to deduce the original intention Data must then

Figure 2 Schematic of the internal table structure of PICanrsquos relational database reflecting the core activities and workflows that comprise the biomarker discovery

and validation program at our institution Most tables represent physical or digital entities within these workflows Where table names may be unclear a brief descrip-

tion has been included along with some example data fields (in parentheses)

Integromics in cancer tissue biomarker research | 5

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

be discarded from analysis potentially reducing the statisticalpower Manual curation at the outset might identify such caseswhen there is still an opportunity to seek clarification from theoriginal data collector Ideally ambiguous annotations and bor-derline cases should be resolved in consultation with a suitablyqualified medical professional before they enter the databaserather than leaving them open to interpretation by downstreaminformaticians Likewise missing or contradictory data pointsare preferably identified and where possible corrected beforeentry into the database However a major advantage of an un-structured database model is the ability to more faithfully cap-ture the complexities of the patientrsquos clinical pathological andother data parameters including free-form text fields such asclinical notes The ability to capture patient data in its nativeform also improves the traceability and scalability of data entryinto a non-structured database solution by eliminating the la-bour and documentation associated with data curation andenabling automated bulk data upload Despite the tools wehave developed to semi-automate data upload this will remaina relative drawback of manual curation in the short term espe-cially for clinical and pathological data fields In the longerterm scalability may be improved with increased adoption ofelectronic medical records wherein many of these data fieldscould be completed by the attending physician when encoun-tering the patient

Balancing the data curation afforded by the structured rela-tional database with the high-throughput flexibility of unstruc-tured models are various hybrid options For example PICanoffers flexibility to the structured database model by includingfreeform text fields to capture unstructured notes (in our experi-ence these are typically for clinical notes) in several of the datatables As these fields are not referenced directly by the auto-mated online analysis component which requires well-definedparameters they are hidden from analysis but can be retrievedby users wishing to conduct further data refinement or offlineanalysis In practice however it is more likely that any fieldsnot used in analysis by the system will simply be ignored andneglected Another hybrid solution we are actively exploring toincrease data throughput is to allow the importation of non-curated data sets for end-user analysis In this model curatedlsquopublicrsquo data sets would be available to all users (with necessaryapprovals and access restrictions in place) while each end userwill be able to upload lsquoprivatersquo data sets to be analysed along-side or integrated with the curated data The user then is re-sponsible for ensuring the quality of hisher own dataExtending the hybrid model concept further are heterogeneousdatabases in which only the most important unchanging andsensitive data are typically stored in a relational databaseMeanwhile the bulk of the data in terms of size will be storedin non-relational databases or as raw data with links stored inthe main database to ensure ease of access and retrieval

Image analysis data storage and management

Current work is focussing on improving the support of digitalimage analysis data an emerging area of research in which ourgroup has significant expertise Imaging data presents someunique challenges for database storage design The imagesthemselves can be large One of our typical tissue slidesscanned at 40objective magnification (025 lm per pixel) canhave 2ndash10 billion pixels Imaging in RGB at 8 bits per channelgives uncompressed file sizes of up to 30 GB Additional colourchannels (such as for fluorescence imaging) z-stacks and dataredundancy in the form of image pyramids for improved image

loading performance would all greatly increase file sizesCompression typically JPEG or JPEG2000 for whole slide imagescan significantly reduce file size without drastically impairingvisual assessment although suitability for image analysis ismore variable [39] Hence it is vital to store the original slideimage to allow a pathologist to go back and verify that imageanalysis results are consistent with hisher expectations This iscurrently achieved by hosting the slide images on remote ser-vers maintained by PathXL A link is stored in the PICan data-base that allows the web-based interface to retrieve the imagedata on user request via an application programming interface(API) supplied by PathXL Pan zoom and overlay functionalitiesin the PICan web interface are powered by OpenSeadragon [58]As the samples are drawn from NIB collections NIB maintainscontrol of access and security using the PathXL software

In addition to the raw image pixels image analysis datamight include markups annotations and calculated metrics(such as integrated optical densities nuclear areas or immuno-histochemical H-scores [50 51]) Metrics can be at a cellularlevel or summarized across a region of interest (ROI) a slide oreven a set of slides Much of these data may be hierarchicallystructured with each slide potentially containing multiple ROIseach ROI containing many cells and each cell being associatedwith values for numerous metrics However the efficiency ofstoring image analysis data in a structured database is uncer-tain when a single image can contain thousands of annotationsFor now PICan stores only summary scores and metrics (eg H-score [50 51] Allred Score [59] percentage positive cells) whileusers wishing to interrogate the image data on a more granularlevel will need to rely on external specialist tools The challengeof storing and transferring image analysis data between sys-tems is an ongoing one with several potential approaches hav-ing been proposed including DICOM [60] OME-TIFF [61] variousHDF5-based formats [62 63] and proprietary vendor formatsbut without a clear front-runner DICOM and OME-TIFF are pri-marily image formats with flexible metadata support optimizedfor image viewing rather than storage of image analysis resultswith large numbers of image objects In contrast HDF5 capturesobject hierarchy relationships well but on its own it is not a na-tive image format Meanwhile proprietary formats are poten-tially encumbered by patenting and intellectual propertyconcerns

Integromics data management is patient-centric

Regardless of the database management system being used acritical characteristic of integromics analysis is patient-centricity The ultimate objective in bringing together data frommultiple experimental platforms is to build a more holistic pic-ture of the interplay between genotype and phenotype in thepatients under study As interest in amalgamating multipledata types has grown many platforms originally designed tohost one data type have been expanded to accommodate add-itional types of data Many biobank management software sys-tems allow addition of clinicopathological notes andexperimental assay results on top of sample tracking dataMeanwhile digital slide hosting platforms like OMERO [64]PathXL Xplore Cytomine [65] and DigitalScope (Aptia SystemsHouston TX used by the College of American Pathologists) [66]are beginning to support the addition of associated clinicopa-thological data as metadata Others like TCGA [25 26] origi-nated as molecular data repositories but have begun addingwhole slide imaging There is a clear trend towards convergenceof database models encompassing more and more data types

6 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

but their impact will be limited if these extra data are simplystored and forgotten Integromics is an active process not just ofbringing data together but more importantly of creative com-parisons and contrasts to expose underlying interactions under-pinning the cancer phenotype Many public repositories and bigdatabases such as TCGA draw from an assortment of patientcohorts so associations between different data types measuredon different cohorts are limited to cohort summaries poten-tially obscuring trends between data types that are only appar-ent at the individual patient level

Interacting with external resourcesmdashdataexchange between systems

With the current rapid expansion of big data analytics in bio-medicine integromics frameworks must remain flexibleenough to incorporate emerging and future technologiesStructured architecture and design of the overall integromicsdatabase framework can facilitate this However large data setslike next-generation sequencing (and upcoming third-gener-ation sequencing [67ndash70]) can be unwieldy in a MySQL databasewith typical whole exome or RNA sequencing runs comprising10ndash100 billion base pairs for a single sample so future integro-mics systems may house these types of data in databases thatare structured specifically for this purpose Instead of a singledatabase housing multiple data types a future integromicsdatabase will undoubtedly be structured as a fully distributedfederated database [71ndash74] with a central database storing iden-tifiers that interface with a cluster of highly specialized data-bases each housing a specific type of data in its native formatArguably PICan already uses a version of this model by out-sourcing digital slide image storage to remote servers Futuredevelopments may include enabling PICan to tap into publiclyavailable data sources and allowing users to analyse these pub-lic data sets in tandem with PICan data such as accessing GeneExpression Omnibus data through GEO2R [75]

One of the challenges when working with a federated data-base model is how to exchange data faithfully and efficientlybetween the various components Each constituent databasewill need an interface to interact seamlessly with the centraldatabase and control system Some of these will be availableoff-the-shelf or rely on published APIs while others such asthose interacting with in-house software may need significantdeveloper time and effort to implement Ultimately there areno shortcuts to establishing these interfaces and so institutionsshould aim to reduce the number of distinct interfaces bystandardizing data around a small set of formats For examplewhen working with whole slide images a single institutioncould strive to use only a single file format The various avail-able formats each have their own strengths and weaknessesbut the natural option will likely be the native format of thescanner used at the institution or a vendor-neutral format likeDICOM [60] or OME-TIFF [61] Fortunately many tools for work-ing with whole slide images support multiple file formats suchas DigitalScope [66] for remote web-based viewing of variousimaging formats Meanwhile projects like Bio-Formats (OpenMicroscopy Environment consortium) [61] and OpenSlide [76]further facilitate the viewing and exploitation of various imag-ing formats in other programming and analysis languages likeC Cthornthorn Python and Java Additionally it is important to stand-ardize the data transfer protocol whether this is throughInternet standards like hypertext transfer protocol physicalmeans like hard disks or some other agreed protocol Perhaps

establishing an institutional integromics strategy will encour-age adoption of standardized data formats by creating an ex-pectation among data generators that data should be portableand compatible with other data types for analysis

Data flow within an integromics environment can often bemultidirectional Even with only a single central data repositorythere will be instances when it is beneficial to export some ofthese data to a separate third-party platform whether for speci-alized analysis report generation or some other purpose AsPICanrsquos web-based user interface is designed to be user friendlyit supports only a limited number of types of analyses and re-port generation options Analyses not hard-coded into the inter-face for example would need to be performed lsquoofflinersquo In factthis is how new analyses are tested before being incorporatedinto PICan Similar to data import and storage data export froman integromics system is complicated by the wide range of datatypes that demand different formats to accommodate their in-herently different natures Efforts have been made in somefields to standardize and integrate data models such as thoseof the Global Alliance for Genomics and Health for genomicdata [20] Hence where widely recognized data exchange for-mats exist such as FASTAFASTQ files for nucleotide or aminoacid sequences [77] or the TMA data exchange standard [78]they should be prioritized for integromics support For otherdata types like digital image analysis however there remains aneed for a well-defined and widely accepted standard for dataexchange Whether it is possible or feasible to define a singleunifying integromic data exchange format that would encapsu-late multiple data types attached to individuals within a cohortof patient subjects remains an open question

Standardization of data vocabulary

Any data exchange standard needs to address both how dataare encoded (structure) as well as a common vocabulary forcommonly encountered fields CSV JSON XML and derivativesof these are popular formats that lend themselves well to struc-tured data Certainly CSV export is our default as although itdoes not fully represent the relational structure underpinningPICanrsquos database it mimics the basic tabular structure and canbe easily viewed by other researchers in common spreadsheetor statistical packages Non-tabular data like digital slide images(which are stored as links in the database) are exported as sep-arate files The greater challenge might be agreeing on a com-mon vocabulary Data field names need to be mapped fromPICanrsquos internal vocabulary to that of the target external plat-form This is not a trivial task as even professionals within thesame institution may use different names for the same metricSome clinicians may be more precise in how certain metrics aremeasured and recorded such as when some distinguish be-tween lymphatic and vascular invasion while others groupthem together as lymphovascular invasion Harmonizing be-tween cancer types can also be challenging hence the use ofcancer-type-specific tables in PICan for certain clinicopathologi-cal metrics Additionally even where a data exchange specifica-tion exists like the one for TMA reporting [78] there may beallowance for custom fields posing an extra challenge for align-ing under a common vocabulary Attempts have been made tostandardize terminology in some disciplines The US NationalLibrary of Medicine maintains a collection of controlled vocabu-laries known as the Unified Medical Language System [79]which incorporates SNOMED CT MeSH and OMIM amongmany other ontologies Another compendium is caCOREincluding the Enterprise Vocabulary Services and Cancer Data

Integromics in cancer tissue biomarker research | 7

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Standards Repository of common data elements [80 81]Additionally some groups have devised cancer-type-specificlists such as for breast [82] mesothelioma [83] and prostate[84] In the absence of clear community-supported data stand-ards institutions are advised to agree on and use a minimal setof data export formats

Unique challenges of multimodal integrativeanalysis

Collecting and storing diverse data types is only half the battlein integromics unlocking the value of these data will comefrom analysis and interpretation Through a connection withthe R statistical environment via statconn [49] PICan is capableof on-demand statistical analysis graphical output and reportgeneration Besides R and statconn other similar frameworksand interfaces include Shiny [85] RNET [86] Python (eg mat-plotlib [87]) and D3js [88] While PICan is strictly a research toolan integromic approach will undoubtedly be essential to futureclinical decision support systems placing integromics at thefront line of patient care delivery

Multimodal analysis is particularly important when con-sidering the future of digital pathology and tissue imagingConventional light and fluorescence imaging provides the basisfor most studies in this field However the future may providenew imaging technologies using light outside of the normal vis-ible spectrum that may be more informative towards illuminat-ing the underlying cancer biology and delivering sensitive andspecific biomarkers of clinical outcome Numerous studies areemerging that exploit technologies such as ultraviolet [89] in-frared [90] and Raman spectroscopy [91 92] together with abroad range of associated multispectral technologies to gain abetter insight into tissue and cell characterization and to visual-ize compounds and chemistry not accessible to conventionalmicroscopy Integrating these imaging modalities into a singleframework to compare compound and deduce clinical utilitynot in isolation but together with conventional imaging meth-ods is essential Similarly there is capacity to integrate otherforms of medical imaging relevant to cancer discovery includ-ing positron emission tomography magnetic resonance imag-ing computed tomography and X-ray This can be used tounderstand the relationship between in vivo clinical imagingand ex vivo tissue imaging thereby translating markers that arecurrently only visible on microscopy to early in vivo detectionwith clinical imaging modalities In both settings the real valueis likely to come from the overlay and integration of imagingmodalities extending the capacity of tissue imaging on its ownand providing new visualization tools for both pathologists andradiologists that are more informative and consequently im-prove clinical decision making The integromics initiative dis-cussed in this article provides such a model for imaging as wellas molecular and clinical data blurring the edges of medicalspecialties and providing a unified framework for discovery

The basic tools of integromic data analysis are generallysimilar to other disciplines of bioinformatics Mean compari-sons clustering survival analyses machine learning and linearmodels for example can all be used as can the various pipe-lines that have been developed for next-generation sequencinggene expression microarrays and other molecular big dataomics techniques Integration of digital image analysis data inintegromic analysis can present its own distinct risks and re-wards Images and their associated markups and annotationsdo not lend themselves on their own to t-tests or clustering for

example Nevertheless these may be useful to contextualizeother pieces of data like IHC scores Calculated scores (egH-score) and metrics (eg integrated optical density nucleararea) can often be expressed as nameattribute pairs so theylend themselves more immediately to integrative analysis al-though data structure and hierarchy can still complicate ana-lysis Some metrics can be measured at a cellular level andsummarized at an ROI or whole slide level Others howevermight not be meaningful at a cellular level (eg H-score) so onemust be mindful of the meaning of each measure when incor-porating it into a multivariate model of the data Tumour het-erogeneity is also an important complication Summary scorescomputed across different regions of the same tumour may dif-fer markedly Heterogeneity at least in solid tumours can bemore easily mapped in tissue sections where tissue structureand cellular context is maintained and intact Spatial changesin tumour cell arrangement can be visualized microscopicallyand used to sample regions for molecular analysis and themeasurement of intratumoural heterogeneity Whole slideimaging and digital image analysis represent powerful tools toquantitatively map tissue phenotype and provide a sound basisfor biomarker assay development while opening up new av-enues of inquiry For example one can begin to explore whethercertain morphometric phenotypes correlate with specific gen-etic or epigenetic profiles and whether these have any signifi-cant prognostic or therapeutic impact

A specific challenge for integromics is the application ofstatistical techniques to data from disparate sources Care mustbe taken to ensure differences observed between experimentalgroups reflect the underlying biology and not simply batch ef-fects Differences in experimental assay platforms such as dif-ferent probes targeting the same gene or different antibodies forIHC may produce discrepancies in the data based on differentlevels of sensitivity and specificity In digital pathology differ-ences in imaging hardware analysis algorithms and parameterdefinitions may lead to various imaging artefacts and interpret-ation difficulties Even within the same software cell countsdetermined from different threshold settings are not compar-able Less visible confounders include study and patient vari-ables like recruitment methodology study admission criteriapatient stratificationdiagnostic criteria pre-collection treat-ments and standard of care (for prognostic biomarker studies)that can affect the makeup of the study patient populationMeanwhile specimen handling variables like cold ischemictime fixation method staining protocols and formulations canhave a profound effect on the data often dominating other con-founders regardless of the subsequent pipeline Ideally thendata harmonization begins well before data are generated withstandardization and thorough documentation of all pre-analytical steps including patient selection tissue collectionand specimen handling Where potential batch effects or con-founding variables have been identified a number of strategiescan be used such as subset analysis (ie stratifying the subjectpopulation on the confounding variable) or regression model-ling [93 94]

Mandated public release of large data sets has been helpful inadvancing transparency in science but tracking the associatedmetadata and assessing its impact on results derived from thesebig data sets remains a major unresolved challenge Even largerepositories like TCGA are composed of smaller collections froma wide range of geographies and institutional environments Inmany cases pooling such diverse data sets becomes an apples-to-oranges comparison Batch effects in TCGA and other largehigh-throughput data sets are well-documented [52 53] Without

8 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 7: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

be discarded from analysis potentially reducing the statisticalpower Manual curation at the outset might identify such caseswhen there is still an opportunity to seek clarification from theoriginal data collector Ideally ambiguous annotations and bor-derline cases should be resolved in consultation with a suitablyqualified medical professional before they enter the databaserather than leaving them open to interpretation by downstreaminformaticians Likewise missing or contradictory data pointsare preferably identified and where possible corrected beforeentry into the database However a major advantage of an un-structured database model is the ability to more faithfully cap-ture the complexities of the patientrsquos clinical pathological andother data parameters including free-form text fields such asclinical notes The ability to capture patient data in its nativeform also improves the traceability and scalability of data entryinto a non-structured database solution by eliminating the la-bour and documentation associated with data curation andenabling automated bulk data upload Despite the tools wehave developed to semi-automate data upload this will remaina relative drawback of manual curation in the short term espe-cially for clinical and pathological data fields In the longerterm scalability may be improved with increased adoption ofelectronic medical records wherein many of these data fieldscould be completed by the attending physician when encoun-tering the patient

Balancing the data curation afforded by the structured rela-tional database with the high-throughput flexibility of unstruc-tured models are various hybrid options For example PICanoffers flexibility to the structured database model by includingfreeform text fields to capture unstructured notes (in our experi-ence these are typically for clinical notes) in several of the datatables As these fields are not referenced directly by the auto-mated online analysis component which requires well-definedparameters they are hidden from analysis but can be retrievedby users wishing to conduct further data refinement or offlineanalysis In practice however it is more likely that any fieldsnot used in analysis by the system will simply be ignored andneglected Another hybrid solution we are actively exploring toincrease data throughput is to allow the importation of non-curated data sets for end-user analysis In this model curatedlsquopublicrsquo data sets would be available to all users (with necessaryapprovals and access restrictions in place) while each end userwill be able to upload lsquoprivatersquo data sets to be analysed along-side or integrated with the curated data The user then is re-sponsible for ensuring the quality of hisher own dataExtending the hybrid model concept further are heterogeneousdatabases in which only the most important unchanging andsensitive data are typically stored in a relational databaseMeanwhile the bulk of the data in terms of size will be storedin non-relational databases or as raw data with links stored inthe main database to ensure ease of access and retrieval

Image analysis data storage and management

Current work is focussing on improving the support of digitalimage analysis data an emerging area of research in which ourgroup has significant expertise Imaging data presents someunique challenges for database storage design The imagesthemselves can be large One of our typical tissue slidesscanned at 40objective magnification (025 lm per pixel) canhave 2ndash10 billion pixels Imaging in RGB at 8 bits per channelgives uncompressed file sizes of up to 30 GB Additional colourchannels (such as for fluorescence imaging) z-stacks and dataredundancy in the form of image pyramids for improved image

loading performance would all greatly increase file sizesCompression typically JPEG or JPEG2000 for whole slide imagescan significantly reduce file size without drastically impairingvisual assessment although suitability for image analysis ismore variable [39] Hence it is vital to store the original slideimage to allow a pathologist to go back and verify that imageanalysis results are consistent with hisher expectations This iscurrently achieved by hosting the slide images on remote ser-vers maintained by PathXL A link is stored in the PICan data-base that allows the web-based interface to retrieve the imagedata on user request via an application programming interface(API) supplied by PathXL Pan zoom and overlay functionalitiesin the PICan web interface are powered by OpenSeadragon [58]As the samples are drawn from NIB collections NIB maintainscontrol of access and security using the PathXL software

In addition to the raw image pixels image analysis datamight include markups annotations and calculated metrics(such as integrated optical densities nuclear areas or immuno-histochemical H-scores [50 51]) Metrics can be at a cellularlevel or summarized across a region of interest (ROI) a slide oreven a set of slides Much of these data may be hierarchicallystructured with each slide potentially containing multiple ROIseach ROI containing many cells and each cell being associatedwith values for numerous metrics However the efficiency ofstoring image analysis data in a structured database is uncer-tain when a single image can contain thousands of annotationsFor now PICan stores only summary scores and metrics (eg H-score [50 51] Allred Score [59] percentage positive cells) whileusers wishing to interrogate the image data on a more granularlevel will need to rely on external specialist tools The challengeof storing and transferring image analysis data between sys-tems is an ongoing one with several potential approaches hav-ing been proposed including DICOM [60] OME-TIFF [61] variousHDF5-based formats [62 63] and proprietary vendor formatsbut without a clear front-runner DICOM and OME-TIFF are pri-marily image formats with flexible metadata support optimizedfor image viewing rather than storage of image analysis resultswith large numbers of image objects In contrast HDF5 capturesobject hierarchy relationships well but on its own it is not a na-tive image format Meanwhile proprietary formats are poten-tially encumbered by patenting and intellectual propertyconcerns

Integromics data management is patient-centric

Regardless of the database management system being used acritical characteristic of integromics analysis is patient-centricity The ultimate objective in bringing together data frommultiple experimental platforms is to build a more holistic pic-ture of the interplay between genotype and phenotype in thepatients under study As interest in amalgamating multipledata types has grown many platforms originally designed tohost one data type have been expanded to accommodate add-itional types of data Many biobank management software sys-tems allow addition of clinicopathological notes andexperimental assay results on top of sample tracking dataMeanwhile digital slide hosting platforms like OMERO [64]PathXL Xplore Cytomine [65] and DigitalScope (Aptia SystemsHouston TX used by the College of American Pathologists) [66]are beginning to support the addition of associated clinicopa-thological data as metadata Others like TCGA [25 26] origi-nated as molecular data repositories but have begun addingwhole slide imaging There is a clear trend towards convergenceof database models encompassing more and more data types

6 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

but their impact will be limited if these extra data are simplystored and forgotten Integromics is an active process not just ofbringing data together but more importantly of creative com-parisons and contrasts to expose underlying interactions under-pinning the cancer phenotype Many public repositories and bigdatabases such as TCGA draw from an assortment of patientcohorts so associations between different data types measuredon different cohorts are limited to cohort summaries poten-tially obscuring trends between data types that are only appar-ent at the individual patient level

Interacting with external resourcesmdashdataexchange between systems

With the current rapid expansion of big data analytics in bio-medicine integromics frameworks must remain flexibleenough to incorporate emerging and future technologiesStructured architecture and design of the overall integromicsdatabase framework can facilitate this However large data setslike next-generation sequencing (and upcoming third-gener-ation sequencing [67ndash70]) can be unwieldy in a MySQL databasewith typical whole exome or RNA sequencing runs comprising10ndash100 billion base pairs for a single sample so future integro-mics systems may house these types of data in databases thatare structured specifically for this purpose Instead of a singledatabase housing multiple data types a future integromicsdatabase will undoubtedly be structured as a fully distributedfederated database [71ndash74] with a central database storing iden-tifiers that interface with a cluster of highly specialized data-bases each housing a specific type of data in its native formatArguably PICan already uses a version of this model by out-sourcing digital slide image storage to remote servers Futuredevelopments may include enabling PICan to tap into publiclyavailable data sources and allowing users to analyse these pub-lic data sets in tandem with PICan data such as accessing GeneExpression Omnibus data through GEO2R [75]

One of the challenges when working with a federated data-base model is how to exchange data faithfully and efficientlybetween the various components Each constituent databasewill need an interface to interact seamlessly with the centraldatabase and control system Some of these will be availableoff-the-shelf or rely on published APIs while others such asthose interacting with in-house software may need significantdeveloper time and effort to implement Ultimately there areno shortcuts to establishing these interfaces and so institutionsshould aim to reduce the number of distinct interfaces bystandardizing data around a small set of formats For examplewhen working with whole slide images a single institutioncould strive to use only a single file format The various avail-able formats each have their own strengths and weaknessesbut the natural option will likely be the native format of thescanner used at the institution or a vendor-neutral format likeDICOM [60] or OME-TIFF [61] Fortunately many tools for work-ing with whole slide images support multiple file formats suchas DigitalScope [66] for remote web-based viewing of variousimaging formats Meanwhile projects like Bio-Formats (OpenMicroscopy Environment consortium) [61] and OpenSlide [76]further facilitate the viewing and exploitation of various imag-ing formats in other programming and analysis languages likeC Cthornthorn Python and Java Additionally it is important to stand-ardize the data transfer protocol whether this is throughInternet standards like hypertext transfer protocol physicalmeans like hard disks or some other agreed protocol Perhaps

establishing an institutional integromics strategy will encour-age adoption of standardized data formats by creating an ex-pectation among data generators that data should be portableand compatible with other data types for analysis

Data flow within an integromics environment can often bemultidirectional Even with only a single central data repositorythere will be instances when it is beneficial to export some ofthese data to a separate third-party platform whether for speci-alized analysis report generation or some other purpose AsPICanrsquos web-based user interface is designed to be user friendlyit supports only a limited number of types of analyses and re-port generation options Analyses not hard-coded into the inter-face for example would need to be performed lsquoofflinersquo In factthis is how new analyses are tested before being incorporatedinto PICan Similar to data import and storage data export froman integromics system is complicated by the wide range of datatypes that demand different formats to accommodate their in-herently different natures Efforts have been made in somefields to standardize and integrate data models such as thoseof the Global Alliance for Genomics and Health for genomicdata [20] Hence where widely recognized data exchange for-mats exist such as FASTAFASTQ files for nucleotide or aminoacid sequences [77] or the TMA data exchange standard [78]they should be prioritized for integromics support For otherdata types like digital image analysis however there remains aneed for a well-defined and widely accepted standard for dataexchange Whether it is possible or feasible to define a singleunifying integromic data exchange format that would encapsu-late multiple data types attached to individuals within a cohortof patient subjects remains an open question

Standardization of data vocabulary

Any data exchange standard needs to address both how dataare encoded (structure) as well as a common vocabulary forcommonly encountered fields CSV JSON XML and derivativesof these are popular formats that lend themselves well to struc-tured data Certainly CSV export is our default as although itdoes not fully represent the relational structure underpinningPICanrsquos database it mimics the basic tabular structure and canbe easily viewed by other researchers in common spreadsheetor statistical packages Non-tabular data like digital slide images(which are stored as links in the database) are exported as sep-arate files The greater challenge might be agreeing on a com-mon vocabulary Data field names need to be mapped fromPICanrsquos internal vocabulary to that of the target external plat-form This is not a trivial task as even professionals within thesame institution may use different names for the same metricSome clinicians may be more precise in how certain metrics aremeasured and recorded such as when some distinguish be-tween lymphatic and vascular invasion while others groupthem together as lymphovascular invasion Harmonizing be-tween cancer types can also be challenging hence the use ofcancer-type-specific tables in PICan for certain clinicopathologi-cal metrics Additionally even where a data exchange specifica-tion exists like the one for TMA reporting [78] there may beallowance for custom fields posing an extra challenge for align-ing under a common vocabulary Attempts have been made tostandardize terminology in some disciplines The US NationalLibrary of Medicine maintains a collection of controlled vocabu-laries known as the Unified Medical Language System [79]which incorporates SNOMED CT MeSH and OMIM amongmany other ontologies Another compendium is caCOREincluding the Enterprise Vocabulary Services and Cancer Data

Integromics in cancer tissue biomarker research | 7

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Standards Repository of common data elements [80 81]Additionally some groups have devised cancer-type-specificlists such as for breast [82] mesothelioma [83] and prostate[84] In the absence of clear community-supported data stand-ards institutions are advised to agree on and use a minimal setof data export formats

Unique challenges of multimodal integrativeanalysis

Collecting and storing diverse data types is only half the battlein integromics unlocking the value of these data will comefrom analysis and interpretation Through a connection withthe R statistical environment via statconn [49] PICan is capableof on-demand statistical analysis graphical output and reportgeneration Besides R and statconn other similar frameworksand interfaces include Shiny [85] RNET [86] Python (eg mat-plotlib [87]) and D3js [88] While PICan is strictly a research toolan integromic approach will undoubtedly be essential to futureclinical decision support systems placing integromics at thefront line of patient care delivery

Multimodal analysis is particularly important when con-sidering the future of digital pathology and tissue imagingConventional light and fluorescence imaging provides the basisfor most studies in this field However the future may providenew imaging technologies using light outside of the normal vis-ible spectrum that may be more informative towards illuminat-ing the underlying cancer biology and delivering sensitive andspecific biomarkers of clinical outcome Numerous studies areemerging that exploit technologies such as ultraviolet [89] in-frared [90] and Raman spectroscopy [91 92] together with abroad range of associated multispectral technologies to gain abetter insight into tissue and cell characterization and to visual-ize compounds and chemistry not accessible to conventionalmicroscopy Integrating these imaging modalities into a singleframework to compare compound and deduce clinical utilitynot in isolation but together with conventional imaging meth-ods is essential Similarly there is capacity to integrate otherforms of medical imaging relevant to cancer discovery includ-ing positron emission tomography magnetic resonance imag-ing computed tomography and X-ray This can be used tounderstand the relationship between in vivo clinical imagingand ex vivo tissue imaging thereby translating markers that arecurrently only visible on microscopy to early in vivo detectionwith clinical imaging modalities In both settings the real valueis likely to come from the overlay and integration of imagingmodalities extending the capacity of tissue imaging on its ownand providing new visualization tools for both pathologists andradiologists that are more informative and consequently im-prove clinical decision making The integromics initiative dis-cussed in this article provides such a model for imaging as wellas molecular and clinical data blurring the edges of medicalspecialties and providing a unified framework for discovery

The basic tools of integromic data analysis are generallysimilar to other disciplines of bioinformatics Mean compari-sons clustering survival analyses machine learning and linearmodels for example can all be used as can the various pipe-lines that have been developed for next-generation sequencinggene expression microarrays and other molecular big dataomics techniques Integration of digital image analysis data inintegromic analysis can present its own distinct risks and re-wards Images and their associated markups and annotationsdo not lend themselves on their own to t-tests or clustering for

example Nevertheless these may be useful to contextualizeother pieces of data like IHC scores Calculated scores (egH-score) and metrics (eg integrated optical density nucleararea) can often be expressed as nameattribute pairs so theylend themselves more immediately to integrative analysis al-though data structure and hierarchy can still complicate ana-lysis Some metrics can be measured at a cellular level andsummarized at an ROI or whole slide level Others howevermight not be meaningful at a cellular level (eg H-score) so onemust be mindful of the meaning of each measure when incor-porating it into a multivariate model of the data Tumour het-erogeneity is also an important complication Summary scorescomputed across different regions of the same tumour may dif-fer markedly Heterogeneity at least in solid tumours can bemore easily mapped in tissue sections where tissue structureand cellular context is maintained and intact Spatial changesin tumour cell arrangement can be visualized microscopicallyand used to sample regions for molecular analysis and themeasurement of intratumoural heterogeneity Whole slideimaging and digital image analysis represent powerful tools toquantitatively map tissue phenotype and provide a sound basisfor biomarker assay development while opening up new av-enues of inquiry For example one can begin to explore whethercertain morphometric phenotypes correlate with specific gen-etic or epigenetic profiles and whether these have any signifi-cant prognostic or therapeutic impact

A specific challenge for integromics is the application ofstatistical techniques to data from disparate sources Care mustbe taken to ensure differences observed between experimentalgroups reflect the underlying biology and not simply batch ef-fects Differences in experimental assay platforms such as dif-ferent probes targeting the same gene or different antibodies forIHC may produce discrepancies in the data based on differentlevels of sensitivity and specificity In digital pathology differ-ences in imaging hardware analysis algorithms and parameterdefinitions may lead to various imaging artefacts and interpret-ation difficulties Even within the same software cell countsdetermined from different threshold settings are not compar-able Less visible confounders include study and patient vari-ables like recruitment methodology study admission criteriapatient stratificationdiagnostic criteria pre-collection treat-ments and standard of care (for prognostic biomarker studies)that can affect the makeup of the study patient populationMeanwhile specimen handling variables like cold ischemictime fixation method staining protocols and formulations canhave a profound effect on the data often dominating other con-founders regardless of the subsequent pipeline Ideally thendata harmonization begins well before data are generated withstandardization and thorough documentation of all pre-analytical steps including patient selection tissue collectionand specimen handling Where potential batch effects or con-founding variables have been identified a number of strategiescan be used such as subset analysis (ie stratifying the subjectpopulation on the confounding variable) or regression model-ling [93 94]

Mandated public release of large data sets has been helpful inadvancing transparency in science but tracking the associatedmetadata and assessing its impact on results derived from thesebig data sets remains a major unresolved challenge Even largerepositories like TCGA are composed of smaller collections froma wide range of geographies and institutional environments Inmany cases pooling such diverse data sets becomes an apples-to-oranges comparison Batch effects in TCGA and other largehigh-throughput data sets are well-documented [52 53] Without

8 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 8: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

but their impact will be limited if these extra data are simplystored and forgotten Integromics is an active process not just ofbringing data together but more importantly of creative com-parisons and contrasts to expose underlying interactions under-pinning the cancer phenotype Many public repositories and bigdatabases such as TCGA draw from an assortment of patientcohorts so associations between different data types measuredon different cohorts are limited to cohort summaries poten-tially obscuring trends between data types that are only appar-ent at the individual patient level

Interacting with external resourcesmdashdataexchange between systems

With the current rapid expansion of big data analytics in bio-medicine integromics frameworks must remain flexibleenough to incorporate emerging and future technologiesStructured architecture and design of the overall integromicsdatabase framework can facilitate this However large data setslike next-generation sequencing (and upcoming third-gener-ation sequencing [67ndash70]) can be unwieldy in a MySQL databasewith typical whole exome or RNA sequencing runs comprising10ndash100 billion base pairs for a single sample so future integro-mics systems may house these types of data in databases thatare structured specifically for this purpose Instead of a singledatabase housing multiple data types a future integromicsdatabase will undoubtedly be structured as a fully distributedfederated database [71ndash74] with a central database storing iden-tifiers that interface with a cluster of highly specialized data-bases each housing a specific type of data in its native formatArguably PICan already uses a version of this model by out-sourcing digital slide image storage to remote servers Futuredevelopments may include enabling PICan to tap into publiclyavailable data sources and allowing users to analyse these pub-lic data sets in tandem with PICan data such as accessing GeneExpression Omnibus data through GEO2R [75]

One of the challenges when working with a federated data-base model is how to exchange data faithfully and efficientlybetween the various components Each constituent databasewill need an interface to interact seamlessly with the centraldatabase and control system Some of these will be availableoff-the-shelf or rely on published APIs while others such asthose interacting with in-house software may need significantdeveloper time and effort to implement Ultimately there areno shortcuts to establishing these interfaces and so institutionsshould aim to reduce the number of distinct interfaces bystandardizing data around a small set of formats For examplewhen working with whole slide images a single institutioncould strive to use only a single file format The various avail-able formats each have their own strengths and weaknessesbut the natural option will likely be the native format of thescanner used at the institution or a vendor-neutral format likeDICOM [60] or OME-TIFF [61] Fortunately many tools for work-ing with whole slide images support multiple file formats suchas DigitalScope [66] for remote web-based viewing of variousimaging formats Meanwhile projects like Bio-Formats (OpenMicroscopy Environment consortium) [61] and OpenSlide [76]further facilitate the viewing and exploitation of various imag-ing formats in other programming and analysis languages likeC Cthornthorn Python and Java Additionally it is important to stand-ardize the data transfer protocol whether this is throughInternet standards like hypertext transfer protocol physicalmeans like hard disks or some other agreed protocol Perhaps

establishing an institutional integromics strategy will encour-age adoption of standardized data formats by creating an ex-pectation among data generators that data should be portableand compatible with other data types for analysis

Data flow within an integromics environment can often bemultidirectional Even with only a single central data repositorythere will be instances when it is beneficial to export some ofthese data to a separate third-party platform whether for speci-alized analysis report generation or some other purpose AsPICanrsquos web-based user interface is designed to be user friendlyit supports only a limited number of types of analyses and re-port generation options Analyses not hard-coded into the inter-face for example would need to be performed lsquoofflinersquo In factthis is how new analyses are tested before being incorporatedinto PICan Similar to data import and storage data export froman integromics system is complicated by the wide range of datatypes that demand different formats to accommodate their in-herently different natures Efforts have been made in somefields to standardize and integrate data models such as thoseof the Global Alliance for Genomics and Health for genomicdata [20] Hence where widely recognized data exchange for-mats exist such as FASTAFASTQ files for nucleotide or aminoacid sequences [77] or the TMA data exchange standard [78]they should be prioritized for integromics support For otherdata types like digital image analysis however there remains aneed for a well-defined and widely accepted standard for dataexchange Whether it is possible or feasible to define a singleunifying integromic data exchange format that would encapsu-late multiple data types attached to individuals within a cohortof patient subjects remains an open question

Standardization of data vocabulary

Any data exchange standard needs to address both how dataare encoded (structure) as well as a common vocabulary forcommonly encountered fields CSV JSON XML and derivativesof these are popular formats that lend themselves well to struc-tured data Certainly CSV export is our default as although itdoes not fully represent the relational structure underpinningPICanrsquos database it mimics the basic tabular structure and canbe easily viewed by other researchers in common spreadsheetor statistical packages Non-tabular data like digital slide images(which are stored as links in the database) are exported as sep-arate files The greater challenge might be agreeing on a com-mon vocabulary Data field names need to be mapped fromPICanrsquos internal vocabulary to that of the target external plat-form This is not a trivial task as even professionals within thesame institution may use different names for the same metricSome clinicians may be more precise in how certain metrics aremeasured and recorded such as when some distinguish be-tween lymphatic and vascular invasion while others groupthem together as lymphovascular invasion Harmonizing be-tween cancer types can also be challenging hence the use ofcancer-type-specific tables in PICan for certain clinicopathologi-cal metrics Additionally even where a data exchange specifica-tion exists like the one for TMA reporting [78] there may beallowance for custom fields posing an extra challenge for align-ing under a common vocabulary Attempts have been made tostandardize terminology in some disciplines The US NationalLibrary of Medicine maintains a collection of controlled vocabu-laries known as the Unified Medical Language System [79]which incorporates SNOMED CT MeSH and OMIM amongmany other ontologies Another compendium is caCOREincluding the Enterprise Vocabulary Services and Cancer Data

Integromics in cancer tissue biomarker research | 7

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Standards Repository of common data elements [80 81]Additionally some groups have devised cancer-type-specificlists such as for breast [82] mesothelioma [83] and prostate[84] In the absence of clear community-supported data stand-ards institutions are advised to agree on and use a minimal setof data export formats

Unique challenges of multimodal integrativeanalysis

Collecting and storing diverse data types is only half the battlein integromics unlocking the value of these data will comefrom analysis and interpretation Through a connection withthe R statistical environment via statconn [49] PICan is capableof on-demand statistical analysis graphical output and reportgeneration Besides R and statconn other similar frameworksand interfaces include Shiny [85] RNET [86] Python (eg mat-plotlib [87]) and D3js [88] While PICan is strictly a research toolan integromic approach will undoubtedly be essential to futureclinical decision support systems placing integromics at thefront line of patient care delivery

Multimodal analysis is particularly important when con-sidering the future of digital pathology and tissue imagingConventional light and fluorescence imaging provides the basisfor most studies in this field However the future may providenew imaging technologies using light outside of the normal vis-ible spectrum that may be more informative towards illuminat-ing the underlying cancer biology and delivering sensitive andspecific biomarkers of clinical outcome Numerous studies areemerging that exploit technologies such as ultraviolet [89] in-frared [90] and Raman spectroscopy [91 92] together with abroad range of associated multispectral technologies to gain abetter insight into tissue and cell characterization and to visual-ize compounds and chemistry not accessible to conventionalmicroscopy Integrating these imaging modalities into a singleframework to compare compound and deduce clinical utilitynot in isolation but together with conventional imaging meth-ods is essential Similarly there is capacity to integrate otherforms of medical imaging relevant to cancer discovery includ-ing positron emission tomography magnetic resonance imag-ing computed tomography and X-ray This can be used tounderstand the relationship between in vivo clinical imagingand ex vivo tissue imaging thereby translating markers that arecurrently only visible on microscopy to early in vivo detectionwith clinical imaging modalities In both settings the real valueis likely to come from the overlay and integration of imagingmodalities extending the capacity of tissue imaging on its ownand providing new visualization tools for both pathologists andradiologists that are more informative and consequently im-prove clinical decision making The integromics initiative dis-cussed in this article provides such a model for imaging as wellas molecular and clinical data blurring the edges of medicalspecialties and providing a unified framework for discovery

The basic tools of integromic data analysis are generallysimilar to other disciplines of bioinformatics Mean compari-sons clustering survival analyses machine learning and linearmodels for example can all be used as can the various pipe-lines that have been developed for next-generation sequencinggene expression microarrays and other molecular big dataomics techniques Integration of digital image analysis data inintegromic analysis can present its own distinct risks and re-wards Images and their associated markups and annotationsdo not lend themselves on their own to t-tests or clustering for

example Nevertheless these may be useful to contextualizeother pieces of data like IHC scores Calculated scores (egH-score) and metrics (eg integrated optical density nucleararea) can often be expressed as nameattribute pairs so theylend themselves more immediately to integrative analysis al-though data structure and hierarchy can still complicate ana-lysis Some metrics can be measured at a cellular level andsummarized at an ROI or whole slide level Others howevermight not be meaningful at a cellular level (eg H-score) so onemust be mindful of the meaning of each measure when incor-porating it into a multivariate model of the data Tumour het-erogeneity is also an important complication Summary scorescomputed across different regions of the same tumour may dif-fer markedly Heterogeneity at least in solid tumours can bemore easily mapped in tissue sections where tissue structureand cellular context is maintained and intact Spatial changesin tumour cell arrangement can be visualized microscopicallyand used to sample regions for molecular analysis and themeasurement of intratumoural heterogeneity Whole slideimaging and digital image analysis represent powerful tools toquantitatively map tissue phenotype and provide a sound basisfor biomarker assay development while opening up new av-enues of inquiry For example one can begin to explore whethercertain morphometric phenotypes correlate with specific gen-etic or epigenetic profiles and whether these have any signifi-cant prognostic or therapeutic impact

A specific challenge for integromics is the application ofstatistical techniques to data from disparate sources Care mustbe taken to ensure differences observed between experimentalgroups reflect the underlying biology and not simply batch ef-fects Differences in experimental assay platforms such as dif-ferent probes targeting the same gene or different antibodies forIHC may produce discrepancies in the data based on differentlevels of sensitivity and specificity In digital pathology differ-ences in imaging hardware analysis algorithms and parameterdefinitions may lead to various imaging artefacts and interpret-ation difficulties Even within the same software cell countsdetermined from different threshold settings are not compar-able Less visible confounders include study and patient vari-ables like recruitment methodology study admission criteriapatient stratificationdiagnostic criteria pre-collection treat-ments and standard of care (for prognostic biomarker studies)that can affect the makeup of the study patient populationMeanwhile specimen handling variables like cold ischemictime fixation method staining protocols and formulations canhave a profound effect on the data often dominating other con-founders regardless of the subsequent pipeline Ideally thendata harmonization begins well before data are generated withstandardization and thorough documentation of all pre-analytical steps including patient selection tissue collectionand specimen handling Where potential batch effects or con-founding variables have been identified a number of strategiescan be used such as subset analysis (ie stratifying the subjectpopulation on the confounding variable) or regression model-ling [93 94]

Mandated public release of large data sets has been helpful inadvancing transparency in science but tracking the associatedmetadata and assessing its impact on results derived from thesebig data sets remains a major unresolved challenge Even largerepositories like TCGA are composed of smaller collections froma wide range of geographies and institutional environments Inmany cases pooling such diverse data sets becomes an apples-to-oranges comparison Batch effects in TCGA and other largehigh-throughput data sets are well-documented [52 53] Without

8 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 9: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

Standards Repository of common data elements [80 81]Additionally some groups have devised cancer-type-specificlists such as for breast [82] mesothelioma [83] and prostate[84] In the absence of clear community-supported data stand-ards institutions are advised to agree on and use a minimal setof data export formats

Unique challenges of multimodal integrativeanalysis

Collecting and storing diverse data types is only half the battlein integromics unlocking the value of these data will comefrom analysis and interpretation Through a connection withthe R statistical environment via statconn [49] PICan is capableof on-demand statistical analysis graphical output and reportgeneration Besides R and statconn other similar frameworksand interfaces include Shiny [85] RNET [86] Python (eg mat-plotlib [87]) and D3js [88] While PICan is strictly a research toolan integromic approach will undoubtedly be essential to futureclinical decision support systems placing integromics at thefront line of patient care delivery

Multimodal analysis is particularly important when con-sidering the future of digital pathology and tissue imagingConventional light and fluorescence imaging provides the basisfor most studies in this field However the future may providenew imaging technologies using light outside of the normal vis-ible spectrum that may be more informative towards illuminat-ing the underlying cancer biology and delivering sensitive andspecific biomarkers of clinical outcome Numerous studies areemerging that exploit technologies such as ultraviolet [89] in-frared [90] and Raman spectroscopy [91 92] together with abroad range of associated multispectral technologies to gain abetter insight into tissue and cell characterization and to visual-ize compounds and chemistry not accessible to conventionalmicroscopy Integrating these imaging modalities into a singleframework to compare compound and deduce clinical utilitynot in isolation but together with conventional imaging meth-ods is essential Similarly there is capacity to integrate otherforms of medical imaging relevant to cancer discovery includ-ing positron emission tomography magnetic resonance imag-ing computed tomography and X-ray This can be used tounderstand the relationship between in vivo clinical imagingand ex vivo tissue imaging thereby translating markers that arecurrently only visible on microscopy to early in vivo detectionwith clinical imaging modalities In both settings the real valueis likely to come from the overlay and integration of imagingmodalities extending the capacity of tissue imaging on its ownand providing new visualization tools for both pathologists andradiologists that are more informative and consequently im-prove clinical decision making The integromics initiative dis-cussed in this article provides such a model for imaging as wellas molecular and clinical data blurring the edges of medicalspecialties and providing a unified framework for discovery

The basic tools of integromic data analysis are generallysimilar to other disciplines of bioinformatics Mean compari-sons clustering survival analyses machine learning and linearmodels for example can all be used as can the various pipe-lines that have been developed for next-generation sequencinggene expression microarrays and other molecular big dataomics techniques Integration of digital image analysis data inintegromic analysis can present its own distinct risks and re-wards Images and their associated markups and annotationsdo not lend themselves on their own to t-tests or clustering for

example Nevertheless these may be useful to contextualizeother pieces of data like IHC scores Calculated scores (egH-score) and metrics (eg integrated optical density nucleararea) can often be expressed as nameattribute pairs so theylend themselves more immediately to integrative analysis al-though data structure and hierarchy can still complicate ana-lysis Some metrics can be measured at a cellular level andsummarized at an ROI or whole slide level Others howevermight not be meaningful at a cellular level (eg H-score) so onemust be mindful of the meaning of each measure when incor-porating it into a multivariate model of the data Tumour het-erogeneity is also an important complication Summary scorescomputed across different regions of the same tumour may dif-fer markedly Heterogeneity at least in solid tumours can bemore easily mapped in tissue sections where tissue structureand cellular context is maintained and intact Spatial changesin tumour cell arrangement can be visualized microscopicallyand used to sample regions for molecular analysis and themeasurement of intratumoural heterogeneity Whole slideimaging and digital image analysis represent powerful tools toquantitatively map tissue phenotype and provide a sound basisfor biomarker assay development while opening up new av-enues of inquiry For example one can begin to explore whethercertain morphometric phenotypes correlate with specific gen-etic or epigenetic profiles and whether these have any signifi-cant prognostic or therapeutic impact

A specific challenge for integromics is the application ofstatistical techniques to data from disparate sources Care mustbe taken to ensure differences observed between experimentalgroups reflect the underlying biology and not simply batch ef-fects Differences in experimental assay platforms such as dif-ferent probes targeting the same gene or different antibodies forIHC may produce discrepancies in the data based on differentlevels of sensitivity and specificity In digital pathology differ-ences in imaging hardware analysis algorithms and parameterdefinitions may lead to various imaging artefacts and interpret-ation difficulties Even within the same software cell countsdetermined from different threshold settings are not compar-able Less visible confounders include study and patient vari-ables like recruitment methodology study admission criteriapatient stratificationdiagnostic criteria pre-collection treat-ments and standard of care (for prognostic biomarker studies)that can affect the makeup of the study patient populationMeanwhile specimen handling variables like cold ischemictime fixation method staining protocols and formulations canhave a profound effect on the data often dominating other con-founders regardless of the subsequent pipeline Ideally thendata harmonization begins well before data are generated withstandardization and thorough documentation of all pre-analytical steps including patient selection tissue collectionand specimen handling Where potential batch effects or con-founding variables have been identified a number of strategiescan be used such as subset analysis (ie stratifying the subjectpopulation on the confounding variable) or regression model-ling [93 94]

Mandated public release of large data sets has been helpful inadvancing transparency in science but tracking the associatedmetadata and assessing its impact on results derived from thesebig data sets remains a major unresolved challenge Even largerepositories like TCGA are composed of smaller collections froma wide range of geographies and institutional environments Inmany cases pooling such diverse data sets becomes an apples-to-oranges comparison Batch effects in TCGA and other largehigh-throughput data sets are well-documented [52 53] Without

8 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 10: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

control over data origins and consistency it is essential that theentire process be rigorously validated before interpretation of re-sults What local data sets lack in scale they make up for in trace-ability For PICan we require that our collaborators are able tostand over every data point that enters our system therebyensuring a higher standard of quality and confidence in our dataMetadata recording technical and experimental details are storedin PICan alongside the data Regardless of how it is done meta-data tracking should be rigorous and thorough as differences inexperimental conditions and design must factor into the inter-pretation of any analysis Where appropriate data should be cor-rected and normalized to ensure comparisons are meaningfullike-versus-like comparisons and observed contrasts are notmerely batch or scalar effects

Interdisciplinary teamwork

Apart from robust infrastructure and software support the heartof any successful integromics framework is a cohesive and inter-disciplinary team to support and champion the entire effortDespite the technical complexities of a system like PICan team-work is arguably the most important and yet most often over-looked piece of the integromics puzzle This is fundamentally acollaborative endeavour bringing with it all the challenges ofworking with an interdisciplinary team of individuals and organ-izations each of whom have their own diverse set of goals andpriorities Clinical team members for example may have respon-sibilities to their patients that would not apply to academic re-searchers Nevertheless the specialist insights afforded by aninterdisciplinary team have been indispensable to the success ofPICan Clinicopathological data tables are built on templates cre-ated by clinical experts specializing in each cancer type Clinicaland translational researchers inform the choice of analysis toolsto be added which are then developed and tested by transla-tional bioinformaticians Establishing and incorporating an inte-gromics framework into the mindset of an institution will requireadditional effort and workload on certain individuals Hence it isimperative that team discussions include delegation and sharedresponsibility over matters such as data collection and curationdata formats and standards system maintenance data exchangeand access control Differing priorities and opinions betweenteam members are to some extent unavoidable but teams canadopt measures to manage conflicts such as having all protocolsclearly documented or maintaining a regular meeting scheduleRegardless of how these challenges are met we stress the im-portance of maintaining open lines of communication and agree-ing to clear roles and responsibilities of all team membersEffective teamwork is a non-trivial yet integral contributor to thesuccess of any integromics framework

With an ever-increasing emphasis on interdisciplinaryresearch projects like PICan underscore the need for cross-disciplinary education to instil young researchers with thenecessary mindset understanding and skills to navigate thechallenges and opportunities in the research environment ofthe future Queenrsquos University Belfast for example offers aMasters-level course in Bioinformatics and ComputationalGenomics [95] The course attracts students from a diverserange of backgrounds from medical doctors to engineers andcomputer scientists to biologists from a range of life sciencesubject areas and aims to equip them with a solid foundationacross a broad range of topics such as advanced bioinformaticsdatabase design tissue imaging and integromics

Interdisciplinary teamwork is vital to another pillar of anyintegromics framework access to quality data Any analysis is

only as robust as the data that underpins it The role of the NIBis crucial to PICanrsquos success not only through access to its bio-specimens but also its expertise and infrastructure support forsample preparation processing accessioning tracking andquality control The active participation of an interdisciplinaryteam also distributes the workload of data curation One of thegreat struggles of data collection is maximizing quality andquantity with a limited resource of labour With an increasingfocus among funders and institutions on large-scale multi-centre studies scaling up an integromics framework while pre-serving data quality will rely heavily on greater delegation ofdata curation to the data collectors themselves The onus willbe on the individual data collectors to flag data discrepanciesso that the database administrator serves as the last line of de-fence rather than the first

Ethics and data governance

Like any other system handling clinical and pathological infor-mation an integromics framework also requires appropriateapproval for the use of patient data Additionally data controlsin integromics need to protect both patients and data collectorsde-identification protects patient privacy while access controlsand terms of use policies protect the rights of data collectors

All studies undertaken in PICan need to be approved follow-ing an application to NIB where project-specific ethics and gov-ernance are granted under the Office for Research EthicsCommittees Northern Ireland approved NIB guidelines As a re-search tool PICan stores no personally identifiable informationin its database ensuring patient anonymization and minimiz-ing the potential impact of any data security breach Each pa-tient sample and associated analytical data are labelled using aunique independent study-specific identifier within the systemClinically qualified and authorized patient data collectors retainthe key that links these identifiers to hospital systems wherepersonal data are securely stored Researchers using PICan arerestricted to using anonymized research data onlymdashrespectingdata protection laws and regulations However as importantfollow-up data on patients accrues over time these can still beaccessed by having appropriately authorized clinical staff anddata collectors use the key to identify patients on hospital sys-tems and retrieve more up-to-date patient informationAuthorized clinical personnel can then review and retrieve per-tinent research-relevant information from the hospital systemsand update PICan again with only anonymized research datausing the PICan identifiers This is equivalent to an honest bro-ker system used elsewhere [96 97]

Data curation of de-identified information in PICan producesa robust privacy-controlled snapshot of analytical data but theunderlying data sources are themselves continually beingupdated as patients are followed up and new specimens are col-lected While data update is possible it is necessarily a manualprocess as PICan has been walled off from the original datasources by design allowing it to operate with fewer privacy con-trols than a typical clinical system as no identifiable patientdata are stored To maintain quality and consistency with exist-ing data data updates to PICan are done only periodically withentire cohorts of patients updated at once Alternatively insti-tutions that prioritize constantly updated data will requiredeeper integration between their research and clinical data sys-tems In such cases additional data protection measures will berequired such as encryption or safe havens for data storage andtransfer Essentially such an integrated system would need tomeet local rules and regulations for clinical data systems

Integromics in cancer tissue biomarker research | 9

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 11: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

Scaling up for large-scale epidemiologicalstudies

The last few decades have seen tremendous investment into allaspects of cancer biomarker discovery research but as of 2013only 18 protein cancer biomarkers had received US Food andDrug Administration (FDA) approval [98] despite decades of re-search and thousands of publications on novel candidate bio-markers Part of the reason may be a focus on early-stagediscovery activities and the relative lack of large-scale multi-centre validation and clinical trials Studies of such scale pre-sent a number of data management challenges not unlikethose we designed PICan to address First and foremost integro-mics is inherently a team activity and the same principles ofteamwork apply whether it is conducted within a single centreor across multiple research centres on different continents Inthis way integromics research extends readily to large-scalemulticentre epidemiological studies Data harmonization of fileformats and common vocabulary remain vitally important andbest addressed through open communication between collabor-ators As in the case of collaborators within an institution teammembers from different institutions should ideally agree on asingle data format for each data type and a common agreed setof vocabulary to describe data fields collected depending on theneeds of the study and its participants Analysis of multicentredata also requires standardization of pre-analytical parameterssuch as inclusion criteria sample acquisition sample process-ing assay conditions and measurements recorded When estab-lishing project standards and protocols teams should considerwhat community standards are already available to maximizeinteroperability and avoid redundancies Across Europe for ex-ample projects like ELIXIR [99] and Euro-BioImaging [100] are al-ready seeking to establish hubs for biomedical and bioimagingresource sharing Globally similar efforts in genomics are alsounderway [19ndash21]

The process of building digital infrastructure to support amulticentre integromics framework is similar to the case of asingle institution All team members should meet and agree onthe structure of the framework Depending on the extent of theproject and range of requirements the technical scientific andmedical team will need to grow as well Among the team thereis a need for at least one full-time staff member dedicated totechnical development and management of the digital infra-structure As the project grows the staffing requirement willundoubtedly increase in tandem For this reason modular de-sign becomes even more crucial as it allows development tasksto be carved up and delegated out to different team membersThere will also be additional requirements for hardware andsoftware licences Software costs can be mitigated by using free(eg R) or low-cost volume licenced (eg Microsoft Visual Studio(Redmond WA USA) for ASPNET in an academic setting) op-tions Nevertheless despite minor challenges digital solutionsgenerally scale well both in size and geographic distributionParallels can be drawn between incorporating external data re-sources into an integromics environment as previously dis-cussed and the challenges of connecting information systemsacross multiple sites However multicentre teams also facesome specific challenges Teams may need to comply with vari-ous sets of local laws and regulations surrounding researchethics data security and patient privacy Network access andsecurity may also be further complicated in a multicentre set-ting PICan for example was built for use within a single insti-tution so access was limited to within the institutionrsquos networkfirewall with all external access blocked A system spanning

multiple institutions would need greater attention to securitywhile maintaining access from outside the host institutionTeams may also need to contend with different software andhardware infrastructures at the various centres and implementthe necessary interfaces to ensure compatibility However theultimate key to success of multicentre collaborations is a funda-mental commitment to team science and on that front institu-tions that embrace integromics will have the experience to leadthe coming age of large-scale multicentre studies

Furthermore platforms like PICan and the integration ofimage analysis data with clinicopathological and therapeuticdata will help accelerate the screening of candidate tissue bio-markers for prognostic and predictive capacity While this rep-resents a powerful research and discovery tool this is likely toidentify new tissue biomarkers (based on IHC chromogenic insitu hybridization (CISH) fluorescence in situ hybridization(FISH) RNAscope etc) that have clear value in routine diagnos-tics and the stratification of patients for precision therapeuticsFurther validation of these biomarkers would be necessary incarefully controlled and standardized biomarker trials takinginto consideration sample preparation scanner configurationlab-to-lab variation comparative performance against patholo-gist scoring and clinical outcome data Regulatory approval ofdigital diagnostic tests would require FDA clearance (USA) orCE-IVD marking (Europe) if they were to be marketed as clinicaltests by industry but may also be implemented as laboratory-developed tests with documented internal validation and clin-ical evidence if done within an academic medical centre suchas our own setting Integromics platforms provide the frame-work for discovery and if designed appropriately can provide ro-bust clinical evidence of the utility of new biomarkers forsubsequent commercial licensing regulatory approval and clin-ical translation into routine diagnostictherapeutic practice

Conclusion

Modern biomedical research generates scientific and clinicaldata faster than the ability to fully interpret it while data fromnew and emerging technologies need to be integrated andharmonized with existing analysis workflows Efforts to resolvethe resulting bottleneck in interpretation to deliver on thepromises of precision medicine in this era of fast-track bio-marker discovery and evaluation have led to the birth of a newfield in bioinformatics integromics After introducing our ownframework for streamlining integromic analyses it becameclear that there is widespread interest in implementing integro-mic strategies in other research institutions In light of the pre-sent discussion we feel that an optimal integromics frameworkmust be well-supported by a dynamic and interdisciplinaryteam reflect the local needs and priorities of the institution andbe designed with an eye to the future The team is the gatewayto high-quality data and specialist insights that inform analysisand ultimately catalyse translation to clinical application insupport of precision medicine Meanwhile biomarker researchat our institution is currently centred on TMAs and we haveemphasized data quality and curation over raw throughputThese traits are imprinted in the design and operation of PICanbut other institutions may have other needs that demand a dif-ferent infrastructure Ongoing development of PICan includesincorporating raw data storage and access for completenessand traceability and strengthening support for digital imageanalysis As our institution adopts newer technologies PICanlikewise will continue to evolve facilitated by its modular de-sign to meet the changing needs of a busy cancer research

10 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 12: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

environment spanning numerous programmes technologiesand scientific objectives Hence integromics should be interwo-ven into the fabric of the institution and not simply be a down-loadable add-on

Key Points

bull Bold approaches are needed to maximize the potentialof big data in modern biomedical research and tobreak down the silos in which they are currently held

bull Collating storing transferring and integratively analy-sing vast amounts of heterogeneous multimodal datarequires a comprehensive and cohesive integromicsstrategy that streamlines biomarker discovery byexposing fresh insights into the underlying factorscontributing to disease phenotypes

bull We explore the practical considerations behind imple-menting an integromics framework using our experi-ences and design choices with PICan as a case studywhile recognizing that the global diversity of bio-marker discovery research programmes will requiretailoring solutions to each research environment

bull Successful integromics frameworks are supported bydynamic interdisciplinary teams and are inherentlyadaptable to new technologies and the host institu-tionrsquos evolving priorities

bull Integromics must be a central part of the institutionalculture and not just a downloadable add-on

Conflict of interest

Professor Peter Hamilton is the founder of and non-executivedirector with PathXL Ltd Professor Manuel Salto-Tellez isSenior Advisor to PathXL Ltd

Funding

The research leading to these results has received funding fromthe People Programme (Marie Curie Actions) of the EuropeanUnionrsquos Seventh Framework Programme FP72007ndash2013underREA grant agreement (285910 to PWH) Invest NorthernIreland (RDO0712612 to PWH) Cancer Research UK Accelerator(C11512A20256 to PWHMS-T) CRUK (to PB) The NorthernIreland Molecular Pathology Laboratory is supported by CancerResearch UK Experimental Cancer Medicine Centre Networkthe NI Health and Social Care Research and DevelopmentDivision the Sean Crummey Memorial Fund the Tom SimmsMemorial Fund and the Friends of the Cancer Centre (to MS-T) The Northern Ireland Biobank is funded by the Health andSocial Care Research and Development Division of the PublicHealth Agency in Northern Ireland and Cancer Research UKthrough the Belfast CRUK Centre and Northern IrelandExperimental Cancer Medicine Centre additional support wasreceived from the Friends of the Cancer Centre (to JAJ)

References1Canuel V Rance B Avillach P et al Translational research plat-

forms integrating clinical and omics data a review of publiclyavailable solutions Brief Bioinform 201516280ndash90

2 Noor AM Holmberg L Gillett C et al Big Data the challengefor small research groups in the era of cancer genomics Br JCancer 20151131405ndash12

3 PathXL Xplore httpwwwpathxlcompathxl-researchpathxl-xplore (20 November 2015 date last accessed)

4 Biomarker Interpretation Workflow - OncoMark httpwwwoncomarkcomgoproductsbiomarker-interpretation-workflow (23 November 2015 date last accessed)

5 Denaxas SC George J Herrett E et al Data resource profilecardiovascular disease research using linked bespoke studiesand electronic health records (CALIBER) Int J Epidemiol2012411625ndash38

6 Lubbock ALR Katz E Harrison DJ et al TMA Navigatornetwork inference patient stratification and survival ana-lysis with tissue microarray data Nucleic Acids Res 201341W562ndash8

7 Giardine B Riemer C Hardison RC et al Galaxy a platform forinteractive large-scale genome analysis Genome Res 2005151451ndash5

8 Goecks J Nekrutenko A Taylor J Galaxy a comprehensiveapproach for supporting accessible reproducible and trans-parent computational research in the life sciences GenomeBiol 201011R86

9 Madhavan S Gusev Y Harris M et al G-DOC a systems medi-cine platform for personalized oncology Neoplasia 201113771ndash83

10Dander A Baldauf M Sperk M et al Personalized OncologySuite integrating next-generation sequencing data andwhole-slide bioimages BMC Bioinformatics 201415306

11Kartashov AV Barski A BioWardrobe an integrated platformfor analysis of epigenomics and transcriptomics data GenomeBiol 201516158

12Harris PA Taylor R Thielke R et al Research electronic datacapture (REDCap)ndasha metadata-driven methodology andworkflow process for providing translational research in-formatics support J Biomed Inform 200942377ndash81

13Szalma S Koka V Khasanova T et al Effective knowledgemanagement in translational medicine J Transl Med201081ndash9

14Manley S Mucci NR De Marzo AM et al Relational databasestructure to manage high-density tissue microarray data andimages for pathology studies focusing on clinical outcomethe prostate specialized program of research excellencemodel Am J Pathol 2001159837ndash43

15Weinstein JN Integromic analysis of the NCI-60 cancer celllines Breast Dis 20041911ndash22

16Weinstein JN Pommier Y Transcriptomic analysis of theNCI-60 cancer cell lines C R Biol 2003326909ndash20

17Yu P Artz D Warner J Electronic health records (EHRs) sup-porting ASCOrsquos vision of cancer care Am Soc Clin Oncol EducBook 2014225ndash31

18McArt DG Blayney JK Boyle DP et al PICan an integromicsframework for dynamic cancer biomarker discovery MolOncol 201591234ndash40

19Hudson TJ Anderson W The International Cancer GenomeConsortium et al International network of cancer genomeprojects Nature 2010464993ndash8

20Home j Global Alliance for Genomics and Health httpsgenomicsandhealthorg (15 March 2016 date last accessed)

21The 1000 Genomes Project Consortium A global reference forhuman genetic variation Nature 201552668ndash74

22Collins FS Varmus H A New Initiative on Precision MedicineN Engl J Med 2015372793ndash5

23Kolesnikov N Hastings E Keays M et al ArrayExpress up-datendashsimplifying data submissions Nucleic Acids Res 201543D1113ndash6

Integromics in cancer tissue biomarker research | 11

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 13: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

24Leinonen R Akhtar R Birney E et al The European NucleotideArchive Nucleic Acids Res 201139D28ndash31

25The Cancer Genome Atlas Research NetworkComprehensive genomic characterization defines humanglioblastoma genes and core pathways Nature 20084551061ndash8

26Home - The Cancer Genome Atlas - Cancer Genome -TCGAhttpcancergenomenihgov (1 December 2015 datelast accessed)

27Kristensen VN Lingjaeligrde OC Russnes HG et al Principlesand methods of integrative genomic analyses in cancer NatRev Cancer 201414299ndash313

28Guinney J Dienstmann R Wang X et al The consensusmolecular subtypes of colorectal cancer Nat Med 2015211350ndash6

29Gao J Aksoy BA Dogrusoz U et al Integrative analysis ofcomplex cancer genomics and clinical profiles using thecBioPortal Sci Signal 20136pl1

30Cerami E Gao J Dogrusoz U et al The cBio cancer genomicsportal an open platform for exploring multidimensional can-cer genomics data Cancer Discov 20122401ndash4

31Forbes SA Beare D Gunasekaran P et al COSMIC exploringthe worldrsquos knowledge of somatic mutations in human can-cer Nucleic Acids Res 201543D805ndash11

32Broad GDAC Firehose httpgdacbroadinstituteorg (22March 2016 date last accessed)

33Le Cao K-A Gonzalez I Dejean S integrOmics an R packageto unravel relationships between two omics datasetsBioinformatics 2009252855ndash6

34Waller T Gubała T Sarapata K et al DNA microarray integro-mics analysis platform BioData Min 2015818

35Cecic IK Li G MacAulay C Technologies supporting analyt-ical cytology clinical research and drug discovery applica-tions J Biophotonics 20125313ndash26

36Rocha R Vassallo J Soares F et al Digital slides present sta-tus of a tool for consultation teaching and quality control inpathology Pathol Res Pract 2009205735ndash41

37Pantanowitz L Valenstein P Evans A et al Review of the cur-rent state of whole slide imaging in pathology J Pathol Inform2011236

38Brachtel E Yagi Y Digital imaging in pathologyndashcurrent ap-plications and challenges J Biophotonics 20125327ndash35

39Hamilton PW Bankhead P Wang Y et al Digital pathologyand image analysis in tissue biomarker research Methods20147059ndash73

40Schneider CA Rasband WS Eliceiri KW NIH Image to ImageJ25 years of image analysis Nat Methods 20129671ndash5

41Schindelin J Arganda-Carreras I Frise E et al Fiji - an OpenSource platform for biological image analysis Nat Methods20129676ndash82 101038nmeth2019

42de Chaumont F Dallongeville S Chenouard N et al Icy anopen bioimage informatics platform for extended reprodu-cible research Nat Methods 20129690ndash6

43Sommer C Straehle C Kothe U et al Ilastik interactive learn-ing and segmentation toolkit In Biomedical Imaging fromNano to Macro 2011 IEEE International Symposium on 2011pp 230ndash233 IEEE

44Hamilton PW van Diest PJ Williams R et al Do we see whatwe think we see The complexities of morphological assess-ment J Pathol 2009218285ndash91

45Salto-Tellez M James JA Hamilton PW Molecular path-ologymdashthe value of an integrative approach Mol Oncol201481163ndash8

46Li G van Niekerk D Miller D et al Molecular fixative enablesexpression microarray analysis of microdissected clinicalcervical specimens Exp Mol Pathol 201496168ndash77

47Galon J Mlecnik B Bindea G et al Towards the introductionof the lsquoImmunoscorersquo in the classification of malignant tu-mours J Pathol 2014232199ndash209

48R Core Team R A Language and Environment for StatisticalComputing Vienna Austria R Foundation for StatisticalComputing 2015

49Baier T Neuwirth E Excel COM R Comput Stat 20072291ndash10850McCarty KS Miller LS Cox EB et al Estrogen receptor ana-

lyses Correlation of biochemical and immunohistochemicalmethods using monoclonal antireceptor antibodies ArchPathol Lab Med 1985109716ndash21

51Goulding H Pinder S Cannon P et al A new immunohisto-chemical antibody for the assessment of estrogen receptorstatus on routine formalin-fixed tissue samples Hum Pathol199526291ndash4

52Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

53Lauss M Visne I Kriegner A et al Monitoring of TechnicalVariation in Quantitative High-Throughput Datasets CancerInform 201312193ndash201

54Codd EF Normalized Data Base Structure a Brief Tutorial InProceedings of the 1971 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description Access and Control SanDiego CA USA 1971 pp 1ndash17 Association for ComputingMachinery New York NY

55Kacprzyk J Zadro _zny S De Tre G Fuzziness in database man-agement systems half a century of developments and futureprospects Fuzzy Sets Syst 2015281300ndash7

56Bosc P Pivert O Farquhar K Integrating fuzzy queries intoan existing database management system an example Int JIntell Syst 19949475ndash92

57Duggal R Khatri SK Shukla B Improving patient matchingsingle patient view for Clinical Decision Support using BigData analytics In 2015 4th International Conference onReliability Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions) Noida India 2015 pp 1ndash6 IEEENew York NY

58OpenSeadragon httpsopenseadragongithubio (7 April2016 date last accessed)

59Allred DC Bustamante MA Daniel CO et alImmunocytochemical analysis of estrogen receptors inhuman breast carcinomas evaluation of 130 cases and re-view of the literature regarding concordance with biochem-ical assay and clinical relevance Arch Surg 1990125107ndash13

60Singh R Chubb L Pantanowitz L et al Standardization in digi-tal pathology supplement 145 of the DICOM standardsJ Pathol Inform 2011223

61Linkert M Rueden CT Allan C et al Metadata matters accessto image data in the real world J Cell Biol 2010189777ndash82

62The HDF Group HDF Group - HDF5 httpswwwhdfgrouporgHDF5 (3 January 2016 date last accessed)

63Sommer C Held M Fischer B et al CellH5 a format for dataexchange in high-content screening Bioinformatics 2013291580ndash2

64Allan C Burel J-M Moore J et al OMERO flexible model-driven data management for experimental biology NatMethods 20129245ndash53

65Maree R Rollus L Stevens B et al Collaborative analysis ofmulti-gigapixel imaging data using Cytomine Bioinformatics2016321395ndash1401

12 | Li et al

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from

Page 14: Embracing an integromic approach to tissue biomarker ... › download › pdf › 33595853.pdf · Her research uses integrative molecular approaches for biomarker discovery and validation

66DigitalScope V5 httpwwwdigitalscopeorg (26 November2015 date last accessed)

67Astier Y Braha O Bayley H Toward single molecule DNAsequencing direct identification of ribonucleoside and deox-yribonucleoside 5lsquo-monophosphates by using an engineeredprotein nanopore equipped with a molecular adapter J AmChem Soc 20061281705ndash10

68Stoddart D Heron AJ Mikhailova E et al Single-nucleotidediscrimination in immobilized DNA oligonucleotides with abiological nanopore Proc Natl Acad Sci USA 20091067702ndash7

69Flusberg BA Webster D Lee J et al Direct detection of DNAmethylation during single-molecule real-time sequencingNat Methods 20107461ndash5

70Roberts RJ Carneiro MO Schatz MC The advantages of SMRTsequencing Genome Biol 201314405

71Sheth AP Larson JA Federated database systems for manag-ing distributed heterogeneous and autonomous databasesACM Comput Surv 199022183ndash236

72Baker EJ Biological databases for behavioral neurobiology IntRev Neurobiol 201210319ndash38

73Peuroaeuroakkonen P Pakkala D Reference architecture and classifi-cation of technologies products and services for big data sys-tems Big Data Res 20152166ndash86

74Haider S Ballester B Smedley D et al BioMart Central Portalndashunified access to biological data Nucleic Acids Res 200937W23ndash7

75Barrett T Wilhite SE Ledoux P et al NCBI GEO archive forfunctional genomics data setsndashupdate Nucleic Acids Res201341D991ndash5

76Goode A Gilbert B Harkes J et al OpenSlide a vendor-neutralsoftware foundation for digital pathology J Pathol Inform2013427

77Pearson WR Lipman DJ Improved tools for biological se-quence comparison Proc Natl Acad Sci USA 1988852444ndash8

78Berman JJ Edgerton ME Friedman BA The tissue microarraydata exchange specification a community-based opensource tool for sharing tissue microarray data BMC MedInform Decis Mak 200335

79Unified Medical Language System (UMLS) - Home httpswwwnlmnihgovresearchumls (20 November 2015 datelast accessed)

80Komatsoulis GA Warzel DB Hartel FW et al caCORE version3 implementation of a model driven service-oriented archi-tecture for semantic interoperability J Biomed Inform 200841106ndash23

81Covitz PA Hartel F Schaefer C et al caCORE a common infra-structure for cancer informatics Bioinformatics 200319 2404ndash12

82Papatheodorou I Crichton C Morris L et al A metadata ap-proach for clinical data management in translational gen-omics studies in breast cancer BMC Med Genomics 2009266

83Mohanty SK Mistry AT Amin W et al The development anddeployment of Common Data Elements for tissue banks fortranslational research in cancer - an emerging standardbased approach for the Mesothelioma Virtual Tissue BankBMC Cancer 2008891

84 Patel AA Kajdacsy-Balla A Berman JJ et al The developmentof common data elements for a multi-institute prostate can-cer tissue bank the Cooperative Prostate Cancer TissueResource (CPCTR) experience BMC Cancer 20055108

85 Shiny httpshinyrstudiocom (6 April 2016 date lastaccessed)

86 RNET documentation ndash user version j RNET ndash user versionhttpjmp75githubiordotnet (6 April 2016 date lastaccessed)

87 matplotlib python plotting mdash Matplotlib 151 documenta-tion httpmatplotliborg (6 April 2016 date last accessed)

88 D3js - Data-Driven Documents httpsd3jsorg (6 April2016 date last accessed)

89 Fereidouni F Datta-Mitra A Demos S et al Microscopy withUV Surface Excitation (MUSE) for slide-free histology andpathology imaging In Progress in Biomedical Optics andImaging - Proceedings of SPIE San Francisco CA USA 2015p 93180F SPIE Bellingham WA USA

90 Bird B Miljkovic M Romeo MJ et al Infrared micro-spectralimaging distinction of tissue types in axillary lymph nodehistology BMC Clin Pathol 200888

91 Taleb A Diamond J McGarvey JJ et al Raman Microscopy forthe Chemometric Analysis of Tumor Cells J Phys Chem B200611019625ndash31

92 Guze K Pawluk HC Short M et al Pilot study Raman spec-troscopy in differentiating premalignant and malignant orallesions from normal mucosa and benign lesions in humansHead Neck 201537511ndash7

93 McNamee R Regression modelling and other methods tocontrol confounding Occup Environ Med 200562500ndash6

94 dos Santos Silva I Dealing with confounding in the analysisIn Cancer Epidemiology Principles and Methods LyonFrance International Agency for Research on Cancer 1999305ndash31

95 Queenrsquos University Belfast j Postgraduate and ProfessionalDevelopment j MSc Bioinformatics and ComputationalGenomics httpwwwqubacukschoolsmdbspgdPTMScBioinformaticsandComputationalGenomics (24 March2016 date last accessed)

96 Dhir R Patel AA Winters S et al A multidisciplinary ap-proach to honest broker services for tissue banks and clin-ical data a pragmatic and practical model Cancer 20081131705ndash15

97 Boyd AD Hosner C Hunscher DA et al An lsquoHonest Brokerrsquomechanism to maintain privacy for patient care and aca-demic medical research Int J Med Inform 200776407ndash11

98 Pavlou MP Diamandis EP Blasutig IM The long journey ofcancer biomarkers from the bench to the clinic Clin Chem201359147ndash57

99 ELIXIR Data for life httpswwwelixir-europeorg (24March 2016 date last accessed)

100 Research Infrastructure for Imaging Technologies inBiological and Biomedical Sciences j Euro-BioImaginghttpwwweurobioimagingeu (24 March 2016 date lastaccessed)

Integromics in cancer tissue biomarker research | 13

at Queens U

niversity of Belfast on June 3 2016

httpbiboxfordjournalsorgD

ownloaded from