Curve groups and breast cancer - Norsk RegnesentralCurve groups and breast cancer Exploring carcinogenesis by gene expression in blood before diagnosis of breast cancer by curve group

Curve groups and breast cancer

Exploring carcinogenesis by gene expression in blood before diagnosis of breast cancer by curve group analysis - the prospective NOWAC postgenome cohort

Note no. SAMBA/18/15

Authors Eiliv Lund, Lars Holden, Hege Bøvelstad, Sandra Plancade, Nicolle Mode, Clara-Cecilie Günther, Gregory Nuel, Jean-Christophe Thalabard, Marit Holden

Date

26. mai. 2015

Authors Eiliv Lund (1) Lars Holden (2) Hege Bøvelstad (1) Sandra Plancade (3) Nicolle Mode (1, 4) Clara-Cecilie Günther (2) Gregory Nuel (5) Jean-Christophe Thalabard (6) Marit Holden (2)

1) UiT The Arctic University of Norway, Tromsø, Norway 2) Norsk Regnesentral, Oslo, Norway 3) INRA, UR1404 Unité Mathématiques et Informatique Appliquées du Génome à

l'Environnement, F78352 Jouy-en-Josas, France. 4) National Institute on Aging, National Institutes of health, Baltimore, MD, USA 5) DR CNRS, INSMI Stochastics and Biology Group (PSB) LPMA, UPMC, Sorbonne

University, Paris, France 6) MAP 5, Universite Paris Descartes, Sorbonne Paris Cite, France

Norsk Regnesentral Norsk Regnesentral (Norwegian Computing Center, NR) is a private, independent, non-profit foundation established in 1952. NR carries out contract research and development projects in information and communication technology and applied statistical-mathematical modelling. The clients include a broad range of industrial, commercial and public service organisations in the national as well as the international market. Our scientific and technical capabilities are further developed in co-operation with The Research Council of Norway and key customers. The results of our projects may take the form of reports, software, prototypes, and short courses. A proof of the confidence and appreciation our clients have in us is given by the fact that most of our new contracts are signed with previous customers.

Curve groups and breast cancer 3

Title Curve groups and breast cancer

Authors Eiliv Lund, Lars Holden, Hege Bøvelstad, Sandra Plancade, Nicolle Mode, Clara-Cecilie Günther, Gregory Nuel, Jean-Christophe Thalabard, Marit Holden

Date 26. mai. 2015

Year 2015

Publication number SAMBA/18/15

Abstract The understanding of temporal mutational processes in cancer is limited. This analysis aimed at exploring the trajectories for the genes, i.e. the changes in gene expression in blood between breast cancer cases and controls as a function of time to cancer diagnosis. Between 2003 and 2006 almost 50 000 women entered the Norwegian Women and Cancer (NOWAC) postgenome biobank by donating a blood sample preserved for transcriptomic analyses (PAX tube). A total of 637 invasive breast cancer cases were identified through 2009 by linkages to the Cancer Registry of Norway. For each case, a random control matched on birth year and time of blood sampling was selected. After exclusions 441 case-control pairs were available for analyses. The trajectories consist of the differences over time in gene expression between each case and control pair. We present novel non-parametric statistical methods based on hypothesis testing that show whether there is development over time or not, and whether this development varies among the different strata. We introduced the concept of curve groups, where each curve group consists of genes that have a similar development through time. The gene expressions varied with time in the last years before diagnosis, and this development differs among clinical stages for women participating in the National Breast Cancer Screening Program. The differences among the strata appeared larger the last year before diagnosis, compared to earlier years. The curve group analysis revealed significant gene expression differences in blood before diagnosis among strata of clinical stage and mode of detection.

Keywords transcriptomics, gene expression, cohort, breast cancer, carcinogenesis, metastasis, mammographic screening, blood, systems epidemiology

Availability Open

Project number 220 641 and 220633

Research field Bioinformatics

Number of pages 32

© Copyright Norsk Regnesentral

4 Curve groups and breast cancer


Table of Content

1 Introduction ................................................................................................................. 7

2 Material and methods .................................................................................................. 8

2.1 Follow-up and register information ............................................................................ 9

3 Laboratory procedures ................................................................................................. 9

3.1 Microarray data ........................................................................................................... 9

3.2 Preprocessing of array data ...................................................................................... 10

3.3 Statistical methods .................................................................................................... 11

3.4 Hypothesis tests for development in time for each stratum .................................... 11

3.5 Hypothesis test for comparing two strata ................................................................ 12

3.6 An alternative statistic for comparing two strata ..................................................... 13

3.7 Computing p-values – permutation tests.................................................................. 13

4 Results ....................................................................................................................... 14


4.2 Hypothesis tests for comparing two strata ............................................................... 14

5 Discussion .................................................................................................................. 15

6 Conclusion ................................................................................................................. 18

References ........................................................................................................................ 20

7 Supplementary ........................................................................................................... 28


Supplementary figure for Table 2 in the paper ......................................................... 28

Supplementary table for Table 2 in the paper ........................................................ 30

7.2 Methods for separation of the strata. ....................................................................... 31

Supplementary figure for Figure 3 in the paper ...................................................... 31


1 Introduction

The assumption of systems epidemiology [1] is that functional aspects of human

carcinogenesis might be communicated through blood as gene expression patterns

before diagnosis, either as active signals or as passive information. Recently an editorial

in Nature Medicine [2] advocated the need to change from mice models to a “human

model” for the understanding of the carcinogenic processes. In observational studies of

humans, the prospective design would be best to incorporate the time aspect of the

carcinogenesis and the changing exposures. On the other hand, analyses of somatic

mutations in cancer genome studies have revealed a huge diversity of the mutational

processes as part of carcinogenesis [3]. One explanation for this observation could be

that multiple mutational processes operate dependently on different biological processes

in subgroups of cancers, thus giving a jumbled composite signature. Due to the

problems of jumbled composite signature, the functional analyses in observational

studies should be stratified based on important clinical knowledge like node status,

mode of detection and potential exposures.

One approach for prospective functional genomic studies is to compile

trajectories from many independent case-control pair measurements in order to study

the process of carcinogenesis [4]. The trajectory for a gene is a curve that shows the

changes in gene expression in blood as a function of time to cancer diagnosis, and

consists of the differences in gene expression between cases and controls in a nested

case-control design. The controls establish the average (mean) level of gene expression

in women not affected by cancer and provide exposure adjusted analyses. The level of

expression for a gene not involved in the carcinogenic process should be constant, on

average, during years before diagnosis. Genes related to the different stages of the

carcinogenesis could be differentially expressed over time. There is no prior knowledge

about the form of the trajectories for any of the thousands of genes. This lack of a priori

information demands an agnostic approach [5] putting all genes on an equal basis and

adjusting for multiple testing using a false discovery rate [6].

We present a prospective analysis based on the Norwegian Women and Cancer

postgenome study (NOWAC) [7] . The aim was to describe the time-dependent

carcinogenesis process in blood through an agnostic approach and epidemiological

design. The trajectories were analyzed stratified on important clinical factors like lymph


node status at time of diagnosis and the mode of detection, but without identifying

single genes or conducting pathway analyses.

2 Material and methods

The Norwegian Women and Cancer (NOWAC) cohort study is a nation-wide

population-based cancer study initiated in 1991; for detailed information see [8].

Random samples of women based on unique national birth number were drawn from the

central person register by Statistics Norway. Name and address were printed on the

letter of information and the birth number was replaced by a serial number on the

questionnaires. The linkage file for the birth number and the serial number was kept at

Statistics Norway. The questionnaires were returned to the Institute of Community

Medicine, University of Tromsø. Non-responders were mailed one or two remainders.

Between 2003 and 2006 the postgenome biobank collected approximately

50 000 samples nested in the NOWAC cohort, for more details see [7]. Women in the

NOWAC study who completed an eight-page questionnaire with an information letter

introducing blood sampling and who agreed to participate in blood sampling (97.2%)

were eligible for blood donation. Each woman received equipment for blood collection

and a two-page questionnaire. Blood sampling equipment was mailed in batches of 500

to randomly chosen women with one reminder after 4-6 weeks. The blood sampling

consisted of one PAXgene tube (PreAnalytiX GmbH, Hembrechtikon, Switzerland)

with a buffer or stabilization agent for mRNA in order to improve the quality of the

gene expression for genome wide microarray analyses. Blood was primarily drawn at

the family doctor’s office and sent as biological material overnight to Tromsø. Upon

arrival the PAX tubes were immediately frozen.

Altogether 66 072 women were invited through 141 groups and 47 763 (72.3%)

of them returned a blood sample and questionnaire during May 2003 - August 2006. In

addition, 2569 women donated blood at the Mammographic Screening Unit, the

University Hospital of Tromsø from 2004 until 2006. After removing duplicates,

missing blood samples and excluding women who later would leave the study (n=4) a

total of 48 692 blood samples were available for follow-up.


2.1 Follow-up and register information For the set of unique women belonging to the postgenome biobank breast cancer cases

through the end of 2009 were identified through linkages to the Cancer Registry of

Norway providing information on incident cases of breast cancer and stage. Altogether

637 cases of invasive breast cancer were reported. For each case a control matched on

time of blood sampling and year of birth was analyzed together with the case.

After removing 16 cases with other cancer, 8 cases with previous breast cancer, 18 pairs

where the control was diagnosed with cancer within two years after blood sampling and

44 pairs defined as outliers of which 18 case-control pairs were marked technical

outliers, a total of 551 pairs remained. Of these, 83 had missing, incomplete or uncertain

clinical information. Cases with blood samples taken more than five years before

diagnosis, 27, were not included. The eligible cohort consisted of 441 pairs.

Information on method of diagnosis, at screening unit or outside, was obtained

from the Cancer Registry of Norway through linkage to the screening database kept by

the National Breast Cancer Screening Program [9]. The cases were reclassified into

node negative or positive (without spread or with spread) based on the pTNM

information from the Cancer Registry of Norway.

Women participating in the screening program diagnosed with breast cancer

consisted of two groups: “Screen detected cancer” and “Interval cancer” detected within

two years after their last mammogram screening. The stratum “Clinical” consists of

women that did not attend screening prior to diagnosis. Cases with a negative

mammogram and with a clinically detected breast cancer more than two years after

screening were included in the Clinical group. Based on this information we could

classify our cases into six strata: «Screening with spread», «Screening without spread»,

«Interval with spread», «Interval without spread», «Clinical with spread», and «Clinical

without spread». The repartition of the selected 441 case-control pairs into the six strata

is shown in Table 1.

3 Laboratory procedures

3.1 Microarray data To control for technical variability such as different batches of reagents and kits, day to

day variations, microarray production batches and effects related to different laboratory

operators, each case and its random control matched on birth year and month of blood


collection were kept together through all procedures like extraction, amplification and

hybridization. RNA extraction used the PAXgene Blood miRNA Isolation kit according

to the manufacturer’s manual at the NTNU Genomic Core Facility in Trondheim,

Norway. RNA quality and purity was assessed using the NanoDrop ND 8000

spectrophotometer (ThermoFisher Scientific; Delaware, USA) and Agilent bioanalyzer

(Palo Alto; CA, USA), respectively. RNA amplification was performed on 96 plates

using 300 ng of total RNA and the Illumina TotalPrep-96 RNA Amplification Kit

(Ambio Inc; Austin, Texas, USA). The amplification procedure consisted of reverse

transcription with a T7 promotor and ArrayScript, followed by a second-strand

synthesis. In vitro transcription with T7 RNA polymerase using a biotin-NTP mix

produced biotinylated cRNA copies of each mRNA in the sample. All cases and

controls were run on either the IlluminaHumanAWG-6 version 3 expression bead or the

HumanHT-12 version 4. The microarray service was provided by the Genomics Core

Facility, Norwegian University of Science and Technology. Outliers were excluded

after visual examination of dendrograms, principal component analysis plots and

density plots. Individuals that were considered as borderline outliers were excluded if

their laboratory quality measures where below given thresholds (RIN value < 7,

260/280 ratio < 2, 260/230 ratio < 1.7, and 50 < RNA < 500).

3.2 Preprocessing of array data The dataset was preprocessed as previously described [10]. The dataset consisting of

441 case-control pairs and 30 046 probes were background corrected using negative

control probes and normalized on the original scale using quantile normalization. Data

from the two Illumina chip types (HumanWG-6 v3 and HumanHT-12 v4) were

combined on identical nucleotide universal identifiers (nuID) [11]. We retained probes

present in at least 1 % of the individuals, i.e. in at least 9 of the 882 individuals. If a

gene was represented with more than one probe only one was selected resulting in a

dataset with 11 431 probes. The probes were translated to genes using the

IlluminaHumanAll.db database [12]. Finally, the log2-differences of the expression

values for each case-control pair were computed and used in the statistical analyses.

Additional adjustments for possible batch effects were unnecessary due to the matched-

pair processing and focus on differences between pairs.


3.3 Statistical methods An original statistical method based on hypothesis testing was developed in order to

detect a functional dependence in gene expression over time, and whether this

functional dependency differed among the different strata. We have developed methods

that are able to identify small changes that are varying slowly in time and/or among

strata, by using a large number of genes in each test. For defining test statistics that

measure development in time and differences among strata, we have introduced the

concept of curve groups, where each curve group consists of genes that have a similar

development in time, i.e., similar trajectories. Below we will describe the methods in

detail.

Let 𝑋𝑋𝑔𝑔,𝑝𝑝 be the log2-expression difference for case-control pair 𝑝𝑝 and gene 𝑔𝑔. Each case-

control pair belongs to a stratum 𝑠𝑠 and a time period 𝑡𝑡, 𝑡𝑡 = 1,2,3. where t=1 is 0-1 year

before diagnosis, t=2 is 1-2 years before diagnosis and 3 is t=3-5 years before diagnosis.

We want to test whether 𝑋𝑋𝑔𝑔,𝑝𝑝 is independent of the time period, and whether there is no

difference among the strata, i.e., 𝑋𝑋𝑔𝑔,𝑝𝑝 is independent of stratum.

3.4 Hypothesis tests for development in time for each stratum For each stratum we will test whether 𝑋𝑋𝑔𝑔,𝑝𝑝 is independent of the time period. To define

a statistic that measures development in time we first introduce the concept of curve

groups:

• For a given stratum 𝑠𝑠, a gene 𝑔𝑔 can belong to zero or one of six curve groups

based on the order of the average of the data over all case-control pairs in the

stratum in the three time periods. These averages are denoted 𝑋𝑋�𝑔𝑔,3,𝑠𝑠, 𝑋𝑋�𝑔𝑔,2,𝑠𝑠 and

𝑋𝑋�𝑔𝑔,1,𝑠𝑠, respectively. Six curve groups, called « 123, 132, 213, 231, 312 and

321», respectively, were defined. The three numbers in each name of a curve

group represent the order of time period 3 (left number), the order of time

period 2 (middle number) and the order of time period 1 (right number). If e.g.

𝑋𝑋�𝑔𝑔,3,𝑠𝑠 < 𝑋𝑋�𝑔𝑔,2,𝑠𝑠 < 𝑋𝑋�𝑔𝑔,1,𝑠𝑠, gene 𝑔𝑔 may belong to curve group ‘123’ indicating an

increasing gene expression in time when approaching the time of diagnosis. See

Figure 1 for an illustration of the concept of curve groups.

• For each curve group we will only include genes with a significant change in

gene expression over time. This is done by testing whether the smallest and

largest values of 𝑋𝑋�𝑔𝑔,3,𝑠𝑠, 𝑋𝑋�𝑔𝑔,2,𝑠𝑠 and 𝑋𝑋�𝑔𝑔,1,𝑠𝑠 are different using a two-sample t-test

(assuming unequal variances). Let 𝑝𝑝𝑔𝑔,𝑐𝑐 be the p-value of this test. Depending


on the statistical question at hand, we define two alternative criteria for

concluding that a gene g belongs to the curve group c:

o Inclusion criterion 1: Gene 𝑔𝑔 belongs to curve group 𝑐𝑐 if 𝑝𝑝𝑔𝑔,𝑐𝑐 is below

a predefined limit 𝛼𝛼.

o Inclusion criterion 2: Gene 𝑔𝑔 belongs to curve group 𝑐𝑐 if gene 𝑔𝑔 is

among the M genes with lowest 𝑝𝑝𝑔𝑔,𝑐𝑐-value, see more next section.

To make a test for development in time, we count for each stratum the number of genes

that belong to the curve group using inclusion criterion 1 defined above. For each

stratum, we then perform seven hypothesis tests, one global test and one for each of the

six curve groups. In the global test the test statistic is the total number of genes which

belong to one of the six curve groups, while in the test for a curve group the test statistic

is the number of genes that belong to this curve group. If the conclusion of the

hypothesis test is that there are more genes in the curve groups than what is expected by

chance, we conclude that there is a significant development in time for some of these

genes.

3.5 Hypothesis test for comparing two strata We want to test whether there are differences in gene expressions between two strata

‘with spread’ and ‘without spread’ using information from several genes. For each

curve group 𝑐𝑐, stratum 𝑠𝑠 and case-control pair 𝑝𝑝, we define a curve group variable 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝

as follows: We select the genes that belong to the curve group 𝑐𝑐 for stratum 𝑠𝑠 using

inclusion criterion 2 defined above with M=100. Let 𝐺𝐺𝑐𝑐,𝑠𝑠 denote this set of genes. The

curve group variable 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 for case-control pair p is then computed as the average value

of the data 𝑋𝑋𝑔𝑔,𝑝𝑝 over genes in 𝐺𝐺𝑐𝑐,𝑠𝑠:

𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 =1

100� 𝑋𝑋𝑔𝑔,𝑝𝑝𝑔𝑔∈𝐺𝐺𝑐𝑐,𝑠𝑠

.

We can test whether the variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 are different for case-control pairs 𝑝𝑝 between

the two strata either for all time periods combined or for each time period separately.

Note that the genes are selected based on data from stratum 𝑠𝑠, but the variable may be

calculated for case-control pairs 𝑝𝑝 in any stratum. More specificly, assume that we want

to test if there is a difference in gene expression between case-control pairs in stratum

"with spread" versus stratum "without spread" for curve group ‘123’. Assume that the

set of100 genes G123,spread is selected using criterion 2 in the "spread" stratum. We then


calculate Z123,spread,p for all case-control pairs p in stratum "spread" and Z123,without spread,p'

for p in stratum "without spread", and test if the difference is larger than expected by

chance. Note that testing the "with spread" versus "without spread" strata may also be

performed with the set of curve groups G123,without spread selected from the "without

spread" stratum or from any of the other defined strata. ??

3.6 An alternative statistic for comparing two strata The test described above focuses on genes that belong to the same curve group. We

have also constructed a hypothesis test to compare the difference in time development

between two strata that does not depend on curve groups. The test statistic is

constructed by first computing the two-sample t-statistic 𝑇𝑇𝑔𝑔,𝑡𝑡, comparing the difference

in gene expression between the two strata for each gene 𝑔𝑔 and time period 𝑡𝑡. We define

𝐹𝐹𝑔𝑔 = ∑ 𝑤𝑤𝑡𝑡|𝑇𝑇𝑔𝑔,𝑡𝑡|𝑡𝑡 as the weighted sum of the absolute values of the t-statistics for gene 𝑔𝑔

with weight 𝑤𝑤𝑡𝑡. Further, the test statistic is defined as 𝐿𝐿𝑘𝑘 = ∑ 𝐹𝐹𝑔𝑔𝑔𝑔∈𝐺𝐺𝑘𝑘 , where 𝐺𝐺𝑘𝑘 is the

set of genes with the 𝑘𝑘 largest 𝐹𝐹𝑔𝑔 values, i.e. 𝐿𝐿𝑘𝑘 is the sum of the 𝑘𝑘 largest 𝐹𝐹𝑔𝑔 values.

We observe that 𝐿𝐿𝑘𝑘 is a weighted sum of t-statistics. We used equal weights 𝑤𝑤𝑡𝑡 = 1/3

for each time period. Alternatively, the weights could be selected either as proportional

to the number of case-control pairs in each time period or with larger values for the

pairs with time period closer to the time of diagnosis. In addition to the global test

including all three time periods, separate tests for each time period were also performed,

in which only data corresponding to each time period were included. This test

performed very well on several simulated datasets with a different time development or

different gene expression level for some genes for two strata, for details see [13].

In the Supplementary we use this t-statistics to construct a variable that separates the

case-control pairs in two strata.

3.7 Computing p-values – permutation tests In all tests described above, we compute p-values by estimating the null distribution for

the statistic of the hypothesis test by randomizing the data. In the hypothesis test for a

given stratum where we test for development in time, the null model is estimated by

randomizing case-control pairs for that stratum between time periods, while in the

hypothesis tests where two strata are compared, the null model is estimated by

randomizing case-control pairs between the two strata for each time period. Note that


these randomization algorithms maintain the correlation structure between the genes for

each case-control pair. Also note that the curve groups are redefined before a sample of

the null model is computed from a randomized dataset. The p-value of the test is set to K+1N+1

, where N is the total number of randomizations and K is the number of

randomizations out of N with a more extreme statistic than the statistic for the real data

[14]. In the results presented we have used N =1000.

4 Results

4.1 Hypothesis tests for development in time for each stratum A time trend was considered present if there were more genes in the curve groups than

expected by chance. Results for the different strata are presented in Table 2. In the first

panel we compared all pairs with spread to all pairs without spread. The results were

non-significant indicating no changes in gene expression over time when not stratifying

on mode of detection. Stratifying cases on participation in the screening or not revealed

significant time trends in cases with spread either found at screening or as interval

cancers, as more p-values are less than 0.05 than we would expect by chance. Further

stratification on all modes of detection showed that the effect mainly was restricted to

interval cancers with spread. In the tests we have used inclusion criterion 1 with

𝛼𝛼 = 0.01. In Figure 4 in the Supplementary we show how the results depend on the 𝛼𝛼-

values. We conclude that the results are not very sensitive to the choice of 𝛼𝛼-values and

that 𝛼𝛼 = 0.01 is a reasonable choice.

4.2 Hypothesis tests for comparing two strata Based on the results from the previous section, we restrict our analysis to compare the

gene expression in the two strata «Screening or interval with spread» and «Screening or

interval without spread» using the curve group variable 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 described in the method

section. P-values obtained by testing whether the curve group variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 are

different in the two strata are shown in Table 3. Note that many of the p-values are

below 0.05 and that some are smaller than 0.01. In Figure 2 we illustrate how to use the

gene expression data to separate the two strata by showing the curve group variable

𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 for each case-control pair 𝑝𝑝 in the different strata. The plot showed that the

difference between the two strata changes over time for the two most significant 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝


variables. Black and red points are separated, which indicate that the differences

between the strata (with spread and without spread) were larger the last year before

diagnosis, than in earlier years, and that this difference holds for both screening and

interval cases (red/black circles and red/black triangles, respectively). Nevertheless, the

differences were not large and the ability to predict the clinical stage for individual

cases remains limited. However, it can be possible to develop a procedure that can

separate a subgroup of the case-control pairs without spread from the remaining case-

control pairs, i.e. it should be possible to predict the clinical stage for some of the cases

without spread, but not for all.

In the methods section we introduced the statistic 𝐿𝐿𝑘𝑘, a weighted sum of t-

statistics, as an alternative to the curve group variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 for comparing the gene

expression levels of two strata. In Figure 3 we plot the p-value in a hypothesis test with

𝐿𝐿𝑘𝑘 as test statistic against the number of genes 𝑘𝑘. The plot show that the gene

expression levels are different in the two strata. The p-values decrease with increasing

number of genes used in the calculation of 𝐿𝐿𝑘𝑘. If we used 50 genes, the p-value is about

0.05, and the p-value decreased to below 0.02 when we used the 1000 most significant

genes. This indicate that the difference between the strata is present in a large number of

genes, but so weak that the strongest result was obtained when including a large number

of genes. Also, notice that the time period 1, the last year before diagnosis, contributed

most to the low p-values. This is in accordance with the results shown in Figure 2 and

Table 3. In Figure 5 in the Supplementary we illustrate how to separate the two strata

for each case-control pair using a variable that corresponds to the test statistic 𝐿𝐿𝑘𝑘 used in

Figure 3.

5 Discussion

This explorative analysis has shown that it is possible to significantly discriminate the

time trend of gene expression patterns observed before diagnosis. The findings are

based on an original approach for the statistical analysis of time dependent curves of

gene expression in the NOWAC postgenome cohort. The methods could also be used

for other aspects of functional genomics like methylation. These findings deserve to be

further interpreted in relation to the biology of both single genes and gene pathways.

The prospective analyses of gene expression in the years preceding diagnosis as


assessed by the log-fold change between cases and controls showed significant

differences in the curve groups according to stratification as defined by mode of

detection and node status of the cases at time of diagnosis.

Studies of gene expression in peripheral blood are challenging as they are

exposed to many difficulties and pitfalls. The ubiquitous degradation by RNase reduces

the quality of mRNA for whole genome analyses in most biobanks except for those with

a buffer or directly frozen in liquid nitrogen. The signals related to carcinogenesis are

expected to be much weaker than in tumor tissue and can be confounded by signals

from exposures to carcinogens or other lifestyle factors. The problem of noise due to the

complicated study object of carcinogenesis, the need for adequate epidemiological

design including exposure information and blood sampling, complicated technology and

development of robust statistics could make the approach unsuccessful. The prospective

design made it difficult to increase the statistical power of the study, so interpretation of

the results should be made carefully.

To the best of our knowledge, the NOWAC postgenome cohort is the largest

population based prospective cancer study designed for transcriptomic studies based on

buffered RNA. All parts of the analyses are done within the same cohort framework of

NOWAC. In the NOWAC postgenome cohort a single laboratory processed all samples

using the same technology, thus reducing analytical bias and batch effects. The cohort

design reduced selection bias. A weakness of a prospective study could be the change of

case-control status as controls became cases over time, thus reducing the differences in

gene expression within a pair. We removed all pairs where controls were diagnosed

with breast cancer or another cancer in a period of at least two years after blood

sampling. Unfortunately no repeated sampling of blood and questionnaires was

conducted. Repeated measurements would secure better analyses making it possible to

use intra individual comparisons over time.

One stratification factor was based on the mode of detection. In Norway, the

National Mammographic Screening Program for breast cancer started in 1996 with

complete coverage of the population from 2005 [9]. It has been estimated that the

introduction of population based mammographic screening in Norway gave a mean

sojourn time for invasive cancer of 4.0 years in women aged 50-59 years and 6.6 years

for those 60-69 years [15]. Analyses of breast cancer carcinogenesis as a time dependent

process should therefore take into consideration that cases diagnosed at the


mammographic screening program are diagnosed at an earlier phase of carcinogenesis

and thus not directly comparable to clinically detected cancers.

Secondly, node status has for a hundred years been the most important

prognostic factor in breast cancer treatment. In one of the earliest publications from

Yale 1920 the five-year survival of metastatic cancer was 15% [16]. Even in Norway

after the Second World War, the observed five-year survival rate was 25% [17] in node

positive cases. What can be observed before diagnosis or treatment are thus the signals

from a deadly disease. The starting time of the metastatic growth is unknown. The time

from initiation of metastases to its diagnosis has been estimated at 5.8 years [18]. At

time of diagnosis we had a censored distribution of tumors where mode of detection

determines the time of diagnosis irrespective of the underlying carcinogenic process.

Differences in gene expression in blood at diagnosis between node positive and node

negative tumors have been described in a small clinical study without controls [19].

The findings of pervasive, but small changes in gene expression present in blood

before diagnosis of breast cancer could have several explanations depending on the

different views on carcinogenesis. One conclusion of the cancer genome project was

that remarkably little is known about the process of carcinogenesis [1]. The

interpretations of the gene expression trajectories should therefore be explorative.

Human observational findings should not necessarily be related to existing models of

carcinogenesis since these are based mainly on animal experiments. Among models

currently debated, there is the driver-passenger model [20], the mathematical multistage

model or the two stage clonal model [21], the oncogene addiction hypothesis [22],

hallmarks of cancer with a current focus on the immune system [23] and the exposure

driven model [24].

The findings of a strong effect of stratification by stage and mode of detection

could have implications for the construction of predictive or prognostic clinical tests.

Most studies with tumor tissues have not taken into account the different biological

signals from cancers at different stages. Stratification could improve the sensitivity and

specificity of such tests.

From a statistical point of view the Cox proportional hazard model and its

extension have been largely used by epidemiologist since the seminal work by Cox [25]

for analyzing cohort studies with time- varying covariates. It has been adapted as well

for case-control designs [26] and some extension have been proposed for covariates


measured with noise [27] and time-changing coefficients [28]. More recently, the

adjunction of covariates in high dimension like gene expression data added some

challenging statistical issues [29]. While the characteristics and the basic assumptions

of the Cox model are adapted to the dimensionality and the very specific paired design

of the NOWAC study, the Cox model is not fully adapted to the estimation of changes

in the gene expression curves and to the biological interpretations of gene pathways.

An agnostic search for time trends depends on a sensitive statistical approach.

We have presented two novel statistical methods that demonstrated that the gene

expressions vary with time the last years before diagnosis and that this development in

time differs between clinical stages for participants inside a screening program. One of

the methods focuses on identifying genes with specific functional dependencies in time

within a given clinical stage. The other method focuses on difference in gene

expressions between clinical stages in the different time periods. Hence, the two

methods focus on different aspects of functional time dependency relative to time of

diagnosis of the gene expressions. Both methods give significant results when we use

many genes and the data from the last year before diagnosis contributes the most to this

result. As the gene expression data are very noisy, all methods use information from

several genes simultaneously to increase the power of the hypothesis tests used. We

found that the differences between the strata (with spread and without spread) are larger

the last year before diagnosis, than in earlier years, but that the differences are small and

the ability to predict the clinical stage for individual cases is limited. However, it is

possible to separate a subgroup of the case-control pairs without spread from the

remaining case-control pairs, Figure 2, and predict the clinical stage for some cases

without spread, but not for all.

A potential weakness of the curve group approach could be the increasing

number of curve groups as time of observation increases. With four time periods we

will need 24 curve groups, and with five time periods even more.

6 Conclusion

The findings indicate that gene expression in blood before diagnosis might be used as a

biomarker of disease extent. These findings could be viewed as a proof of concept of


systems epidemiology indicating the potential of including gene expression for

functional analysis in prospective studies of cancer.


References

1. Lund E, Dumeaux V. Systems Epidemiology in Cancer. Cancer Epidemiology Biomarkers & Prevention. 2008;17(11):2954-7. doi:10.1158/1055-9965.epi-08-0519. 2. Of men, not mice. Nat Med. 2013;19(4):379-. 3. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415-21. doi:10.1038/nature12477. 4. Lund E, Plancade S. Transcriptional output in a prospective design conditionally on follow-up and exposure: the multistage model of cancer. International Journal of Molecular Epidemiology and Genetics. 2012;3(2):107-14. 5. Spitz MR, Bondy ML. The evolving discipline of molecular epidemiology of cancer. Carcinogenesis. 2010;31(1):127-34. doi:10.1093/carcin/bgp246. 6. Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003;19(3):368-75. doi:10.1093/bioinformatics/btf877. 7. Dumeaux V, Borresen-Dale A-L, Frantzen J-O, Kumle M, Kristensen V, Lund E. Gene expression analyses in breast cancer epidemiology: the Norwegian Women and Cancer postgenome cohort study. Breast Cancer Research. 2008;10(1):R13. 8. Lund E, Dumeaux V, Braaten T, Hjartåker A, Engeset D, Skeie G et al. Cohort Profile: The Norwegian Women and Cancer Study (NOWAC) Kvinner og kreft. Int J Epidemiol. 2008;37(1):36-41. doi:10.1093/ije/dym137. 9. Hofvind S, Geller B, Vacek PM, Thoresen S, Skaane P. Using the European guidelines to evaluate the Norwegian Breast Cancer Screening Program. European Journal of Epidemiology. 2007;22(7):447-55. doi:10.2307/27822793. 10. Günther C, Holden M, Holden L. Preprocessing of gene-expression data related to breast cancer diagnosis: SAMBA/35/14. 11. Du P, Kibbe W, Lin S. nuID: a universal naming scheme of oligonucleotides for Illumina, Affymetrix, and other microarrays. Biology Direct. 2007;2(1):16. 12. Carlson M. lumiHumanAll.db: Illumina Human Illumina expression annotation data (chip lumiHumanAll. R Package version 1.22.0. 13. Holden L. Classify strata. NR note SAMBA/11/15. 14. Phipson B, Smyth GK. Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn. Stat Appl Genet Mol Biol. 2010;31(9). doi:10.2202/1544-6115.1585. 15. Weedon-Fekjær H, Lindqvist BH, Vatten LJ, Aalen OO, Tretli S. Estimating mean sojourn time and screening sensitivity using questionnaire data on time since previous screening. Journal of Medical Screening. 2008;15(2):83-90. doi:10.1258/jms.2008.007071. 16. Todd M, Shoag M, Cadman E. Survival of women with metastatic breast cancer at Yale from 1920 to 1980. Journal of Clinical Oncology. 1983;1(6):406-8. 17. Survival of cancer patients : cases diagnosed in Norway 1953-1967. The Norwegian Cancer Society and The Cancer Registry in Norway. Oslo1975. 18. Engel J, Eckel R, Kerr J, Schmidt M, Fürstenberger G, Richter R et al. The process of metastasisation for breast cancer. European Journal of Cancer. 2003;39(12):1794-806.


19. Zuckerman NS, Yu H, Simons DL, Bhattacharya N, Carcamo-Cavazos V, Yan N et al. Altered local and systemic immune profiles underlie lymph node metastasis in breast cancer patients. International Journal of Cancer. 2013;132(11):2537-47. doi:10.1002/ijc.27933. 20. Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458(7239):719-24. 21. Vineis P, Schatzkin A, Potter JD. Models of carcinogenesis: an overview. Carcinogenesis. 2010;31(10):1703-9. doi:10.1093/carcin/bgq087. 22. Felsher DW. Oncogene Addiction versus Oncogene Amnesia: Perhaps More than Just a Bad Habit? Cancer Research. 2008;68(9):3081-6. doi:10.1158/0008-5472.can-07-5832. 23. Hanahan D, Weinberg RA. The Hallmarks of Cancer. Cell. 2000;100(1):57-70. doi:10.1016/S0092-8674(00)81683-9. 24. Lund E. An exposure driven functional model of carcinogenesis. Medical Hypotheses. 2011;77(2):195-8. doi:10.1016/j.mehy.2011.04.009. 25. Cox DR. Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B (Methodological). 1972;34(2):187-220. doi:10.2307/2985181. 26. Aalen OO, Borgan Ø, Gjessing HK. Survival and Event History Analysis. A Process Point of View, Statistics for Biology and Health. New York. Springer; 2008. 504p. 27. Hu P, Tsiatis AA, Davidian M. Estimating the Parameters in the Cox Model When Covariate Variables are Measured with Error. Biometrics. 1998;54(4):1407-19. doi:10.2307/2533667. 28. O'Quigley J. Proportional Hazards Regression. Statistics for Biology and Health. Springer; 2008. 29. Benner A, Zucknick M, Hielscher T, Ittrich C, Mansmann U. High-Dimensional Cox Models: The Choice of Penalty as Part of the Model Building Process. Biometrical Journal. 2010;52(1):50-69. doi:10.1002/bimj.200900064.


Table 1 Number of case-control pairs in each stratum and time period for the dataset. The stratum “Clinical” consists of i) Women that attended screening, but this was more than two years before diagnosis; and ii) Women that did not attend screening prior to diagnosis.

Year before diagnosis (time period) 5-3 (3) 2 (2) 1 (1) Stratum Screening: Diagnosed at a screening visit

Spread 41 11 6 Not spread 118 42 43

Interval: Diagnosed within two years of a screening visit


Clinical: Outside the screening program



Table 2 P-values obtained when testing whether there are more genes in the curve groups than what is expected by chance. We have used inclusion criterion 1 with 𝛼𝛼 = 0.01. P-values below 0.05 are highlighted in yellow. The observed and expected number of genes in each curve group are shown in Table 4 in the Supplementary.

p-value

Curve group

Screening, interval or clinical

with spread


without spread

Screening or interval

with spread


without spread Global 0.78 0.27 0.01 0.20

123 0.61 0.23 0.02 0.39 132 0.49 0.13 0.008 0.11 312 0.88 0.18 0.13 0.11 321 0.41 0.74 0.02 0.66 231 0.74 0.68 0.50 0.57 213 0.58 0.17 0.48 0.13

p-value

Curve group

Screening with

spread

Screening without spread

Interval with

spread

Interval without spread

Clinical with

spread

Clinical without spread

Global 0.36 0.43 0.02 0.46 0.40 0.81

123 0.10 0.33 0.21 0.89 0.06 0.34 132 0.38 0.19 0.009 0.32 0.51 0.63 312 0.83 0.30 0.07 0.21 0.98 0.81 321 0.18 0.90 0.05 0.40 0.22 0.66 231 0.33 0.63 0.21 0.83 0.94 0.93 213 0.70 0.27 0.29 0.16 0.90 0.59


Table 3 P-values obtained when testing whether the curve group variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 are different in the two strata «Screening or interval with spread» and «Screening or interval without spread». P-values below 0.05 are highlighted in yellow.

p-value

Genes selected based on stratum 𝑠𝑠1 =

«Screening or interval with spread» 𝑍𝑍𝑐𝑐,𝑠𝑠1,𝑝𝑝

Genes selected based on stratum 𝑠𝑠2 = «Screening or interval without spread»

𝑍𝑍𝑐𝑐,𝑠𝑠2,𝑝𝑝 Period 𝑡𝑡 3 2 1 3 2 1

N1 69 20 12 69 20 12 N2 148 57 53 148 57 53

Curve group 𝑐𝑐 123 0.22 0.59 0.02 0.53 0.11 0.08 132 0.90 0.005 0.004 0.71 0.11 0.009 312 0.80 0.27 0.15 0.04 0.009 0.001 321 0.12 0.98 0.24 0.35 0.72 0.15 231 0.26 0.45 0.78 0.34 0.38 0.23 213 0.53 0.45 0.65 0.36 0.04 0.08

‘N1’ is the number of case-control pairs in the stratum «Screening or interval with spread» in the time period 𝑡𝑡, while ‘N2’ is the number of case-control pairs in the stratum «Screening or interval without spread» in the time period 𝑡𝑡.


Figure 1 Example of two different curve groups: ‘123’ (upper) and ‘132’ (lower). In the left panels curves for 20 genes from the given curve group are plotted. For illustrational purposes, the curves have been estimated from the data using splines. In the middle panels the data for one of the 20 genes are shown with the corresponding spline-estimated curve. The points represent the differences in gene expression shown with the corresponding spline-estimated curve. The points represent the differences in gene expression 𝑋𝑋𝑔𝑔,𝑝𝑝 for each case-control pair. The mean value in each time period 𝑋𝑋�𝑔𝑔,3,𝑠𝑠, 𝑋𝑋�𝑔𝑔,2,𝑠𝑠 and 𝑋𝑋�𝑔𝑔,1,𝑠𝑠, is shown in red. The right panels are similar to the middle panels except that the data that are plotted are the mean values computed over the 20 genes in the left panel.


Figure 2 Plot of two of the most significant curve group variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 for the screening and interval strata. «With spread 132» on the x-axis denotes that s in 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 is the stratum «Screening or interval with spread» and c is curve group 132, while «Without spread 312» on the y-axis denotes that s in 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 is the stratum «Screening or interval without spread» and c is curve group 312.


Figure 3 The p-value in a hypothesis test with test statistic 𝐿𝐿𝑘𝑘, a weighted sum of t-statistics, plotted against the number of genes 𝑘𝑘 used in the calculation of 𝐿𝐿𝑘𝑘. The two strata that are compared using 𝐿𝐿𝑘𝑘 are «Screening or interval with spread» and «Screening or interval without spread».


7 Supplementary

7.1 Hypothesis tests for development in time for each stratum Supplementary figure for Table 2 in the paper In Table 2 in the paper we presented p-values obtained when testing whether there are more genes in the curve groups than what is expected by chance. In these tests we used inclusion criterion 1 with 𝛼𝛼 = 0.01. A small 𝛼𝛼-value implies that we only include genes with a strong trend. Figure 4 shows how the p-values depend on the value of 𝛼𝛼. We observe that the p-values are not sensitive to the choice of 𝛼𝛼 except that 𝛼𝛼 should be closer to 0 than to 1.

Figure continues on next page


Figure 4 P-values obtained when testing whether there are more genes in a curve group than expected, plotted against 𝛼𝛼, the parameter of inclusion criterion 1. The horizontal


dotted line indicates a p-value equal to 0.05. The data for the stratum «With spread» consists of the data for «Screening with spread», «Interval with spread» and «Clinical with spread», and similar for the stratum «Without spread». Supplementary table for Table 2 in the paper In Table 2 in the paper we presented p-values obtained when testing whether there are more genes in the curve groups than what is expected by chance. In these tests we used inclusion criterion 1 with 𝛼𝛼 = 0.01. Table 4 shows the observed number and the expected number of genes in each curve group. Here it is important to notice that the numbers of genes in each curve group is not too small. If this had been the case, this would indicate that a too small 𝛼𝛼-value had been chosen weakening the power of the test. The table shows that this is not the case.

Table 4 The observed number of genes in each curve group with expected number of genes in parenthesis. The cases with a p-value below 0.05 in Table 2 are highlighted in yellow.

Observed number of genes (expected number of genes)

Curve group


with spread


without spread


with spread


without spread Global 305 (513) 609 (535) 1360 (482) 708 (547)

123 47 (76) 97 (82) 259 (70) 69 (86) 132 69 (100) 171 (103) 518 (99) 205 (107) 312 37 (102) 145 (105) 171 (105) 203 (108) 321 66 (82) 40 (82) 314 (77) 46 (82) 231 38 (77) 44 (81) 48 (66) 51 (82) 213 48 (76) 112 (82) 50 (65) 134 (83)

Observed number of gene (expected number of genes)

Curve group

Screening with

spread

Screening without spread

Interval with

spread

Interval without spread

Clinical with

spread

Clinical without spread

Global 475 (464) 490 (547) 1233 (485) 471 (525) 448 (491) 302 (502)

123 139 (75) 78 (85) 101 (81) 33 (90) 233 (84) 83 (83) 132 81 (91) 141 (106) 515 (92) 96 (97) 52 (84) 54 (90) 312 43 (96) 107 (109) 237 (89) 123 (96) 18 (82) 40 (92) 321 115 (82) 29 (82) 213 (81) 71 (83) 101 (83) 45 (77) 231 63 (63) 46 (82) 92 (70) 31 (78) 21 (77) 27 (77) 213 34 (58) 89 (83) 75 (73) 117 (81) 23 (81) 53 (83)


7.2 Methods for separation of the strata. Supplementary figure for Figure 3 in the paper In Figure 3 in the paper we illustrated our ability to separate between two strata based on the t-statistics 𝑇𝑇𝑔𝑔,𝑡𝑡 for each gene g and time period t. We can also illustrate the separation of the two strata by calculating a variable 𝑌𝑌𝑝𝑝,𝑘𝑘 for each case-control pair 𝑝𝑝 using information from several genes simultaneously to show that there are differences in gene expression between the two strata. We define the variable 𝑌𝑌𝑝𝑝,𝑘𝑘 for each case-control pair 𝑝𝑝 such that the variable is low for case-control pairs from one stratum and high for case-control pairs from the other stratum. We define

𝑌𝑌𝑝𝑝,𝑘𝑘 =1𝑘𝑘� 𝑋𝑋𝑔𝑔,𝑝𝑝 ∙ 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠(𝑇𝑇𝑔𝑔,𝑡𝑡)𝑔𝑔∈𝐺𝐺𝑘𝑘

,

where 𝑡𝑡 is the time period for case-control pair 𝑝𝑝, and 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠(𝑇𝑇𝑔𝑔,𝑡𝑡) is 1 if 𝑇𝑇𝑔𝑔,𝑡𝑡 is positive, -1 otherwise. Here, 𝐺𝐺𝑘𝑘 is the set of genes with the 𝑘𝑘 largest 𝐹𝐹𝑔𝑔-values (defined in the paper). Figure 5 shows the 𝑌𝑌𝑝𝑝,1000 values for the different case-control pairs p for the different strata. Notice that there is a separation between spread and not spread in period 1 and between interval with spread and interval without spread in period 2. The separation is such that some of the pairs without spread, but not all, have smaller values that seems to be outside the range of the values with spread.


Figure 5 Plot of the variable 𝑌𝑌𝑝𝑝,𝑘𝑘, where 𝑘𝑘=1000 and where genes have been selected based on data in the period of case-control pair 𝑝𝑝 (upper) or on data in all three periods (lower).

Documents

Curve groups and breast cancer - Norsk RegnesentralCurve groups and breast cancer Exploring carcinogenesis by gene expression in blood before diagnosis of breast cancer by curve group