Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Curve groups and breast cancer
Exploring carcinogenesis by gene expression in blood before diagnosis of breast cancer by curve group analysis - the prospective NOWAC postgenome cohort
Note no. SAMBA/18/15
Authors Eiliv Lund, Lars Holden, Hege Bøvelstad, Sandra Plancade, Nicolle Mode, Clara-Cecilie Günther, Gregory Nuel, Jean-Christophe Thalabard, Marit Holden
Date
26. mai. 2015
Authors Eiliv Lund (1) Lars Holden (2) Hege Bøvelstad (1) Sandra Plancade (3) Nicolle Mode (1, 4) Clara-Cecilie Günther (2) Gregory Nuel (5) Jean-Christophe Thalabard (6) Marit Holden (2)
1) UiT The Arctic University of Norway, Tromsø, Norway 2) Norsk Regnesentral, Oslo, Norway 3) INRA, UR1404 Unité Mathématiques et Informatique Appliquées du Génome à
l'Environnement, F78352 Jouy-en-Josas, France. 4) National Institute on Aging, National Institutes of health, Baltimore, MD, USA 5) DR CNRS, INSMI Stochastics and Biology Group (PSB) LPMA, UPMC, Sorbonne
University, Paris, France 6) MAP 5, Universite Paris Descartes, Sorbonne Paris Cite, France
Norsk Regnesentral Norsk Regnesentral (Norwegian Computing Center, NR) is a private, independent, non-profit foundation established in 1952. NR carries out contract research and development projects in information and communication technology and applied statistical-mathematical modelling. The clients include a broad range of industrial, commercial and public service organisations in the national as well as the international market. Our scientific and technical capabilities are further developed in co-operation with The Research Council of Norway and key customers. The results of our projects may take the form of reports, software, prototypes, and short courses. A proof of the confidence and appreciation our clients have in us is given by the fact that most of our new contracts are signed with previous customers.
Curve groups and breast cancer 3
Title Curve groups and breast cancer
Authors Eiliv Lund, Lars Holden, Hege Bøvelstad, Sandra Plancade, Nicolle Mode, Clara-Cecilie Günther, Gregory Nuel, Jean-Christophe Thalabard, Marit Holden
Date 26. mai. 2015
Year 2015
Publication number SAMBA/18/15
Abstract The understanding of temporal mutational processes in cancer is limited. This analysis aimed at exploring the trajectories for the genes, i.e. the changes in gene expression in blood between breast cancer cases and controls as a function of time to cancer diagnosis. Between 2003 and 2006 almost 50 000 women entered the Norwegian Women and Cancer (NOWAC) postgenome biobank by donating a blood sample preserved for transcriptomic analyses (PAX tube). A total of 637 invasive breast cancer cases were identified through 2009 by linkages to the Cancer Registry of Norway. For each case, a random control matched on birth year and time of blood sampling was selected. After exclusions 441 case-control pairs were available for analyses. The trajectories consist of the differences over time in gene expression between each case and control pair. We present novel non-parametric statistical methods based on hypothesis testing that show whether there is development over time or not, and whether this development varies among the different strata. We introduced the concept of curve groups, where each curve group consists of genes that have a similar development through time. The gene expressions varied with time in the last years before diagnosis, and this development differs among clinical stages for women participating in the National Breast Cancer Screening Program. The differences among the strata appeared larger the last year before diagnosis, compared to earlier years. The curve group analysis revealed significant gene expression differences in blood before diagnosis among strata of clinical stage and mode of detection.
Keywords transcriptomics, gene expression, cohort, breast cancer, carcinogenesis, metastasis, mammographic screening, blood, systems epidemiology
Availability Open
Project number 220 641 and 220633
Research field Bioinformatics
Number of pages 32
© Copyright Norsk Regnesentral
4 Curve groups and breast cancer
Curve groups and breast cancer 5
Table of Content
1 Introduction ................................................................................................................. 7
2 Material and methods .................................................................................................. 8
2.1 Follow-up and register information ............................................................................ 9
3 Laboratory procedures ................................................................................................. 9
3.1 Microarray data ........................................................................................................... 9
3.2 Preprocessing of array data ...................................................................................... 10
3.3 Statistical methods .................................................................................................... 11
3.4 Hypothesis tests for development in time for each stratum .................................... 11
3.5 Hypothesis test for comparing two strata ................................................................ 12
3.6 An alternative statistic for comparing two strata ..................................................... 13
3.7 Computing p-values – permutation tests.................................................................. 13
4 Results ....................................................................................................................... 14
4.1 Hypothesis tests for development in time for each stratum .................................... 14
4.2 Hypothesis tests for comparing two strata ............................................................... 14
5 Discussion .................................................................................................................. 15
6 Conclusion ................................................................................................................. 18
References ........................................................................................................................ 20
7 Supplementary ........................................................................................................... 28
7.1 Hypothesis tests for development in time for each stratum .................................... 28
Supplementary figure for Table 2 in the paper ......................................................... 28
Supplementary table for Table 2 in the paper ........................................................ 30
7.2 Methods for separation of the strata. ....................................................................... 31
Supplementary figure for Figure 3 in the paper ...................................................... 31
Curve groups and breast cancer 7
1 Introduction
The assumption of systems epidemiology [1] is that functional aspects of human
carcinogenesis might be communicated through blood as gene expression patterns
before diagnosis, either as active signals or as passive information. Recently an editorial
in Nature Medicine [2] advocated the need to change from mice models to a “human
model” for the understanding of the carcinogenic processes. In observational studies of
humans, the prospective design would be best to incorporate the time aspect of the
carcinogenesis and the changing exposures. On the other hand, analyses of somatic
mutations in cancer genome studies have revealed a huge diversity of the mutational
processes as part of carcinogenesis [3]. One explanation for this observation could be
that multiple mutational processes operate dependently on different biological processes
in subgroups of cancers, thus giving a jumbled composite signature. Due to the
problems of jumbled composite signature, the functional analyses in observational
studies should be stratified based on important clinical knowledge like node status,
mode of detection and potential exposures.
One approach for prospective functional genomic studies is to compile
trajectories from many independent case-control pair measurements in order to study
the process of carcinogenesis [4]. The trajectory for a gene is a curve that shows the
changes in gene expression in blood as a function of time to cancer diagnosis, and
consists of the differences in gene expression between cases and controls in a nested
case-control design. The controls establish the average (mean) level of gene expression
in women not affected by cancer and provide exposure adjusted analyses. The level of
expression for a gene not involved in the carcinogenic process should be constant, on
average, during years before diagnosis. Genes related to the different stages of the
carcinogenesis could be differentially expressed over time. There is no prior knowledge
about the form of the trajectories for any of the thousands of genes. This lack of a priori
information demands an agnostic approach [5] putting all genes on an equal basis and
adjusting for multiple testing using a false discovery rate [6].
We present a prospective analysis based on the Norwegian Women and Cancer
postgenome study (NOWAC) [7] . The aim was to describe the time-dependent
carcinogenesis process in blood through an agnostic approach and epidemiological
design. The trajectories were analyzed stratified on important clinical factors like lymph
8 Curve groups and breast cancer
node status at time of diagnosis and the mode of detection, but without identifying
single genes or conducting pathway analyses.
2 Material and methods
The Norwegian Women and Cancer (NOWAC) cohort study is a nation-wide
population-based cancer study initiated in 1991; for detailed information see [8].
Random samples of women based on unique national birth number were drawn from the
central person register by Statistics Norway. Name and address were printed on the
letter of information and the birth number was replaced by a serial number on the
questionnaires. The linkage file for the birth number and the serial number was kept at
Statistics Norway. The questionnaires were returned to the Institute of Community
Medicine, University of Tromsø. Non-responders were mailed one or two remainders.
Between 2003 and 2006 the postgenome biobank collected approximately
50 000 samples nested in the NOWAC cohort, for more details see [7]. Women in the
NOWAC study who completed an eight-page questionnaire with an information letter
introducing blood sampling and who agreed to participate in blood sampling (97.2%)
were eligible for blood donation. Each woman received equipment for blood collection
and a two-page questionnaire. Blood sampling equipment was mailed in batches of 500
to randomly chosen women with one reminder after 4-6 weeks. The blood sampling
consisted of one PAXgene tube (PreAnalytiX GmbH, Hembrechtikon, Switzerland)
with a buffer or stabilization agent for mRNA in order to improve the quality of the
gene expression for genome wide microarray analyses. Blood was primarily drawn at
the family doctor’s office and sent as biological material overnight to Tromsø. Upon
arrival the PAX tubes were immediately frozen.
Altogether 66 072 women were invited through 141 groups and 47 763 (72.3%)
of them returned a blood sample and questionnaire during May 2003 - August 2006. In
addition, 2569 women donated blood at the Mammographic Screening Unit, the
University Hospital of Tromsø from 2004 until 2006. After removing duplicates,
missing blood samples and excluding women who later would leave the study (n=4) a
total of 48 692 blood samples were available for follow-up.
Curve groups and breast cancer 9
2.1 Follow-up and register information For the set of unique women belonging to the postgenome biobank breast cancer cases
through the end of 2009 were identified through linkages to the Cancer Registry of
Norway providing information on incident cases of breast cancer and stage. Altogether
637 cases of invasive breast cancer were reported. For each case a control matched on
time of blood sampling and year of birth was analyzed together with the case.
After removing 16 cases with other cancer, 8 cases with previous breast cancer, 18 pairs
where the control was diagnosed with cancer within two years after blood sampling and
44 pairs defined as outliers of which 18 case-control pairs were marked technical
outliers, a total of 551 pairs remained. Of these, 83 had missing, incomplete or uncertain
clinical information. Cases with blood samples taken more than five years before
diagnosis, 27, were not included. The eligible cohort consisted of 441 pairs.
Information on method of diagnosis, at screening unit or outside, was obtained
from the Cancer Registry of Norway through linkage to the screening database kept by
the National Breast Cancer Screening Program [9]. The cases were reclassified into
node negative or positive (without spread or with spread) based on the pTNM
information from the Cancer Registry of Norway.
Women participating in the screening program diagnosed with breast cancer
consisted of two groups: “Screen detected cancer” and “Interval cancer” detected within
two years after their last mammogram screening. The stratum “Clinical” consists of
women that did not attend screening prior to diagnosis. Cases with a negative
mammogram and with a clinically detected breast cancer more than two years after
screening were included in the Clinical group. Based on this information we could
classify our cases into six strata: «Screening with spread», «Screening without spread»,
«Interval with spread», «Interval without spread», «Clinical with spread», and «Clinical
without spread». The repartition of the selected 441 case-control pairs into the six strata
is shown in Table 1.
3 Laboratory procedures
3.1 Microarray data To control for technical variability such as different batches of reagents and kits, day to
day variations, microarray production batches and effects related to different laboratory
operators, each case and its random control matched on birth year and month of blood
10 Curve groups and breast cancer
collection were kept together through all procedures like extraction, amplification and
hybridization. RNA extraction used the PAXgene Blood miRNA Isolation kit according
to the manufacturer’s manual at the NTNU Genomic Core Facility in Trondheim,
Norway. RNA quality and purity was assessed using the NanoDrop ND 8000
spectrophotometer (ThermoFisher Scientific; Delaware, USA) and Agilent bioanalyzer
(Palo Alto; CA, USA), respectively. RNA amplification was performed on 96 plates
using 300 ng of total RNA and the Illumina TotalPrep-96 RNA Amplification Kit
(Ambio Inc; Austin, Texas, USA). The amplification procedure consisted of reverse
transcription with a T7 promotor and ArrayScript, followed by a second-strand
synthesis. In vitro transcription with T7 RNA polymerase using a biotin-NTP mix
produced biotinylated cRNA copies of each mRNA in the sample. All cases and
controls were run on either the IlluminaHumanAWG-6 version 3 expression bead or the
HumanHT-12 version 4. The microarray service was provided by the Genomics Core
Facility, Norwegian University of Science and Technology. Outliers were excluded
after visual examination of dendrograms, principal component analysis plots and
density plots. Individuals that were considered as borderline outliers were excluded if
their laboratory quality measures where below given thresholds (RIN value < 7,
260/280 ratio < 2, 260/230 ratio < 1.7, and 50 < RNA < 500).
3.2 Preprocessing of array data The dataset was preprocessed as previously described [10]. The dataset consisting of
441 case-control pairs and 30 046 probes were background corrected using negative
control probes and normalized on the original scale using quantile normalization. Data
from the two Illumina chip types (HumanWG-6 v3 and HumanHT-12 v4) were
combined on identical nucleotide universal identifiers (nuID) [11]. We retained probes
present in at least 1 % of the individuals, i.e. in at least 9 of the 882 individuals. If a
gene was represented with more than one probe only one was selected resulting in a
dataset with 11 431 probes. The probes were translated to genes using the
IlluminaHumanAll.db database [12]. Finally, the log2-differences of the expression
values for each case-control pair were computed and used in the statistical analyses.
Additional adjustments for possible batch effects were unnecessary due to the matched-
pair processing and focus on differences between pairs.
Curve groups and breast cancer 11
3.3 Statistical methods An original statistical method based on hypothesis testing was developed in order to
detect a functional dependence in gene expression over time, and whether this
functional dependency differed among the different strata. We have developed methods
that are able to identify small changes that are varying slowly in time and/or among
strata, by using a large number of genes in each test. For defining test statistics that
measure development in time and differences among strata, we have introduced the
concept of curve groups, where each curve group consists of genes that have a similar
development in time, i.e., similar trajectories. Below we will describe the methods in
detail.
Let 𝑋𝑋𝑔𝑔,𝑝𝑝 be the log2-expression difference for case-control pair 𝑝𝑝 and gene 𝑔𝑔. Each case-
control pair belongs to a stratum 𝑠𝑠 and a time period 𝑡𝑡, 𝑡𝑡 = 1,2,3. where t=1 is 0-1 year
before diagnosis, t=2 is 1-2 years before diagnosis and 3 is t=3-5 years before diagnosis.
We want to test whether 𝑋𝑋𝑔𝑔,𝑝𝑝 is independent of the time period, and whether there is no
difference among the strata, i.e., 𝑋𝑋𝑔𝑔,𝑝𝑝 is independent of stratum.
3.4 Hypothesis tests for development in time for each stratum For each stratum we will test whether 𝑋𝑋𝑔𝑔,𝑝𝑝 is independent of the time period. To define
a statistic that measures development in time we first introduce the concept of curve
groups:
• For a given stratum 𝑠𝑠, a gene 𝑔𝑔 can belong to zero or one of six curve groups
based on the order of the average of the data over all case-control pairs in the
stratum in the three time periods. These averages are denoted 𝑋𝑋�𝑔𝑔,3,𝑠𝑠, 𝑋𝑋�𝑔𝑔,2,𝑠𝑠 and
𝑋𝑋�𝑔𝑔,1,𝑠𝑠, respectively. Six curve groups, called « 123, 132, 213, 231, 312 and
321», respectively, were defined. The three numbers in each name of a curve
group represent the order of time period 3 (left number), the order of time
period 2 (middle number) and the order of time period 1 (right number). If e.g.
𝑋𝑋�𝑔𝑔,3,𝑠𝑠 < 𝑋𝑋�𝑔𝑔,2,𝑠𝑠 < 𝑋𝑋�𝑔𝑔,1,𝑠𝑠, gene 𝑔𝑔 may belong to curve group ‘123’ indicating an
increasing gene expression in time when approaching the time of diagnosis. See
Figure 1 for an illustration of the concept of curve groups.
• For each curve group we will only include genes with a significant change in
gene expression over time. This is done by testing whether the smallest and
largest values of 𝑋𝑋�𝑔𝑔,3,𝑠𝑠, 𝑋𝑋�𝑔𝑔,2,𝑠𝑠 and 𝑋𝑋�𝑔𝑔,1,𝑠𝑠 are different using a two-sample t-test
(assuming unequal variances). Let 𝑝𝑝𝑔𝑔,𝑐𝑐 be the p-value of this test. Depending
12 Curve groups and breast cancer
on the statistical question at hand, we define two alternative criteria for
concluding that a gene g belongs to the curve group c:
o Inclusion criterion 1: Gene 𝑔𝑔 belongs to curve group 𝑐𝑐 if 𝑝𝑝𝑔𝑔,𝑐𝑐 is below
a predefined limit 𝛼𝛼.
o Inclusion criterion 2: Gene 𝑔𝑔 belongs to curve group 𝑐𝑐 if gene 𝑔𝑔 is
among the M genes with lowest 𝑝𝑝𝑔𝑔,𝑐𝑐-value, see more next section.
To make a test for development in time, we count for each stratum the number of genes
that belong to the curve group using inclusion criterion 1 defined above. For each
stratum, we then perform seven hypothesis tests, one global test and one for each of the
six curve groups. In the global test the test statistic is the total number of genes which
belong to one of the six curve groups, while in the test for a curve group the test statistic
is the number of genes that belong to this curve group. If the conclusion of the
hypothesis test is that there are more genes in the curve groups than what is expected by
chance, we conclude that there is a significant development in time for some of these
genes.
3.5 Hypothesis test for comparing two strata We want to test whether there are differences in gene expressions between two strata
‘with spread’ and ‘without spread’ using information from several genes. For each
curve group 𝑐𝑐, stratum 𝑠𝑠 and case-control pair 𝑝𝑝, we define a curve group variable 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝
as follows: We select the genes that belong to the curve group 𝑐𝑐 for stratum 𝑠𝑠 using
inclusion criterion 2 defined above with M=100. Let 𝐺𝐺𝑐𝑐,𝑠𝑠 denote this set of genes. The
curve group variable 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 for case-control pair p is then computed as the average value
of the data 𝑋𝑋𝑔𝑔,𝑝𝑝 over genes in 𝐺𝐺𝑐𝑐,𝑠𝑠:
𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 =1
100� 𝑋𝑋𝑔𝑔,𝑝𝑝𝑔𝑔∈𝐺𝐺𝑐𝑐,𝑠𝑠
.
We can test whether the variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 are different for case-control pairs 𝑝𝑝 between
the two strata either for all time periods combined or for each time period separately.
Note that the genes are selected based on data from stratum 𝑠𝑠, but the variable may be
calculated for case-control pairs 𝑝𝑝 in any stratum. More specificly, assume that we want
to test if there is a difference in gene expression between case-control pairs in stratum
"with spread" versus stratum "without spread" for curve group ‘123’. Assume that the
set of100 genes G123,spread is selected using criterion 2 in the "spread" stratum. We then
Curve groups and breast cancer 13
calculate Z123,spread,p for all case-control pairs p in stratum "spread" and Z123,without spread,p'
for p in stratum "without spread", and test if the difference is larger than expected by
chance. Note that testing the "with spread" versus "without spread" strata may also be
performed with the set of curve groups G123,without spread selected from the "without
spread" stratum or from any of the other defined strata. ??
3.6 An alternative statistic for comparing two strata The test described above focuses on genes that belong to the same curve group. We
have also constructed a hypothesis test to compare the difference in time development
between two strata that does not depend on curve groups. The test statistic is
constructed by first computing the two-sample t-statistic 𝑇𝑇𝑔𝑔,𝑡𝑡, comparing the difference
in gene expression between the two strata for each gene 𝑔𝑔 and time period 𝑡𝑡. We define
𝐹𝐹𝑔𝑔 = ∑ 𝑤𝑤𝑡𝑡|𝑇𝑇𝑔𝑔,𝑡𝑡|𝑡𝑡 as the weighted sum of the absolute values of the t-statistics for gene 𝑔𝑔
with weight 𝑤𝑤𝑡𝑡. Further, the test statistic is defined as 𝐿𝐿𝑘𝑘 = ∑ 𝐹𝐹𝑔𝑔𝑔𝑔∈𝐺𝐺𝑘𝑘 , where 𝐺𝐺𝑘𝑘 is the
set of genes with the 𝑘𝑘 largest 𝐹𝐹𝑔𝑔 values, i.e. 𝐿𝐿𝑘𝑘 is the sum of the 𝑘𝑘 largest 𝐹𝐹𝑔𝑔 values.
We observe that 𝐿𝐿𝑘𝑘 is a weighted sum of t-statistics. We used equal weights 𝑤𝑤𝑡𝑡 = 1/3
for each time period. Alternatively, the weights could be selected either as proportional
to the number of case-control pairs in each time period or with larger values for the
pairs with time period closer to the time of diagnosis. In addition to the global test
including all three time periods, separate tests for each time period were also performed,
in which only data corresponding to each time period were included. This test
performed very well on several simulated datasets with a different time development or
different gene expression level for some genes for two strata, for details see [13].
In the Supplementary we use this t-statistics to construct a variable that separates the
case-control pairs in two strata.
3.7 Computing p-values – permutation tests In all tests described above, we compute p-values by estimating the null distribution for
the statistic of the hypothesis test by randomizing the data. In the hypothesis test for a
given stratum where we test for development in time, the null model is estimated by
randomizing case-control pairs for that stratum between time periods, while in the
hypothesis tests where two strata are compared, the null model is estimated by
randomizing case-control pairs between the two strata for each time period. Note that
14 Curve groups and breast cancer
these randomization algorithms maintain the correlation structure between the genes for
each case-control pair. Also note that the curve groups are redefined before a sample of
the null model is computed from a randomized dataset. The p-value of the test is set to K+1N+1
, where N is the total number of randomizations and K is the number of
randomizations out of N with a more extreme statistic than the statistic for the real data
[14]. In the results presented we have used N =1000.
4 Results
4.1 Hypothesis tests for development in time for each stratum A time trend was considered present if there were more genes in the curve groups than
expected by chance. Results for the different strata are presented in Table 2. In the first
panel we compared all pairs with spread to all pairs without spread. The results were
non-significant indicating no changes in gene expression over time when not stratifying
on mode of detection. Stratifying cases on participation in the screening or not revealed
significant time trends in cases with spread either found at screening or as interval
cancers, as more p-values are less than 0.05 than we would expect by chance. Further
stratification on all modes of detection showed that the effect mainly was restricted to
interval cancers with spread. In the tests we have used inclusion criterion 1 with
𝛼𝛼 = 0.01. In Figure 4 in the Supplementary we show how the results depend on the 𝛼𝛼-
values. We conclude that the results are not very sensitive to the choice of 𝛼𝛼-values and
that 𝛼𝛼 = 0.01 is a reasonable choice.
4.2 Hypothesis tests for comparing two strata Based on the results from the previous section, we restrict our analysis to compare the
gene expression in the two strata «Screening or interval with spread» and «Screening or
interval without spread» using the curve group variable 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 described in the method
section. P-values obtained by testing whether the curve group variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 are
different in the two strata are shown in Table 3. Note that many of the p-values are
below 0.05 and that some are smaller than 0.01. In Figure 2 we illustrate how to use the
gene expression data to separate the two strata by showing the curve group variable
𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 for each case-control pair 𝑝𝑝 in the different strata. The plot showed that the
difference between the two strata changes over time for the two most significant 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝
Curve groups and breast cancer 15
variables. Black and red points are separated, which indicate that the differences
between the strata (with spread and without spread) were larger the last year before
diagnosis, than in earlier years, and that this difference holds for both screening and
interval cases (red/black circles and red/black triangles, respectively). Nevertheless, the
differences were not large and the ability to predict the clinical stage for individual
cases remains limited. However, it can be possible to develop a procedure that can
separate a subgroup of the case-control pairs without spread from the remaining case-
control pairs, i.e. it should be possible to predict the clinical stage for some of the cases
without spread, but not for all.
In the methods section we introduced the statistic 𝐿𝐿𝑘𝑘, a weighted sum of t-
statistics, as an alternative to the curve group variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 for comparing the gene
expression levels of two strata. In Figure 3 we plot the p-value in a hypothesis test with
𝐿𝐿𝑘𝑘 as test statistic against the number of genes 𝑘𝑘. The plot show that the gene
expression levels are different in the two strata. The p-values decrease with increasing
number of genes used in the calculation of 𝐿𝐿𝑘𝑘. If we used 50 genes, the p-value is about
0.05, and the p-value decreased to below 0.02 when we used the 1000 most significant
genes. This indicate that the difference between the strata is present in a large number of
genes, but so weak that the strongest result was obtained when including a large number
of genes. Also, notice that the time period 1, the last year before diagnosis, contributed
most to the low p-values. This is in accordance with the results shown in Figure 2 and
Table 3. In Figure 5 in the Supplementary we illustrate how to separate the two strata
for each case-control pair using a variable that corresponds to the test statistic 𝐿𝐿𝑘𝑘 used in
Figure 3.
5 Discussion
This explorative analysis has shown that it is possible to significantly discriminate the
time trend of gene expression patterns observed before diagnosis. The findings are
based on an original approach for the statistical analysis of time dependent curves of
gene expression in the NOWAC postgenome cohort. The methods could also be used
for other aspects of functional genomics like methylation. These findings deserve to be
further interpreted in relation to the biology of both single genes and gene pathways.
The prospective analyses of gene expression in the years preceding diagnosis as
16 Curve groups and breast cancer
assessed by the log-fold change between cases and controls showed significant
differences in the curve groups according to stratification as defined by mode of
detection and node status of the cases at time of diagnosis.
Studies of gene expression in peripheral blood are challenging as they are
exposed to many difficulties and pitfalls. The ubiquitous degradation by RNase reduces
the quality of mRNA for whole genome analyses in most biobanks except for those with
a buffer or directly frozen in liquid nitrogen. The signals related to carcinogenesis are
expected to be much weaker than in tumor tissue and can be confounded by signals
from exposures to carcinogens or other lifestyle factors. The problem of noise due to the
complicated study object of carcinogenesis, the need for adequate epidemiological
design including exposure information and blood sampling, complicated technology and
development of robust statistics could make the approach unsuccessful. The prospective
design made it difficult to increase the statistical power of the study, so interpretation of
the results should be made carefully.
To the best of our knowledge, the NOWAC postgenome cohort is the largest
population based prospective cancer study designed for transcriptomic studies based on
buffered RNA. All parts of the analyses are done within the same cohort framework of
NOWAC. In the NOWAC postgenome cohort a single laboratory processed all samples
using the same technology, thus reducing analytical bias and batch effects. The cohort
design reduced selection bias. A weakness of a prospective study could be the change of
case-control status as controls became cases over time, thus reducing the differences in
gene expression within a pair. We removed all pairs where controls were diagnosed
with breast cancer or another cancer in a period of at least two years after blood
sampling. Unfortunately no repeated sampling of blood and questionnaires was
conducted. Repeated measurements would secure better analyses making it possible to
use intra individual comparisons over time.
One stratification factor was based on the mode of detection. In Norway, the
National Mammographic Screening Program for breast cancer started in 1996 with
complete coverage of the population from 2005 [9]. It has been estimated that the
introduction of population based mammographic screening in Norway gave a mean
sojourn time for invasive cancer of 4.0 years in women aged 50-59 years and 6.6 years
for those 60-69 years [15]. Analyses of breast cancer carcinogenesis as a time dependent
process should therefore take into consideration that cases diagnosed at the
Curve groups and breast cancer 17
mammographic screening program are diagnosed at an earlier phase of carcinogenesis
and thus not directly comparable to clinically detected cancers.
Secondly, node status has for a hundred years been the most important
prognostic factor in breast cancer treatment. In one of the earliest publications from
Yale 1920 the five-year survival of metastatic cancer was 15% [16]. Even in Norway
after the Second World War, the observed five-year survival rate was 25% [17] in node
positive cases. What can be observed before diagnosis or treatment are thus the signals
from a deadly disease. The starting time of the metastatic growth is unknown. The time
from initiation of metastases to its diagnosis has been estimated at 5.8 years [18]. At
time of diagnosis we had a censored distribution of tumors where mode of detection
determines the time of diagnosis irrespective of the underlying carcinogenic process.
Differences in gene expression in blood at diagnosis between node positive and node
negative tumors have been described in a small clinical study without controls [19].
The findings of pervasive, but small changes in gene expression present in blood
before diagnosis of breast cancer could have several explanations depending on the
different views on carcinogenesis. One conclusion of the cancer genome project was
that remarkably little is known about the process of carcinogenesis [1]. The
interpretations of the gene expression trajectories should therefore be explorative.
Human observational findings should not necessarily be related to existing models of
carcinogenesis since these are based mainly on animal experiments. Among models
currently debated, there is the driver-passenger model [20], the mathematical multistage
model or the two stage clonal model [21], the oncogene addiction hypothesis [22],
hallmarks of cancer with a current focus on the immune system [23] and the exposure
driven model [24].
The findings of a strong effect of stratification by stage and mode of detection
could have implications for the construction of predictive or prognostic clinical tests.
Most studies with tumor tissues have not taken into account the different biological
signals from cancers at different stages. Stratification could improve the sensitivity and
specificity of such tests.
From a statistical point of view the Cox proportional hazard model and its
extension have been largely used by epidemiologist since the seminal work by Cox [25]
for analyzing cohort studies with time- varying covariates. It has been adapted as well
for case-control designs [26] and some extension have been proposed for covariates
18 Curve groups and breast cancer
measured with noise [27] and time-changing coefficients [28]. More recently, the
adjunction of covariates in high dimension like gene expression data added some
challenging statistical issues [29]. While the characteristics and the basic assumptions
of the Cox model are adapted to the dimensionality and the very specific paired design
of the NOWAC study, the Cox model is not fully adapted to the estimation of changes
in the gene expression curves and to the biological interpretations of gene pathways.
An agnostic search for time trends depends on a sensitive statistical approach.
We have presented two novel statistical methods that demonstrated that the gene
expressions vary with time the last years before diagnosis and that this development in
time differs between clinical stages for participants inside a screening program. One of
the methods focuses on identifying genes with specific functional dependencies in time
within a given clinical stage. The other method focuses on difference in gene
expressions between clinical stages in the different time periods. Hence, the two
methods focus on different aspects of functional time dependency relative to time of
diagnosis of the gene expressions. Both methods give significant results when we use
many genes and the data from the last year before diagnosis contributes the most to this
result. As the gene expression data are very noisy, all methods use information from
several genes simultaneously to increase the power of the hypothesis tests used. We
found that the differences between the strata (with spread and without spread) are larger
the last year before diagnosis, than in earlier years, but that the differences are small and
the ability to predict the clinical stage for individual cases is limited. However, it is
possible to separate a subgroup of the case-control pairs without spread from the
remaining case-control pairs, Figure 2, and predict the clinical stage for some cases
without spread, but not for all.
A potential weakness of the curve group approach could be the increasing
number of curve groups as time of observation increases. With four time periods we
will need 24 curve groups, and with five time periods even more.
6 Conclusion
The findings indicate that gene expression in blood before diagnosis might be used as a
biomarker of disease extent. These findings could be viewed as a proof of concept of
Curve groups and breast cancer 19
systems epidemiology indicating the potential of including gene expression for
functional analysis in prospective studies of cancer.
20 Curve groups and breast cancer
References
1. Lund E, Dumeaux V. Systems Epidemiology in Cancer. Cancer Epidemiology Biomarkers & Prevention. 2008;17(11):2954-7. doi:10.1158/1055-9965.epi-08-0519. 2. Of men, not mice. Nat Med. 2013;19(4):379-. 3. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415-21. doi:10.1038/nature12477. 4. Lund E, Plancade S. Transcriptional output in a prospective design conditionally on follow-up and exposure: the multistage model of cancer. International Journal of Molecular Epidemiology and Genetics. 2012;3(2):107-14. 5. Spitz MR, Bondy ML. The evolving discipline of molecular epidemiology of cancer. Carcinogenesis. 2010;31(1):127-34. doi:10.1093/carcin/bgp246. 6. Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003;19(3):368-75. doi:10.1093/bioinformatics/btf877. 7. Dumeaux V, Borresen-Dale A-L, Frantzen J-O, Kumle M, Kristensen V, Lund E. Gene expression analyses in breast cancer epidemiology: the Norwegian Women and Cancer postgenome cohort study. Breast Cancer Research. 2008;10(1):R13. 8. Lund E, Dumeaux V, Braaten T, Hjartåker A, Engeset D, Skeie G et al. Cohort Profile: The Norwegian Women and Cancer Study (NOWAC) Kvinner og kreft. Int J Epidemiol. 2008;37(1):36-41. doi:10.1093/ije/dym137. 9. Hofvind S, Geller B, Vacek PM, Thoresen S, Skaane P. Using the European guidelines to evaluate the Norwegian Breast Cancer Screening Program. European Journal of Epidemiology. 2007;22(7):447-55. doi:10.2307/27822793. 10. Günther C, Holden M, Holden L. Preprocessing of gene-expression data related to breast cancer diagnosis: SAMBA/35/14. 11. Du P, Kibbe W, Lin S. nuID: a universal naming scheme of oligonucleotides for Illumina, Affymetrix, and other microarrays. Biology Direct. 2007;2(1):16. 12. Carlson M. lumiHumanAll.db: Illumina Human Illumina expression annotation data (chip lumiHumanAll. R Package version 1.22.0. 13. Holden L. Classify strata. NR note SAMBA/11/15. 14. Phipson B, Smyth GK. Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn. Stat Appl Genet Mol Biol. 2010;31(9). doi:10.2202/1544-6115.1585. 15. Weedon-Fekjær H, Lindqvist BH, Vatten LJ, Aalen OO, Tretli S. Estimating mean sojourn time and screening sensitivity using questionnaire data on time since previous screening. Journal of Medical Screening. 2008;15(2):83-90. doi:10.1258/jms.2008.007071. 16. Todd M, Shoag M, Cadman E. Survival of women with metastatic breast cancer at Yale from 1920 to 1980. Journal of Clinical Oncology. 1983;1(6):406-8. 17. Survival of cancer patients : cases diagnosed in Norway 1953-1967. The Norwegian Cancer Society and The Cancer Registry in Norway. Oslo1975. 18. Engel J, Eckel R, Kerr J, Schmidt M, Fürstenberger G, Richter R et al. The process of metastasisation for breast cancer. European Journal of Cancer. 2003;39(12):1794-806.
Curve groups and breast cancer 21
19. Zuckerman NS, Yu H, Simons DL, Bhattacharya N, Carcamo-Cavazos V, Yan N et al. Altered local and systemic immune profiles underlie lymph node metastasis in breast cancer patients. International Journal of Cancer. 2013;132(11):2537-47. doi:10.1002/ijc.27933. 20. Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458(7239):719-24. 21. Vineis P, Schatzkin A, Potter JD. Models of carcinogenesis: an overview. Carcinogenesis. 2010;31(10):1703-9. doi:10.1093/carcin/bgq087. 22. Felsher DW. Oncogene Addiction versus Oncogene Amnesia: Perhaps More than Just a Bad Habit? Cancer Research. 2008;68(9):3081-6. doi:10.1158/0008-5472.can-07-5832. 23. Hanahan D, Weinberg RA. The Hallmarks of Cancer. Cell. 2000;100(1):57-70. doi:10.1016/S0092-8674(00)81683-9. 24. Lund E. An exposure driven functional model of carcinogenesis. Medical Hypotheses. 2011;77(2):195-8. doi:10.1016/j.mehy.2011.04.009. 25. Cox DR. Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B (Methodological). 1972;34(2):187-220. doi:10.2307/2985181. 26. Aalen OO, Borgan Ø, Gjessing HK. Survival and Event History Analysis. A Process Point of View, Statistics for Biology and Health. New York. Springer; 2008. 504p. 27. Hu P, Tsiatis AA, Davidian M. Estimating the Parameters in the Cox Model When Covariate Variables are Measured with Error. Biometrics. 1998;54(4):1407-19. doi:10.2307/2533667. 28. O'Quigley J. Proportional Hazards Regression. Statistics for Biology and Health. Springer; 2008. 29. Benner A, Zucknick M, Hielscher T, Ittrich C, Mansmann U. High-Dimensional Cox Models: The Choice of Penalty as Part of the Model Building Process. Biometrical Journal. 2010;52(1):50-69. doi:10.1002/bimj.200900064.
22 Curve groups and breast cancer
Table 1 Number of case-control pairs in each stratum and time period for the dataset. The stratum “Clinical” consists of i) Women that attended screening, but this was more than two years before diagnosis; and ii) Women that did not attend screening prior to diagnosis.
Year before diagnosis (time period) 5-3 (3) 2 (2) 1 (1) Stratum Screening: Diagnosed at a screening visit
Spread 41 11 6 Not spread 118 42 43
Interval: Diagnosed within two years of a screening visit
Spread 28 9 6 Not spread 30 15 10
Clinical: Outside the screening program
Spread 11 8 10 Not spread 28 12 13
Curve groups and breast cancer 23
Table 2 P-values obtained when testing whether there are more genes in the curve groups than what is expected by chance. We have used inclusion criterion 1 with 𝛼𝛼 = 0.01. P-values below 0.05 are highlighted in yellow. The observed and expected number of genes in each curve group are shown in Table 4 in the Supplementary.
p-value
Curve group
Screening, interval or clinical
with spread
Screening, interval or clinical
without spread
Screening or interval
with spread
Screening or interval
without spread Global 0.78 0.27 0.01 0.20
123 0.61 0.23 0.02 0.39 132 0.49 0.13 0.008 0.11 312 0.88 0.18 0.13 0.11 321 0.41 0.74 0.02 0.66 231 0.74 0.68 0.50 0.57 213 0.58 0.17 0.48 0.13
p-value
Curve group
Screening with
spread
Screening without spread
Interval with
spread
Interval without spread
Clinical with
spread
Clinical without spread
Global 0.36 0.43 0.02 0.46 0.40 0.81
123 0.10 0.33 0.21 0.89 0.06 0.34 132 0.38 0.19 0.009 0.32 0.51 0.63 312 0.83 0.30 0.07 0.21 0.98 0.81 321 0.18 0.90 0.05 0.40 0.22 0.66 231 0.33 0.63 0.21 0.83 0.94 0.93 213 0.70 0.27 0.29 0.16 0.90 0.59
24 Curve groups and breast cancer
Table 3 P-values obtained when testing whether the curve group variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 are different in the two strata «Screening or interval with spread» and «Screening or interval without spread». P-values below 0.05 are highlighted in yellow.
p-value
Genes selected based on stratum 𝑠𝑠1 =
«Screening or interval with spread» 𝑍𝑍𝑐𝑐,𝑠𝑠1,𝑝𝑝
Genes selected based on stratum 𝑠𝑠2 = «Screening or interval without spread»
𝑍𝑍𝑐𝑐,𝑠𝑠2,𝑝𝑝 Period 𝑡𝑡 3 2 1 3 2 1
N1 69 20 12 69 20 12 N2 148 57 53 148 57 53
Curve group 𝑐𝑐 123 0.22 0.59 0.02 0.53 0.11 0.08 132 0.90 0.005 0.004 0.71 0.11 0.009 312 0.80 0.27 0.15 0.04 0.009 0.001 321 0.12 0.98 0.24 0.35 0.72 0.15 231 0.26 0.45 0.78 0.34 0.38 0.23 213 0.53 0.45 0.65 0.36 0.04 0.08
‘N1’ is the number of case-control pairs in the stratum «Screening or interval with spread» in the time period 𝑡𝑡, while ‘N2’ is the number of case-control pairs in the stratum «Screening or interval without spread» in the time period 𝑡𝑡.
Curve groups and breast cancer 25
Figure 1 Example of two different curve groups: ‘123’ (upper) and ‘132’ (lower). In the left panels curves for 20 genes from the given curve group are plotted. For illustrational purposes, the curves have been estimated from the data using splines. In the middle panels the data for one of the 20 genes are shown with the corresponding spline-estimated curve. The points represent the differences in gene expression shown with the corresponding spline-estimated curve. The points represent the differences in gene expression 𝑋𝑋𝑔𝑔,𝑝𝑝 for each case-control pair. The mean value in each time period 𝑋𝑋�𝑔𝑔,3,𝑠𝑠, 𝑋𝑋�𝑔𝑔,2,𝑠𝑠 and 𝑋𝑋�𝑔𝑔,1,𝑠𝑠, is shown in red. The right panels are similar to the middle panels except that the data that are plotted are the mean values computed over the 20 genes in the left panel.
26 Curve groups and breast cancer
Figure 2 Plot of two of the most significant curve group variables 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 for the screening and interval strata. «With spread 132» on the x-axis denotes that s in 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 is the stratum «Screening or interval with spread» and c is curve group 132, while «Without spread 312» on the y-axis denotes that s in 𝑍𝑍𝑐𝑐,𝑠𝑠,𝑝𝑝 is the stratum «Screening or interval without spread» and c is curve group 312.
Curve groups and breast cancer 27
Figure 3 The p-value in a hypothesis test with test statistic 𝐿𝐿𝑘𝑘, a weighted sum of t-statistics, plotted against the number of genes 𝑘𝑘 used in the calculation of 𝐿𝐿𝑘𝑘. The two strata that are compared using 𝐿𝐿𝑘𝑘 are «Screening or interval with spread» and «Screening or interval without spread».
28 Curve groups and breast cancer
7 Supplementary
7.1 Hypothesis tests for development in time for each stratum Supplementary figure for Table 2 in the paper In Table 2 in the paper we presented p-values obtained when testing whether there are more genes in the curve groups than what is expected by chance. In these tests we used inclusion criterion 1 with 𝛼𝛼 = 0.01. A small 𝛼𝛼-value implies that we only include genes with a strong trend. Figure 4 shows how the p-values depend on the value of 𝛼𝛼. We observe that the p-values are not sensitive to the choice of 𝛼𝛼 except that 𝛼𝛼 should be closer to 0 than to 1.
Figure continues on next page
Curve groups and breast cancer 29
Figure 4 P-values obtained when testing whether there are more genes in a curve group than expected, plotted against 𝛼𝛼, the parameter of inclusion criterion 1. The horizontal
30 Curve groups and breast cancer
dotted line indicates a p-value equal to 0.05. The data for the stratum «With spread» consists of the data for «Screening with spread», «Interval with spread» and «Clinical with spread», and similar for the stratum «Without spread». Supplementary table for Table 2 in the paper In Table 2 in the paper we presented p-values obtained when testing whether there are more genes in the curve groups than what is expected by chance. In these tests we used inclusion criterion 1 with 𝛼𝛼 = 0.01. Table 4 shows the observed number and the expected number of genes in each curve group. Here it is important to notice that the numbers of genes in each curve group is not too small. If this had been the case, this would indicate that a too small 𝛼𝛼-value had been chosen weakening the power of the test. The table shows that this is not the case.
Table 4 The observed number of genes in each curve group with expected number of genes in parenthesis. The cases with a p-value below 0.05 in Table 2 are highlighted in yellow.
Observed number of genes (expected number of genes)
Curve group
Screening, interval or clinical
with spread
Screening, interval or clinical
without spread
Screening or interval
with spread
Screening or interval
without spread Global 305 (513) 609 (535) 1360 (482) 708 (547)
123 47 (76) 97 (82) 259 (70) 69 (86) 132 69 (100) 171 (103) 518 (99) 205 (107) 312 37 (102) 145 (105) 171 (105) 203 (108) 321 66 (82) 40 (82) 314 (77) 46 (82) 231 38 (77) 44 (81) 48 (66) 51 (82) 213 48 (76) 112 (82) 50 (65) 134 (83)
Observed number of gene (expected number of genes)
Curve group
Screening with
spread
Screening without spread
Interval with
spread
Interval without spread
Clinical with
spread
Clinical without spread
Global 475 (464) 490 (547) 1233 (485) 471 (525) 448 (491) 302 (502)
123 139 (75) 78 (85) 101 (81) 33 (90) 233 (84) 83 (83) 132 81 (91) 141 (106) 515 (92) 96 (97) 52 (84) 54 (90) 312 43 (96) 107 (109) 237 (89) 123 (96) 18 (82) 40 (92) 321 115 (82) 29 (82) 213 (81) 71 (83) 101 (83) 45 (77) 231 63 (63) 46 (82) 92 (70) 31 (78) 21 (77) 27 (77) 213 34 (58) 89 (83) 75 (73) 117 (81) 23 (81) 53 (83)
Curve groups and breast cancer 31
7.2 Methods for separation of the strata. Supplementary figure for Figure 3 in the paper In Figure 3 in the paper we illustrated our ability to separate between two strata based on the t-statistics 𝑇𝑇𝑔𝑔,𝑡𝑡 for each gene g and time period t. We can also illustrate the separation of the two strata by calculating a variable 𝑌𝑌𝑝𝑝,𝑘𝑘 for each case-control pair 𝑝𝑝 using information from several genes simultaneously to show that there are differences in gene expression between the two strata. We define the variable 𝑌𝑌𝑝𝑝,𝑘𝑘 for each case-control pair 𝑝𝑝 such that the variable is low for case-control pairs from one stratum and high for case-control pairs from the other stratum. We define
𝑌𝑌𝑝𝑝,𝑘𝑘 =1𝑘𝑘� 𝑋𝑋𝑔𝑔,𝑝𝑝 ∙ 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠(𝑇𝑇𝑔𝑔,𝑡𝑡)𝑔𝑔∈𝐺𝐺𝑘𝑘
,
where 𝑡𝑡 is the time period for case-control pair 𝑝𝑝, and 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠(𝑇𝑇𝑔𝑔,𝑡𝑡) is 1 if 𝑇𝑇𝑔𝑔,𝑡𝑡 is positive, -1 otherwise. Here, 𝐺𝐺𝑘𝑘 is the set of genes with the 𝑘𝑘 largest 𝐹𝐹𝑔𝑔-values (defined in the paper). Figure 5 shows the 𝑌𝑌𝑝𝑝,1000 values for the different case-control pairs p for the different strata. Notice that there is a separation between spread and not spread in period 1 and between interval with spread and interval without spread in period 2. The separation is such that some of the pairs without spread, but not all, have smaller values that seems to be outside the range of the values with spread.
32 Curve groups and breast cancer
Figure 5 Plot of the variable 𝑌𝑌𝑝𝑝,𝑘𝑘, where 𝑘𝑘=1000 and where genes have been selected based on data in the period of case-control pair 𝑝𝑝 (upper) or on data in all three periods (lower).