Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Ridge regression for risk predictionwith applications to genetic data
Erika Cule and Maria De Iorio
Imperial College LondonDepartment of Epidemiology and Biostatistics
School of Public Health
May 2012
Outline
1 Risk Prediction using Genetic Data
2 Methods and challenges
3 Ridge RegressionShrinkage parameterSignificance testing
4 Conclusions
Outline
1 Risk Prediction using Genetic Data
2 Methods and challenges
3 Ridge RegressionShrinkage parameterSignificance testing
4 Conclusions
Risk Prediction using Genetic Data
...genome-wide associationstudies have identified thou-sands of genetic variantsassociated with hundreds ofdiseases and traits.
In the decade following thepublication of the first draftof the Human Genome Se-quence...
Risk Prediction using Genetic Data
...genome-wide associationstudies have identified thou-sands of genetic variantsassociated with hundreds ofdiseases and traits.
In the decade following thepublication of the first draftof the Human Genome Se-quence...
Risk Prediction using Genetic Data
However, clinicians are getting impatient about the utility ofthese identified variants for risk prediction in complex diseases:
Risk Prediction using Genetic Data
However, clinicians are getting impatient about the utility ofthese identified variants for risk prediction in complex diseases:
Risk prediction using genetic data
• Recently, questions have been raised about the potential utility ofgenetic risk prediction for complex diseases (Clayton, 2009).
• The aim here is to make ridge regression possible for genetic data in asemi-automatic way
• The framework that we propose allows for the simultaneous inclusion ofall predictors genome-wide in a regression model.
• Our approach is appropriate where there are many predictors of smalleffect size,which is thought to be the case in genetic data.
Risk prediction using genetic data
• Recently, questions have been raised about the potential utility ofgenetic risk prediction for complex diseases (Clayton, 2009).
• The aim here is to make ridge regression possible for genetic data in asemi-automatic way
• The framework that we propose allows for the simultaneous inclusion ofall predictors genome-wide in a regression model.
• Our approach is appropriate where there are many predictors of smalleffect size,which is thought to be the case in genetic data.
Risk prediction using genetic data
• Recently, questions have been raised about the potential utility ofgenetic risk prediction for complex diseases (Clayton, 2009).
• The aim here is to make ridge regression possible for genetic data in asemi-automatic way
• The framework that we propose allows for the simultaneous inclusion ofall predictors genome-wide in a regression model.
• Our approach is appropriate where there are many predictors of smalleffect size,which is thought to be the case in genetic data.
Risk prediction using genetic data
• Recently, questions have been raised about the potential utility ofgenetic risk prediction for complex diseases (Clayton, 2009).
• The aim here is to make ridge regression possible for genetic data in asemi-automatic way
• The framework that we propose allows for the simultaneous inclusion ofall predictors genome-wide in a regression model.
• Our approach is appropriate where there are many predictors of smalleffect size,which is thought to be the case in genetic data.
Outline
1 Risk Prediction using Genetic Data
2 Methods and challenges
3 Ridge RegressionShrinkage parameterSignificance testing
4 Conclusions
Univariate tests of association
from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging fromextreme elationormania to severedepres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 andheritability 80–90%
27,28. Thedefinitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.
Several genomic regions have been implicated in linkage studies30
and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with
both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.
The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P5 6.33 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.
Of the four regions showing association at P, 53 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P5 2.231027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias
−log
10(P
)
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
Chromosome
Type 2 diabetes
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, withP values,13 1025 highlighted in green. All panels are truncated at2log10(P value)5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.
ARTICLES NATURE |Vol 447 |7 June 2007
666Nature ©2007 Publishing Group
Univariate tests of association
from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging fromextreme elationormania to severedepres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 andheritability 80–90%
27,28. Thedefinitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.
Several genomic regions have been implicated in linkage studies30
and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with
both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.
The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P5 6.33 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.
Of the four regions showing association at P, 53 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P5 2.231027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias
−log
10(P
)
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
Chromosome
Type 2 diabetes
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, withP values,13 1025 highlighted in green. All panels are truncated at2log10(P value)5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.
ARTICLES NATURE |Vol 447 |7 June 2007
666Nature ©2007 Publishing Group
Univariate tests of association
from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging fromextreme elationormania to severedepres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 andheritability 80–90%
27,28. Thedefinitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.
Several genomic regions have been implicated in linkage studies30
and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with
both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.
The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P5 6.33 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.
Of the four regions showing association at P, 53 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P5 2.231027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias
−log
10(P
)
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
Chromosome
Type 2 diabetes
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, withP values,13 1025 highlighted in green. All panels are truncated at2log10(P value)5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.
ARTICLES NATURE |Vol 447 |7 June 2007
666Nature ©2007 Publishing Group
from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging fromextreme elationormania to severedepres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 andheritability 80–90%
27,28. Thedefinitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.
Several genomic regions have been implicated in linkage studies30
and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with
both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.
The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P5 6.33 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.
Of the four regions showing association at P, 53 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P5 2.231027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias
−log
10(P
)
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
Chromosome
Type 2 diabetes
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, withP values,13 1025 highlighted in green. All panels are truncated at2log10(P value)5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.
ARTICLES NATURE |Vol 447 |7 June 2007
666Nature ©2007 Publishing Group
WTCCC (2007)
Multivariate methods
• Consider all SNPs jointly
• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.
• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.
• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.
• Known as shrinkage methods - “shrink” regression coefficients towardszero.
• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...
Ridge Regression
Multivariate methods
• Consider all SNPs jointly
• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.
• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.
• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.
• Known as shrinkage methods - “shrink” regression coefficients towardszero.
• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...
Ridge Regression
Multivariate methods
• Consider all SNPs jointly
• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.
• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.
• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.
• Known as shrinkage methods - “shrink” regression coefficients towardszero.
• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...
Ridge Regression
Multivariate methods
• Consider all SNPs jointly
• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.
• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.
• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.
• Known as shrinkage methods - “shrink” regression coefficients towardszero.
• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...
Ridge Regression
Multivariate methods
• Consider all SNPs jointly
• Standard multivariate methods cannot be used with modern geneticdata sets which have p � n.
• Typically, additional (non-genetic) covariates are included in theanalysis, further increasing the dimensionality of the data.
• Penalized regression methods constrain the size of the maximumlikelihood estimates of regression coefficients.
• Known as shrinkage methods - “shrink” regression coefficients towardszero.
• A number of penalized regression approaches have been proposed inthe literature: Lasso regression, HyperLasso, Elastic Net...
Ridge Regression
Prior distributions in Lasso and Ridge Regression
Outline
1 Risk Prediction using Genetic Data
2 Methods and challenges
3 Ridge RegressionShrinkage parameterSignificance testing
4 Conclusions
Ridge regression
• Ridge regression (Hoerl & Kennard, 1970) is a penalized regressionapproach proposed to overcome the problems associated withmulticollinearity among predictors in multiple regression.
• Among penalized regression approaches, ridge regression has beenshown to offer very good predictive performance (Frank & Friedman,1993).
• We applied ridge regression to the problem of risk prediction usinggenetic data obtained from genome-wide association studies.
• Ridge regression shrinks the squared length of the regressioncoefficient vector - corresponds to a quadratic penalty on thecoefficients.
Outline
1 Risk Prediction using Genetic Data
2 Methods and challenges
3 Ridge RegressionShrinkage parameterSignificance testing
4 Conclusions
Shrinkage parameter
• Controls the degree of shrinkageof the regression coefficients.
• A larger shrinkage parametershrinks the coefficients furthertowards zero.
• Data-driven methods proposedin the literature cannot beapplied p � n, because theydepend on the ordinary leastsquares estimates.
• Ridge trace (graphical method)
Shrinkage parameter
• Controls the degree of shrinkageof the regression coefficients.
• A larger shrinkage parametershrinks the coefficients furthertowards zero.
• Data-driven methods proposedin the literature cannot beapplied p � n, because theydepend on the ordinary leastsquares estimates.
• Ridge trace (graphical method)
Our starting point
Linear model:
Y = Xβ + ε εiid∼ N(0, σ2)
Ridge regression:
βk = arg minβ
n∑
i=1
yi −
p∑
j=1
βixij
2
+ kp∑
j=1
|β2j |
Proposed by Hoerl, Kennard & Baldwin (1975):
kHKB =pσ2
β′β
σ2, β estimated from ordinary least squares (OLS).
Our starting point
Linear model:
Y = Xβ + ε εiid∼ N(0, σ2)
Ridge regression:
βk = arg minβ
n∑
i=1
yi −
p∑
j=1
βixij
2
+ kp∑
j=1
|β2j |
Proposed by Hoerl, Kennard & Baldwin (1975):
kHKB =pσ2
β′β
σ2, β estimated from ordinary least squares (OLS).
Our starting point
Linear model:
Y = Xβ + ε εiid∼ N(0, σ2)
Ridge regression:
βk = arg minβ
n∑
i=1
yi −
p∑
j=1
βixij
2
+ kp∑
j=1
|β2j |
Proposed by Hoerl, Kennard & Baldwin (1975):
kHKB =pσ2
β′β
σ2, β estimated from ordinary least squares (OLS).
Our starting point
Linear model:
Y = Xβ + ε εiid∼ N(0, σ2)
Ridge regression:
βk = arg minβ
n∑
i=1
yi −
p∑
j=1
βixij
2
+ kp∑
j=1
|β2j |
Proposed by Hoerl, Kennard & Baldwin (1975):
kHKB =pσ2
β′β
σ2, β estimated from ordinary least squares (OLS).
We observed
Linear model:
Y = Xβ + ε
= Zα+ ε
εiid∼ N(0, σ2)
Proposed by Hoerl, Kennard & Baldwin (1975):
kHKB =pσ2
β′β
=pσ2
α′α
α are principal components regression coefficients.
PCR coefficients are available when p >> n
We observed
Linear model:
Y = Xβ + ε = Zα+ ε εiid∼ N(0, σ2)
Proposed by Hoerl, Kennard & Baldwin (1975):
kHKB =pσ2
β′β
=pσ2
α′α
α are principal components regression coefficients.
PCR coefficients are available when p >> n
We observed
Linear model:
Y = Xβ + ε = Zα+ ε εiid∼ N(0, σ2)
Proposed by Hoerl, Kennard & Baldwin (1975):
kHKB =pσ2
β′β=
pσ2
α′α
α are principal components regression coefficients.
PCR coefficients are available when p >> n
We observed
Linear model:
Y = Xβ + ε = Zα+ ε εiid∼ N(0, σ2)
Proposed by Hoerl, Kennard & Baldwin (1975):
kHKB =pσ2
β′β=
pσ2
α′α
α are principal components regression coefficients.
PCR coefficients are available when p >> n
We observed
Linear model:
Y = Xβ + ε = Zα+ ε εiid∼ N(0, σ2)
Proposed by Hoerl, Kennard & Baldwin (1975):
kHKB =pσ2
β′β=
pσ2
α′α
α are principal components regression coefficients.
PCR coefficients are available when p >> n
We propose
kHKB =pσ2
α′α
kr =r σ2
rα′r αr
Harmonic mean of the “ideal” shrinkage parameters of the PCRcoefficients, with coefficients replaced by their ordinary leastsquares estimates.
How many components?
We propose
kHKB =pσ2
α′αkr =
r σ2r
α′r αr
Harmonic mean of the “ideal” shrinkage parameters of the PCRcoefficients, with coefficients replaced by their ordinary leastsquares estimates.
How many components?
We propose
kHKB =pσ2
α′αkr =
r σ2r
α′r αr
Harmonic mean of the “ideal” shrinkage parameters of the PCRcoefficients, with coefficients replaced by their ordinary leastsquares estimates.
How many components?
How many components?
% of replicates with larger MSE using kHKB than using kr
12
34
56
78
9
19
2549
81200
400900
250010000
0
20
40
60
80
100
number of PCs (r)
signal to noise
ratio
percent
0
20
40
60
80
100
How many components?
Most of the variance in ge-netic data can be explainedby the first few principalcomponents.
How many components?
PSE ={
1 + tr(HH′)n
}σ2 +
b′bn
= variance +bias2
n
• H is the “hat matrix”: Y = HY
• Degrees of freedom for variance = tr(HH ′) (Hastie & Tibshirani (1990) ).
How many components?
• For given r , RR estimates haveless bias than PCR estimates.
• PCR using r components has rdegrees of freedom for variance.
• We fixed r such that degrees offreedom of the ridge model usingr components equals r .
tr(HH ′
)= r
How many components?
• For given r , RR estimates haveless bias than PCR estimates.
• PCR using r components has rdegrees of freedom for variance.
• We fixed r such that degrees offreedom of the ridge model usingr components equals r .
tr(HH ′
)= r
How many components?
• For given r , RR estimates haveless bias than PCR estimates.
• PCR using r components has rdegrees of freedom for variance.
• We fixed r such that degrees offreedom of the ridge model usingr components equals r .
tr(HH ′
)= r
How many components?
• For given r , RR estimates haveless bias than PCR estimates.
• PCR using r components has rdegrees of freedom for variance.
• We fixed r such that degrees offreedom of the ridge model usingr components equals r .
tr(HH ′
)= r
Simulation Study
Mean prediction squared error:
p-value trace:
Simulation Study
Mean prediction squared error: p-value trace:
Simulation study
• Performance comparison
SNP ranking followed by multivariate regressionHyperLasso
• Continuous and binary outcomes
Univariate HLasso RR% of SNPs ranked by univariate p-value 0.1% 0.5% 1 % 3% 4%
Continuous outcomes (mean PSE) 1.51 1.55 1.54 2.21 3.93 2.41 1.23Binary outcomes (mean CE) 0.49 0.48 0.48 0.49 0.50 0.50 0.46
Bipolar Disorder Data
• Two GWAS of Bipolar Disorder: WTCCC and GAIN.
• Case-control studies - model extended to logistic ridge regression.
• SNPs typed on different platforms. Impute2 to obtain common SNPs.
• When determining shrinkage parameter, training data were thinned(1 SNP every 100kb).
• Univariate model - which significance threshold?
• HyperLasso - cross-validation to choose the parameters iscomputationally intensive.
Univariate HyperLasso Ridge Regression
p-value threshold 10−5 10−7 10−10
Mean0.489 0.491 0.490 0.492 0.465
Classification Error
Bipolar Disorder Data
• Two GWAS of Bipolar Disorder: WTCCC and GAIN.
• Case-control studies - model extended to logistic ridge regression.
• SNPs typed on different platforms. Impute2 to obtain common SNPs.
• When determining shrinkage parameter, training data were thinned(1 SNP every 100kb).
• Univariate model - which significance threshold?
• HyperLasso - cross-validation to choose the parameters iscomputationally intensive.
Univariate HyperLasso Ridge Regression
p-value threshold 10−5 10−7 10−10
Mean0.489 0.491 0.490 0.492 0.465
Classification Error
Outline
1 Risk Prediction using Genetic Data
2 Methods and challenges
3 Ridge RegressionShrinkage parameterSignificance testing
4 Conclusions
Significance testing in ridge regression
• Ridge regression is not a variable selection method - the shrinkagepenalty does not shrink any coefficient estimates to zero.
• A test of significance of ridge regression coefficients had beenproposed (Halawa & El Bassiouni, 2000) and applied (Malo et al, 2008)but not evaluated.
• We extended the test to be applicable when p >> n and to be appliedin logistic ridge regression, and evaluated its performance on simulatedand real data sets.
Significance testing in ridge regression
• Ridge regression is not a variable selection method - the shrinkagepenalty does not shrink any coefficient estimates to zero.
• A test of significance of ridge regression coefficients had beenproposed (Halawa & El Bassiouni, 2000) and applied (Malo et al, 2008)but not evaluated.
• We extended the test to be applicable when p >> n and to be appliedin logistic ridge regression, and evaluated its performance on simulatedand real data sets.
Significance testing in ridge regression
• Ridge regression is not a variable selection method - the shrinkagepenalty does not shrink any coefficient estimates to zero.
• A test of significance of ridge regression coefficients had beenproposed (Halawa & El Bassiouni, 2000) and applied (Malo et al, 2008)but not evaluated.
• We extended the test to be applicable when p >> n and to be appliedin logistic ridge regression, and evaluated its performance on simulatedand real data sets.
Significance test
Based on a Wald test:
Tk =βk
se(βk
) H0 : Tk ∼ N (0,1)
se(βk
)from covariance matrix
Var(βk
)= σ2(X ′X + kI)−1X ′X (X ′X + kI)−1
taking into account both correlation in predictors and amount ofshrinkage.
Simulation studyCausal SNP
Ridge Regression with Applications to Genetic DataErika Cule1, Paolo Vineis1 and Maria De Iorio2
1Imperial College London and 2University College London
Risk Prediction using Genetic DataRecent technological developments have increased the availabilityof genetic data,. These data can be used to investigate theassociation between genetic variants and disease risk. Genetic datapresent statistical and computational challenges due to their highdimensionality and correlation structure.
Ridge RegressionRidge regression [2] is a penalized regression method that places apenalty on the squared length of the regression coefficients.Ordinary least squares (OLS) estimates of regression coefficientsare replaced by their ridge counterparts. The amount ofpenalization is determined by a penalty parameter, k. In thestandard regression model
Y = Xβ + � (1)
Here Y = (Y1, . . . , Yn) are the observed phenotypes in nindividuals and X is an n × p matrix comprising rowsxi = xi1, . . . , xip of genotypes at p loci. β =
�βj, . . . ,βp
�is a vector
of p regression coefficients to be estimated. �iid∼ N(0,σ2) is the
measurement error. Ridge regression estimates are obtained as
βk =�X �X + kIp
�−1 X �Y
Significance TestingThe Wald statistic is obtained by dividing the estimated coefficientby its standard error:
Tk =βjk
se(βjk)
The standard error is obtained as the square root of the jth elementof the diagonal of the variance matrix:
Var(βk) = σ2 �X �X + kIp�−1 X �X
�X �X + kIp
�−1
σ2 is replaced by its estimate, σ2:
σ2 =(Y − Xβ) �(Y − Xβ)
n − tr (2H − HH �)H is the hat matrix:
H = X�X �X + kIp
�−1 X �
In the large sample sizes typical of GWAS, under H0 : βj = 0,Tk ∼ N(0, 1)
Logistic Ridge RegressionThe logistic model is commonly used to model biomedical data,where the Yi represent, for example, cases (Yi = 1) and controls(Yi = 0). The above test was extended to coefficients estimatedusing ridge logistic regression .
Simulation study
p = 0
coefficient estimate
Fre
qu
en
cy
!1.0 !0.5 0.0 0.5 1.0 1.5
05
01
50
25
0
!2 0 2 4 60
.00
.10
.20
.30
.4
p = 1.07e!08
T!
Pro
ba
bili
ty
p = 0.496
coefficient estimate
Fre
qu
en
cy
!1.0 !0.5 0.0 0.5 1.0 1.5
05
01
00
15
02
00
!2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
p = 0.322
T!
Pro
ba
bili
ty
Simulated genetic data wereused to compare theperformance of the proposedtest (right column) with apermutation test (left column),which we view as a benchmark.Top, a SNP that was associatedwith phenotype; Bottom, a SNPnot associated with phenotype.
Lung Cancer Data
3 SNPs that have previouslybeen found to be associatedwith lung cancer disease statuswere found as such by our test(right), which performs well incomparison to a permutationtest (left).
Approximate test
Shrinkage parameter
!lo
g p!
valu
e
0 50 100 150 200 250 300
02
46
8
rs8034191
rs16969968
rs402710
other SNPs
a
Permutation test
Shrinkage parameter
!lo
g p!
valu
e
0 50 100 150 200 250 300
02
46
8
Inf
rs8034191
rs16969968
rs402710
other SNPs
b
Choice of Penalty ParameterSeveral methods for choosing the penalty parameter k in ridgeregression have been proposed, but no single method provides auniversally optimum choice. Further, existing methods are notapplicable when the number of predictors is greater than thenumber of observations, as is commonly the case for genetic data.
Proposed estimatorTaking the eigendecomposition X �X = QΛQ � the model in (1) iswritten in canonical form:
Y = Zα + �
with Z = XQ and α = Q �β. Columns of Z are the principalcomponents of X. The OLS estimator for α is
α = Λ−1Z �Y
and the first r elements of α are the first r coefficients in a principalcomponents regression (PCR). We propose
kr =rσ2
rα �
rαr
where σ2r is estimated using the first r columns of Z and r is chosen
to such that the degrees of freedom for variance is the same as thatof a PCR using r components.
Prediction resultsThe estimator was evaluated by comparison to HyperLassoregression [3] Data are from the WTCCC, Bipolar DIsorderphenotype and were split into training and test data sets. HLshape parameter was fixed to 3.5, penalty parameter was chosenby cross-validation.
References[1] Erika Cule, Paolo Vineis, and Maria De Iorio, Significance testing in ridge regression for genetic data, BMC
Bioinformatics 2011 12:372 12 (2011), no. 1, 372 (en).
[2] Arthur E Hoerl and RW Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12(1970), no. 1, 55–67.
[3] Clive J Hoggart, JC Whittaker, M De Iorio, and David J Balding, Simultaneous analysis of all snps in genome-wide andre-sequencing association studies, PLoS Genet 4 (2008), no. 7, e1000130.
Simulation studyNon-causal SNP
Ridge Regression with Applications to Genetic DataErika Cule1, Paolo Vineis1 and Maria De Iorio2
1Imperial College London and 2University College London
Risk Prediction using Genetic DataRecent technological developments have increased the availabilityof genetic data,. These data can be used to investigate theassociation between genetic variants and disease risk. Genetic datapresent statistical and computational challenges due to their highdimensionality and correlation structure.
Ridge RegressionRidge regression [2] is a penalized regression method that places apenalty on the squared length of the regression coefficients.Ordinary least squares (OLS) estimates of regression coefficientsare replaced by their ridge counterparts. The amount ofpenalization is determined by a penalty parameter, k. In thestandard regression model
Y = Xβ + � (1)
Here Y = (Y1, . . . , Yn) are the observed phenotypes in nindividuals and X is an n × p matrix comprising rowsxi = xi1, . . . , xip of genotypes at p loci. β =
�βj, . . . ,βp
�is a vector
of p regression coefficients to be estimated. �iid∼ N(0,σ2) is the
measurement error. Ridge regression estimates are obtained as
βk =�X �X + kIp
�−1 X �Y
Significance TestingThe Wald statistic is obtained by dividing the estimated coefficientby its standard error:
Tk =βjk
se(βjk)
The standard error is obtained as the square root of the jth elementof the diagonal of the variance matrix:
Var(βk) = σ2 �X �X + kIp�−1 X �X
�X �X + kIp
�−1
σ2 is replaced by its estimate, σ2:
σ2 =(Y − Xβ) �(Y − Xβ)
n − tr (2H − HH �)H is the hat matrix:
H = X�X �X + kIp
�−1 X �
In the large sample sizes typical of GWAS, under H0 : βj = 0,Tk ∼ N(0, 1)
Logistic Ridge RegressionThe logistic model is commonly used to model biomedical data,where the Yi represent, for example, cases (Yi = 1) and controls(Yi = 0). The above test was extended to coefficients estimatedusing ridge logistic regression .
Simulation study
p = 0
coefficient estimate
Fre
quency
!1.0 !0.5 0.0 0.5 1.0 1.5
050
150
250
!2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
p = 1.07e!08
T!
Pro
babili
ty
p = 0.496
coefficient estimate
Fre
quency
!1.0 !0.5 0.0 0.5 1.0 1.5
050
100
150
200
!2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
p = 0.322
T!
Pro
babili
ty
Simulated genetic data wereused to compare theperformance of the proposedtest (right column) with apermutation test (left column),which we view as a benchmark.Top, a SNP that was associatedwith phenotype; Bottom, a SNPnot associated with phenotype.
Lung Cancer Data
3 SNPs that have previouslybeen found to be associatedwith lung cancer disease statuswere found as such by our test(right), which performs well incomparison to a permutationtest (left).
Approximate test
Shrinkage parameter
!lo
g p!
valu
e
0 50 100 150 200 250 300
02
46
8
rs8034191
rs16969968
rs402710
other SNPs
a
Permutation test
Shrinkage parameter
!lo
g p!
valu
e
0 50 100 150 200 250 300
02
46
8
Inf
rs8034191
rs16969968
rs402710
other SNPs
b
Choice of Penalty ParameterSeveral methods for choosing the penalty parameter k in ridgeregression have been proposed, but no single method provides auniversally optimum choice. Further, existing methods are notapplicable when the number of predictors is greater than thenumber of observations, as is commonly the case for genetic data.
Proposed estimatorTaking the eigendecomposition X �X = QΛQ � the model in (1) iswritten in canonical form:
Y = Zα + �
with Z = XQ and α = Q �β. Columns of Z are the principalcomponents of X. The OLS estimator for α is
α = Λ−1Z �Y
and the first r elements of α are the first r coefficients in a principalcomponents regression (PCR). We propose
kr =rσ2
rα �
rαr
where σ2r is estimated using the first r columns of Z and r is chosen
to such that the degrees of freedom for variance is the same as thatof a PCR using r components.
Prediction resultsThe estimator was evaluated by comparison to HyperLassoregression [3] Data are from the WTCCC, Bipolar DIsorderphenotype and were split into training and test data sets. HLshape parameter was fixed to 3.5, penalty parameter was chosenby cross-validation.
References[1] Erika Cule, Paolo Vineis, and Maria De Iorio, Significance testing in ridge regression for genetic data, BMC
Bioinformatics 2011 12:372 12 (2011), no. 1, 372 (en).
[2] Arthur E Hoerl and RW Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12(1970), no. 1, 55–67.
[3] Clive J Hoggart, JC Whittaker, M De Iorio, and David J Balding, Simultaneous analysis of all snps in genome-wide andre-sequencing association studies, PLoS Genet 4 (2008), no. 7, e1000130.
Simulation studyTrue-positive and False-positive rates
Shrinkage ParameterApproximate test Permutation test
0.1 1 10 100 0.1 1 10 100Individuals SNPs500 20
TPR 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000FPR 0.045 0.045 0.061 0.133 0.015 0.015 0.017 0.095
100TPR 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000FPR 0.056 0.054 0.071 0.141 0.015 0.018 0.024 0.074
1000TPR 0.100 0.500 0.900 1.000 0.000 0.200 0.800 1.000FPR 0.038 0.045 0.049 0.080 0.007 0.006 0.010 0.029
ALLTPR 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000FPR 0.318 0.071 0.068 0.069 0.019 0.019 0.020 0.020
5000 20TPR 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000FPR 0.048 0.048 0.048 0.113 0.006 0.006 0.006 0.053
100TPR 0.900 0.900 1.000 1.000 0.800 0.900 1.000 1.000FPR 0.055 0.052 0.062 0.100 0.003 0.001 0.007 0.055
1000TPR 0.700 0.700 1.000 1.000 0.700 0.700 0.900 1.000FPR 0.046 0.046 0.045 0.060 0.006 0.007 0.008 0.014
ALLTPR 0.400 0.500 0.900 1.000 0.300 0.900 0.900 1.000FPR 0.026 0.027 0.029 0.042 0.007 0.007 0.007 0.009
Lung Cancer Data
Ridge Regression with Applications to Genetic DataErika Cule1, Paolo Vineis1 and Maria De Iorio2
1Imperial College London and 2University College London
Risk Prediction using Genetic DataRecent technological developments have increased the availabilityof genetic data,. These data can be used to investigate theassociation between genetic variants and disease risk. Genetic datapresent statistical and computational challenges due to their highdimensionality and correlation structure.
Ridge RegressionRidge regression [2] is a penalized regression method that places apenalty on the squared length of the regression coefficients.Ordinary least squares (OLS) estimates of regression coefficientsare replaced by their ridge counterparts. The amount ofpenalization is determined by a penalty parameter, k. In thestandard regression model
Y = Xβ + � (1)
Here Y = (Y1, . . . , Yn) are the observed phenotypes in nindividuals and X is an n × p matrix comprising rowsxi = xi1, . . . , xip of genotypes at p loci. β =
�βj, . . . ,βp
�is a vector
of p regression coefficients to be estimated. �iid∼ N(0,σ2) is the
measurement error. Ridge regression estimates are obtained as
βk =�X �X + kIp
�−1 X �Y
Significance TestingThe Wald statistic is obtained by dividing the estimated coefficientby its standard error:
Tk =βjk
se(βjk)
The standard error is obtained as the square root of the jth elementof the diagonal of the variance matrix:
Var(βk) = σ2 �X �X + kIp�−1 X �X
�X �X + kIp
�−1
σ2 is replaced by its estimate, σ2:
σ2 =(Y − Xβ) �(Y − Xβ)
n − tr (2H − HH �)H is the hat matrix:
H = X�X �X + kIp
�−1 X �
In the large sample sizes typical of GWAS, under H0 : βj = 0,Tk ∼ N(0, 1)
Logistic Ridge RegressionThe logistic model is commonly used to model biomedical data,where the Yi represent, for example, cases (Yi = 1) and controls(Yi = 0). The above test was extended to coefficients estimatedusing ridge logistic regression .
Simulation study
p = 0
coefficient estimate
Fre
qu
en
cy
!1.0 !0.5 0.0 0.5 1.0 1.5
05
01
50
25
0
!2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
p = 1.07e!08
T!
Pro
ba
bili
ty
p = 0.496
coefficient estimate
Fre
qu
en
cy
!1.0 !0.5 0.0 0.5 1.0 1.5
05
01
00
15
02
00
!2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
p = 0.322
T!
Pro
ba
bili
ty
Simulated genetic data wereused to compare theperformance of the proposedtest (right column) with apermutation test (left column),which we view as a benchmark.Top, a SNP that was associatedwith phenotype; Bottom, a SNPnot associated with phenotype.
Lung Cancer Data
3 SNPs that have previouslybeen found to be associatedwith lung cancer disease statuswere found as such by our test(right), which performs well incomparison to a permutationtest (left).
Approximate test
Shrinkage parameter
!lo
g p!
valu
e
0 50 100 150 200 250 300
02
46
8
rs8034191
rs16969968
rs402710
other SNPs
a
Permutation test
Shrinkage parameter
!lo
g p!
valu
e
0 50 100 150 200 250 300
02
46
8
Inf
rs8034191
rs16969968
rs402710
other SNPs
b
Choice of Penalty ParameterSeveral methods for choosing the penalty parameter k in ridgeregression have been proposed, but no single method provides auniversally optimum choice. Further, existing methods are notapplicable when the number of predictors is greater than thenumber of observations, as is commonly the case for genetic data.
Proposed estimatorTaking the eigendecomposition X �X = QΛQ � the model in (1) iswritten in canonical form:
Y = Zα + �
with Z = XQ and α = Q �β. Columns of Z are the principalcomponents of X. The OLS estimator for α is
α = Λ−1Z �Y
and the first r elements of α are the first r coefficients in a principalcomponents regression (PCR). We propose
kr =rσ2
rα �
rαr
where σ2r is estimated using the first r columns of Z and r is chosen
to such that the degrees of freedom for variance is the same as thatof a PCR using r components.
Prediction resultsThe estimator was evaluated by comparison to HyperLassoregression [3] Data are from the WTCCC, Bipolar DIsorderphenotype and were split into training and test data sets. HLshape parameter was fixed to 3.5, penalty parameter was chosenby cross-validation.
References[1] Erika Cule, Paolo Vineis, and Maria De Iorio, Significance testing in ridge regression for genetic data, BMC
Bioinformatics 2011 12:372 12 (2011), no. 1, 372 (en).
[2] Arthur E Hoerl and RW Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12(1970), no. 1, 55–67.
[3] Clive J Hoggart, JC Whittaker, M De Iorio, and David J Balding, Simultaneous analysis of all snps in genome-wide andre-sequencing association studies, PLoS Genet 4 (2008), no. 7, e1000130.
Outline
1 Risk Prediction using Genetic Data
2 Methods and challenges
3 Ridge RegressionShrinkage parameterSignificance testing
4 Conclusions
Summary
• Prediction is a challenging problem!
• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.
• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.
• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.
• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.
Summary
• Prediction is a challenging problem!
• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.
• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.
• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.
• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.
Summary
• Prediction is a challenging problem!
• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.
• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.
• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.
• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.
Summary
• Prediction is a challenging problem!
• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.
• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.
• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.
• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.
Summary
• Prediction is a challenging problem!
• Ridge regression is a popular penalized regression approach that hasbeen shown to perform well for prediction.
• We propose a semi-automatic method for choosing the shrinkageparameter in ridge regression, which can be applied when p � n.
• We introduced a method for testing the significance of regressioncoefficients estimated using ridge regression.
• We have enabled ridge regression to be a feasible tool for genetic riskprediction on a genome-wide scale.
R package ridge
• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.
• We have written an R package, ridge, for fitting such models.
• For large data sets, C code is used (with a user-friendly R interface).
• Where available, multi-core or GPU computation speeds up matrixoperations.
• Flexibility to include non-genetic covariates - penalized or not.
• Significance test is implemented.
• Graphical outputs - ridge and p-value traces.
• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.
R package ridge
• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.
• We have written an R package, ridge, for fitting such models.
• For large data sets, C code is used (with a user-friendly R interface).
• Where available, multi-core or GPU computation speeds up matrixoperations.
• Flexibility to include non-genetic covariates - penalized or not.
• Significance test is implemented.
• Graphical outputs - ridge and p-value traces.
• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.
R package ridge
• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.
• We have written an R package, ridge, for fitting such models.
• For large data sets, C code is used (with a user-friendly R interface).
• Where available, multi-core or GPU computation speeds up matrixoperations.
• Flexibility to include non-genetic covariates - penalized or not.
• Significance test is implemented.
• Graphical outputs - ridge and p-value traces.
• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.
R package ridge
• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.
• We have written an R package, ridge, for fitting such models.
• For large data sets, C code is used (with a user-friendly R interface).
• Where available, multi-core or GPU computation speeds up matrixoperations.
• Flexibility to include non-genetic covariates - penalized or not.
• Significance test is implemented.
• Graphical outputs - ridge and p-value traces.
• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.
R package ridge
• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.
• We have written an R package, ridge, for fitting such models.
• For large data sets, C code is used (with a user-friendly R interface).
• Where available, multi-core or GPU computation speeds up matrixoperations.
• Flexibility to include non-genetic covariates - penalized or not.
• Significance test is implemented.
• Graphical outputs - ridge and p-value traces.
• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.
R package ridge
• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.
• We have written an R package, ridge, for fitting such models.
• For large data sets, C code is used (with a user-friendly R interface).
• Where available, multi-core or GPU computation speeds up matrixoperations.
• Flexibility to include non-genetic covariates - penalized or not.
• Significance test is implemented.
• Graphical outputs - ridge and p-value traces.
• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.
R package ridge
• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.
• We have written an R package, ridge, for fitting such models.
• For large data sets, C code is used (with a user-friendly R interface).
• Where available, multi-core or GPU computation speeds up matrixoperations.
• Flexibility to include non-genetic covariates - penalized or not.
• Significance test is implemented.
• Graphical outputs - ridge and p-value traces.
• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.
R package ridge
• Fitting ridge regression models to data comprising hundreds ofthousands of predictors presents computational challenges.
• We have written an R package, ridge, for fitting such models.
• For large data sets, C code is used (with a user-friendly R interface).
• Where available, multi-core or GPU computation speeds up matrixoperations.
• Flexibility to include non-genetic covariates - penalized or not.
• Significance test is implemented.
• Graphical outputs - ridge and p-value traces.
• Option for user-specified shrinkage parameter, with our semi-automaticmethod as the default.
Acknowledgements
• Maria De Iorio
• Colleagues in the Department of Epidemiology and Biostatistics,Imperial College London
• Colleagues in the Department of Statistical Science,University College London
• ILCO study nested within EPIC
• WTCCC and GAIN studies
References
[1] D. G Clayton.Prediction and interaction in complex disease genetics: experience in type 1 diabetes.PLoS Genetics, 2009.
[2] Erika Cule and Maria De Iorio.A semi-automatic method to guide the choice of ridge parameter in ridge regression.arXiv, stat.AP, May 2012.
[3] Erika Cule, Paolo Vineis, and Maria De Iorio.Significance testing in ridge regression for genetic data.BMC Bioinformatics, 12(1):372, 2011.
[4] Ildiko Frank and Jerome Friedman.A statistical view of some chemometrics regression tools.Technometrics, 35(2):109–135, May 1993.
[5] A M Halawa and M Y El Bassiouni.Tests of regression coefficients under ridge regression models.Journal of Statistical Computation and Simulation, 65(1):341–356, 2000.
[6] T Hastie and R Tibshirani.Generalized Additive Models.Chapman & Hall, 1990.
[7] Arthur E Hoerl and RW Kennard.Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970.
[8] Clive J Hoggart, JC Whittaker, M De Iorio, and David J Balding.Simultaneous analysis of all snps in genome-wide and re-sequencing association studies.PLoS Genet, 4(7):e1000130, 2008.
[9] Nathalie Malo, Ondrej Libiger, and Nicholas J Schork.Accommodating linkage disequilibrium in genetic-association analyses via ridge regression.Am J Hum Genet, 82(2):375–385, Feb 2008.
[10] Robert Tibshirani.Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society B, 58:267–288, 1996.
[11] Hui Zou and Trevor Hastie.Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society B, 67(2):301–320, Jan 2005.