Upload
doanlien
View
221
Download
0
Embed Size (px)
Citation preview
STATISTICAL ANALYSIS OF MEDICAL IMAGES WITH
APPLICATIONS TO NEUROIMAGING
BY
Rafal Kustra
THESIS SLBMITTED IS COSFOR,\IIT\' \ï'ITH THE
REQCIREIIESTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
GR-4DC.ATE DEP-4RTlIEST OF
PUBLIC HE-ALTH SCIESCES
I S THE
C'Sn'ERSITY OF TOROSTO
TOROSTO. OST~UZIO
ACGCST 2000
@ Copyright by Rafal Kustra. 2000
National Library of Canada
Bibliothéque natio~ale du Canada
Acquisitions and Acquisitions et Bibliographie Services services bibliographiques
395 Wellington Street 395. nie Wdlingtori OnawaON K l A W WwaON K l A OlrC4 Canada canada
The author has granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or sell copies of this thesis in rnicrofom, paper or electronic formats.
The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or othewise reproduced without the author's permission.
L'auteur a accordé une Licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
C'SI\'ERSITY OF TOROSTO
DEP-IRTJIEST OF
PUBLIC HE-ALTH SCIESCES
The undersigned hereby certify that the' have rcad and recommend to
t hc School of Graduate Studies for acceptance a t hesis entitled -Stat istical
Analysis of Medical Images with Applications to Neuroimaging"
IF Rafal Kustra in partial fulfillment of the requirements for the degree of
Doctor of Philosophy.
Dated: -4ugust 2000
Esternal Esaminer: Keith Worsley. PhD
Research Supervisor: Robert Tibshirani
Esarning Commit tee: Stcven Strother. PhD
.James Stafford. PIiD
Randy McIntosh. PhD
UNIVERSITY OF TORONTO
Date: August 2000
.-lut hor: Rafal Kustra
Tit le: S t at ist ical Analysis of Medical Images wit h
Applications to Neuroimaging
Dcpart nient: Public Health Sciences
Dcgrcc: Ph.D. Convocation: October \car: 2000
Pern~ission is hcrewith granted to U'riivcrsity of Toronto to circiilate and to have copicd for non-commercial purposes. at its discretion. thc abovc title upon tlic rcqucst of individuals or institutions.
Signaturc of Author
THE ACTHOR RESERITES OTHER PCBLIC-r\TIOS RIGHTS. ASD SEITHER THE THESIS S O R ESTESSI\'E ESTR-ACTS FROJI IT 1LAY BE PRISTED OR CITHERIVISE REPRODCCED IVITHOCT THE ACTHOR'S \\'RITTES PER,\IISSIOS.
THE XC'THOR ATTESTS THAT PERJIISSIOS HXS BEES OBTXISED FOR THE C'SE OF ASY COPI'RIGHTED 1LITERIXL ,APPEARISG IS THIS THESIS (OTHER TH.W BRIEF ESCERPTS REQCIRISG OSLY PROPER .ACKSO\VLEDGE'rlEST I S SCHOLXRLY \\'RITISG) ASD TH-AT -ILL SLCH cSE IS LLE-ARLY -4CKSOiVLEDGED.
Contents
List of Tables
List of Figures
Abstract
Acknowledgments
ix
X
xvi
xvii
1 Introduction 1
1.1 Irriagcs as Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 \lotivation and the Setup for Statistical Iniage Analysis . . . . . . . . . . . 4
1.:3 Fiinctional Data . . . . - . . . . . . . . . . . . . . - . - . . - . . . . . - . . 6
1.4 Sotation and Conventions . . . . . . . . - . . . . - . . - . . . . . . . . . 9
2 Neuroimaging Data and Methods 12
2.1 Goals and Study Design in Seuroimaging . . . . . . . . . . . . . . . . . . . 1'7
2.2 PET and fl1RI !.Iodalities . . . . . . . . . . . - . . . . . . . . . . . . . . . 15
2.2.1 Positron Emission Toniography . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Functional Slagnetic Resonance Irnaging . . . . . . . . . . . . - . . 20
2.3 Litcrature Ovcrview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Single Vosel .Anal-is: Statist ical Paramctric llapping aiid Gaussian
Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 'LI
2-3.2 Scalrd Suhprofile Uotlel: State-Driven iariance Decomposit ion wit h
. . . . . . . . . . . . . . . . . . Global and Subject Effect Removal 31
. . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Partial Least Squares 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Datasets Studied 36
. . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Finger Opposition Task 36
. . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Static Force n IRI data 35
3 PenaIized Linear Discriminant Analysis with Basis Expansion 41
. . . . . . . . . . . . . . . . . . . . 3.1 Classical Linear Discriminant -4nalysis 44
. . . . . . . . 3.1.1 Discriminant Functions and AI.lSO\;.4 \ ïew of LD-4 -IG
. . . . . . . . . . . 3.1.2 The Geometry of LD.4 in Two Class . 2D setting 47
. . . . . . . . . . . . . . . . . . . . . . . 3.2 LDr\ arid Random Subject Effects 48
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Simulation Study .5 1
3.13 Diriicrision Reduct ion in LD;\ iising Smoot hness Const raints and Penalizat ion 57
. . . . . . . . . . . . . . . . 3.3.1 Basis Espansion of Canonical \ariates 58
. . . . . . . . . . . . . . . . 3.3.2 Penalized Linear Discriminant Analysis 59
3.3.3 Penalized Discriminant Analysis and St at ist ical Paranietric llapping 6 1
. . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 PD;\ via Regression 64
. . . . . 3.3.5 Espressing the PD.4 algorithm in the 3-dimcnsional space GG
. . . . . . . . . . . . . . . . . . . . . . 3.3.6 Effective Degrees of Freedom GT
. . . . . . . . . . . . . . . . . . .3.3.1 Prediction Error and its Estimates 68
. . . . . . . . . . . . . . . . . . . . . . . 3.4 -4 Sotc on Gaussian .\ ssumption 73
. . . . . . . . . . . . . . . 3.5 1s Ridge Penalty Enough for tlic B-spline Basis? 74
4 Results with B-Spline and 3 dimensional Wavelets 79
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -4.1 i\ivelet Basis 79
. . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 l\avelets: Introduction 80
. . . . . . . -1.1.2 Orthogonal \\.'al.eIet Basis and Slultiresoliition Analysis 81
4.1.3 Discretel\avelet Transform . . . . . . . . . . . . . . . . . . . . . . 85
4.1.4 3D l i a ~ e l e t Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1.3 \\-welct Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Finger Opposition Data: Methods . . . . . . . . . . . . . . . . . . . . . . . 90
4 2 . Data and the Standard t-Test .A nalxsis . . . . . . . . . . . . . . . . 90
4.2.2 Two-way Classification n-it h TPS: Interna1 Optimization and Scan
Sormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 1-23 Two-11-a? Classification: Esternal Optimizat ion
4.2.4 Deriving Tirne and Spatial projections . . . . . . . . . . . . . . . .
4.3 Finger Opposition Data: Results . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 4.3.1 Two-way Classification and Interna1 Opt imization
4.3.2 Estcrnal Optimization wi-ith Different Bases . . . . . . . . . . . . .
4.3.3 Applying PD-A to an Eight-Class Problcm: State ancl Temporal
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Changes
. . . . . . . . . . . . . . . . . . . . . . . . 4.4 Resrilt ivi tli \\a\-elet Espansions
5 Static Force fMRI Analysis 112
5.1 llotlciing the tinie series effects: Tirne-Smooti~ccl PD-1 . . . . . . . . . . . 112
5 Introducing Betwen-Scan Smoothness within the Discriminant Framework 113
5.3 T h O ( S ) -4lgorithm for Time-Smoothed Penalized Discriminant . . . . . 115
5.3.1 Constructing the Second-Order B-spline Penalty Matris . . . . . . 117
5.4 Corinections n-ith Canonical Correlation .A nalysis and l I -U 'Oi~- - l . . . . . . 118
- - 3 Pcnalizcd Discriminant -4nalysis of StaticForce da ta in B-splines and \\a\-elct
do~riains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1'23
. . . . . . . 5 . G Applying Time-Smoothed PD.-\ mode1 to the StaticForce da ta 125
vii
6 Conclusioas and Extensions 128
6.1 Estending the predictirc anal-sis . . . . . . . . . . . . . . . - . . . . . . . 129
6.2 Cornparing the ~ e s u l t s ;\cross Son-Predict ive Paradigms . . . . . . . - . - 131
6.3 l\>vclcts and Basis Selection Tecliniqiies . . . . . . . . . . - - . . . . . . . 132
6.4 Inference and other issues . . . . . . . . . . . . . . . . . - . . . . . . . - . 1.34
A Tensor Product B-Spline Basis 137
B Basis Expansion of Canonical Variates 139
C CCA via Regression 141
D Correspondence Between CCA and LDA variates 143
E Deriving Predictions in the n-Dimensional Space 145
F Ridge Regression With the Outer Product Matrix 146
G Centering the Design Matrix 149
List of Tables
3.1 Estima te o f Effects in the four --lrvO\ 14 models of the simulation rcsults.
The terms arc input dimensionalitj-. P = (5.30) a s cornparcd to P =
10. training set size -V = 50 compared to ,V = 100. nunher of sul~jects.
Subjects={5.10. ".\-") compared to 1 subject and ratio o f iarianccs of subject
rffect to crror eflect. VarRatio={ 10.100) compwed to VarRatio= 1. Tlicre is
dso a n interaction term Detwcn P and .V. . . . . . . . . . . . . . . . . . 56
List of Figures
3.1 Demonstration o f 2-class LD.4 in 2 dimensions. The light points (class 1 )
and darkcr points (class 2) show the ,300 bitariate Gaussian obsert-ations
gcncrated from each class. The solid linc is the true canonical \-ariate (C\-).
Tlie circlcs are class means, and the clia~noncls are the means projected onto
&lie Cl-. The points marked nith the cross. represent the test point and its
projections ont0 the mean-difference (broken) and Cl* lines. . . . . . . . . 47
3.2 Thc orthonormal hasis functions (columns o f L-) from the 2-il tensor-product
B-spline cspansion. T h r e are 5 x 3 B-spline basis. ir-tiich were diagonalizecl
ir-itlt S I D and displayed hcre in the order o f decreasing eigcm-altrcs. T h -- penalticssk~on-r~foreach basisn-erccalculatccf n . i t b X = l . . . . . . . . . . t r
4.1 Haar (lcft) and Dau bechies SJ-rnmlet n.a\det functions. The detail le\-el
grows from bottom up, and onlj. some intcger translates are clran-n a t each
let-el . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 In tcrnal optimiza tion o f the ridge t uning parameter (Sec. 4 .3 .1 ) - expressed as
a n un1 ber o f Effectii-e Degrees o f Freedom . for Penalized Discriminant .-Inal-
ysis in tn-O-class problem. This example uses the da ta projected on the BZS.
tensor product B-spline basis set. The curres show the change of Predic-
t ion Error (bo th .\ lisclassification (JIC) ra te and Squared Predic tion Error
(SPE)) as a function o f Effectiw Degrees o f Freedom (EDF). Both cross-
r-didation (Cl r) and -6.3PBootstrap estimates are eshibitcd. T i ~ c top Icft
panel shon-s changes for un-normalized data. n-hile the bottom panel deals
ri-ith mean-normalized data. The top right panel shon-s Cl- and .63ZBoot-
strap SPE curws for mean-normalized data n-hich n-ere obtained n-ith a
ciiffercnt random secd for ei-err- EDF: these portraj- the greater r-ariahilit?- o f
C'\- estimates. Thin lines - Cl- estimates. Thick lincs - .632+Bootstr;tp
cstinlates. solid lines - estimates o f SPE. dashed lincs - estimates o f MC
rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4 -3 Prcdiction Error (PE) curves as a function o f the cffectir-c degrecs o f frceclorn
(EDF) for al1 tensor-product B-splinc (B15-635) hases and unprojected rari.
(Braw) and srnoothed (Bsmooth3,S) scans. The upper panel show Scduared
Prcdictio~i Error (SPE=( 1 - P;). diere Ijc is tlic estima ted posterior prob
ahilitj- for the truc class): and the lon-er panel dcpicts -\fiscZassification rate
(SIC rate) as a percentage o f the total number o f scans n~isclassified. The
iniages from un-projected data are shon-n n-ith larger-ri-idth cun-es. The
niarkcrs show the minima for cach cim-e. (Thc minimum for the 630 curr-e
is not shon-n as it occured bej-ond the figure framc). . . . - . . . . . . . . 9.5
4 4 Filnctionalb- actir-a ted [ '.'O] ira ter PET vosels a bor-e the 93. J percentile
(n-hi te or -erlay) in t erlear -ed n-ith registered grayseale -\ 1R.I brain slices for
Pcnalized Discriminant -4nalysis oE ( A 1 6-A40) unprojected ran- data pres-
rnoothed rr-ith a 3 x 3 x 3 r-osel boxcar kernel (BSrnooth3): ( B I 6-B40) unpro-
jccted ran- data rvithout presrnoothing (Braw): (Cl 6-C40) tensor procl uct
splinc basis n-itd 15 spline bases in each spatial dimension (BI 5) : (01 6-040)
tcnsor product splinc hasis n-ith 35 splinc bases in each spatial dimension
(835) (actir-a tion images -4 to D have decreasing squared prediction error
(SPE) 1-alues as illustrated in Figure 4.3): (E16-E40) is a pooled standard
der-iation t-test image o f scans presmoothed n-ith a 3 x 3 x 3 rose1 boscar
kcrncl - the Bonferrozri t - d u e (t=4.G) at the 93.4 percentile iras used to
dcfine a conserrati~-e actiration tilreshold n-ith n-hich to compare actimtion
imagc peaks (n-hitc or-ml+-) for a fisecl number o f r-osels. PET and AIRI
sliccs are 1'28 x 128 rr-ith 3.1 mm pixels n-ith centcr-to-center slice spacing o f
3 . 4 m m (i.e.. slice -42.3 ancl -426 are separated b- (26 - 23) - 3.4 = 10.2mm)
and arc parallei GO the .4C-PC plane. rr-hich coincides rr-ith slicc 24. Image
left = brain left. . . . . . . . . . . . . . . . . . . . . . . . . . . . . - . . . 98
4. . Scatter plot o f pairs o f activation imagc ralucs for al1 Talairach brain ras-
cls ( 1 point/\-osel) for a single-rose1 t-test image using a pooled standard
der-iation estimate, cornpared to penalized discriminant analysis ( P m ) of
a tensor product spline representation n-ith 35 B-spline bases along each
spatial dimension (B35). The dashed line depicts the principal a i s from
a principal componen t ana&-sis o f tlw scat ter plot ciistri bu tion. The circle
Iiighligk~ts a group o f 1-osels in the primary t-isual region that have mo1-ed
froni the 2oth percentile in the t-test image to the 9oLh percentile in the FD.4
imagc. The solid 1-ertical line depicts the Bonferroni t-ralue (t=4.6.5) at
thc 93.4 percentile o f the t-test distribution o f t-ose1 ralues ( d i t e orerlq:
row E o f Figure 4-4) and the solici horizontal lioe reflects the 93.4 pcrceotile
(ralue=0.0065) for the PD-4 distribution of t-oscl talues (rr-hite 01-erl;?: rcirr-
D o f Figure -1.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
G Square Predictiozi Error (SPE) cur\-CS. in an 8-class prohlcm. a s a function
o f Equil-alen t Degrees o f Frcedon~ for 8 penalized discriminant models rr-ith
differen t represen ta tions: 5 tensor-prod uc t B-splinc projected da tasets \ri th
rnrj-ing nurnbers o f hasis functions (BI5 to B35) and 3 unprojectecf raw
(Braw) and srnoothed (BSmooth3 and BSmooth5) datasets (thicker lines).
Thc markers short- the minima for each curr-e. . . . . . . . . . . . . . . . . 103
4 Projecting tlic data on first t~r-O canonical images obtain bj- Penalized Dis-
criminant --lnaljsis o f the 8-n-aj- classification pro blem. The points are la-
belcd according to the class (1: first basehe, 2: first active, 3: second base-
fine. etc). and the class means are slion-n in cirdcs. Tkiis figure rr-as obtained
from tcnsor-product B-spline projected data using 35 13-spline basis in each
dimension (B35) mode1 n-ith X corresponding to the minimum Squared Pre-
. . . . . . . . . . . diction Error (SPE): but is similar across basis and X k. 104
s i i i
4. S Top pancl compares war-ele t and Bsplinc rcsults in the 2clas.s problem. Shonp
are Daubechies order 2 thresholded n-ar-elet basis compared to ran- and B-
splinc representations using .632+ Bootstrap estima te o f squared prediction
crror. The bottom panel shows SPE curl-es for the two-class problem for var-
ious mi-elet families. \\; compare Daubechies order 2 and 3 famih- and order
2 C'oiflct system. For each familj- irae inr-estigate tn-O threskiolding strategies:
simplj- 'peeling o f f ' one finest detail let-el in e a d dimension (32 x 32 x 16):
iind Donolio and Jotlnstone IïsuSlirink hard thresholding rule. . . . . . . . 101
4.9 \ ïsual cornparison of \\arelet (top ron-) and 830 representations in the 2
dass problern. First three slices shon- portions o f the cerebellurn. nest tn-O
display the midbrain portions. and the last three sliccs depict the actir-ation
o f the cortex The graj-scale image is the ana tomical -\IR1 scan in the Ta-
Iairacfl spacc and the Cl - is or-da)-ed on top o f it using the hot-metal color
coding. Both images n-cfe crea tecf tlsing EDF that nlinimized the SPEr 74.6
for \\a\-clet r-s 53.6 for B30. . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.10 C'omparing the B-splincs (B30) and n-ar-clets (DaubZThresh) Canonical Im-
ages using a corner-cube en\-ironment o f Rehm et al. (1 998) in the tn-O-class
sct ring. Escept for the three major or-crlaping regions. the foci hale bcen fit
insidc a bal1 o f the same r-olume a s that o f corresponding focus. Blue foci
correspond to B30 Canonical Image. . . . . . . . . . . . . . . . . . . . . . 109
-1 - 1 1 Scpared Prediction Error for various n-ar-cle t families in thc 8class pro Hem.
nlree rr-ai-elet functions are inr-estiga ted: Daubechies order 2 and -1, and
Coiff ets order 2. For each farnilr. n-e ei ther remor-e top-scale rr-ar-elet coef-
ficient ler-el. resulting in 32 x 32 x 16 n-ar-elet coefficients or n-e apply D k J
\ YsuShrink hard thresholding (Thresh). -4s a further dimension reduction
technique. rr-e inrestiga te using all 7. or first 2 C\ *S to perform classification
(07 r-s 02.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Projections on first 2 C \ S from the PD.4 mode1 applied to . . . . . . . . .
Projections o f and time-poin ts (first ron-) and force Icr-el rneans ooto the first
foirr Canonical Images using the time-smootli PD-4 rnodel n-ith B25 Scnsor
Product B-spline basis and B-splincs for the tinic asis. Force le\-els n-cre:
1 -bascline. 2-200g. 3-400g. 4 -6UOg. 5-800g. 6- 1000g. The rime-struct urc
pcrialty hj-perpiirameter rc-as set a t A) - = 10. . . . . . . . . . . . . . . . . .
Sclcctcd slices o f the tliird canonical image rcsirlting froru appb-ing the time -
srnoothrd PD-4 model to the StaticForce da ta n-ith 625 hasis. EDF=SO and
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,\y = 10.
ID a d 2D B-spline basis. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abstract
\\C cstcnd a classical multivariate technique: Linear Discriminant Analysis (LD-4) and
apply i t iri ttie analysis of PET and fl1M images of human brain function to discover re-
gioris of activation driven by the esperirnental stimuli. \\é re-examine and spccializc some
ecluivalences be twen LD-4 and: Canonical Correlation Analysis (CC-1) and ~lul t ivar ia te
.\S0\*-4 (\I=\SO\--A). Furthermore. efficient algorithms are cicrivcd to facilitate applying
tlicsc niultivariatc nioclels to estremel!- large image data. 11-c cleal witli the ill-poscci nature
of t hc problcrri iising spatial basis expansion and the perializatioii ( w i t 11 Pcnalizecl Discrini-
iriarit Arial?-sis (PD,\) of Hastie et al. (1995)). and utilize efficient mcasiires of predictivc
pcrforniancc to optiinize h~-perparamcters and validate the niodels in a robiist fastiiori. \\é
csaniine cspanding the images into a 3D tensor-product B-spline and l\hvelet basis and
corripare to the rcsults obtained without espansion. Some parallels bctween our proposal
and soine of those currently popular in the neuroimage comrnunity are discussed. Anotkicr
cstensiori to PD-1 is derived and applied that allows one to modei time serics effccts that
esist in n l R I images. \Ive conclude with many possible enhanccments to the proposed
paradignl.
sv i
Acknowledgment s
This work lias been a truly interdisciplina~ effort and therefore owes irnmcnsel? to a large
niiriibcr of people. 1 would like to thank the whole \--\PET team in Minneapolis for helping
~ i i e in evcry way. fielding m- questions and massaging the image data in some unortliodos
ancl iricon\-eriient wq-S. 1 would particularly like to thank llr. .Ion ,Anderson and ,\Ir. IGrt
Scliarpcr for tlieir data and systeni hclp.
1 ivoirld also like to express my gratitude to Dr Sick Lange froni Harvard Aledical School.
S i ck lias providecl me with the mani- insights. data and somc financial baçking tfiat has al1
lwcri vcry important in cornplcting this tliesis. 1 truly tiope that will tx abIc to work
togctiicr niore in the future.
S o tlicsis work is possible n-ithout tlie guidance and encouragement of a supervisor. 1
have hccn most lucky with two such persons: prof. Rob Tibshirani whom 1 owe a huge debt
of gratitude for introducing me into modern statistics. inspiring nie to pursue escellcnce
n-ort Iiy of such a super~isor and instilling in me a belief in esperiment and enthusiasm for
rinorthodos approachcs to data. On the other liand 1 have enjoyed his unconditional trust
and support in choosing the path 1 felt appropriatc. There was so little hand-holding. even
tliough 1 yearned for it a t times: only recently have 1 bccn able to appreciate his restrain
to lcacl me and his desire to let me discovcr my own way in the esciting and ever-growing.
hiit clefinitcl~. riot simple: world of statistics.
This tliesis also owes its esistence to prof. Steven Strother. PI of the PET group in the
\'-A rncdical ccnter in llinneapolis. Steve has @Yen me enormous support and guidance
svii
th-uugliout our cooperation. He has taught rnc hou- to work n-ith complicatcd data. hou-
to interpret the results of analysis. hou- to \\-rite papers and fight for them later on. and
lias givcn mc moral support al1 \ v q long. His enthusiasm \vas crucial a t times. 1 ha\-e also
lcarriccl froni iiirn how to look beyond the beauty of mathematical models. 1 1 0 ~ to accept
the sonietimes brutal truths spoken by the da t a rrhich more often than not force us t o
ahandon complication and retain simplicity.
1 would also like to offer speciai thanks to al1 faculty in the Departments of Statistics
and Biostatistics a t Cniversity of Toronto. and in the Department of Statistic in Stanford
L-riivcrsity. 1 u-oiild especiallj- like to thank prof. Paul Corey in Toronto for his support
arici encouragement. prof. .James Stafford in Toronto for his support and enthusiasni. and
prof. Trm-or Hastic in Stanford for his suggestions and opinions. In particular. matcrial in
scctioris 3.3 and 5.2 has been suggested b - Trevor. 1 also thank him for providing me n-ith
ari cscellcnt display of wa\-elet functions (figure 4.1). 1 am greatly inclcbted t a m - friends
aiid coIleagiics. both in Toronto and in Stanford: Ilana Belitskaya. -innie Dupuis, Celia
Grocriu-ood. Carmen lIak, 1Iike llanrio. Jorge Picazo. Bogdan Popcscu and .Jiaming Sun:
t h m k >-ou!
Last. but ccrtainly not the least. niy love and decp appreciation goes t o my family. 1
\varit to tliarik my wife. Kasia. who has been so patient with nie. my brother and his family
for lm-c and support. and my parents who have always be l ie~ed in me in the most difficult
of timcs.
Chapter 1
Introduction
1.1 Images as Data
Data cornes to ils in various fornis: incrcasingly. it is collected. stored and analyzed in the
for111 of images. To me tiiis is not ven- surprising: liurnans receive most of tiicir information
aboiit the n-orld surrounding us. as a visual input. It is only natural that we upgraclc tiic
positiori of tliis niedium in the statistical analysis clomain.
Tlicrc arc varioiis rcasons why it is only in rcccnt years that the image data lias beconie
iricrcasingly conimon. l los t important are tcchnological ad~sances on many fronts. To store
a n d process any data. bc it sound. iniagc. or categorical attributes. we necd. as far as our
currerit information processing techniques go. a numcrical represention. The methodologl-
for ;.iccluiririg images and converting thcm to numbers. a process know as scanning or
iriiagc digitizing. has bccome very accessiblc. lloreover. many medical instruments have
bccn digitizcd and already store and process images in digital form. Other technological
adj-a~iccs had t o be made in the storage domain: image da ta is vcry byte-consuming.
-And licre. again, much has been donc in recent years, mainly to mect the demand of
rriirltiniedia-savv>- consumers. Hard disks of ever-increasing capacity and speed. and other
fornis of cornputer storage - most recently DVD that stores over an hoiir of good quality
vide0 on a laser tiisk - have bccome \-en- accessible.
-4s important. although less well known. have been the advances in image compression
ancl proccssing. To appreciate their importance. one needs to understand the sheer size of
image da t a after they have becn conevrted to numerical representation or digitized. After
digitizing. the image is viewed as a large number of indivisible components called pixels
(pict urc clcnients). or voxels (volume clements) when the image is t hree dimensional. The
more pisels there are the better resolution of the digitized image. To represent cotor
information in each pixel. one usually needs threc numbers for each of Red. Green and
Blue colour channels. The nurnber of availahle colors is governed by a maximum allowed
1-aluc of cach such number u-hich is in turn connected to the number of bits allowed for
cach nuniber. Today 24 bits per pixel (or S bits per niimber). so called --TrueColor". giving
2'' or alnlost seventeen million colours possible. is an industry standard. To givc somc idca
a1)oiit thc storage requirement. a typical 640 by 480 TrueColor imagc nccds 921.600 bytes.
alrnost a nicgabytc. to store its 307.200 pisels.
I t is oh-ious. tliat storing images needs cornprcssion. The gencral cornpression algo-
rit linis arc riot optimal sincc they do riot takc irito account tlic spatial structure of images.
Bcttcr arc methods that use the information about special mcaning of the bytes and incor-
poratc snioothriess assumptions. Two such methods have bccome very popular: GIF ancl
.J PEG. Bot h are capable of reducing the size rcquirement many fold. Incrcasingly u-avelets
arc bcing tried for this task (-antonini e t al.. 1992).
TLic old saying: '-picture speaks a thousand words" has a special mcaning to statisticians
arid hiostatisticians. After al[, one of the main goals of statistics is to estract and surnmarize
the iriforrnation in the data. separating it from noise and error. If images indecd carry so
niuch more in format ion. t hen we should be particularly interested in adapting, estcnding
m c l dcvcloping statistical methodology for analyzing them. Statistics. however. has been
sion. in joining otlier fields that have become lieavily involved in image analusis: such
as the SIachinc Lcarning community. TIwre are some notable csccptions. such as Duda
ancl Hart ( 1973). who int roduced some stat ist ical met hodology. mainly in the classification
dornain. to the pattern recognition community: Ripley (1951) who has summarized esisting
spatial stat ist ics techniques: or Cressie (1993) who has pro\-ided a t heoret ical framcwork
for matheniatical image processing. These are just a few csamples. of course. and many
more could be found. The point is. however. that the statistical analysis of images is still
riot a well es tabl i sh i division in our field.
Thc ,\lachine Learning comrnunity. which has already provided statisticiaris with many
rien. escitirig models and methods. siich as Seural Setworks (Hertz et al.. 1991). Support
ICctor ,\lachines (Ikpnik. 1995). and \-eV promising Boosting learning (Freund (1995).
Freund and Schapire (1996). but see also Friedman et al. (In Press)) has bcen mucli cluicker
to approach iniage data. Most of the esamples therc. however. deal with building preclictivc
~rioticls tliat usc images as inputs: for instance cliaracter and zip-code recognition models.
\ \o rk in the clirectiori of statistical understanding and analyzing image data lias becn niucIi
lcss proniiricnt .
Thcrc arc rious us possible rcasons why stat ist icians have been rcluctant to considcr
iriiage data. One is the difficulty in working with images: it reqiiires considerable cornputingr
sophistication. Images are stored in many formats and casy to use librarics for reading.
writing and transferring betiwen them are not readily avaiiable. The problem of huge
image sizcs. where one image may bc many times largcr than a typical whole dataset
iisccl in st a t ist ics, requires much largcr computat ional resources and special programming
approachcs. -Anotlicr rcason may be the total inadequacy of many cornmon statistical
tools. Let us take a simple esample when the n x p input data matris. S consists of n
iniagcs. cach n-ith tens or hundreds of thousancis (p) of piscls. and n-hcn eacti image is
associatcd with one or more numerical responses, y. Yow. if ire think of running regrcssion
of 9 ont0 S. d i c r e n is many times smaller than p, we immecliately sec the difficulty with
çiicli data. 1Iost of the asymptotic results \vc use. increasingly qucstioncd evcn with the
-'regiilar" data. are totally inadequate here. IIorking with images requires a new way of
statistical t hinking t hat questions and examines al1 the assumptions ive have corne t o rely
ori in statistics.
1.2 Motivation and the Setup for Statistical Image
Analysis
This work dcals mostly with methods of analysis of medical images tiiat tiavc been ac-
quircd iiriclcr \-arious esperimental conditions. Imaging techniclues in rncclicine baye been
incrcasing in importance rver sirice the introduction of Sray imaging. There are non- mmany
riiodalitics that arc routinclj- uscd to gather visual inforniatiori in vivo about the workings
of our orgariism. In this dissertation 1 center oii the neuroimage data: scans of the living
f m ~ i n that rcpresent patterns of neuronal activation. The tcchriicpes introcluceci. liowci-cr.
lia\-c. in niy opinion. a mucli greater application. In this section 1 prcscnt a general cs-
pcrirnental situation which 1 vicw as a basis for the rnethodology prescntecl in the latcr
chaptrrs.
Lct LIS imagine that we have S subjects and that several images have been acquirecl from
cadi of thcm. Suppose further that thc images have been obtained under various conditions:
eithcr inciuced esperimentally. or sampled from the population. For instance \ve may have
-'riornial'- and **sick" patients. or the patients rnay be askcd to perform certain tasks. In
the first case we have that each patient is uniquely assigned to one of the conditions, in the
secoricl case ive have a blocked design with subjects as blocks. Therc coiild also be other
variables collccted in the patients, either for each image separatels or once per subject.
but siich varial~les arc of seconda-- interest and would be iised in the analysis to control
for confouriding. Each observation could be prescnted as follows:
Herc. and throughout the document. i will denote an image. The indeses 9 and r denotc
subjccts and repetitive scans acquired from a gisen subject. respectively. while k refcrs
to tlic condition or statc undcr which the image \vas acquired. The image data may be
siippleriic~ited by additional measurements. z as rnentioned aboi-c.
It is assurned that the goal of the ana lp i s is to estimate the *-differences" among the
images thnt were acquired under various conditions. 1 use the quotes on 'differcnces' to stress
thc fact that 1 do not necessarily mean algebraic differences. but an>- measurc of disparity.
\lé arc thus required to summarize that part of the variability in the iniages that ivas
iriduccd h- ttic conditions. The important goal of any descriptive analysis of csperimental
data is to proviclc some decomposition of that part of variability that is associated trith the
cspcri~~iental sctiip. Furthcrrnorc. it is desircd that ttie resiilts be in the forni of one or nlorc
iniages. with important measures (such as percentagc of csplaincd rariability. assessment
of ttie type of \-ariabi1it~- esplaincd by cach sunirriary image. etc) attaclicd. This is cjuite
diffcrcrit froni tlic prcdictive goal of many -41 methods: n-c would not bc content witli a
black 110s that takcs the images as an input and predicts t hcir condition. k. for esamplc.
One of the ciiffïculties alluded to in prel-ious sections is an atypical data setup. In the
gcricral frarnework introduced above. ive hase independent observations~ {i. x) that haw
hiigc apparent dimensionality q u a 1 to the number of pixels (plus whatever extra variables
x ive riicasiircd). tviiile the number of such observations is many times smaller. This is
refcrrcd to as an eztremely ill posed problem (cg. , Lautrup et al.. 1995). This is one
of ttic reasons that the usual inferential statistics based on asymptotic results canriot be
applicd. The motivation of this work was to develop a framework that tvould be testable
nit11 as fciv distributional assumptions as possible. To this end. 1 utilizc measures of
prcdictivc performance as a goodness of fit assessment that are robust across distributional
assumpt ions.
1.3 Functional Data
Iriiage data rnv be thought of as a two and three dimensional estension of functional data:
data t h is rcalized from observing smooth functional processes discretized on a cornrnon
latticc. Raniscy and Silverman (1997) have surveyed and estended common li~iear statis-
tical riicthods for sud i data. They develop functional alternatives to Principal Component
Regrcssion. Gcneral Linear Jlodels. including JI-ASOl:A. Canonical Corrclation and Lincar
Discriminarit -\nalysis. among ot hers.
Ttic starting point for al1 the models described in the book is the definition of the
fii~ictiorial inncr product. Since inner products are the workhorse of al1 the lincar miithod-
o1og-y in statistics. prol-ing the right defiriition for functional data results in a functional
cc~uivalcrit of the niodel. The aut hors scttle on the iisual L2(R) Hilbert space inner procluct:
n-liere T is the doniain of the data.
Of intcrest for t his t hcsis is the funct ional approach to Canonical Correlat ion -4nalysis
(CC;\). Tlie classical CC.-\ ma>- bc stated as a niasirnization problem (sec also Eq. 3.38):
wlicre S... are appropriate covariance aiid cross-covariance matrices for xi'y,. the N ob-
siirvcd pairs for which Ive seek to fit the CCA. If ive non* assume that what ive observed
are pairs of fiinctions: x i ( t ) , yi ( t ) we may define variance operators. e.g.,:
\\ï t 11 thesc definit ions the functional CC-\ criterion is posed:
nlicre <. r j are functions belonging to Hilbert space definecl by the inner procluct (L(P)' for
thc oric ciefiried in (1.2)). To obtain unique and interpretable solutions the inner products
in the dcriotiiinator are modified via perialization. Typicallj- a second derivative penalty
muid bc uscd. If one clenotes the second-derivative differential opcrator by D'. we can
rnodify tlic critcrion (1.6):
ir-hidi. irndcr niild regiilarity and boundaq- conditions. is ecluil-alent to:
iisirig the foiirt h-derirat im operator. D4.
Sou-. that the criterion is posed. what remâins is to develop an algorithm to optimize
it. givcri tlic ol~served data. One possibility pursued in Ramsej- and Silverman (1991) is to
iisc thc basis cspansion step which permits the use of standard tools for functional data.
That is. g i l r n a systern {oa(t)}F==, we assume:
(Therc is nothirig in the following algebra that recluires the use of the same systern for x
and y: in fact for the linear discrimination only x's are espandcd). In place of operators
1 >, etc, one has matrices espressing the covariance in the smooth basis domain:
The covariaiice matris for the basis is simply JIr = (o,(t): ok(t)). Similady. the penalty ma-
t r i s is Iijk = ( ( D 2 0 j ) ( t ) . (D2&) ( t ) ) . \\:it h these definitions.ive can show that the penalized
critcrion ( 1.7) bccomcs:
il-trerc the left and right canonical variates nov- are:
To sec that . let us look at one component of the denominator, ( r ) : I , ,q). without penalty
The covarianc.~ kernel is:
Thiis. n-ith pcnalty tcrm added. we get the two componcrits of the denominator of ( 1 . 1 1 ) .
The nunierator can be derit-ed in an almost identical rnanner.
TIic discriminant version of this is to use the class indicator matris. 1- tvitliout any
rcgiilarizat ion. but to regularize the observations x, (t). Our approach has been simpler: as
wc show in Scc. 3.3.4 ive onlj- cspand the canonical lariate (here. C) in the smooth B-splinc
aiiti wavelct basis. -1s compared \vit li the approach prcsentccl hcre. t his does not nccessitate
f i t t ing the basis to the data. to ohtain coefficicrits c;~. but merely projecting the data ont0
tlic basis. \\-ith least squares fitting. the difference is tliat w c arc using a rion-orthogonal
projcct ion ont0 the space spanned by t hc basis furict ions: iristead we orthogonal1~- project
ot~scrvations x,(t) ont0 each basis speparatcly. Tliis mcans that it is not necessari. to
coniputc tlic basis covariance matris. J . in orir case. For orthogonal basis. like wavelcts.
tlic two approaches are identical. For non-orthogonal basis. like B-splines tliat n-c use. i t
n-oiiltl bc a n-ortliu-hile esperinierit to compare our (simplcr) mode1 n-itli that proposed -- Rariiscy and Silvcrman ( l99Ï).
1.4 Notation and Convent ions
\\-c n-il1 adlicre to the typical statistical notation cscept for a few exceptions. Tfius x will
dcriote a dependent observation (column) vector. ~vith one or morc subscripts as rcquired.
( k ) For classes (or conditions) we will use superscripts in brackets. Thus zl j will denote the
( i j ) " ' ol~scrvatioii. with the meaning of subscripts espliiined at appropriate places. and
obtairicd iinder tlie kth class or condition. In place of x, ~r-e d l use i. if a-c are referring to
tlic scari data. \\é intend to cal1 the input images. scans. and usually resen-e the word image
to niean tlic rcsult of some analysis. that lies in the same space as the input scans. and
rnay tlius be visualized in the samc way. Howver. whcn tlie contest m a k s the differenco
clear. nec ma>- somet imes use "scan" and "image" interdiangeablu for stylist ic reasons.
\ i c usc boldface in formulas to distinguish vcctors from scalars if there zs a danger
of cnnjusion. siich as when both scalars and vectors appear. Cpper case Roman (late
alphabet: CI I -. II: SI I -: 2) and Greek let ters are used for matrices vi t hout bolciface. In
fcn- places boldfacc upper case S. 1- are used to denote randoni vectors. but. in gcneral.
ive do not make not at ional distinctions between random variables (incliiding vectors) and
tlieir realizations. unless it is necessar5-.
I\-e have intcrided to limit and localize the use of acronyms. Some dl -known oncs
(i..g.. ~1.-\?;0\*--4) are used f r e e l ~ Other acronyms used globally arc esplaincd in table 1.1.
Acronym Stands for: Descript ion Statistical technique. due initially t o Fisher. for classifying multitwiate obser-
LD.4 Linear Discriminant -4nalysis xxtions into one of few populations by means of linear discriminant functions
CCX
PET
SPE
EDF
Penalized Discriminant Analysis
Canonica! Correlation -4nalysis
Positron Emission Tomography
Estension of LD.4 due to Hastie et al. ( 199.5) t hat introduces general penaliza- tion of co\ariancc matris and provides an appealing algorithm n-ith a penalizcd re- gression as a main componcnt
-4nother classical multitariate technique that. git-en obserntions n-ith variables di- tided in two sets. finds "left" and -right" linear combination that exhibit maximum correlation. LD.4 can be seen as a special case of CCX
One of fem- tomographie imaging tech- niciuc. especially usefuI for imaging of the brain. The PET camera picks up gamma rays emit ted n-itliin the iniaged organ by a previously injccted radiotracer
Xnother imaging technique useci by neu- roscientists t o study b a i n function. f3IFU
functional AIagnetic Rcsonancc Irnaging is a specialization of .\lm that is able to measurc relative concentrations of osy-
Canonical \;ariate
Squared Prediction Error
Ecluivalent Degees of Freedom
genatctl blood
The linear combination(s) that result in CC-A and LD.4. If the Ci- lies in the image spacc. or has been reconstructed using the B-spline or n-avelet basis. WC sornetimes rcfer to i t as a Canonical Image
The main measure of prcdictive perfor- mance that WC use, defined as SPE = (1 - gr)' where 2 is a postcrior proba- bility of the true class
One IV- t o normdize the riclgc penalty hyperparamcter by calculnting the trace of the "hat" or projection matris in ridgc
Table 1.1: Some comrnon acronyms used throughout the thesis
Chapter 2
Neuroimaging Data and Methods
2.1 Goals and Study Design in Neuroimaging
Sciiroiniagirig is a relativcly Young discipline that at tcmpts to study the workings of the
hrain aiid t lie central nervous system tiiroiigh irnaging tcchniclucs (Frackowiak et al.. 1997).
The rnost irriportant goal is to discover the functional organization of the brain: t h rict-
works of functionally connectcid structurcs in the brain spccific to a gi\-en task or groiips of
tasks. It is postiilatcd that the brain is organized in \-arious. likely overlappirig. netn-orks
tliat are coriiiected by function rather than anatoniical1~- (Strotlier et al.. 199% ,\icIntosh
et al.. 1997). Tlicse nctworks. that can be obsemcd as patterns of activation. corne to hfc
when the hrain faces a specific challenge. and work together to delil-er a response.
Tiiere are two opposing views of the brain organization. One dcpicts the brain as a
monolithic black bos of neurons. This view of massive parallelism has lcad to thc develop-
ment of -4rtificial Seural Xetworks. n-hich have becorne a very su~cessful computational and
mocleling device. rather than a true mode1 of the brain. The ot hcr view of precise iocaliza-
tion of the arcas within the brain has becn supported througliout the centurj- a series
of first anatomical. and tlien functional discoverics of specific rcgions in the brain starting
with discoveries of language components in the brain by Broca, 1Vernicke and Lichtheim at
the cnct of t h ninctccnth ccntuv. The updated view lies. as it often happens in science.
soriiewhere in between. One way to describe it (Strother et al.. 1995a. ,\IcIntosh et al..
1997) is to consider functional networks of areas. where the networks are specific for the
task. Tlierefore. on some level. we do have honiogeneous areas in the brain. but it t - d k es a
systcrn of these. not one. for the brain to process a task. Tlic same regions are likcly used
in qiiite different situations wlien the' will be connected in different networks. -4 similar
1-iew is cspoused by the notions of j u ~ ~ c t i o n u l segregation and fvnctional integration (pp. 5 .
Frackon-iak et al.. 1997). The functional segregation concept refers to large number of spa-
tiaI1y localizecl areas in the brain that work more or lcss independently. Tlie integration
iclca refcrs to the global integration of these specializcd arcas in the face of the task. Tliese
two vicn-s are riot csactly the same. If n-e assurned that some orthogona1it~- needs to be
iriiposecl tlic fiinctional nctworks' concept ~vouId rnorc easily correspond to the orthogonal
ri<>tn-orks of areas. while the fiinctional integration/segrcgation description woidd be lwtter
scrvccl by thc orthogonality among the specializcd arcas. It is not clear. bon-cver. that ariy
ortliogo~iality assumptiori is correct in dcscribing the brain fiinction.
To clarify some idcas 1 non- provide an esaniplc using parts of the cortcs rcsponsible. to
sonic rlcgrcc. for our motor abilitics. ,\Iost of the description here is takcn from (Frackowiak
ct HI.. 1991. Ch. 11) and Soback et al. (1991). Tlierc are ma-- parts of wltat is known as
a niotor cortes: primary rnotor cortex (111). supplcmcntary motor arca (S11-1). premotor
cortcs. also known as pre-SlI-1. secondary motor cortex and cingulate motor area (ClI-A).
-411 of tliesc correspond to the distinct B ~ o d r n a n n areas which arc a meticuloiis division of a
liiiman brai11 donc on the basis of the local propertics of the brain cells (cytoarchitecture).
It lias becn cstablisfied that -\II. pre-SJI-4 ancl SSI-4 contain multiple representations of
thc b o d ~ that is they are organized sornatotopicaly. -4notlier words. one can find mostly
contigiioiis arcas that correspond to ail parts of Our voluntary motor system. from toes to
the toriguc and facial muscles. The situation is quite complicated: the cingulate parts of the
niotor cortex arc still controrersial. the pre-SM-A and SILA areas seem to be composed of
cvcn srnaller parts of some autonomy. and the function of secondan- motor cortex remains
largcly unknon-n. There are also subcortical areas in the cerebellum and the ventral part of
the thalaniiis. u-hich contribute significantly to the functioning of the motor cortex There
arc also parts of the somatosensor?- systems necesSan. for rnotor control.
1 Iajor research questions center on the functional significance and connectivity of al1
tliesc. arid otlicr. related. systems. \le would like to understand how the brain coritrols
Our miisculatiire. how does it plan and esecute movernents. Iiow is the working of the
niotor cortex of a fine pianist different from that of an average person. Can movemcnts bc
clil-idetl irito groups that correspond to distinct actit-ity patterns? \\é are only beginning
to tackle these questions mostly with neuroimaging techniques. The second half of the last.
arid al1 of the present century have provided us with a hiige body of hou-lcdgc rcIatcd to
the anatomical arrangement of the brain. Ive thus know a fair amoiint about thc major
conricctiorts in the brain. but to enhance oiir understanding of this most important orgari
of ours. wc must coriccntratc on its functiorial arrangement.
-4 conimon t>-pe of study design used in neiiroimaging attcmpts to dclineatc the ac-
ti\-atiori signal iisirig two contrast states. Thc states are chosen in such a fashion that
t lie tliffcrencc in activation patterns will provitle maximum information about a spccific
fiiriction of the brain. For esample in our fingcr opposition (FOPP) PET data (Sec. '2.4.1)
tlie baselinc and actimtion states differ only in the presence of absence of paced fingcr
riiovemcnt: in particular the eyes are patched in bot h states and thcre is no auditory input
escept for tlic pacing signal in the active state. Similarlj- in the StaticForce flIRI (Sec. 3.42)
espcrinicnt the subjects ohsen-e control lines during the baseline state to compensate for
the 1-isual stimulation due to the force lewl display in the active states.
In anothcr kind of a study design one gradually varies a single parameter that relates to
strcngth of a supposecl neural signal: an esample is tlic StaticForce dataset. This situation
is similar to t hc dose-rcsponsc relat ionstiip t hat is common1~- obsen-ed in pharmacologicai
stuclies: litre one would look for patterns of activity that change with the parameter. The
ctiarigc niay he monotonie. linear o r not. o r ma>- be abrupt a t first (in a transition from
basclinc to active state) and not change mucti hereafter. Indeed. both kirids of activit~.
rnay be rclatecl t o two different neuronal patterns a t the sanie time. For instance. Sadato
ct al. (1996) reports how bilateral primary motor area and contralateral ventral premotor
cortes. among others. were equally ac t iwted during an active phase of finger rnovernents
of ii~crcasirig complesity. On the other band. the ipsilateral premotor area. also among
ot hem. lias shotvn a linear incrcase with the movement complesity. \Ve tiierefore observe a t
lcast tn.0 important networks of activation in this sti~ciy: one associatcd with the movement
itsclf. a n executive network. and one responsible for processing and planning of movcmeiit
whicii tlierefore has had an irxreased act iv i t~ . for comples tasks. Similar results have also
hcrri rcported by Catalan e t al. (1998).
2.2 PET and fMRI Modalities
Thc objcct of an? neuroiniage modalit5- is t o rcveal thc neuronal activity throiighout
t hc brairi 1-olunic or part t hereof. T h e two most commonly used ivhole-brain modalities
arc Positron Emission Tomograpliy (PET) and functional llagnctic Resonance Imaging
(f3lRI). Xone of thcsc actuallj* record the neuronal spiking patterns: rather they go after
a "prosy" rncasurc whose correlation with the neuronal activity has been established.
2.2.1 Positron Emission Tomography
Positron Emission Tomography (PET) is a general imaging technique tha t is used for many
purposes in inedicine where an image of physiological function is required. PET modality is
an irnpro\-ernent over other radioisotope-based modalities. such as single-photon ernission
corriputecl tornography (SPECT). The description in t his scction is bascd mainly on Ollinger
aritl Fcssler (1997).
PET works by counting the number of high energ'- (512 kY) photons ernitted froni the
irriaged orgari. In summary. positrons are createrl u-hen the injected radiotracer steadily de-
ca?*s: siich dccay produces a single positron which very shortly annihilates with an clectron.
The anniiiilat ion procliices two 512 k\- photons propagating in nearly opposite dircct ion.
T h c PET camera is able t o detect single photons and synchronize two hits t o estal~lish that
t iic two photons originated from the same ariniliilation. In t his way discrete approsimations
to t lie line integraIs of radiotracer density along nianu lines are computcd and the 2D or
3D image of the activity obtained by the inverse Radon transform.
-4 P E T cilmcra Lias detectors made of crystals (usually bismuth-germanate) which con-
vcrt a single Iiigh-cnergy ,712 k\- photon into about '2.500 Iight photons. These are thcn
fcd irito Phot O .\ Iiiltiplier Tubcs (PUTs ) wkiich t hen change the light act i\-ity irito clectri-
cal sigrials. l los t of the scanners connect eacli block of sniall crystals (-y. 7 x 8 a r r v of
c r~-s ta l s ) into a block of fewer PSITs ( s q . 2 x 2 array of thcm). Tlic crystafs in t h block
diffclr sliglit1~- aniong each othcr which allows the camera to determine the onc crystal in
the block hit t>y a photon. The camera counts the numbcr of euents: a pair of photons
Iiitting tn-o crj-stals on opposite sidcs of the camera witliin a ver? short period of time.
callccl the coincidence timing window. usually about 10 ns. These counts. aftcr the? have
bccri proccsscd by the inverse Radon transform. create t hc 3D image of organ fiiiiction.
Thcrc arc many simplifying assumptions and problems that ciecreasc the signal-to-
noise ratio of the data . First. it is assurned that the positron will anniliilate an electron
inirncdiatelj- aftcr bcing emitted. It has bcen shon-n that the positron range is usuallj-
srnallcr than Imm. which is much smaller than the resolution of the scanner. and therefore
ignored. Anotlier assumption is that the annihilation will produce photons flying out in
csac t 1)- opposite directions. It l m been established t hat the divergence from collincarity
is or1 the order of one degrcc or less? and can also bc ignored. The other problems arc
more serious and usually cannot be ignored. The first of these is attenuation: a decrease of
pliotori's crier= due to its interactions with body tissue and \vit h outer shell electrons. The
intcraction with body tissue. a photoelectric interaction. whilc a big problem for SPECT.
is negligible for PET due to the type of radiotracers used. The other type of interactiori.
Compton scatter can bc statistically corrected for in the image reconstruction process.
This correction is possible because the attenuation esperienced by the pair of photons
is iridependent of the position of the annihilation event. To correct for attenuation one
approsirnates the the probability of a single photon pair esperiencing Compton scatter
wliicli clepends on the total distance traveled. and then includes this correction in the
iniage reconstruction step. The probability of a single photon. traveling along the line 1.
riot cspcriericing the Compton scatter is modelcd by the following equation:
n-licrc / L ( J ' ) is the lincar attenuation coefficient a t position x. This probability is usually
approsimated by obtaining two extra scans: transmission ancl blank. These are obtairied by
a liric soiircc of radiation rotating arounci the field of view of the caniera with (transmission
scari) ancl tvithout (blarik scan) the subject. The ratio of counted everits for each possible
liric 1 approsimates the the probability 2.1. The number of detected events in the regular
e7nission scan is then corrccted for the probability of scatter.
, I r 1 additional \va>- to correct for Compton scatter. so called scatter correction coincs
from tlir! fact that the scattered photons have smaller enc re . The energi- can be measured.
to some dcgrce (to about 10% on most scanners with bismuth-germanate crystals) a t the
iridi\idiial dctcctors and then the threshold established bclow which the events are not
coiintcd.
Coinpton scatter prociuces another undesired effect: it causes a deflcction in the path of
tlic affcctcd photon. lfost of the timc such a photon will not hit any detector: the unscat-
tercci corrip1ementar~- photon w-hich will hit the detector is callert a single. It is possible.
g i \ m tlie large number of scattered photons. that two singles will hit the camera within
tlic coinciclcnce timing window-. and therefore be crroneously counted as an event that oc-
ciircd on the line joining them. Such undesired events are called randoms or accidental
coincidences.
The l u t problern which needs to be correctcd for is tlctector deudtinie. This is due to
tlic finite arriount of time that the detector needs to process a hit: during this time the
detcctor is not able to sense any other hits. The cletector deadtirne lirnits the nixsimiim
dose of the radiotracer that can bc placed in the patient: the researcher will try to use the
niasimum dosage that will still not satiirate the camera. but which provides enough events
t o niake t lie discrete approximation. used by algrit hms such as filtered backproject ion. to
linc ixitegrals. implicit in the Raclon transforni. viable.
Tlic PET data can be collected in 2D or 3D acquisition mottes. In 2D mode. the
c\-c~its arc counted in slices pliysically determincd by collirnators: thin annular rings of
tungstc11 c d e d septu. This mode results in greater accuracy by decreasing t lie probability
of scattercd evcnts and randoms. since man'- more of scattered photons originating in the
ZD ficlcl of view will never hit the colliminatecl detectors. Howcvcr. 3D mode. i r i which
the scpta are retracted and al1 possible events are countcd. has up to eight times increasect
scnsit ivi ty n-hich lcads to decreased image variance and/or lowered doses of radiotracer
reqiiired. Cntil recently most of the PET data \vas collected in 2D mode. most1'- due to
l a d of 3D reconstruction algorithms. However. 3D mode is iiow gaining a wide acceptancc
ir i nciiroiniagc comrnunity.
Tlic data collected by the PET carnera are in the form of counts for each possible
liric in tlie field of view. To reconstruct the image of the radiotracer density inside the
irnagctl organ one uses a computational approach to solvc the inverse Radon transform.
callcd filtercd-backprojection (FBP). After correcting for some or al1 artifacts, such as
attcnuation and scatter, one has. in the 2D scanning mode. the data in a forrn of photons h
eniittcd in a given line, indesed depth and angle. and denoted by -\led. Tbcse counts
coristitutc tlic input to the FBP. Tlie goal is to estimate a 3D distribution function of an
radioisotope. A(x. I/. 2) with values of line integrals available:
FBP is a clcterministic method that assumes we have obscrved perfect data. that is:
ovcr a discrete sct of angles. 8 and depths d. This is an instance of inverse problern and one
spccializccl soliit ion to t his problem is the algorit lim called filtered back-project ion. The
algorithm has bcen estencled to full 3D reconstruction. There arc also approaches that Lise
clist ribu t ional assiimptions. most ly Poisson. but tlicy have yet to gain widespreacl support.
Functional Neuroimaging via PET PET niay bc uscd for imaging nian? orgins. Iri
f~irictio~ial ricuroiniaging. whcrc ive want to obtain information about tlic tieiironal activity.
two possil>le radiotracers emergecl: [ 18~]~iioro-2-deosu-~-glucosc (FDG) and [ radio va ter . FDG is a radiolabcled glucosc and a l l o u PET to show local glucose concentration in thc
brain. This ir i tiirn is able to show us the neuronal activity since increased ncuronal activity
is \-cri- quickly follotved by the surge of glucosc-rich blood. (Barinaga. 1997). Similarly.
[ '"0]watcr tracer allows PET to image the bloociflorv in the brain. Blood is also thouglit
to surge irito active neural areas to mect the dcmand for osygen needed for metabolism of
gliicosc (tlierc is, howver, sornc controvcrs>- regarding the nature of mctabolism of neurally
active areas. sec for csample Buxton and Frank (1997) and Barinaga (1997)).
2.2.2 Functional Magnet ic Resonance Imaging
\\i, \vil1 first describe (non-functional) AIRI. u-hicIi is sometimes caIlcd anatomical MRI in
the neuroimaging community, as it describes the stat ic anatomy of an organ (the brain. for
csaniple) ratlier then the dynamic function. l los t of the ciescription here is based on the
ovcrvicw article hy \4right (1997).
General Physics of MR Imaging
'\IR techniques are based on magnctic properties of atoms. callecl nuclear magnetic reso-
riarice. first observed in the forties. In medical imaging one uscs. almost esclusivcl~. the
siriiplest atoni: single proton h'drogen nucleus. -4 hyclrogen a tom ni- be thought of as a
niiriiscuIc niagnct with its two poies prociucing a local magnctization vector in a certain
oriciitat ion. In t lie absence of any esternal niagnetic field of significant strength. t hernio-
dynmiic rriovcment causes random dist rihut ion in the local mügnet izat ion vector direct ions
wliicli rcsults in a net magnetization. M. equal to zero.
,\IR niachiries iiscd for human diagnostics applj- a static ficld. Bo. of strength u-hich
is -5 ordcrs of magnitude liighcr than the part h ficld (a typical ,\IR machine lias 1.5 Tesla
(1.5T) static field wliich is about 20.000 larger than the cartli field). The most visible
cffcct of siich a large static field is that it causes a small portion of liydrogen niiclci to align
thcriisrtlt.cs in the direction of Bo, which ive presume to bc d o n g tlie \-ertical asis. 2. in a
3D refcrcncc franie. The process of alignmcnt is not immediate. the net magnetization in
the direction of Bo has an esponential delay:
-11: is a vertical magnetization at time t after the stat ic field has be turned on. dlo is the
as>-rnptotc of this. and T l is a longitudinal relaxation time which is a property of a material
studied. For esamplc, a t 1.X': T l for grcy matter in the brain is about lOOOms, while for
n-hitc mattcr only about G5Oms and evcn less for fat ('260ms).
-\ssuniing that thcre is a non-zero net magnetization component in the plane perpen-
tlicular to Bo field. i.e. in the S I ' plane (which is not the case in a tissue without any
nlagnetic influence: such component is introduced by the SIR machine as describecf later).
the strong static field. Bo causes the -1-1- magnetization component to rotate aroiind the
Z asis as it tries to align itself with thc static field. In the litcrature. this rotation is called
precession. and its angular frecluency is directly proportional to the strerigth of Bo. This
frcquency is called the Larmor frequency. =\ny rotating magnetic dipole. such as a h?-drogcn
proton. generates electrical current in the coi1 that is positioncd perpendicular to the plane
of rotation. This is the signal detected hy the SIR machine: the coils are made to resonate
at the Larmor frequency of a proton to masimize the signal detected.
In gcneral. any volume of a tissue tliat contains many protons will have not net niagnc-
timtion in the or transverse plane. This magnetization is induccd in the SIR machine
with a R F pulse that rotates with the Larmor frecluency in the transverse plane. Figu-
ratively speaking. one may imagine t h e RF pulse as tipping over tliesc dipolcs that have
aligritd thenlselvcs with thc static field. The RF pulse is applied Ioiig enough to tip tlic
tiipolcs to t hc horizontal direction: \vit h tirrie the -1-1 - magnetization componerit. while
rotating aroiirid the Z axis. will return to the thermal eqiiilibrium condition wherc the only
sigriificarit component is in the Z direction. But an cven larger component contribiiting
to the deca'- of the SI' signal cornes from the gradua1 lost of pliase coherencc arnorig the
prccessing dipoles. -Ifter the RF pulse. the precessing dipolcs will be in phase. Duc to
thcir hctcrogcneous physical environment they \si11 preccss with slightly different rates and
as a consccliience the phase coherency will be lost resultirig in a diminishing signal. The
associatcd csponential d e c q has a characteristic time constant. denoted by T2'. To restore
thc pliase cohcrenc>+ SIR machines apply another magnetic pulse that induces a sp in echo.
Spccifically. let the dipoles evolve for time T : when some phase discrepancy d l be evident.
.-\pplj. a short (as compared with T ) magnetic pulse in a single direction (say y) in the
trarisverse plane (in practicc one applies the pulse in the single direction of the transverse
plane rotating wit h Larmor frequency). This pulse effectively '-flips" the dipoles about the
y asis: t h s e that were prccessing faster and were ..ahead" of y. non- lag hehind the same
arnoiint. and vice versa. The result of the refocusing pirlse is that after time 27 the phase
colicrcricy will be restored. assuming that rate differences arnong dipoles do not change
n'ith tinic.
In practice. the precession frequencies of diffèrent dipoles do change with time. and.
dcspite spin echo. the S I - signal will decline. The dynamics of this dcclinc. taking into
account t h rcfocusing efforts. are rnodeled as an esponential ciecay with constant T2:
-4s Kas the case for T l . the transverse relaxation time. T2. is tissue specific. For grey matter.
whitc rtiat ter. and fat Tl ' s are: 106nis. 69ms ancl GOms. (al1 at 1.5T) respectivcly. -411 tlic
at,ovc corisidcrations arc captured in one eqiiation discovercd by Bloch in 19-16 wliich
<i<wx-it,es the full dyiiamics of the magnetization field. M ( t ) = (.Il,(t). .il,(t). .Il-(t)):
n-here B is the total magnetic field applied. The first term describes the gcneral precession
d!-naniics (7 is a gyromagnetic constant: for protons. 7 = 2ir - 42.6 .\IHz/T) and the
rc~iiaining terms deal with transverse decay (Eq. 2.3) and gradua1 alignment of dipolcs
witli the static field (Eq. 2 .2 ) . -4fter the signal declines to limiting levels. a short period
is rccluired fur the sj-stem to return to the thermal equilibrium within the static field.
txfore riest the RF pulsc is applied and the measurement process repeated. Tlie total time
t)ctweeri RF pulses is denoted by TR and is usually in order of few seconds. The time
t~etwccri refocusing pulses is denoted by TE= Zr.
Contrasts and Spatial Irnaging in MRI
The '\IR signal mcasured. that cornes from the precession dynamics is eventually used to
produce the tissue images. Depending on the tissue under study different contrasts may
bc iisecl: tlicse are combinations of of T l and T'Z relaxation tirnes. Depending on the time
n-iriclou- duririg wliich the data is acquired. one weighs either of these more hcavily in tlic
rcsiilt irig coritrast. \l'Iiat is needed at this point is a way to spatially select rcgions for data
acqiiisition. to producc images. This is achieved in a few steps. F i r s t l ~ the "static" field.
Bo in the Z direction rias a linear gradient. Since the precession frequency of protons in the
trarisversc planc depends on the strengtli of the field. one may acqiiire the data in slices by
applying RF pulses witli different frequencies matched to the precession frequcncies in thin
sliccs alorig Z a ~ i s . This ensures that the recorder signal conies mostl- from the dipoles in
a spccific horizontal slice.
To locatc tlic signal in the S I - plane. sirnilar idcas arc uscd. One introdiices gradients
i n S arid 1' tiircctio~is tliat also \ .ap with time. The signal acquircd iip to time t. say.
n-il1 be a 2D spatial Fourier transform of the total slice magnetization. samplcd a t the
spitt ial frcquericy. ( k , ( t ) . k , ( t ) ) . the so-callcd k-space. Thc k-functions arc intcgrals of the
rcspccti\-e tlyriamic gradients owr time. u p to t . The image of thc slice magnetization is
recoristriïcted iising inverse 2D Fourier transform.
Functional MRI
Fiinctiorial AIRI was clcvcloped by Ogawa et al. (1990a.b). The %mctional'' adjective rcfers
to t lie n o 4 utilization of )IR technolog?. for imaging physiological fiinctions as opposed
to stat ic structures for which it was originally proposed.
Fiirictional LIRI takes âclvantage of the differing niagrietic properties of hemoglobin that
dcpcricl on whcther it carries osygen or not. Osyhcmoglobin is diamagnetic as are other
tissucs in tilc brain. Deoxyheinoglobin is paramagnctic and causes changes to the proton
niolecules in the u-atcr within the blood and surrounding the blood vessel. This is called
a BIood Osygenation Level-Dependent (BOLD) contrast. The paramagnetic nature of
dcosyhernoglobin is s o m e l i o ~ ~ '-felt" b ~ - the u-ater moiecules "close-by" which amplifies the
BOLD signal significantly. The change in magnet ic suscepti bility affects the distribution
of Larnior frecpencies of nearby photons. causing a much greater phase spread. This in
tiirri cililses a significant decrease in )IR signal in affected areas and results in a magnetic
coritrast rnostly dependent on the TI' tinie constant. Fitnctional A L R I produces images
that show local concentrations of osygenated hemoglobin in veins and capillaries. Sincc
wc bclieve that there is strong correlation bctweeri levels of osidized blood and neuronal
activation. BOLD images may bc intcrpreted as images of the brain function.
2.3 Literature Overview
PET and. rriost recently. n I R I data have been anal>-zcd b ~ - the a plcthora of niethocfs of
c\-cr- incrcasirig complesity. The most iniportarit challenges sccrn to he:
Huge input dimensionaiity: each "observation'- is a n image coniposed of 30-500 thoii-
santl numhers. This leads to the "estrernely-il1 poscd" situation ( c g . . ,\Iorch et al..
1997) where the nurnber of variables is much larger tlian the numbcr of observations
Time series effects: even if al1 ive necd is a pattern of neuronal activity n-hich corre-
sponds to the activity that we study. ive know tliat we cannot obtain a truly repcti-
t ire expr iment witliin a subject: the brain state changes througliout the esperiment.
as t herc is some learning, change in the environment. adaptation. and niany othcr
transient effccts
Subject effects: \Vith multiple subject stuclies. ncedcd to obtain results of some gener-
ality. ive observe (Strotlier e t al.: 1995a) that the differcnces anlong brains producc
cffccts n-hich are much larger than the efTect due to the stimulus iinder study.
Spatial correlation: The activity in nearby brain locations is correlated. This niust be
acknowledged in eit her the modeling stage or in hypot hesis testing paradigms (e.g..
\\Torsle>- et al.. 1992) or both.
Alan>- initial met hods analyzed Regions of Interests (ROIS) which n-ere a nianiial1~-
delincatecl using anatomical or other known regions in the brain (e-g.. Clark et al.. 1985).
The average activity within each region was used as the input to an>- siibseqiient analysis.
This resiiltecl in a great reduction in input dimensionality as one n-ould typically have a
few dozcn regions at most. This met hod has been rnost ly abandoned u-it h the introduction
of m ~ t hodolog'- to deal with estremely-il1 posed problems and the corresponding software
packages. The most fiindamental criticism of ROI methodolog'- is directcd a t the fact
tIiat manilal and ad hoc ROI dcfinitions imposes a strong. mostly anatomical. prior on the
a~ialysis.
Ciirre~itl!- thcrc are several groups of models used to estirnate activatiori maps for PET
ancl tlie closcly rclatcd functional magnetic resonance iniaging (f l lRI) techniqiie. Oiir
cittcgorization follows that recently proposcd by Lange et al. (1999) in fllRI. First. con-
sider tcdiriiqiics t liat esplici t l ~ - incorporate t hc csperimental state of the siibj ect for eüch
scan (c.g.. büscline or activation) with possible additional esplanaton- variables such as
~iciirops~-chological performance measures. whicii correspond to the " interesting indica-
tor/categorical variables" and "covariates" of the general linear mode1 (GLAI) approach of
Fristori et al. ( 19%). Simple subtraction of ( possibly standardized) average images from
two csperimental states is the most widely used esample of this approach (e-g.. Fos and
1Iintiim. 1989. \iorsley et al.? 1992). and -4SOl*-A related methods for more than two
states liavc bcen generalized by GLlI . Thcse methocls attempt to find thc activation pat-
tern u-hic11 is driven by (or whidi drives) the esperimcntal or observed conditions: an
irnposed stimulus. motor task or abnormality.
The second category includcs tcchniqucs such as principal componcnt analysis ( e g .
Friston et al.. 1993. Strother et al.. 1993b). wliich require no esperimental brain-state
information. In these methods. one attempts to csplain the gencral variability with a
set of indepcndent or orthogonal components and then post hoc link some of these to
the cspcriniental conditions. The probleni with these niethods is that thc variability is
partit ioncd n-ithout any reference to the stimulus or esperimental conclitions. which are
t lieu --souglit after" among the resulting coniporients. The t hird wide category includes al1
t lic rion-linear niodels such as neural nctu-orks (e.g.. Kippenham et al.. 1994- Lautrup et al..
199.5. Morch et al.. 1997) and very recently \Olterra kerneIs within the GLAI frarnework
(Friston. 1998).
,411 t h e catcgories tnay be appIied to two différent spatial data representations thât
havc cvolvcci to dcal with the IiighIy iIl-posed nature of the functional ncuroirnaging do-
main: rian~cly tlie hiigc dimensionality of the input space. eqiial to the number of voscls in
t tic iniagc (e.g.. 20 to 30 thousand in PET). coniparecl to the availablc numbcr of inclepcri-
dcrit scaris. wiiicti is typically onlj- a few hiintlrcd. The rriost comnion spatial reprcsentation
iiiitially ignores tliis issue b>- analyzing individual vosels. or voliimes of intercst (\-01). as
iridependent saniplcs. gcncrating a test statistic for cadi vosel (or \-01). and thcn post
hoc allon-ing for simple local spatial correlations by using inferential tests basecl on randorn
ficltl thcory to threshold the rcsulting statistical parametric maps (e-g.. l\orsley et al.. 1992.
Friston et aI.. 199.5. \Iorsley et al.. 1996). The second representation uses a data-drivcn
basis. such as the one obtained from Singular lk luc Decomposition (SI'D) of the input
data matris, to rcduce the effective dimensionality of the modeling problcm. This \vas
iritrocfiiced to PET for 1 7 0 1 rneasurements (Clark et al.? 1985. 1Iocller et al.. 1987. ~Ioeller
ancl Strotlier. 1991) and \vas then estended to vosel bwed [1 '5~]water studies (Lautrup
et al.. 199s. Strother ct ai., 1993a.b, Friston et al.. 1996, Worsley et al., 1997).
2.3.1 Single Voxel Analysis: Statistical Parametric Mapping and
Gaussian Randorn Fields
Iri t his scct ion 1 will summarize one popiilar methotl: Statistical Parametric Mapping
(SPlI) ~ i t h Gaussian Random Field tIieory testing (IYorsley et al.. 1993. Friston et al..
1995. \\orsle>- et al.. 1996). a method that n-orks separatel'- with each vosel and thcn
iiscs cstimatect spatial correlations to control the type 1 error in the multivosel hypothcsis
tcsting.
11-itli tlici simplcst S P l l setup. ones assumes that the scans come from two different
conditions. sa)- Basetine and .Active (\IorsIey et al.. 1993). -A more general framework
\vas providccl in \\orslcy et al. (1996) wliere any (one) contrast could be used. -4so in
t tic ricn'cst incarnations SPA1 may come from an>- method tha t gerierates a '2 ' . 't'. 'F-
or \ ? statistir a t eacli voscl. \ive will kecp to the simplcr tn-O condition situation but
ttic estrnsioii to general contrasts is immccliate. Let iilk(x. 9. z ) denote tlic kt'' scan from
siibjcct i ( i = 1.2. .. . . . n ) unclcr conclition j ( j E {.A. B}). The riormalizcd subjcct spccifiç
corit rast irriagcs arc formcd:
TIiiis the scaris within each subject are averagcd for each condition (-4. B) and di\-idccl
II!- tlic subject and condition specific constant T c , . ( - . -. - ) wliicli estimates the global blood
flolv. (Somctimes the scan-specific normalization by i i l k ( - . -. -) is used before averaging
ovcr k (Strother et al.. 1995a. -Appendis). to try to remove the scan global blood flow.
Sornc normalization is nccessary. especially for the PET measurments. as these are relatil-c.
The propcr normalization methods are subject to some debate (Strother et al.. 1995a.
Appcndis)). The fi constant is used to kcep the standard deriation of the difference the
sanie as t h of the original scans (\\orsley et al.. 1996). but it docs riot appear in the carlier
\-crsions of the method (Worsley et al.. 1992). The contrast images are then averagcd ovcr
siibjccts to produce the mean difference image:
n-licrc r i is a number of subjects. Again. iiorsley et al. (1992) uses n in place of fi. Sonic
estiniatc of standard deviation of 1 is then used to normalize the mean-difference across
t-oscls. Thcrc are many choices: the simplcst. proposed in iVorsley et al. (1992). is to
calculate the subject-specific estimatct for cach vosel and then a\-erage over vosels:
wlicre \ ' is number of vosels and:
Tliis assiinics that the variance across vosels is tlic same. The other choices are not to
pool across vosels. to pool over conditions iisirig ,ASO\--\ or .iSCO\---4 estimators (Friston
ct al.. 1991. ll~orsley e t al.. 1996) or to combine t h latter witli tlic pooled estimator 2.6
(Ilorsic>- ct al.. 1996). The inherent dilemma is the lon. dcgrecs of frccdom available if rio
poolirig across vosels is done (duc to sniall riumbcr of sul~jects) and a strorig assumption
of Iioriiosccdasticity across vosels if sudi poolirig is donc.
The statistical t-rnap is formed by dividirig the mean clifferencc image 2.3 by an estimate
of noise. wtiicli can eitticr hc an image itself (Le. 1-oscl-specific) or a scalar. as cicscribed
abo\.c. Csing the estimate 2.6, for esample. the t-niap is:
Tliis gii-es a t-statistic for every vosel. The problcm is non- to determine ~di ich of the
niany thoiisatid t-statistics are significant dcsignating neuronal regions with significant
cliarigc bct~vccri tlie conditions. Typically in PET images tlicrc would be about 30.000-
40.000 intcrcranial 1-oxels which are uscd to form the t-map, and tlierefore that man'.
t-statistics. By using the unadjusted significance level. a. 11-e will seriously otlerestimate
the ovcr-ail significance. o r inflate the type I error because of the muItiple testing problern.
Since tbere are large spatial correlations esisting in the images and therefore in the t-map.
t lie siinplest Bonferroni adjustment method, which is most effecti\-e \&en the tests are
indcpcndent would be very consen-ative decreasing thc potver t o ver- small levels. One
solution proposed in \Iorsley et al. (1992). generalizing and making inorc rigoroiis the
idcas in Friston et al. (1991). is based on the theory of maxima of Gaussian Randorn
Fields (,Adler and Hasofer. 1976. Hasofer and Adler. 1978). -1 threc dimensional Gaussian
Random Field with mcan p ( x . v. z ) and covariance C( (x l . 9,. zl). (xs 7/i. z - ) ) is a continuous
stochastic process. G ( x . 9.2). such that for an?- finite T L . and for any selection of points
( r 1 = ) - . . . ( x y 2 ) the joint distribution of { G ( x l . y, . z , ) . . . . . G(z,. y.. z,)} is
n-variate Gaussian with mean {p(xl. yl . zi ). . . . . p(x, y,. zn) } and the covariance mat r is
ol)tained Ily cvaluating CO\-ariance function C a t the n' pairs of points.
The 111ai1i idca is t o derivc tiic single thrcsliold. t,. sucli that undcr the nul1 hypothesis:
P(T,, > t,) = ct.
u-hcre Tm,,, is a masimum t value. C-sing the Gaussian Random Field thcory. thc convenierit
riiill hypotlicsis is that if t hcre is no difference bctween conditions. the t-niap will constitutc
a zcro Incail. Gaussian noise. with the (scalar multiple of) identity covariance function.
C. Rciiiarkablè-. using the notion of Euler characteristic number. one can approsimately
c\.aliiatc protmbility 2.9 for any Gaussian Randoni Field, G (,Adler and Hasofer. 1976,
\\orsle>- ct al.. 1993. Eq. 1):
Ilcre. 1- is a volume of the image, in some units. and the ,\ is a 3 x 3 variance matr is of
partial dcrivativcs of the field in cach dimension z, y, 2, in the same units as \/-:
T h matris of partial derisatives (1.11) constitutes a way to specify the covariance
striictiire for the continrioiis and homogenous randorn field. The diagonal cntries tell us
lion. the field varies in the t hrce axial directions. and the off-diagonal entries give variabilit!.
in t h t hrce diagonal directions.
If .\ n-crc known. the Eq. 2.10 could be (numerical1'-) int.erted to find the dcsirccl
ttircshoIcI. ta . The covariance matris .\ may be approsimated using niimerical differcnces.
wliicli is. lion-ever. a poor and uristable cstimate. ,hotfier solittiori is proposecl in \ iorsle~-
ct al. (1992). n-hich uscs the known properties of a sinoother which is applied to the scans.
Ilsiiig ttie assumption. that iiritlcr ttic nul1 liypotiicsis. the t-map is a white Gaitssian noise
ficld. one cari dcrivc an espressioii for tlie covariance rnatris of the wiiite noise corit-olvcrl
wit 11 a kcrncl sniootlicr. This is then itsed in Eq. 2-10 to calculate t , whicli is t hcn ixscd to
tlircsholcl the t-niap and select the --significant" voscls.
SPSI with Gaussian Random Fielci tlicory for determining the thresholcl has been a
grcat step forirard in the analysis of neuroimages. It has also bccn successfully applicd
in other ficlds. such as astropt~.sics. The theory is as remarkable and practical as it is
hcautiful. Tlierc are, however. a number of assumptions going into t h SPlI method. some
of wliich ha\v bccn addresscd in later papers, which could cause prohlems in intcrpreting
ttic rcsi~lts. The norrnality assumption may only be viable if thcre arc a large numbcr
of clcgrecs of frccdorn going into estimating thc t-map. Typically. this is only the casc
u-lien thc voscl-wise variancc cstimators arc poolcd across vosels. This however leads to
rlic possibly ovcr-simplistic assumption of homosccdasticity across the brain volume. T h
otlicr possible problem with the SPlI method is a specification of the riull distribution:
tliat of the white noise Gaussian field convolved with the smoothing filter applied to the
t- nap p. This seems to disregard any possibility of spatial smoothncss present in the t-map
beforc prc-smoot hing is applied. when w e know t hat the hemodj-namic response. which is
actiially ~iieasiired by PET and t3IRI. has an estent of 3-5mm (Jlalonek and G r i n d d .
1996) arid the reconstruction techniques tliemselves impose spatial smoothness. \\-hm the
riiill Iiypothesis is rejectcd. ive thcrcforc still d o not know. even upholding the normaiity
assiiiription. whether the breach came because of the non-zero mean. u-liich is the desired
rcsiilt. or because of the misspecified covariance matris. -1. llost likely it is both. which. at
bcst. leads to the i~lconclusive answer. and at ~ o r s t mai- point to the totally wrong rcgions.
Il'itli t lie lack of rcalistic simulation studies that would examine the robustness of thc SPA1
rricthod uridcr -1 niisspecification. it seems probable to me that crrors in estirnating .\ may
casily lcacl to the rcjection of the nul1 without an>- support in the mean.
2.3.2 Scaled Subprofile Model: State-Driven Variance Decom-
position with Global and Subject Effect Removal
Sc&d Siihprofile Slocicl (SSll) of lIoe1ler et al. (1987). llocller ancl Strother (1991) has
bccri <le\-clopcd to identify regional variation producecl by a treatment or a stimulus allowing
for licterogcncous covariance patterns and subject cffects. It has been specially formulatccl
to dcal n'ith high-climensional PET datasets obtained using a small number of subjects.
and to work with a minimal set of assumptions regarding subject, treatment and residual
cot.ariaricc patterns. It strives to partition the variability (sirnilarly to -4SO\:l model) to
dissociate t lie subject and treatment covariance patterns.
The two main cquations of SSN are:
Iriitially. (lloeller et al.. 1987. Sloeller and Strot her. 1991) the indes i was meant to de-
riotc siibjccts in studies of one scan per subject. and the method \vas implemented for
prc-dcsignatcd regions in the brain. SSlI LI-as later (e-g.. Strother et al.. 1995a) success-
fiilI?- appliecl in the situation u-here index i denotes a combination of subjects. treatment
(stirriuliis) and rcpetition cffects. and is therefore unique for eacli scan. and the vosels are
iisccl in phcc of regions. In the above equation. each scan is decomposed into global and
rcsicliial i~iiages ( p and ai) which are called Group Slean Profile. GSIP. and Subject Resid-
iial Profiles. SRP. respecth-ely. in Sloeller and Strothcr (1991). Each scan has a scaling
factor. s, associateci with it. The GSIPs are further decomposed into a set of orthogonal
Groilp Invariant Siibprofiles (GISs) here denoted by &. This is related to a previously
riicritioried approach. where each vosel is separately modeled with -4SO\--\ or -4SCOI:A.
aricl the rcsicluals then grouped back together to forrn residual images. diidi are thcn dc-
c-oniposcd iising SI-D or similar technicliics. Strother ct al. (1995a) Ilas listcd similaritics
arid cliffvrcriccs between these approachcs.
Tlic rriairi part of SSSI Kas the dcveloprnent of a proccdurc for estimating \-arious parts
of thc rriodci. 1Ioeller and Strother (1991) provide a cletailcd description wliich tvc will
briefi?- siinirriarizc I-iere. Scaled Subprofile Alode1 can bc approsimatcly cspressed as a
1-oscI-wisc. two-way ASO\:-l on a log scale:
wlicrc the index j refers to vosels: and the division in the residual term is made voscl-by-
voscl. Small signal approsimation In(1 + x) z x for x << 1 is uscd to derive the -4XO\:A
corresporitlcncc. where x = {%l j? for tlach rose1 j. The cstirnation procedure assumes
tnodcI (2.14). It starts by removing two main effects from the log-transformed scans by
double-centcring the log-scan matris, l e i j = ln iij. The resulting matris is then decomposed
iisirig Sirigular ITalue Decomposition. The Mt- and right-hand eigenvectors may be shown
respect il-cl? One can t lien use Ii-variate regression of .\- average log-scans. ln i,. ont0
î t k to estirnate the Ii offsets in (2.15) and Ins,. which comc up as regrcssion coefficients
and rcsiduals. respectively. In fact. the regression only identifies the part of s which lies
i r i an orthogoiial complement of the subspace spanned by -yk: ivitfiout an assuniption of
ort hogonality betwecn the two. ttic SSAI model is not identifiable. Froni the regression
rcstilts orle can estirnate the remaining terms.
SSll providcs an int uit ively appealing, log-linear model for t hc PET scarls collectecl frorn
varioris siibjccts. The parametcrs have t hc following physical interpret at ions: .Y, are scan-
specific niultiplicati~c factors which are related to global scan cffccts. both pliysiological and
rricthodological. e-g . . subject close. The global radioactivity levels arc very liard to control
a t tlic esperinicntal stage: they arc a results of a comflicatecl interrelatioriship bctwccn
the tfosc of radiolabelcd agent. weiglit and other physical diaracteristic of the patient aiid
iiriErrio\vri pliysiological efkcts that affect the distribution of the agent n-itliin the brain.
Th(! Incari pattern p represents a hypothetical brain state that is conimon to al1 scans.
This ma'- incliide a coarse description of rcgional diffcrenccs ttiat is invariable across scans
itrici suhje~ts . The scan specific variations. ai rcprcscnt patterns superin~poscd ont0 this
rrlcan brairi state pattern. Thc resulting imagc of the sum of the meari and scan-specific
patterns is normalized via the global scaling factors. s,.
Tlie niain rcsult of SSJI consists of a set of rcsidual patterns. @, toget her with their
wciglits. y t k . which arc scan-specific. One may show (Eq. 4 Noellcr and Strother. 1991)
that the total variancc of log-transformed scans ma'. be approsirnately decomposed into a
global tcrrn. an error term and the residual profiles terms whicli are indcpendent. One may
tliereforc rcprcsent the Ph scank contribution to the total variance b>. tlic sum of squared
n-cights. C, y;', for this scan. Also, if the overall index i is broken into i, for subjects, i,
for coriclitions and i, for repetitions, the following decomposition of variancc resiilts:
sincc y. . . . . .k = O for each k. IVhat the three pieces represent are betwen-condition. repeat-
tria1 and iritersubject variance contributions for each Siibject Residual Profile. @,. n-hich
are uncorrehted. This allows us to study the particrilar contribution of each SRP. and thus
cietcrrnines n-hether it is mostly associated with the su bjcct variances or the study design.
2.3.3 Partial Least Squares
\IcIritosli ct al. (1996) propose anot her intcresting mult ivariate mctliod for the anal>-sis
of iiciiroimagcs. callcd Partial Least Squares (PLS) (tiieir method is not relatcd to the
wcll-krion-11 regression mode1 under t hc same namc. desçribed in (e.g.. \\ald ct al.. 1984)).
Thc ailthors rnotivate PLS as a iiniqiie approacli that results iri tlie spatial patterns which
optirrially csplain the covariance bctween a set of scans anci tlic --csogenoiis bIocks". The
latter caii l x formcd by contrasts of interest. or ma!- iricli~dc csternal nieasurcs: siich as
t,clia\-ioiiral. performance etc. PLS is related to 110th simple t-maps and. conceptually. to
t h Scalctl Siibprofile '\Iode1 described in tlie prcvious section. It is also rclatcd to LD-4
aiid licricc to our proposal.
Let. as before. ,Y denotc an ,V x p matris with -V scans cadi with p vosels. Let 1-
l x tfic -\' x I< .-esogenous block" matris. with I< blocks: contrasts or esternal rneasures.
For iristance, to follow an esample in the paper. we may haw a miiltisubject PET data
ohtaincd undcr 3 conditions: 1 baseline and 2 active. 1' may contain two colunins: the
first. comparing the baseline to the average of the other two. witli [2: -1: - 11 for a scan in
condition 1. 2. 3, respectively: and the second comparing the two active conditions with a
[O: 1: -11 contrast.
The basis for the PLS method is an S\yD deconiposition of a cross correlation matris.
S = s ' ~ I - ' . a product of column-centered and column-normalizcd S and 1- l . That is:
n-ith -4: a p x K matris of orthogonal singular images. and 1-: a matris of orthogonal
profiles. arid the diagonal m a t r k D with singular values. The a,. b, pair of the first col-
rirnn vcctors of -4. B gives the best linear approsimation to esplaining the cross-correlation
niatris S. The first singular value dl gives the strength of this association: wiien squared
and dividcd by the sum of al1 squarcd singular values it is a proportion of the total variance
csplainccl. One may esamine the image al. together with t lie the first profile. to deduce
the fcatrires of the esperiment and the associated spatial map. which contribute niost to
the O\-crall 1-ariability. 1IcIntosh et al. (1996) also introducc a third measure: subject scores
ot~tairied 13'- projecting individual scans onto the singular images. Thus for cach singular
irnage one obtains -V siibject scores (lvhich could more aptly be calIed scan or image scores)
wliicli can t x plot against the conditions or estcrnal nieasures to gain furthtr insiglit into
t lie fcatiire represented by a particular singular image.
PLS approach includes a calibration method to validatc the modcl and determine the
ti~iriibcr of significarit singular imagc/profile pairs. 1lcIntosh et al. (1996) sliow ttiat the
colariance betwcen the j t ' l subject score: 6, = Sa, and the similar jth lcft liand score:
d.', = 1-b, is d,. the jth singular value. Tlius PLS is finding the singular pairs n-hich
siicccssively masimize the coi-ariances between the left and riglit hand scores. ,\IcIntosh
et al. (1996) propose computing the regression of subject score on the rotvs of S. the contrast
or estcrnal measure matris. -4s a measiire of validity they proposed R2: tlie proportion of
i-ariancc esplaincd ~IJ- the regression. To determine a significant cut-off point, PLS uscs
a permutation test: r o w of S are permuted and for cach permutation an S\'D applied
ancl R"omputed. The R2 computed on the given. unpermuted data. is compared to the
'The paper does not makc it clear R-hether cross-co\.ariance or cross-correlation matris bctwcen .Y and 1- is uscd. The esarnple in the appcndis clearly works with cross-correlations but the text refers to the cross-covariance. Also, the papcr uscs the notation for S and 1' n-hich is the reverse of ours
distribution dctermincd from the permutation test and. for a given significance value. its
significance asserted or rejectcd. This way the wIiole PLS mode1 and a number of significant
singiilar pairs may bc estimated-
PLS constitutcç a complete descriptive technique for neuroimage analysis. in much the
sarrte n-ay as a paradigm proposed for the PET data in this thesis (PLS does not inciude
an>- cstcnsions to deal with the temporal sequences such as oncs found in flIRI data).
It is rriultivariate in nature but ùoes not attempt to niodel the spatial properties, such
as spatial smootlincss. in the data. These could be introduced in a \va>- similar to that
proposed in this thesis. The permutation test to determine the number of significant
singular pairs rcprcsents a big step fonvard from other descriptive techniques. but it is not
wi t hout problcms. The scan data normalizatiori. for csamplc normalizing vos& to unity
\-ariarice. irit rodrices potentially big variability into the procedure. but it is not tcsted in
the pcrrriiitation stage. In gc~ieral. ive can think of LD-4. and hencc oiir procedurc. as
an cstcrisiori of PLS. wliere the per-vosel normalization is rcplaced full-scalc covariance
riorriializatiori: one that involves rotation. as wcll as. scaling. Sirice the full-rank corariancc
rriatris caririot bc cstimatcd. ive use penalization as s1ion.n in the nes t chaptcr.
2.4 Datasets Studied
TIic niethodolog'- described in t his thesis n-as applicd to two datasets. Here, we givc their
description. together with the study design, and esplain the pre-processing steps applicd.
2.4.1 Finger Opposition Task
Tlic fingcr opposition dataset, which ~ v e will cal1 FOPP, is the result of the ["O]water PET
study ori 45 \-olunteers which nTere scanned betwcen April of 94 and December of 91 in the
PET rescarch center of Iétcran ,-Iffain 'rledical Centcr in llinneapolis, \ [S. Due to the
lirliited axial size of the \-.A PET camera (10.Scm) it was not possible to cover thc wliolc
brain. Consec~iicntl-. scans from 18 subjects were discarded as they did not aclequately
colver two particularly important areas: the motor area in the cortes (top of tlie lieacl) and
ttic cersbcllum (at thc bottom of the head).
.\ctclitionally. scatis from 7 subjects had unacceptable between-scan movement. The
iiriacccptable liead movenient \vas defincd as that u-hich causes a misalignment of more tlian
oiic vosel (3.125 x 3.125 x 3.325rnm3) between any [ l J ~ ] w a t e r scan and the attcriiiatiori
scari. Thc exact amount of niovement was dctcrmined based on the results from thc 6-
parametsr rigid body transformation (\Iroods et al.. 1992). The idea of tracking movement
bascd on tlie aligntnent transformation was described in Strother et al. (199-4).
Thc final dataset uscd in this thcsis consisteci of the scans from 20 subjects: 7 maies
(agcs: 42. 39. 30. '25. 41. 33. 37) and 13 females (agcs: 4.5. 30. 25. 33. 53. 53. 56. 3.3. 41. 47.
34. 35. 27).
Eadi sill~jcct was scanned eiglit or teri tinies. Odtl scans w r e taken iiiiclcr the baseline
coritlitiori. tvitll each stud>- starting frorn scüri 1. ~d i i l c the cvcn scans wcrc obtainccl wliile
the suI>j~ct pcrformeci a simple rnotor task. In both States. the subjects had tlieir cycs
covcrccf \vit11 a patcti and w r e lying rclawcl. During the basclinc state t h cars wcre
plirggecl wit li irisert carphones: while during the active state an auditory pacirig signal
\vas dcli~crccl tlirough the earphoncs. Each subject received onc practicc lesson before the
studj-. TIic niotor task consistcd of sequentialli. toucliing, using thc left-hand thiimb. each
of tlic rcniainirig four fingers successively fore and back. The task started with the i.v.
injcctioii of the radioactim [150]\vater bolus and a 90s image accpisition startcd whcii
the radioacti~c bolus reachcd the brain (typically after 10-20s): as assessed by the total
~iunibcr of courits detected by the PET camcra. The scans wcre acquired in the 3D mode
and recoristructed using 3D filtered backprojection. Thc data iras corrected for randoms,
cleacl t inie and at tcnuation, but not for scat ter.
Scaris for eâch subject were separately aligned to the first scan using thc intramodality
iniagc ratio technique described in I\oods et al. (1992). This process uses a linear transfor-
mation to correct for translation and rotation of the head between scans. The eiglit or ten
subject scans wcre then averaged and the averages used to calculate the twelve subject-
spccific paramcters for the inter-subject alignment algorithm (IVoods et al.. 1993). This
algorithm transforms the subject scans to the commori a~iatomical space of the brain in
Talairach coordinates. by appl-ing rotations and translations plus non-rigid body trünsfor-
mations such as shears. to the subject average volumes by comparing them with a simulated
rcfcrcncc PET voliinie in Talairach coordinatcs' space. The intrasubject aligned volumes
wcre srnootlicd with 3 x 3 x 3 and 5 x 5 x 5 boscar smoothers. with simple boundav
correct ion. and t hcse toget lier n-it h t h e unsmoot hed 1-olumes wcrc t ransformecl into t lie
coniniori Talairach space using the subjcct-specific tn-elve parameters derived bcforc. Tlie
3 x 3 x 3 srnoothcd volumes were uscd to derivc thc intracranial niask volume consisting
of 1's for voscls iriside tlie brain. and 0's outside. The mask \vas derivcd by tlircsholding
cadi vol~inic a t the 45th pcrcentilc.
2.4.2 Static Force flMRI data
Sc\-critccri voluritecr subjects had bcen scannecl with a static force paradigm. Each run he-
girls \vit11 tlic baseline condition and alternates between active ancl baselirie. as before. Tlie
active condition required the subject to apply a constant force to a small force transducer
ticltl htwcen kiis/her thumb and index fingers. The actual force applied was displaycd on a
scrcen togcthcr with the -*tolerance bars" according to the espected force levcl. Thcre were
5 forcc l c \ ~ l s : 2OOg. 100g. GOOg. 800g. and 1000g. and the order of thcse \vas randoniized for
cacli sul~jcct and run. Each subject performed tu-O runs. In each run thcre are 11 instances
of altcrnating baseline and actil-e conditions n-ith 3 force Icvels. Each instance ran for 44
s c c o ~ u k During thc baseline conditions the subjccts wcre resting and viewing control lines
on ttic screcn.
The data \vas collected using a 1.5T GE Sigma Scanner with a whole-brain echo-planar
sccjricnce (TR=4s. TE=ïOms, tau offset-Ems). Each image volume consisted of thirty 5mm
oblique axial slices with 64 x 64 rosels (3.123 x 3.1%5mm2) per slice. The postprocessing
of volumes inclucles:
1. 1-isual inspection and esclusion of images witti obvious motion. artifact. poor position-
irig. ancl where the performance and ncurophysiological measures indicatc a failure
to pcrform the task
2. semi-automated generation of brain masks for anatomical SIR1 arid fNRI volumes
3 . calculation of the Gparameter within-subject rigid-body alignmcnt matrices of masketl
fl\lRI scans to the first scan using .\IR (\ioods et al.. 1098) - discard runs show-
ing more than sub-vosel movemerit bascd on tiic maximum voscl movcmcnt in cadi
volume aftcr application of 6 paramcter matrices to brainmasks (Strothcr et al.. 199-4)
4. aligriitig tlic witliin-subjcct fl lRI scans and calculation of t h subject-average aligncd
flIRI scan.
5. calculation of rigid-body alignmcnt parameters for average fMRI to high-resolution
anatomical SIR1 scan using ,AIR
6. visual inspection of the alignment between the average flIRI and SIRI witli and
wit hoiit t he transformation and choosing the best
- r . calculation of the between-subject 12 parameter affine alignmcnt paranieters of the
tiigh-resolut ion anatomical )IR1 to a high-resolut ion )IR1 template in Talairach space
S. formation and application of a single transformation matris taking each masked f!tlRI
scan to the Talairach space
9. dct rcndiiig t hc time series using 4 cosine functions (vosel-by-vosel)
The da ta from three volunteers u-ere eliminated during the first step of visual inspec-
tion. and sis more were eliminated during furt hcr processing. In two of the rernaining cight
siil~jccts thc first riin had poor neurophysiological performance measiires witli one missing
forcc le\-cl. As a rcsult the Static Force dataset contains only single (second) run from 8
siihjccts. Fiirtlicrrnorc. first three scans ancf some transition scans (thosc occurring right
t~efore or after condition change) n-ere dropped because of knoivn artifacts and hcmody-
rianiic traasition effects. The eight remaining subjccts arc 3 males and 5 fcmalcs with an
average agc of 31 zk 6 years.
Chapter 3
Penalized Linear Discriminant
Analysis with Basis Expansion
Iii tliis chaptcr we prescrit t h general franicwork for the analysis of neuroiniagcs. \\c
prtwnt oiir rriotivation. algebraic dcrii-atiori and important details of the conipiitational
aspcct wliicti ive devcloped to deal witti the estrernely ill-poscd nat iirc of the nciiroiniaging
clat a.
Tfic hasis for our anal'-sis has bccn Linear Discriminant ,Anal>-sis (LD-A). LD-A and its
alriiost -ccluivalcnt sister. Canonical \ariate -Lrialysis. n-hidi have Ixcn iised in ncuroimaging
bcfore (e-g.. Amri ct al.. 1993. Rottenbcrg et al.. 1996. Arclekani ct al.. 1998) with thc
cspcrimcrital states defiriing the classcs. Ttiesc studies deal with the ill-poscd nature of
t hc problcrn 1,'- cither dcfining a small number of Iblumcs of Interest (1.01) - anatomically
Iioriiogericous regions of thc brain a-priori consiclered important to the stimulus studicd -
or by dcfining the SI-D-derived basis. Tlic resulting Canonical Variates - one less ttian
t h riuniber of classes or states - are thcn intcrpreted as the neiirai activation patterns.
i l ' i th two classcs the resulting single Canonical Iar iate can be vicwed as an alternative to
t h inetliocls tliat rely on images formecl bu subtracting the average of scans in eacli class.
pcr1iaps norrnalizetl and prcprocessed by -4SCOC--A (Friston et al.. 1991. \\orslei. et al.,
1992. 1996. Friston et al.. 199-5). \\'ith rnorc than two statcs. the LD--4 approach results
in several Canonical Variates ordered in their contribution to esplaining the between-
state variance. Finally. the times-serics effects may be studied by defining the classes to
correspond to the temporal order of each scan.
Scction 2.3 mentions a broaci categorization of the methodolog'- cleveloped for ncuroim-
agc arial!-sis. Tiicrc are methods that deal n-ith stiniulus-indiiced changes. general variance
decorriposition niethods that do not take irito accourit the statc indicators. and methods
t h introciucc non-linearity in various ways. There are also. esisting quite indcpendently
of t hc t hrce categories, two data represcntations used for arialysis: single-vosel and SI-D-
dtrivcd basis.
111 oiir approach. ive acknowledge the multivariate. spatially corrclated nature of tiic
data and introduce a t hird representation b>- espanding the dcsired canonical activation
iniagc in a srnooth basis. Then a penalized version of LD,\. callecl PD-\ ancl developetl in
Hastitl ct al. ( 19%). is applied with smoothness constraints on the canonical variates. This
is sccri (-Appcridis B) as cquivalent to projccting the input scaris on each basis scparatcly.
carrying out the PD,\ analysis in the projectcd dornain ancl rcconstructing the canonical
image froni the result ing cotfficicnts. -A furt her ridge petial ty on t hc n-itliin-class covariance
niatris allou-s for data-depcndent choice of the esact amount of smoot hncss retliiircd. Our
rricthocl inaj- also be seen as bridging the two categorics: analysis of stimulus-indiiccd
charigcs arid gencral variance-partitioning met hods.
E w n with a preliminary dimensionality reduction using SI(-D. or a srnooth basis. the
ill-poscci nature of the functional neuroimaging problem preclucics naive application of
rnulti\-ariatc statistieal mcthods siicii as a standard GLAI or LD-A. even though the data is
clcarly multivariate in nature. The problcm of overfitting the data. here ticd strongly to
the " curse of dimensionality" (Bellman. 1961. Hastie and Tibshirani. 1990. pp. 83-84.). is
cspccially acutc in these data sets, and leads, in many cases. to singularities or saturated
tiiodels a t h s t . \Yith input dimcnsionality (niimbcr of voseIs or basis elernents) so high.
<I\-en simple linear modeIs become very flexible and powerful. with high overfitting potential.
Tlicrc are simp1~- too many degrees of freedom available even with linear rnodels. seemingly
as niany as the number of vosels. although we cannot ob\-iously use more than the number
of scans. The proper assessment of validity and generalizability of the modeling results
is tlicii of paramount importance. The classical goodness of fit techniques. based mostl-
ori asy~nptotic results of some global measure of the residuals. are totally inadequate hcrc
sincc the asymptotic assumption of large :V, the number of observations. as cornpared
to p. thc dimension of the space. are ei-idently not met. The nced for optimal niodel
selection techniques based on measures of modcl generalizability lias been advocated by
sonic (Iiippenham ct al.. 1994. Lautrup et al.. 1995. Strother et al.. 1995b. 1997. 1998a.
llorc-li ct al.. 1997. Alorch. 1998. Hansen et al.. 1999) but has bcen largcly igriored in favour
o f typ ica l l~ as>-niptotic infcrent ial tests of unkriown gcnerality ( c g . Friston. 1998).
Opcrat iorialI5- t tiere arc tti-O problems t hat we t ry to adclress wit h t his approach: strate-
gic dinicnsi~nality reduction of the input spacc. and proper asscssment of niodel gericral-
izal~ility. Tiic first problem is attackcd in two \va>-s: (1) ive induce a srnootti prior on the
spaw cf tlic resuIting canonical image(s) in the form of a non-adaptive. sniooth basis i n
u-liicki n.c cspand thc image(s) - in this thesis. KC use tcnsor protlucts of cuhic B-splines
aricl n-al-elcts: (2) we rcgularize the mode1 further with a simple ridgc penalty that is a com-
promise betwcen the mode1 and the estimated spatial covariance and acts as an additional
srrioot tincss constniirit by reducing t hc effective degrees of freedom.
By posing the estimation of activation maps as a classification problem ive operate
wit hin the probabilistic framcwork of decision thcory where ive can address the issiic of
tiioclcl gcneralizability witli predictive performance measurcs. If wc impose the need too
opcrate wittiin the predictive framework and recluire that a method must result in i~nages
that arc interprctable in a given esperimcntal contest. we are led naturally to LDA-like
approaclies. Our smoothness-constraineci. penalized LD.4. is an extension and special-
ization of the general Penalized Discriminant .-\nalj-sis (PD.-\) model proposed by Hastie
et al. (1995) and also investigated by Sielsen et al. (1998) in the neuroimage doniain. In
section 3.3.4 ive derive an efficient algorithni for fitting PD-\ suitable for this estremely
ill-poscd data.
\\i. aclclress the important problem of model generalizability and ralidity of the result ing
activation niaps using prediction error. In section 3.3.7 WC propose two predictive perfor-
mancc rneasures: i lis classification rate (1IC rate) and Squared Prediction Error (SPE)
11ascd ori posterior probability estimates of class mcmbersliip. To estimate these we use
a statc-of-the-art specialization of Bootstrap. the .632+Bootstrap (Efron and Tibshirani.
1997) (the name cornes from the fact the the probability tliat atiy observation is incltided
in the bootstrap sample is. in the limit. .632). u-hich ive compare with the more tradi-
tional cross-validation (Cl-). These are resampling tcchnicpcs wliicli givc ncarIy unhiascd
est iriiatcs of prcdiction crror anci arc Iargcly frce from distributional assumptions. 11-e
iisc tlicse measurcs to: (1) compare results built with different numbcrs of B-splinc basis
furictioris. and results built n-ithout basis expansion but witli diffcrent aniounts of voscl
prcsniootliing. (2) optimize the ricige parameter wliicli fine-t unes the amount of smootti-
ries i ~ i the result. and (3) compare simplc preproccssing with and witliout mean scan
riornialization.
3.1 Classical Linear Discriminant Analysis
The lincar discriminant function was introduced by Fisher (1936) for two classes as a
sensible approach to discriminate betwecn two sets of obsewations. Fisher sought to project
tlic data on the line in a way that would maximize the separation bctwcen classes as
nicasiired by the betwecn-class variance. The masimization problem must be normalized
ancl i t is intuitively appealing to carry it with respect to some measure of da ta variability.
\Ivi t li n id til-ariate data. the pooled \vit hin-class covariance mat ris is a goocl candichte.
Tlir proMcni tlien becomes one of finding a linear projection that maximizes the ratio of
htu-cm-class to \vit hin-class variance.
Tlit LD-4 niethod iws latcr estended to multiple classes (Rao. 1948). LD-4 can also
bc clerivcd. if al1 canonicai variates are used for classification, as a maximum likelihood
classification rule if the data in al1 classcs is assumed to follou- multivariate Gaussian
distribution ivith a common covariance matris (l lardia et al.. 1979. Hastie et al,, 199.5).
\\Ïtli a simple modification to account for prior class probabilities (usually estimated witii
t h proportion of the observations in each class) the LD-4 method may also be seen as a
pliig-in Bayes cstiniator n-ith the sarnc assumptions.
Let us start by establishing some notation used in tliis section to introciiicc t h classical
LD-4. Let II" t x an ilh pvariate observation in class k = 1. . . . . li in Ii-class classification
problern. Let n i hc a number of obscrrations in class k. and .Y = xk r 2 i . Define tlic
bctn-ecn -class and \vit hin-class covariance mat riccs b>- t lie iisual l I - \SOI~4 quant it im:
;\lgcl>raically. LD-4 is a follon-ing optimization probleni: find a h sucli tliat a h ~ a , . is rnas-
iniizcd subject to abWah = 1 and subjcct to a;Waj = O for j = 1.. . . . h - 1. This can
tx poscd as a generalized eigenvalue problcrn:
siibjcct to the aforementioned orthogonality constraints on a,, ivith rcspcct to W. The
solutiori is the cigendecomposition of W-' B: with the first cigcni-ector defining the first
carioriical variate which esplairis most of the variability betwcen classcs. Geornetri~ally~
with two classes, LD-4 sceks to separate the classes in the pdimensional space by the
straight linc that is orthogonal to the line joining the two centroids of the data that has
first been sphered using W-' .
3.1.1 Discriminant Functions and MANOVA View of LDA
\\'itli two classes. k E { 1. Z}. .inderson (1984) clcfines a discriminant fiinction I l - ( 1 ) = xT6
n-liicli. n-itli appropriate plug-in estimators. leads to Eq. 3.3. From tliere WC sec. that
the two-class LDA seeks a univariate random variable. aTx. that rnasimizes the ratio of
espect ed scliiarcd bctween class differcnce to its variance.
-A gciicralization of that rcsult may be viewed in lI-ASO\--l contest. In section 12.5.
lhrciia ct al. (1979) de\-elops a test of diniensionality that lcacls dirtctly to t h LD-4's
carioriicaI variatcs. Briefl>-• witti al1 LD-4 assumptions. let r 5 min(p. lï - 1). bc tlic
proposcd din-icnsion of tlic hypcrplane n-ithin diicii al1 f< class mcans lie. This tcst is one
of possil~ilities to explore sliould the general (one IV?) '\I-ASO\-A test of al1 means eqiial
l x rcjcctcd. Sotc. that this test lias no analogue in t h iinivariate casc: there one can only
go aftcr specific contrasts.
In gcricral i< rneans spaii a ri- - 1 dimensional hyperplane. given that p is at least Ii.
One nlay wish to sec whether the actual dimensionality of the problem is smallcr, lience
the tcst. Tlie Likelihood Ratio version of this test entails a set of vectors. proportional to
tlic crinonical \-ariates. tliat may be used to tcst for successivel~* larger r: the first vector is
iiscd to test r = 1. the first and second to test whether r 5 2 and so on. Ttiesc vectors span
the siicccssi\-cl>- higher ditnensional hyperplanes such tliat, for a given dimension r. each
1iypcrpIane is a Alasimurn Likelihood estimate of the liypcrplane that contains the IC means
uiidcr nul1 hypothesis. Thcrefore. ive cspect that thc first canonical image will cshibit
t hc fcatirres t hat most distinguish t h e classes, normalized for the coi-ariance structure.
Successi\-c canonical images show the further featurcs of the data that are uncorrelated
n-it h the previous ones.
3.1.2 The Geometry of LDA in Two Class, 2D setting
Figure 3.1: Demonstration of 2-class LD.4 in 2 dimensions. The Ziglit points (class 1 ) and darkcr points (class 3) show the 200 bir-ariatc Gaussian obserr-a tions gcneratcd frorn each class. The solid line is the true canonical r-ariate (C\,'). The circles are clüss mcans. and the dinmoncls are the means projccted ont0 the Cl*. The points marked n-ith the cross. reprcscnt the test point and its projections ont0 the mean-diflcrence (broken) and C l - 1 in es.
Figiire 3.1 shows a demonstration of the LD.4 with two classes and in two dimensions.
Tlie data lias been gcnerated using 2D Gaussian distribution with rneans (0.55,0.45) for
thc first class and (0.2.5,0.6.3) for the second class. The covariance matris \vas chosen to
obtain non-circular shapes with an oblique angle. The test point (a cross at (0.38.0.38))
tliat. given the shape of two Gaussians. quite clearly belongs to class 2. is actually closer
to the mean of class 1. when using the rcgular Euclidean distance. Csing this distance
is ccluivalent to projecting ont0 the mean-difference line. shown in broken style in the
Figure 3.1. and carrying the Euclidean distance classification in one dimension. This line
rcprcscnts tIic t-test image with the pooled cstimatc of standard deviation (ecpation 3.8).
Th(: ca11011i~;ctl variate line (solid) is correctcd to reflect the non-circular shapc of the data.
aritl is a compromise betn-een the first principal component and the mean-ciifference line.
Orle projects the data and class means ont0 this linc and then uses Euclidean distance to
classify.
3.2 LDA and Random Subject Effects
Ir1 al1 of the analysis in Chapter 4 1vc do not csplicitly consider subject effccts. In this
scctiori WC i~ivcstigate how clarigerous tîiis is. and how miich accriracy wc loose to gain sonie
corripu tational advantagc offcred by scan-space version of PD,\. 11% \ d l oril'. look at the
classical LD-4 with two classes and in the .-non-ill-posed" setting. i.c.. with p < n.
It turns out. that LD--4 is doing alrnost --the right thirig" if the data are assunicd
to comc from a two-way replicated design with random subject and fisetf class effects.
This illustrates the clear advantagc of LD,A over methods bascd on average differcncc
iniagcs. likc t-maps. These. and other methods usually subtract subject effects. which is
eqiii1-alerit to assuming a fised-effect modcl. It secms much more plausible to regard subject
effccts as random rather than fiscd. Csing random effects also lets us assess the prcdictivc
perforrriarice of the model. something that is not possible n-ith a fised-effcct structure. \\'itli
fiscd cffccts. such as models that subtract the subject averages from t hc data. ive cannot
cstcnd t lie results bcyond the population studied. and thus cannot use prediction error for
validation. Thus ive n-iil sce, that in addition to tlic fact, that LD-1 is able to correct for
rioriciiagonal covariance mat ris (which corresponds to the non-iid noise structure) it also
--automat icaI1y" deals \vit h random subject effects.
To show tlie workings of LD-a uncier random subject cffect rnodel. assume that we have
S subjects inclesed by S. and each subject had her/his obsen-ation obtained r = 1. . . . . R
tiiiies in cach class. The rnodel for tlic observation r!:) (in class k) is:
v, -. N(0. Es) and érs - N(0. Xe).
wherc Es and Z, are covariance matrices for both effects. -4s usiia1. ive assume that the
siibjcct tcrnis v, are indepcndtnt betwen subjects and indcpcndcnt of tlic iid scqucncc é , , .
.Usa riotc. t h both covariance matrices are the sanic in cach c l a s k. wliich is in tlic spirit
of LD--4.
Tlic Gairssiari vicw of LD-A. assiimcs tliat tvc have a niiiltinornial Gaussian distribiition
in ~i1c .11 rliiss n-ith t h e samc covariance rnatris. and iid obscn-ations in cadi class. Onc tlien
classifies II>- either maximum likcliliood rulc (Ilarclia et al.. 1979. p. 301) or by Bves riilc.
Ig~ioririg prior probabilities. botli rules assign observation 3: to the class with masimuni
likcliliood for x . (The Fisher LD-4 is equivalent when one uses al1 canonical variates and
wlieri a poolcd within covariancc matris estimatc is uscd for the cornrnon class covariance
mat ris).
In our case: al t liough t lie observations from the same subject arc no longer independcnt.
t h ciccision rule is similar to the LD.4. \Fe have:
Thiis the classification is based, as usual. on tlie ~Ialialanobis distance:
-4s cornparcd to LD-4. the onlj- thing that changes is the covariance matris. To esamine
hou- Fisher's LD-4 is doing in this case. ive need t o look at whether the pooled within-class
covariance niatris. used implicitly in LD.1: is a good estimate of Ts + Se.
\\'c n-il1 start by working \vit h the observations in a single class. Let S be the .C- x p
riat tris of observations. \\è have that:
Let ils look at t hc ( t l , t 2 ) element of II.-:
Son..
i Es(t i. t 2 ) + Xe ( t 1 - t 3 ) when ( r . .Y) = (r', .sl)
C(r.s: rI.5') = Es@,$ t 2 ) s = s l , r # r1 (3.13)
1 s # s',r # r1
Tlicre arc -Y pairs for the first case, SR(R - 1) = N ( R - 1) for the second case and
.\i2 - -V - S ( R - 1 ) for the third. Thus:
.\ricl thus ive conclude, that:
- - - . v - R X S ( t I - t 2 ) t Se( t l . t 2 ) .v- 1
l\-c can tlicrcforc see. t hat t lie usual poolecl wîtliin-covariance estiniator that the LD-A
iiscs will tiridcrcstinlate the cornbincd covariance Ss + T, that '\IL requircs. T h arnount
of iinder-cstimatc II-il1 depend on the relation be twen the subject cffcct and error term.
arid the riunilxr of replications per subject relative to the riilniber of subjccts. In oiir
case. -\- = '20 . 4 = 80 and R = -4 so the underestirnate clocs not appcar t o be significant.
.kyniptoticalli-. if the nilmbcr of scans for each subject is hcld fised. the abovc estimate is
irrit~iascd. Tlic rcsult does not change whcn I< classes are considcred and the /\- estimates
of thc coninion covariance rnatris in each class are pooled. since each class lias the samc
cova-tria~icr structure by assiimption.
T 1 i ~ ariiount of l i a s caiised t ~ y the factor in front of Es will also depcncl on Iiow
tliffortvit lmtli covariance matrices are. For instance. if t h e - are not ctiffercnt. tticri LD-1
resiilts in a n cstimate wtiich is correct u p to a scalar multiple. and thc sanle gocs for the
ca~ioiiical variates. Thiis if thc action of the two matrices is conceritrated on the first few
cigcri\-ectors which are similar for both Ys ancl the bias would bc minirnal. O n the other
liant1 tIic hias will have a stronger effect on the canonical \-ariates if the leading eigenvaliies
of subject covariance are large as compared to thcir Z, counterparts, and arc associated
\vit11 \.crj- diffcrent eigent-ectors.
3.2.1 Simulation Study
To asscss the cffccts of the biascd covariance mat r i s estimate tha t LD-1 is implicitly iising
with sut~jcct effccts, we lia\-e performcd a simulation stiidy. In the study Ive compare LD-A
irsirig t hree est imators of the covariance matris:
1. The iisual pooled wit hin-class covariance matris. II . -
A h
2. Tlie siim of common within-subject and error covariance matrices Cs + X E
3. The corrected within-class covariance rnatris: I I - + &Ts
i\-e iised t h cstimators of Ts and SE proposed in the lI-YXO\:A contest b- .Anderson
(198.3). -Anclerson et al. (1986). These correct the usual within and between sum-of-
square and products estimators to make thcm positive semidcfinite. The aiithors concern
thcrrise11-es n'ith o n c - ~ v q random effects lI-iSOl'A' but the resiilts hold in Our case of
rniscd-cffccts 2-way SI-ASO\:i. One starts by forrning the usual within-subject and error
suni-of -scluares and products mat rices:
SISEs = (S- l)-'SSs
Tlic espctctatioris arc as folIo~vs:
EIISSs = X E + RAmSs EllSSE = X E
s suggcsts an estimator for Ss:
h
Xs = (RI<)-'(lISSs - LISSE) (3.23)
ch is riot guaranteed to bc positive definitc. .Anderson (1985). .Anderson et al. (1986)
have ohtained modificd estimators. Thesc more that part of variancc that would make
(3.23) ricgatiw to the error variance. This is donc by first simultaneously decomposing
lISSs ancl LISSE:
- L/2 It mxj- bc achicved. for csarnple. by cigcndccomposing IISS ~ 1 ' 2 ~ ~ ~ ~ s l l ~ ~ , and Icft-
rnultiplying the resulting eigenvectors by I I S S ~ ~ . The estimator (3.23) now becornes:
n-tmc. as hfore. p is a dimension of covariance matrices. The idea is to rcmove this
part of 1-ariance that would make the estimator negative. This is achieved by escluding
thcse colunins of Z with corresponding eigenvalucs u < 1. Let p' denote tiie nuniber of
cigcrn-aliics v > 1, Lct D* be as D but \vit11 only the first p' elements. Let 2' be the
corrcsponding eigenvector niatris with only the first p* colunins of Z included. Thcn we
get tiie cstimator of Cs which is guaranteed to be positive-semidefinite:
Tlic varial~ility renioved from S s is attributed to error variance whicii giws the moclified
cwirriator for XE:
Hcrc the " siibscripts refer to tliesc parts of D. Z that w r e Ieft out in Eq. 3.26. If t h e h
iïcsc nonc. tlicn S E = AISSE. as usual. ,Anderson (1985). -4ndcrson ct al. (1986) show that
tliesc arc rnasimuni likclihood estimators undcr normality assumptions.
Design of the Simulation Study
\ \ c pcrformed a study by simulating the da ta from multivariate Gaussian distribution with
raiidom siihjcct and error effects. Ive only considered two classes with sarne covariance
structure in each. Thus n obscrvations would bc gencratcd according to the mode1 in
Eq. 3.3. Eacli observation is a realization of a 1D Gaussian process observed a t p points on
a line. The mean in class 1 and II \vas a discretizcd sinusoid and cosinusoid. respectively.
RegardIcss of the input dimcnsionality. p. ive set the frequency of the sinusoid so that 4
pcriods ~vould bc covered. \té chose this mean to refiect our interest in the functiond data.
The criicial issue is spccification of covariance matrices for subject and error terms.
-4gairi. to s t a ~ n-ithin the functional data framen-ork we used covariance matrices of an
isorrietric process. The co~ariance between two points only depends on the distance between
t ticm. Speci fically. the covariance structure [vas:
wliere --dis" was nieasured in nurnber of vosels separating XI. 12. Increasing a leads to
fast-clying correlations and thus to rough processes. \\é uscd a. = 5.0 for the error term
and 0 = 0.05 for the subject effect. This confornis to Our intuition that subject effects
tiavc soriic large smoothness propertics. \\-hile error should be mostly noise witti only some
spatial smoot hncss rcmaining.
Tlicse specify tiie correlation matrices. One of the paranieters of the siniulation n-as
VarRatio. Tlic error tcrm aln-ap had variance equal to 1.0 and the siibject tcrm variancc
\vas VarRatio= { 1. 10. 100}. Anotlicr paranieter u-as input diriierisionalit~: p. ii-it h tliree
dioicrs: (5. 10.30). 11-e also 1-aricd tlic nrimber of subjects. S = { 1.5. 10. "Y" 1. wlierc --Y"
rrimnt riiiriit~er of obsen-ations in cach class. .Y = (50.200). Togetlicr. .\: and S cletermined
the riunihcr of observations for a subject in each class. R = -\'/S.
For cacli conibination of {VarRatio. .V. p. S) 50 training sets. each of sizc 2.j: (two classes)
arid onc tes t sct of size 2 5.000 ('2 - 3.000 for S = .\; duc to computational limitations)
\vert gcncrated. Three LD-4 models. ivith three estimates of cornmon covariance matris.
describcct at the beginning of this section (Sec. 3.2.1). were estimatcd on the training sets
and applicd to the test sets to obtain estimated posterior probabilities for cach obsenation
in t lie test set. \\è considered two estimates of prediction error: Dev and SPE. dcfined as
follows. If a test observation xo came from class C = {I. II}. and {pl (xO)> pII(xO)} \wrc
t\vo estimated posterior probabilities then:
Dev(xo) = -2 logpc and SPE(x0) = (1 - p ~ ) 2
Thc tcst set cst imatcd posterior probabilities were t hen used to calculate both predict ion
error estimators and these were averaged over al1 obsemations in the tcst set.
\\'c analyzed the results of the simulation st udy using --I-\-O\*-L mode1 for 4-way factorial
design !vit11 replications. The four factors u-ere: {VarRatio. .\:.p. S) and tliere iwre =
{ZO. Z O O } rcplications. The tests and pvalues are used mostly as guides since the norniality
assiiinption is likely questionable. especially for thc Deviancc rcsponse d i ich eshibits man>-
outiiers. [\> anaiyze four responses:
SPE-ES : SPE(1I') - SPE(TE + Xs) (3.32)
Tlicsc four rcsponscs analyzc the differenccs hetn-een the standard witliiri-class ancl tivo
rriodificcl cstiniators of the covariance matris. on the Deviancc and SPE scales. \\C uscd a
siriiplc additive -ASO\:4 mode1 of al1 four factors nith onc interaction betn-een P and ,Y
terrils. In niany models that we tricd. al1 tcrms werc alw--s significant. biit as we mai-
tion abo\-c we n-erc not ovcrly concerned with p-valucs. 'rlore intcresting are the effects'
cstiriiatcs prcscntecl in Table 3.1. Tlicre are some clearl5- visible trends. -4s cspectcd froni
the rcsiilt in Eq. 3.16. incrcasing VarRatio leads to a bctter performance with moclified
estimators rcgardless of the PE metric used. The improvernent is much more pronounced
using the .-l\'+S" cstimator. Increasing the dimensionality of the data. P. also tends to
favoiir rnodificd estimators. although thcre is a surprising twist in the SPEWS case. Sim-
ilarl~.. the rnoclified estimators work bet ter ivith smaller training set sizes (N50) especially
in high climcnsions as the intcraction term N50:P30 indicates. This suggests that the mod-
ified cstiniators benefit from l o w r signal-to-noise ratios. It is an apparently surprising
finding. since in higher dimensions and with fewer observations onc niight have espccted
poor estirnates of variance matrices. Since v-e rieed two such estimates for the modified
Term 11 Dev-ES Dev-WS 1 SPEES SPE'rWS
Table 3 . 1 : Estima te o f Effects in the four .-LYO\ :-1 models o f the simulation results. The ternis arc input climensionalit~: P = ( 5 .30 ) as compared to P = 10. training set size -Y = 50 cornpared to -V = 100. number of subjects. Subjects=(5.10. .-S") compared to 1 snl~jcct arid ratio of rârianccs ofsubject cflect to error effect. VarRatio={iO. 100) compared to VarRatio= 1. Thcre is also an interaction term betn-een P ancl -V.
LD-4 as opposeci to orle poolcd n-ithin-class matris. the niodificd mcthod could he cspected
to siiffcr more undcr lowcr signal-to-noise ratios and in higlier climensions. One possible
cspla~iatiori is that the modified estimators use t h awilable signals more cfficient1~-. This
hj-potlicsis is partially supported cornparing *-E+S' and *.\I-+S'' estiniators: --E+S"
coiilcl bc cspcctcd to sbon- Iarger improvernent as it does riot use the within niatris a t all.
ririri tliis is iridecd the casc. -1nother possibility is tha t the truc CO\-ariance matrices had
\-Cr!- simple structure dcpendent only on a single paranieter a. It may be possible that in
t hat case more dimensioris act ually help in est imating t hese mat rices.
-4 good casc for modified LD-4 cornes, not surprisingly, from number of subjects' effect:
S u bjects. The baseline was Su bjects = 1 when al1 met hods are nurncrically cqui\.alcnt.
I\Ït h n siit)jccts. t here is understandably little effect of modifiecl LD-A. and in fact II.- + S
docs worse than the baseline. But for 5 and 10 subjects both met hods perform significantly
iletter t h r i the baseline, escept that again there is a huge reversal in the SPEWS column.
For -.E+S" estimator the effect for 5 subjects is itself enough to make this method better.
rcgardlcss of thc statc of other paramcters. for both mcasures of Prcdiction Error. Togetlier
wit h higher VarRatio values. the subject effect makes t hc modified estiniators. especially
' -EtS" very clear winners over classical LDA.
Ir1 gencral -E+S" modified estimator of the covariance matris performs some~vhat better
ttian the pooled u-ithin covariance matris in the contcst of classification. if therc are strong
siihjcct effects present. The improvernent seems to be more pronoiinccd with higher signal-
to-noise ratios. liere indicated by cither smaller training set sizes. higlier dimensionality
or hoth. The iniprovements are not large. liowever. and the!- arc probably smaller 11-hcn
pcnalization is included. It may be worthwhile to deyelop a modified PD,\ method for
PET/flIRI images that n-ould take both siibject and error covariance matrices into accoiint.
3.3 Dimension Reduction in LDA using Smoothness
Constraints and Penalization
;is clescril>td in the preceding section. Linear Discriminant -1nalysis rcsults in the set of
ortliogorlal 1-cctors in tlie data space. called canonical or cliscriminant variatcs. tliat bcst
scparatc the class mcans with respect to the \rit hin-class covariance. The total riuniber
of discrirniriant variatcs is one less than the numbcr of classes if tlie problem is of full
rarik. If al1 of tlic rariates are used. then LD-1 c m be derived as the LIasimum Likeliliood
cbst iniate of the optimal classification rule under tlie miiltivariatc normality assumption
n-it h a cornmon wit hin-class covariance mat ris (Hastie et al.. 1993. Ripley. 1996. pp9G).
Linear discriminant analysis is csscntially equivalent to canonical correlation analysis.
canonical variatc analysis and optimal scoring. in that any one is sufficient to derive the
othcrs. In thc contest of images. where the classes are espc~imental States. the LD-A can be
uscd to obtain the \.ariates (e-g.. Azan et al.. 1993, Rottenberg et al.: 1996, Friston et al.,
199G. Strother ct al.. 1996, Ardckani e t al.: 1998) in the vosel (or \TOI) space, that can be
iriterprctcd as activation images (or profiles). W i e n LD-1 is used with scans as inputs. thc
caiionical variates are usually called canonical images. \\*ith tn-O classes. as in our baselinc-
act il-atioii arialysis. the single canonical image may be interpreted as a pattern of the signal
driving (or driven by) the activation state. ...lgebraically. the canonical image is just a
riicari diffcrcnce pattern rescaled by the inverse of the estimated covariance matris. \Vit h
riiorc tliari t ~ v o classes. the principal canonical image gives this direction (in the scan space)
t liat rriost scparatcs the classes with respect to the within-class cm-ariance structure. In
t liat sense it is t hc image that carries the largest amount of information about the classes.
Siicccssive canonical images are chosen t o estract most of the information about the classes
in the orthogonai complement of the subspace spanncd by the previous canonical images.
Saive application of LD-4 to the images n-iil not work. Due to the ill-posed nature of the
problcni onc will not be able to estimate the inverse of the u-ithin-class covariance niatris.
Tliercforc. wc nctci to constrain the problem and bring its dimensionality down. \\'e achie\.c
i t in two KVS: hy constraining the roughness of canonical images and by pcnalizing t h
\vit liiri -covariance matris.
3.3.1 Basis Expansion of Canonical Variates
By coristraining the problem through imposing spatial smoothness on the resuiting canon-
ical iniagcs iiot only reduce the effective dimension of the problem and. thus potcntiall~.
t tic variance of the result. but ive also mode1 some spatial smoot hness wbich is known to
csist in the scaris. This is done by esprcssing the unknown canonical image(s) as a linear
conibinatiori of the known basis functions of some smootli spacc.
If .3(vl. h. 24) is a canonical image. indesecl b\- the location vcctor (v l . h. u3). we rcquire
that:
Hcrc. B,(ul. u2. 4) is a b a i s function in the voxel space. Slan- choices esist for a basis
set. \Ise have esperinzented with the tensor product B-splines (TPS) and \vavclet bases.
Hai-ing const rained the spat ial --roughncss" of oiir canonical images. u-e need to est imate
coeficicrits sj. LD-A works ivitli the scores. (3.i). which are (discrete) inner products
twt~vcen observed scan. i,. and the canonical image. Lsing Eq. 3.34. we 1ia.e tliat:
Tliiis to firici the coefficients y. in the LD.4 framework. n-e necd to projcct the scans
i ont0 ttic basis set. and treat those as the input to the LD-A. Tlie rcsiilting canonical
l-ariatc will bci a vector of coefficients y. n-hich will let us rcconstriict the canonical image
1-ia Ec~ . 3-34. Appendis B lias more details proving that smoothness-constrained LD-4 (or
PD.4) Icads to unconst rained LD-4 (PD-A) with the projcctcd data.
3.3.2 Penalized Linear Discriminant Analysis
BJ- irriposing the sinoothness on canonical images. we already rcdiice tlie ciirncnsionality of
tlic problerii: tliere will typically be feu-er hasis functions (B,'s: scc Eq 3.34) than voscls
(WC Iiavc a fe\v thousand basis functions and about 30 thousand voscls remaining in il's
after i~iaskirig is appliecl). The problem is still ill-posed. hoivever. and furthcr regularizatiori
is riecded. \\C liavc u-orked with the PD-\ model dewloped by Hastie et al. (1995). They
i~nposc a pcnalty on the n-ithin-class CO\-ariance matris (of the projected images data).
~ I i i c l i directly affects the canonical variates: herc 7. that result. Since WC ha\-e already
cspancled the carionical image in a sniooth basis, WC use a siniplc ridgc penalty which adds
a sniall \-alue to the diagonal entries in the estimated within covariance matris. This is
qui\-alent to imposing a penalty on the sum of squarcd variates' coefficients. [[yll'. The
PD.4 mode1 n-ith a ridgc penaltv is equivalent to the Carionical Ridge model of Iiinod
(1976). \diich is also being esploreci in fiinctional ncuroimaging by Sielsen ct al. (1998).
Tlic intuition behind penalization is as follo\vs. 11-hile u-e constrain the image to lie in
tlic smooth space. the srnoothness constraint can still be overcome by large coefficients. If
one specifies a large positive b, in (3.34). for a basis function jl tliat is ccritered at some
location. and a large negative coefficient î,, for a basis function j?. tliat is centered at a
ricarbj- location. then the resulting canonical image will have a steep dip between locations.
despite our efforts to impose smoothncss. How stecp a dip will depend on tlic type ancl
r i i i r r i h - of tmsis furictions and the size of tlie coefficients. Together thesc threc control
tlic effective smoothness of the canonical imagc. By penalizing the size of' the coefficients.
n.e control the amount of smoothness given the choice and niimber of the basis functions.
Tlierc is a frec tuning paranieter. A. that controls the importarice of the penalty term
relative to the criterion being minimized by LD.4. This is yet another esprcssiori of the
iibiquitoiis bias-variance tradeoff. (Friedman. 2994)
Tlicrc are some advantages to basis espansiori followed by penalization. If ttic hasis uscs
srrioot li functioris. like B-splines. projecting the scans on the hasis is siniiliir to smoot Iiing
t l i w i witli tlic kernel of the shape of the basis furictiori. Tlic \vliole metlioci. lion-c\-er. is ver?
tliffcrerit from pre-smoot hing t tic scans prior to analyzing t hem. 11-ith the basis-cspansion
itlcn n.c liavc t h power to impose regionally diffcrent amount of sniootliness or bandwicltli:
rnoreover tliis spatially varying bandwidth is utilized to masimizc tlic discriminatory powcr
of PD.4. The rcgional smoothness is again determincd by the individual basis furictions.
ttieir placement and tlic size of the coefficients. 7;. Thus. with y bcing sniall in somc regions
and large and variable in other regions. we are able to mode1 a \vide variety of possible
canonical images that eshibit smoothness in some parts and roughness in the others. Tlic
ability to control the overall size of 7 through a ridge penalty gives ils the ability to globally
fine-tunc the smoothness of the canonical imagc. Smoothing on some lewl is necessary:
The images arc re-constructed by a tomographie proccss wliich irnposcs spatial corrclation
(Pajcvic et al.. 1998). the rcgistration techniques arc not perfect (I<jems et al.' 1999). and
thc actual hcmodynamic responsc of the brain. which PET and fllRI methods use as a
prosy for iieiironal activation. has spatial estent on the order of 3-5mm (llalonek and
Grim-ald. 1996).
The basis expansion idea is similar in spirit to Ruttimann et al. (1998). There the
authors also use a specialized (wavelet) basis to inducc a prior and reduce the dimensionality
of tlic f3IRI data. Their approach is geared more towards deriving inferential statistical
tests on the obtained activation images. a task made casier t~ the orthogonaIity of the
basis (which lcatls to the near-orthogonality of the coefficients). while Our emphasis is on
tcstirig generalizability via prediction error. in a non-parametric way. -4 second important
diffcrence is in the input space to which the basis espansion is applied: WC concentrate on
t h LD.4 approach. which works with the whole-brain spatial covariance. whilc Riittimann
c3t al. (1998) appIy the espansion to thc vosel-based poolcd diffcrence image. basically
\corking with tlie first moment statistic.
To simlrnririzc oiir method: the data matris is obtaincd by projecting cach image. as
i n Eq. 3.33. iising a cliosen basis. B,. Then the PDA is applicd to tlie projcctecl data for
sorric \-l.iliie of tiining parameter A. which results in the canonical \-ariates niatris T. Finally.
canonical iniagcs arc thcn rcconstructeci via Eq 3.34.
3.3.3 Penalized Discriminant Analysis and Statistical Paramet ric
Mapping
Sonic intercsting analogies to the vosel based rnetliods that rely on (possibly scalcd) im-
ages tlcrivcd from the differencc of class-averaged iniagcs. like SPlI. cari bc established by
consiclering the two-class problem. I\é have alreadj. indicated thc basic difference in the
gcomctry of hoth approaches in Section 3.1. on page 47. Disregarding for a rnomcnt the
basis projection step. the single canonical image from PD-4 is:
wlierc Sii- is a pooled within-class covariance matris. pl. pu are class-mean images. and c
~ior~iializcs the image to iength one. Therefore. the canonical image is a rotated. rcscaled
1-crsion of the simple class-mean difference image (AIDI). w-herc the rotation and rcscaliiig
attcnipts to ccjualize the variance and to decorrelate vosels, whiie the penalty term works
i r i the opposite direction.
For vcry large values of A: when the penalized within-class covariance matris becomcs
csscritially (a constant multiple of) the ident ity. the canonical image is a scalcd AIDI. This
assiirrics tliat variances across \-osels arc equal. and tliat voscls arc uncorrclüted. and thus
rcsc~rililcs thc \-osel-wise t-map n-i t h pooIed variancc estirnate. For modcrate A. Ive can
cspcct SI\. + XI to l x diagonally dominant with possibl>. cliffercnt diagoiial eIemcrits. Thc
rcsulting image \vil1 be similar to diagonal1~- scalcct ,\ID1 d ierc each voscl in tlic ,\ID1 is
rorriparctl to its variance (noir resernbliig the 1-oscl-wisc t -map n-ith indi\-idual \-oscl vari-
m r c cstirriatcs). For small A. ~ - e get close to full?- cstimating t h witliin covariance rnatris
arirl rotatirig/scaling the '\ID1 to accoiint botli for local variances. as ive11 as. covarianccs.
\\'it 11 tlic interna1 optimization. wc let Prccliction Error dccidc how much information \SC
Liavc to niovc away froni the unrealistic assurnption of homoscetfasticity and independence
across 1-oscls.
Th tcnsor-product basis projection is helping the estimation by somewhat decorrelat-
ing thc variables because it models part of the spatial covariance structure. \f-c can espect
tlic covariarice niatris of the projected data to bc more diagonally dominant than in the
uriprojcctcd space. This in turn results in better estimatcs of the covariance matris. 13-e
thereforc espect PD_& to be a more flexible rnethod ttian SPN. with one hyperparameter X
t h t is able to control the tracteoffs betwecn increased flesibility of full covariance normal-
izatioli and neccssity of simplifying assiimptions of homoscedasticity or diagonal covariance
niatris.
PD,\ also eshibits some similarity to the other well known method in Seuroimaging
callccl Partial Lcast Squares (PLS) describecl in NcIntosh et al. (1996) and in Section 2 -33 .
(Therc is anotiier well known algorithm. also called Partial Least Squares. widely used in
tiic Chcmomctrics community For some statistical description see. e-g.. \l:ald et al. (1954)
aricl riotc that PLS described here is a completely different algorithni). PLS starts wit h
clecoiiiposing 1'1- using SVD. 11-here S is. as in our case. thc scan matris. and 1 - is an
a r t ~ i t rary design a na tris. Furthermore. PLS providcs an interesting paradigm for choosing
a xirinihcr of significant components t h result from S\-D. In light of the correspondeiice
t~ctu-ccn CC.\ and LD-1, which ive prove in the nest section. PLS ma? be seen as an
iiririornializetl version of LD-1: in LD--4 one looks at the singular value tleconiposition of:
(sec Eq.C.4). whidi is followcd by rotating back the left-hand singiilar wctors. -\nother
u-a?. to look at it is via the orthogonality constraint (Eq.3.11): PLS uses a Euclidean metric
wliilc LD-A nornializes to unit>- variance using t hc u-i t hin-class covariance mat ris cstiniator .
=\ssriiiiing t k t t hc variance can be estirnated effect ively LD-A normalizat ion is prefcrable
as it puts al1 vosels on an cclual footing. Of course. we cannot estimate the full covariance
niatris of all \-osels (or basis functions) and Our rccourse is to use pcrialization. One can
agairi establisil sonic analogies for different values of the ridge hyperpararneter. Similarly
as in the previous paragraph. we observe that for a small number of degrees of freedom ive
riiay cspcct our rcsults to be similar to PLS ones. as the lcft-hand normalizing matris wi11
bc close to a constant multiple of i d e n t i t ~ The right-tiand normalization simply reweiglits
the observations by thcir class sizcs.
3.3.4 PDA via Regression
In this section we will shou- hou- to obtain the Canonical \ariates of the PD-4 model
usiiig two steps: penalized regression followed by the eigcndecomposition of the regression
rcsiilts. Oiir proof is different from that given in Hastie et al. (1995) and relies only on
matris algcbra. \lé also feel it is more appropriatc for the ncuroimaging community as it
hirigcs more closcl>- on the current approaclics u-idely used in this domain.
in tIic ncst section ive will show hon- to "train" the PD-\ model. given -V scans (possibly
projcctcd) as input. and how to predict from this model ivith O(-\-) computational effort.
t Iiat is witliout dcaling with large p x p matrices. This is of vital importance as we use
rcsarnpling techniques to est iinate t lie prcdiction crror. \Vithout t his estension obtaining
tlic hiiricircds of mode1 estimates needed by the Bootstrap alid cross-validation ~vould bc
cornpiitat iorialiy prohibit ive.
\\-c start ivitli an unpenalized version (LD.4) and first show that tlic closel'- related
riiclthod. Carioriical Correlation Anal'-sis (CC-A). cari be esprcssccl as a rnultircsponsc rc-
grcssioii follon-cci tq- an eigendccomposit ion. l\-c thcn prove the relat ionship betwen CC;\
aiid LD--4 and. firially. introduce penalization and describc a n estension to deal with the
iriiagc data. Our proof differs from one in Hastic et al. (199.5) in that ive ciirectly apply it
to tlic CC-\ forniirla (Eqs. 3-40 and C.1).
CC-\ is a syrnmctric method that. given two sets of variables measured for cadi obser-
t-atio~i. x, y. seeks tn-O linear combinations that eshibit ma.xima1 correlation. That is each
otxerration is composed of {xi. y,}' wtiere x and y arc in general of different dimensions.
01ic atternpts to summarizc the data by finding two lincar combinations. 6. a of x and y.
respect il-ely. such ttiat:
is rnasiniizcd.
One cstcnds the mcthod by finding al1 such possible directions. ak. bk that siiccessively
masin~ize the correlation and are orthogonal to previously found pairs. Since:
var(aTx) = a T ~ a r ( x ) a and cov(bTx. aTy) = aTCov(x. y)b (3.39)
tlic probleni is to find matrices -4. B. with linear combinations in their columns. such that:
1 T :V- B S,,;L (3.40)
is masimized subject to:
ivhcrr Sr,. Sv, and S,, are respective covariance and cross-covariance matrices . This is a
gcncralizcd SI-D problem (LIardia ct al.. 1979. pp. 282). and one can show (.Appendis C )
that it rriay be solved. after suitable normalizations. via multiresponsc regrcssion of 1- ont0 h
S. lollo~vcd by tlie eigenaiialysis of l - T l - (where F arc fittcd values frorn the regrcssion stcp).
In our case, .Y denotes the data rnatrix (n-hose rows are scans or projected scans).
and 1- is the .\- x .I class-indicator niatris. with 1's denoting the class of each scan. The
classes arc cspcrimcntal conditions: tierc cit her two ciasscs denoting the -Act ive/Basclinc
States. or cight classcs denoting the temporal order of tasks.
It is known (and we rederive in it Appendis D). tliat the Canonical lar iates (C\*'ç)
associatcd with x are. up to a scaling factor. the same as the canonical variates that result
from LD-4 (Hastie et al. (1995).(lIardia et al.. 1979: Es. 11.3.4)). In -4ppendis D we provc
that:
n-licre D is a diagonal matris. Thus ive show how one obtains the canonical variates of
LD--4 by rcscaling B.
-4s rricntioned. the unpenalized version is unsuitable as it rcquires an inversion of a
singular within-class covariance matris, XI\-. To remedy that tve apply penalization to Eli . .
or ccpivalcntly. to the total covariance matris S,,: which results in a penalized regression
s t cp:
For an? positi\-c definite rl. tliis rnakes STS + AR inwrtible. In this paper u-e use ridge
pcrialty. O = In. and tlien Eq. 3.43 defines a ridge regression solution.
3.3.5 Expressing the PDA algorit hm in the N-dimensional space
llajor effort has been spent in deriving efficient computational algorithms presented in
t liis t liesis. The importance of computational issues has increased great ly in statistics.
often due to the resampling niethods. t hat apply an- given algorithm many timcs. due to
popiilarity of siniulatioiis where the coniputer-gcneratcd data of large size is used to test
tlic rriodel. or sinip!? due to ever increasing amoiints of data the statisticians have to cleal
{vit 11. -4s nicntioncd in the introduction and throughout this chapter. images constitutc
a specially challcnging form of the data due to their sizcs. \ \Ï th a 125 x 2 2 3 x 48 PET
scan. wc arc deaIing with 786.432 vosels: and if each is storecl as a floating point nuniber
of single prccision. eacli scan occupies over sis megabytes of disk space. i \ ï t h clozcns of
iniagcs a\-ailable in one data set. the computationall\- efficient methods are a must.
The PD-\ algorithm presentcd in Section 3.3.4 ivorks in the pdimensional space, ivherc
p is the ~iiimher of 1-osels or b a i s functions. The only place n-here p dimensional quantitics
arc iiecded is in the ridge regression step (Eqn. 3.43) and when the canonical variate (I)
or image (Eqn. 3.34) are constriicted. In particular. in the ridge regression step. it appears
tliat ive need to form. and invert. a huge p x p matris STS + X I . In ;\ppendis F we show
tliat the fi tted values >'. of ridge regression step may be computed using only .V-dimensional
quantitics. wherc .V is a number of scans. In order to do that. one nceds to precompute the
outer-product matris. SxT, an espensive step rvhich needs 0 ( : V 2 p ) operations but that is
pcrformed only oncc.
Sincc 11-c usc rcsampling methods to scarch for optimal X and thus need to run the above
algorithni. with a given set of training inputs S. 1- man- times. .Additionally. since we then
oril'- need to compute the posterior probabilit ies (and Iience. predictecl class memberships) .
it pays to precomputc G = -YST once and then apply cross-1-alidation or bootstrap. Each
bootstrap samptc may thcn be obtained selecting only those ro~vs/columns of G that
correspond to the sample observations. and thus forming G'. the bootstrap version of G.
This. after full-data G is coniputed, lets us operate in the -\'-dimensional scan spacc
for as lorig as 11-c do not need to compiite the canonical variatc BtD,\. Since the postcrior
prot~ability estirnatcs can be obtainecl using only 1.: fitted values F. right-hand eigerivec-
tors -4. anci cigenvalues D,. (-Appendis E) ive can pcrforrn the optimization of the ridge
parameter X in the loiver-dimensional spacc of the observations. -Appendis F siiows how
ridgc rcgression ma'- bc computed using only ,Y x 3 matris G and matris 1- of class indica-
tors. ;\ppendis G slimvs the algebraic trick of compiiting ccnteretl version of G: G = .<-s'.
wlicrc .i- Iiatl its colurnns means subtracted. using the uncentered G. Iri fact. since in the
rrsaniplirig nicthods ive use the siibset of the data as a training set. this ,-\ppendis shows
lion- to ccnter the partial G* correspondirig to the subset chosen by rcsampling. and how
to ccntcr the ron-s of remaining observations witii the column mcans of training set used
in fit ting thc PD-I. al1 using the once-computed uncenterecl G .
3.3.6 Effective Degrees of Fkeedom
Liriear statistical arialysis defines a notion of degrees of freedom (d-f.). These specify a
diiiicnsionality of the space ont0 ~rhich ive project the data. and in thc casc of iid Gaussian
crrors. thc cspccted drop in Residual Sums of Squares if only noise variables are included
in thc rriodcl ( c g . Sec. 3.5. Hastie and Tibshirani. 1990).
Sincc we have espressed LDA (and PD-\) with a regression as a building block LW can
carry over thc notion of d.f. In the univariate, full-rank (LV > p). linear regression case,
tI.f.=p. t lie numbcr of variables. Thcri also:
p = trace(^(^^^)-'^^) = trace(H) (3-44)
A
u-hrrr H is a projection ("hat") matris (i-e. 1 - = H l *). Bj- analog'. in the ridge regression
case ive can dcfine the effective degrees of freedom (EDF):
EDF = trace(S(X))
1979. Hastic and Tibshirani. 1990): wit (Crave and Ilaliba. I i S ( X ) = S(STS + AI) - 'Sr .
a pciializeti .projection" matris obtained in the rcgression step of PD;\. Plense note that
(3.45) is not the onlj- possible definition for EDF: see [e.g..]flHastie and Tibshirani (1990)
for ot lier possi bili t ies.
In oiir casc. n-ith n < p. wc c m cornpute EDF using only rnatris G of outer products.
-4 bi t of algcbra sho\vs tliat:
.v O j EDF = trace(S(A)) = -
C t j + X J = L
\vhrrc ci, are cigcn\-a1uc.s of G = SST. saiiic as non-zero eigenvalues of S'S. This
siion-s tliat EDF combines the t uning pararnetcr n-itli the srnoothricss inlicrcnt in t tir b a i s
rq~rcscntation. and is niore informative tlian unscaled A. The EDF vari- from 1 (for X = x
siricc the pcnalization is applied to the centered G) to -\-. for X ncar zero.
3.3.7 Prediction Error and its Estimates
Tt slioultl l x reatized. that the need to impose constraints is more than just a riurnerical
nccessitj-. It affects the generalization ability of our model: i-e. whet her t lie result ing
activation map will be interpretable and significant. or whether it will be overshadowed
by noise and peculiarities of the da ta at hand. That is directly rclated to the predictive
performance of tlie model: if the model lias not bcen constrained enough (and in the "right"
way). t h it will not bc able to ciassify a new scan t h t \vas not uscd in training the model.
TIic gencralizability of functional ncuroimaging models has been addressed before (Kip-
periliarn et al.. 1994. Lautrup et al.. 199.5. Morch et al.. 1997. Alorch. 1998. Strother ct al..
1997. 199Sa. Hansen et al.. 1999). In particular. lIorch (1998) contains a good introduc-
tion to gcncralizat ion error. predictive performance and bias-variance t rade-off issues in t hc
contest of neuroimaging. Evcn though prediction is not the main goal in anal-zing PET
i~riagcs. one ncecls to be concerned about the generalizability of the activation maps (hcre.
canonical images) derived. Prediction Error (PE) is a u-ay to mcasure the generalizability
of mir niodcling proccss. \\é use PE (or. rather. its estimate) as a function of EDF. in thrce
ways: to choosc the amount of smoothness. to assess the final usability of derived patterns.
and to compare different data representations.
A Probabilistic F'ramework.
Lincar Discriminant .-\nalysis. wkiile first establishcd by Fisticr as a sensible proccdurc
rrgartIIess of distri hutional assumptions (AIardia et al.. 1979). can he rcderivccl within a
probal~ilistic framework. If one assumes that tlic scans comc from tlic multivariate Gaiissian
~Iistribirtiori. and assunies that these Gaussians have the sanic covariance structure arnong
ciiissc~. thcri the LD-A çan be derived as a plug-in Ba'-es classifier for the data n.it11 the
ilsual estinlates of class-mean images and covariance matris.
Specifically for a n iniage i(') from class k. lt E {l.. . . . I < ) . let i(') N(pk. S ) . In
gcncral t h Bayes classifier would assign new image io to that class ko which masimizes the
post crior prolxibility:
wlicrc P(io 1 k) is a class-specific likelihood~ here Gaussian. P(k) is a prior probability of
ot~scrving an image in class j. and P(io) is a normalization constant.
Since the covariance matris is assumed the same for al1 classes. the only class dependent
comporicnt of multivariate Gaussian likelihood is the argument to the csponential function.
or :\lahalanobis distance between io and the mean of class k, p k :
D(io. pk) = (io - p k ) T ~ - L (iO - p k ) (3.48)
It is an cstablislied fact (Hastie et al.. 1995. Ripley. 1996. pp96). that the llahalariobis
distance is a Euclidean distancc when the image and class means are projccted ont0 all
carioriical \-ariates. Thereforc LD-4 rcsults can be uscd to obtaili both posterior probabilitics
and classification by:
where C norrrializcs the probabilities to add up to one. BLDII is the rnatris of canonical
variatcs (ir i colunins), which iras derived via the route convenicnt for us. in section 3.3.4.
Eq 3.42. ariti n k arc estitilateci prior prol~abilities for each class.
Prediction Error Measures.
\\'c ricrd to clcfiric a suitablc nicasure of Prediction Error in the population. \\C have iiseti
rwo siicli rncasrircs: 1 lisclassificat ion rate (1IC rate) and Squared Predict ion Error (SPE).
A I C ' rate is a protxhi1it~- of misclassifying a new scan h ~ - thc modcl fittccl on the training
tiat a. I r is a roiigh rncasiire. wit li t hc discontiniious 0-1 penalty for triisclassification. Th i s
thr rriocicl n-hich givcs postcrior probabilitj. of 49% to t h correct class. n-il1 score thc
smir crror ori tliis scan as a mode1 wiiich gi\-es 1% posterior probat~ility. assiiming a 50%
tlircstiold is uscd as in the 2-way classification problern.
,4riotlicr measure of PE with a more rcasonablc metric is Squared Precfiction Error.
SPE = (1 - cc)'. where pc is the posterior probabilitj- estiniated by the mode1 for the
correct class. One coiilcl also use deviance (minus twice the likelihood ratio). w l l knolvn
froni t lie t hcorj- of Gencralizcd Linear llodcIs (.\IcCullagh and Sclder. 1989). here simply
-2 log ljc. \\> esperienced erratic behaviour of t his mesure. because posterior probabilities
wcrc oftc~i close to zero or one: deviailcc puts a ver>- largc penalty on cascs wlierc pc ==: O.
This issue has also bcen addressed by Hintz-.\ladsen e t al. (1998).
Resampling Estimates.
The above are population parameters. conditional on the model and the training data. \\é
necd to clcrive t heir cstimates. \\é use (5-fold) cross-validation (CV) and the bootstrap
rcsanipling techniclues (Efron and Tibshirani, 1993). The 5-fold Ci ' estimate is derived by
first raridomly dividing the data into 5 equal-sized parts. Then the modcl is trained on
4/5ths and iisecl to obtain predictions for the remaining l/3th of the data. This is rcpcated
fivc tirries. for eacli of the five distinct training/validation set divisions. The predictioii error
is ari al-mage of errors accumulated over the five validation sets. The Cl- process mimics
the situation whcre ive have a set of independent obsen-ations on n-hich to cstimatc the
prediction crror. Howver. using a five fold C\- results in an cstimate M-hich is biascd due
to t hc tliminislied size (80%) of the Cl--training set. -Uso. thcre is a variability associated
with nlari>. possible 1vai.s to divide the data into five parts. Thc .63'2+ bootstrap estimate
wis tIcsigriet1 to remcdy that. and has been shon-ri to oiitperform CI- in simulatiori studies
(Efron and Tibshirani. 1993. 1997).
Tlic .63'>+ bootstrap procedure is a refinemcnt of the "lcavc-one-out" bootstrap a p
proach. wliidi ive non- describc. One obtains B bootstrap samplcs. with replacement. from
tlic original data (in oiir case. B = 50). Then a model is huilt on each samplc and tcstcd on
t lie ol~scr\~ttioris t liat wcre (b chance) not included in the samplc. Tlic resulting precliction -(Il
crrors are avcraged to give PE . the leave-one-out Bootstrap PE cstimate.
It shoiild bc rioted. that we have used subjects. eacli with al1 his/her scans: as a sampling
iinit. for both Cl- and bootstrap resampling techniques. Otherwisc large negative biases in
thc pcr-scan PE estimates will result due to the large between-subject variability in t h s e
data sets (Strother et al., 1995a.b).
The .G32 correction (Efron and Tibshirani. 1993. Ch. 17) !vas derived to correct for the
(positi\-e) bias that results since each bootstrap sample contains only 63.2% of the original
sariipie. on average. Thc -632 estimatc is:
or a weiglited average of the leave-one-out bootstrap cstimatc and t h training error on
al1 of t h data. Tliis will non- underestimate the PE for inodels which liighly overfit and
Iiavc training crrors close to zero. The '+' correction attempts to dcal with that. b - first
estiniating thc no-information error rate. which is dcfiricd in the population as:
Tliis means the following: assume distribution. Find. of data points consisting of prcdictors
and rcsponscs: { t . y}. sucli tliat the marginal distributions of predictors aiid rcsponses is t hc
sarric as for t lie observed data. but the two are indepcndeiit: t hat is t here is no information
in t b o t y . Let r z ( t o ) denote the precliction madc te- our mode1 at point {to. .qo) frorn
f,,,,~. traincd on the available data x. The point { t o . go} is also indcpcndent of the training
set x. Thc fiinction Q(-) is the prediction crror measure. SPE or niisclassification ratc in
our case. -A possible enipiricai estiniate of 7. suggested by Efron and Tibshirani (1997). is:
whicli is an crror ratc computed for our data using al1 :V2 pairs of predictor and responses.
t h cffcctively mises up the two and destroys th& relationsiiip.
- --b h
For a misclassification rate' ^j = ?, (1 - 3 ) + (1 - xl)p,, where i i l is a proportion of
class 1 ohsemations. and p7 is a proportion of observations predicted b. rz to belong to
class 1. The rnulticlass estcnsion is 7 = xi=, ?,(l - 6). For SPE Eq. 3.52 becomes:
wherr F(j(i') 1 i i ) is an estimated posterior probability (as in (3.49)) of the class that a scan
i' bclorigs to. Onc may calculate (3.53) in the following wa~-. Class j will be a "correct"
class T I , timcs for cach obscn-ation (where n, is nurnber of obser~ation sin class j ) . Let P
dcnotc tlie S x J rnatris of estimated posterior probabilitics. Then of equation 3.53 \ d l
bc ccpal to the al-erage of ail row-sums in the scaled P. where each column j is multiplied
by T l ] .
Tlir rio-iriformation error rate is uscd to form the wight. E. for the cont-es combination:
as a rcplacemcnt for (3.50). The weight 2 is formcd in thc following way: first definc the A
relutive overfitting rate. R:
Rclat i\-e ovcrfit ting rate measures the overfitting by the difference bctween the lcave-one-
out bootst rap cstimate and the training crror. relativc to the -'piirc" ovcrfitting as nieasiircd h
11'- tlic diff~rcricc between no-information rate and training crror. R varies bet~vccn O - if
t licrc is no t ias in the training error - and 1. if there is --full" overfitting: tliat is when -( 1 ) P E ccliials to rio-information error rate. T. The wcight E. defincci as:
A
\ -ar ia from .631. when R = O to 1. \\'ith iL' so defined. Eq- 3.34 is scen providing some
met hocl-hscd adaptivity to Eq. 3.50.
3.4 A Note on Gaussian Assumption
The LD.4 procedure. as derived by Fisher. does not necd to rely on Gaussian distribu-
tional assurnption which is also true for its penalized and srnoothness constrained version
described hem. The only place where the normality is used is in estimating posterior prob-
ability ancl thus in estimating Prediction Error mcasures dcveloped in section 3.3.7. 'The
qucstioxi of the 1-alidity of Gaussian assumption becomes then a cli~estion of the validity
of PE cstimates. Onc may conjecture that the departures from Sormality would have
dctrirriental effect on the predictive performance of oiir method. n-hich would lead to larger
prcdic-tion errors than one would obtain N-ith the similar. but Gaussian distribiited data.
Perhaps more important. howm-er. is the value of the ridge parameter X wlicre the mini-
niiirn PE happens. as this determines the final image we obtain from the arialysis with a
givcri hasis set. It is quite possible that the departure from Sorrnalit? changes tiic shapc
of PE curvcs (likc these in Fig. 4.3). -Again. Ive do not fccl that the location of the minima
n-oiilcl drastically change with the departure from normality (as this location is clearly in-
rlcpcncicnt to the monotonie transformations of the PE). but acknowlcdgc a necd for some
rol>ustricss studies in that matter.
Finally. n-e have some consolation in that a t least the PET data may ~ io t bc ver? far
frorn Gxiissiari. Each voscl is based on a linear combination of a large nunibcr of raridom
photon counts (sec section 2.2.1) and we thus hopc the Gaiissiari approximation to Poisson
wilI n-ork. i\'ith snioot h basis cspansion t hesc (rcconstriictcd) counts arc f u r t h sniodt Iicd
u-itli a large riunibcr of ncighbouring oncs which hopefiillly @\-es us a possibility tIiat Central
Liniit Tlieorcrn m q bc applied.
3.5 1s Ridge Penalty Enough for the B-spline Basis?
Ridge penalty is ver - convcnient for us to use computationally but the question is whether
is pcnalizcs "the right thing". BJ- that WC usually mcan high frequenq- componcnts or
liiglicr dcrivatives. In one dimension B-splines are usuallj- uscd with second order derivative
pcrialty (cg . . Hastie and Tibshirani: 1990): which results in the natural cubic smoothing
spline fit: siniilar penalty could be composed using 3 dimensional tcnsor product B-splines.
O'Siillivan (1991) gives an very remarkable algorithm for composing an eigendecomposition
of a discrete Laplacian penalty matris, that penalizes square of the sum of the second
dcrirativcs of tlic data. -Alsol there esists a wide literature on thin-plate splines (e.g..
Green and Silverman. 1994) that are the most popular n-ay to estcnd the cuhic smoothing
spliries to higher dimensions. ,411 these methods require that one handles p x p matrices.
wtiich is corripiitationally prohibitive in our case.
In tliis section we show. in a semi-forma1 way. tliat even thougli the ridge penalty is tiot
opt inial in t tic sense of penalizing second tierit-atives, it still behavcs reasonablj- in t Iie sense
t iiat liiglier frcqiicncy coniponents are penalizcd more. The intuitive support wts giwn in
scctio~i 3.3.2.
Let ils start C- looking a t the simpler regression probleni. Let:
wtirrc E - -V(0. Lx,) arid f (x) is a regression fiinction to bc estimatcd. If n-e want to
coristrairi f (s) to bc smootl~. Ive can espand it into some basis of snioot h furlctions. Lct
{ B ( ) I I ~ such a basis. Ttien. if ive denote the evaluatccl basis (n x p) niatris B.
~ v c lia\-c:
by fit ting t lie ridge regression onto the smooth basis. In the contcst of B-splines. we woirld
have rnatris B as the B-spline matris: i.c.. p B-spline basis each evaluated at n design
points. x,. If WC isanted f to be a natural cubic spline, simple ridge penalty is not cnough:
in oric dimension. \\-e lia\-e to use p x p penalty matris:
ij'c would like to avoid using more the complicated RI sitice:
r \\c opcratc in 3-D. -4lthough there are extensions to higher dimensions (like thin-
plate splincs). we would like to use simpler tensor product basis, for which "propcr"
Q is not easy to calculate
Simple ridgc is much more feasible computationall~ in oiir case. as it lets us calculate
f i ts in the S-dimensional scan space, as shown in Sec. 3.3.5.
-4 rnore formai. functional sctting for the above problem is as follows: find furiction f ( - )
tiiat riiinirriizes the penalized regression problem:
Reinarkably. one can show (cg.. i iahba. 1990. Grecn anci Silverman. 1994) that the mini-
rriizing function is a cubic smoothing spline with kriots at distinct values t,.
Let ils diagonalizc thc regression ecliiat ion (3.58). b>- dccomposing B = L'D, I -T. usirig
Sirigi~lar Ihluc Decomposition. Herc C; and 1 - are left and riglit orthonornial eigen-vcctors.
rrspwtivel?-. and D has corresponding eigenvaIues on its diagonal. The problcni non- bc-
('0111CS:
11-e Iiavc csprcssed the simple ridge rcgrcssion problcm (3.58) in the orthonormal basis
L-. Tlic penalty associatcd witli basis function j is ( A + -j:)/$. showing that the basis
associat cd n-i t li larger eigenvalucs are pcrialized l e s .
Thc cpcstion now becomcs: when arranged by decreasing eigenvalues. are the ort honor-
nia1 basis furict ions increasing in ~~cornp les i t~" . t hereby n-arranting highcr penalties'? Thc
partial arisnvr mai- be ohtained by looking a t figure (3.2). Here we have obtained the
orthonormal basis for the 23-spline problem in 2 dimensions. I t is quite visible, that the
"n-iggly-' basis are penalized more.
To go back to PD,% with basis expansion. tve look at the following problem: find function
3 ( t ) such that:
is niiriirnizcd. This is a penalized regression problem and a solution for .3 ( t ) is again a cubic
sinootliing spline nith knots a t distinct values of t,. This problem is covered by a special
case of t hcorcrn 1.3.1 in \Valiba (1990). the so-called generalized smoothing spline problem.
The dctails are esplored in. for esample. Hastie and Tibshirani (1993). The caveat for
tis is t hat one possible n-al- to sol\-e the problem is to espand the coefficient J(-) in cubic
B-spliric basis. and apply the second order penalty matris R (Hastie and Tibshirani. 1993).
Our approach of cspanding the Canonical Image icrcl B-splinc basis has tlic same flavour
and (for 1-D case) would result in the cubic smoothing spline if the riglit penaltj* matris
m sas uscd. \\-c can use the heuristic argument of the previous paragraph to justify the use
of thc ridgc penalty instead.
Chapter 4
Results with B-Spline and 3
dimensional Wavelet s
4.1 Wavelet Basis
.As it \\-iri bc apparent from the results presentcd belon-. wavelcts have siion-n tiiemsclves to
hc a possihlj- more efficient reprcsentation of the Canonical \ariates in FOPP rieiiroiniaging
problclil tlian B-splines. Fewer components basis are recpircd to rcprescnt tlic signal and
t Iicir prcdictivc propcrt ies arc siiperior to t hose of B-splincs. iri two class sett ing. alt hough
tlic rcsults are surprisingly different in the eight-class problcni. In this section we wiH
introduce some propcrties of wavelets and pro\-ide partial justification for the choices we
riiadc n-hen using wavelets for Canonical \Tariates bases.
III t his section wc will introduce wavelet bases. mult iresolution anal~.sis and the wavelet
trarisforni. Tliis general discussion was adapted from three excellent books on wavelcts:
Ogden (1997) Tliis is the first? and a very readable, book on statistical analysis using
nm-clcts.
Vidakovic (1999) This is a more comprehensive book on wat-elets also designed for Statis-
ticians. It offers a more complete theoretical framework and has a \vider spectrum of
ausiliary wavelet topics discussed
Burrus et al. (1998) This is a book u-ritten for cngineers in signal processing fields. It
offers a n escellent discussion on wavelet filters and their implementation. together
11-ith a good introduction to signal processing and filterbank theory.
4.1.1 Wavelets: Introduction
BJ- a ~vavclct one usually means any family of functions tliat is coniposed from a single
mother ,wnvelet function, d ~ ( x ) (Plcase note that we will use a custornap v-avelet notation
licrc. n-liicli m q - conflict with previously introduced symbols). It is assumcd that the
rriot lier wavclet sat isfies an adniissihilit?- condit ion:
u-ticre ik (;) is a Fourier transforni of the mother wavelet. Loosely speaking. condition (4.1)
says t h t the u-at'clet's power miist bc concentrateci in higher frequcncies. Since the n-avelct
is in L'. tliis cffectiwly means tliat the wavelet must be a band-limiteci function. One casy
corisecluence of the admissibilit?- condition is that @(O) = O n-hich in turn implies that:
1 e(x)dx = 0: (4-2)
that is. the motlier wavelet must average t o zero. It is also c u s t o m a r ~ ~ to normalize the
mot her wavclet to unity norm:
Gii-cn a mother wavelet, one constructs the wavelet basis by diadic dilations and integer
t rarisiations:
q:j,k = 2~/'~(2Jx - k) j7k E { O ? &l: *2:. - . }
The scaling fiictor 2j I2 keeps the unity norm. The translation indes k is easily understood
for J = O: it generates a sequence of mother u-avelet translates. each moved to the right or
lcft by an integer. The dilation index. J rescales the x-asis. compressing or espanding the
rriothcr wavelet: it does it in the units of powers of 2.
C-ndcr mild conditions. the wavelet system is orthogonal:
iising a Kronecker 6 symbol. Thcre esist non-orthogonal wal-elet systems: they arc then
iisiially bio-orthogonal. Bio-orthogonaI systems have two sets of u-avelets: orle to project a
furiction onto. to obtain the waveIet coefficients ( the analysis wavelets) and one t o recon-
struct a funct ion from. iising thc n-avelet coefficients (sgnthesis wavelets). Bio-ort hogonal
sj-stems riiaintain cross-orthogonality between the analysis and synthesis wavelcts. The
iisiial ort tiogonal system is a spccial case wherc the analysis and synthcsis systenis are the
sariic. \\C xi11 01115- concern oiirselvcs with thc orthogorial ivavclet systcrns as they have
statistical propcrtics tvhich are better undcrstood.
4.1.2 Orthogonal Wavelet Basis and Multiresolution Analysis
\\C iiintcd above to the fact that the wavclets constitute an orthonormal basis for some
fiirictional spaces. Of these. the most important is L2(R)? the space of al1 functions. f (-).
with a firiitc Li, norm:
Ariotlicr wll-known basis for L2(P) is a Fourier basis.
Givcn a wavelet basis for L2 one can decompose any function f (x) E L2 into its wauelet
coeficients:
and this mapping is one-to-one, i.e. the decornposition is rccersihle:
Equation -1.7 is called an analysis equation and 4.8: a synthesis equation.
Oric spccial property that distinguishes the wavelets from other basis is the l lul t i Reso-
Iiit ion .Anal>-sis (.\IR.\) propertx ahich ive rvill now discus. Let us imagine that the (infinite
diinciisional) space L2(W) lias the following decompositioii:
such that:
J J
Hcre. tlic I ; arc siibspaces of L' which contairi fiinctions of increasing dctail. as IW will sec.
T h closiirc of thcir union. indicated by the overbar. is the n-tiolc L'. but tiieir intersection
is niill. Tlic last condition says. tliat for each function f(r) that is in I ; tlierr is a iiniqiie
furictiori t2(s) = f (2z) in I i T 1 that changes tivice as fast. or ivitli twicc as niuch detail.
\ \c siipposc t hat thcre is an ort honormal hasis for I corisistilig of integer translations
of a scnlzng function or a father wavelet. d(x):
f(x) € IL # f(x) = x ( / ( x ) : @ ( x - k)) d(z- - k) (4.11) k
Tlic diaclic dilations. { 2 j / ' d ( ' z j ~ - k ) } ~ ~ ~ ~ of the basis for l becornc the basis for I >. Further'
siiicc the siibspaces contain each other: we decompose the subspaçe 1 >,1:
i.c.. into t lie direct sum of the previous-level subspace, I,,) and the detail space, II) . Therc
csist a canonical ortlio-basis for II', composed of integer translations of the dilated mother
wa\-clct fiinct ion: +(.). Since:
it is not surprising that the wavelet system. associated with a particular 1IR-4: constitiites
a n ort ho-basis for L2 (R).
The most famous. the simplest and the least practically usable wavelet system is a Haar
t~asis. The Haar scaling function is:
tliat is. a unit>- constant function betn-een O and 1. It is intuitivcly clcar that by scaling
this function down to covcr smaller and smaller intervals- and translating it. one can. in
tiir limit. represent an? reasonable (e.g one in L2(R) ) function.
T h rnother wavelet associated with the Haar scaling function is:
Tliis is a Haar basis for II;: zero-level detail subspace. Projecting f (x) orito the integcr
translates of r* ( - ) . one obtains the local ciifference betwcn f ( - ) represcnted \vit11 o l S k ( - )
arid witli o ~ . ~ ( - ) - i.e.. between f ( - ) represented with first-order detail and f (-) reprcsentcd
\rit11 zcro-order detail. It is easy to check that different translates and dilates of c(-) arc
ortlionornial: inorc other. cjo,. arc orthogonal to for any k and j 5 jo. as we would
cspcct from relation (4.12).
Figure 4.1 s h o w some csamples of war-elet functions. -At each lewl tliere are twice
as nian>- wavelcts as on the previous level but each of them gets more "sclueezed" which
criablcs i t ta uncover more detail in the signal. -1nother feature of many \val-elets is their
.-spikincss" wliicli is a result of requiring compact support and orthogona1it~-. Higher order
wal-clcts get visibly smoother at the espense of longer filter lcngths (nest section).
Figure 4.1: Haar (Mt) and Daubechies SJ-mmlet rr-awlet functions. The detail let-el grou-s from hottom up. and on- some integer translates are drawn at each level
4.1.3 Discrete Wavelet Tkansforrn
III practicc. one is interested in obtaining the wvc le t reprcsentation of a function. just a s
~ v c arc intercsted in Fourier (or frequency dornain) representations computed n-ith Fourier
Transforrns. Given a function f (x). one wants to calculate the lowest-level scaling coeffi-
cients. c,,.k and the detail coefficients wbere:
The arhitrary Icvcl jo. for which ive calculate the scaling cocfficierits. represents the coarsest
scalc IW are interesteci in for a function f (.) under study. In practicc. one does not have a
furlctiori but a sarnple of it obtained with a given sampling rate (we usually assume that
the function f ( - ) tias becn sarnpled uniformly over ttic x-axis). \\é then assume. that the
frinct iori f ( - ) is piccen-ise constant over the sampling ~~~~~~~~~als: f (x) = f, for x E A,. -4
givcri sariipliiig rate tletermincs the highest detail level. J . wc can possibly calculate. i\-c
c m thcri approsimatcl>- assurne t h what WC tia\.c is the projection of function f ( - ) ont0
1 ;. or ttiat:
This is t h start ing point for the Discrete \\'i~\-elct Transform (D1I-T) which is iised
cxterisivcl>- in practice. What rernains to be showri is how to obtain detail and coarser level
scalirig coefficients c1.k: j = jo ' . . . : I . Sincc I o c 1 ive can represent the zero level
scalirig funct ion using the first level ones:
k E 7
This is a so-called scaling equatiori and is fundamental in constructing wavelets. The filter
{ t r c } . is of finite length when the support of P(x) is finite? and is then and esample of a
Firiite Impulse Response filter. Sirnilarly, since II,; c 1 ; : WC have:
=\ri important tlieorem, implied by the orthogona1it)- of n-avelet and scaling fiinctions at
t lie sanie level. states t hat:
Son-. t l ~ e scalcd and translatcd version of the scaling ecluation is:
-4 sinlilar rclatioriship holds for the waveltt coefficierits' cquation (4.19). BJ- writing down
thc dcfiriition of the wawlet and scaling coefficients one can use the ahovc resiilts to obtain
r hc tn-O funciamental equations of the D\\-T:
Tlicsc cqriations shon- hon- to calculate the lower-lewi scaling and w~vclct coefficients from
highcr Icvcl scaling ones. Givcn the starting values C J , ~ from (4.17) one procceds to calculate
t h c n.avelct coefficients for Ievels .J. .J - 1. . . . jo and the final scaling coefficients for level jo.
Tlicrc arc siniilar equations for going up the scale: calculating c , ~ I , ~ from pairs of cocfficicnts
; r d 2 , These equations are used in the synthesis stage. ivhilc the eqiiations (4.23.
-1.24) in the arialysis stage.
4.1.4 3D Wavelet Basis
The discilssion so far n-as centered on the one dimensional tvavclet basis. In order to
use wa-clets for analyzing PET and f ' l IRI scans ive need to construct the 3D hasis. The
mctliotl of choicc is. as in B-spline case, tensor product of one climensional ~vavclet functions.
One has to he careful, liowever: t o obtain an ort,hogonal basis with an appropriate l'IR4
ciccomposition. To this end one first generalizes AIR-4. Let D denote the dimension of the
doniain. RD. Then we have D univariate .\IR;\ decompositions:
for cl = 1.2. . . . . D. Ive are iiiterested in the D-dimcnsional AIR-1:
i.c.. cadi D-rariatc , \ IRA subspace is a tensor prodiict of the corresponding univariate ones.
BJ- a tcxisor product space WC mean that its basis consists of tensor product of unirariate
scalirig fii~ictions that forrn the basis of 1 5:
for ariy k E Z D . To obtain the multidimensional ivavelets we start with espressing Vjii
as a direct sum bet~veen previous level V, and a dctail space. W,. To be concrete let ris
The dctail spaces W: will ernpliasize local fentures in various canonical directions. If u-e
. - irriagixie the directions in the space ordered as run horizontally. verticallu and "into the
page". t lien spaces W: . w:. W: will be -.turned on" by features in the **depth". vertical
aritl horizontal features. and the remaining 4 wilI pick up various diagonal directions. The
space for W ; is spanned by 3-D integer translations of:
aiid d(s) clenotes the crh digit in the b i n a 5 expansion of S .
4.1.5 Wavelet Thresholding
Doiiolio and Joliiistone (199-4. 1995) have clcveloprd tliresholdiiig riiles for dcnoisirig s i p a l s
iisirig n-avclcts. Tlicsc results are optimal in the minimas. or "v-orst case sccnario" sensc.
Spwifically. oric assumes a typical scquencc of noisy obser\-ations:
n-licre f ( - ) bclongs to some functional class and et arc i.i.d. sequcnce of standard Gaussian
rziriable. Dorioho and Johnstone develop a series of rninimax estimators of f ( - ) using
liard and soit wa\-clct coefficient thresholding. \lé will only describe here a liard universal
thrcsholding rule rvhich thcy termed \ïsuShrink.
Di\-T is an orthogonal transform rvhich may be representcd bj- a matris multiplication.
If orle denotes a secluence of nm-elet coefficients of / ( - ) b - 0 then the DTIi- of the noisy
signal y ( t i ) is:
n-iicre t lie transformed noise coefficients arc nt il1 i.i.ci. st andarci normal becaiise of the
ortliogonality of W. The idea of thresholding is to only keep these coefficients that carry
t lie sig~ial. i.e. that are ..big enough". The idea hinges on the fact. that for a wide classes
of signals wavelets provide a sparse representation. t hat is the waveiet espansion of these
signals liavc m a n - zero or near-zero coefficients. The question is how to determine n-hich
~va\-clct cocfficient of a noisj- observation do not carry any signal. l lany solutions hase
bccn proposed. but Donoho and Johnstonc (1994) proved that a particularly simple ride
lias nrar-optimal SISE in the minimas sense. This rule is: replace i ~ . , k by:
Iri practicc the thresliold becomes &J-. wticre a is an estimatc of t lic hornoscedastic
noisc levcl. Donolio and .Johstoiie (1995) propose to use the mcdian of the fincst Icvcl
cmcfficicrits di\-idcd by 0.6745 as an estimate of a: the constant is clerivcd from the Gaussian
case. Otlicrs liave proposcd a similar Sleclian .-\bsolute Deviation of al1 wavclct coefficients.
in plricc uf si~iipic riiediari at tlic liighest level. Everyone agrees that becausc of thc sparcity
proprrtics of wavelcts a robust estimator of variance should bc used.
The abovc rcsults work for the i.i.d. case. Remarkably, Johnstone and Silverman (1997)
sliow that if the noisc is correlated. al1 results of Donoho and Johnstonc are still valid.
provicled tlic thresholcling is done separately on cach lerel. That is both the threshold and
the 1-ariancc arc cstimated separately for each detail lcvel. In the case of images. ive do
tliat scparatcly for each combination of lewl and direction. i.c. separately for each II;" in
t h Eq. 4.29.
4.2 Finger Opposition Data: Methods
4.2.1 Data and the Standard t-Test Analysis
1l-e apply the Penalized Discriminant -4nalysis using a simple ridge penalty and tensor-
prociuct B-splirie (TPS) basis to the FOPP data described in Scction 2.4.1.
-4ftcr proccssing by the 3 x 3 x 3 bos-car smoother and scan-mean normalization
(i.c.. divitliiig each vosel by the mean of al1 vosels within the brain mask for tliat scan) a
poolcd standard deviation est iniate iras calculated (\\orslcy et al.. 1992) and an activation
t-test value obtained for each vosel. as described in Strother et al. (1995a). Sote. that
siich a pooletl t-test activation image has becn shou-n to outperform single-voscl t-test
iniagcs with al tcrnate preprocessing schemes (S t rot her et al.. 1998a). making the poolecf
acti\-ation image a good reference pattern for the various PD-4 canonical images prescntcd
in tliis papcr.
4.2.2 Two-way Classification with TPS: Interna1 Optimization
and Scan Normalization
Eacli scari was assignecl a label. from the {Active. Baseline} set. according to its espcrimental
c.oricIitior~. Tlie tensor-product (cubic) B-spline (TPS) basis. composcd of 25 B-spline basis
in cadi diriicnsion.(defined as B25 in the nest section) \vas set-up in the srnallcst 3D box
tliat circuniscribed the logical ASD of al1 subject masks for raw (Le. unsmoothed) scans
wi t 11 aricl n-i t hou t scan-mean normalizat ion
Fivc-fold cross-validat ion and leave-one-out bootstrap were used to estimate PE mea-
sures for a grid of values of the ridge tuning parameter. A. The sampling units for both
rcsanipling mctliods werc the subjects: thus al1 scans from a subject were included if a
subject was sampled. Fifty bootstrap samplcs were used, for each value of A, to derive a
.GX+ bootstrap cstimate of both SIC rate and SPE. \\'e cal1 this process of searching for
the optimal value of a ridge parameter an Intcrnal Optimization.
4.2.3 Two-Way Classification: External Optimization
The scaris w r e assigned class labels. as bcfore. Different parameterizations of the data
wcrc iiscd. which we denote b:- B15, 620, B25, 930, 835, Braw, BSmooth3, BSmoothS.
11-c will use thc word ..basis'' interchangeably with "representation". in rcfcrencc to any of
these. B?? are terisor-product spline projected images. The tn-O digits cfenote the number
of B-spliric bnsis in each dimension. That results in a widc range of input dimensionality
(i-c.. total nurnber of basis functions with support trithin the -4SD mask): between 2.500
and 25.000. Braw denotes the un-projected data, with vosels within thc -4SD of al1 masks.
This rcsults in 28.500 vosels used. BSmoothk is sirnilar. but from the data ttiat lias first
bcrri sriiootlicd with a k x k x k bos-car. (ie 3D square kcrnel) smootkier: typically. k = 3
ivoiild hc iisetl (e-g.. Strothcr ct al.. 19921).
PD-\ \vas applietl to the mean-normalized data for diffcrcrlt basis clioiccs. For each
tmsis. the bootstrap analysis over the samc grid of A was donc. with B=5O bootstrap
saniplcs. as described above. -4 number of bootstrap sarnpks. 50. is adrnitedly Ion.. but
sccnis (cf. fig. 4.2 a goocl compromise giwn thc huge cornputatiotial burden associatcd witli
rcsampIing. 1\-c compare the error curves (for both IIC rate. or SPE) and the position of
tlit niinima on the SPE/EDF plane.
4.2.4 Deriving Time and Spatial projections
11-r? can iisc t h same mode1 to obtain tlie activation maps of the within-subjcct tempo-
ral cliarigcs throughout the expriment. This is done by dcfining an 8-tvay classification
prol>lcm. whcrc classes denote the ordcr tlic scan was taken (for subjects with 10 scans,
tlic 9th scan was pooleci with the 7th. and the 10th with the 8th). So information about
tlic csperiniental-state or about the temporal ordering is available to the modcl. Out of 7
canonical variates that result, we look for those that represent time and state changes. \\é
projcct tlie labelcd data and class means ont0 the canonical variates and choose two that
arc most appropriate.
4.3 Finger Opposition Data: Results
4.3.1 Two-way Classification and Interna1 Optimizat ion
Figiirc 4.2 shon-s the effect of the ridge timing paramcter (esprcssecl as EDF) on the prc-
dictiori crror. Both CIv and .632+ bootstrap estirnates of PE measurcs are shon-n. I\é
riote that t h '\IC rate has an erratic beimviour due to its discontinuoiis definition. Es-
aniiriing thc SPE curvcs on the top-left and the bottom panel we note that the minima
for C l * estiriiates (thin lines) occur earlicr and rise fastcr with increasing EDF than those
for .G32+Bootstrap (thick lines). This is likcly sincc Cl - estixnates arc not corrected for
tIic smallcr training set size. This would cause tlie iricrcased variability of tlic canoriical
variatcs. t h t cornes with higher EDF. to have niore pronouncecl and cluicker cffcct than
wit 11 t lic fi111 training set.
Iri t h top-right panel n.c see. that CI' cstimatcs eshibit higher ~a r i ance as compared
to thcir .632+Bootstrap counterparts. To obtain ttiis plot we did not set ttic random scccl
to the same value for cvery A in the resampling esercisc. as was donc for ewry othcr plot
in the papcr. This allows us t o eshibit bctwecn sample variabilitj- of both resampling
mcthods that contributes to the variance of the estimate of the prediction error. Another
problcni with C i * is the open question of how many folds to use. On onc estrcme WC liavc
11-fold cross validation. n-hich n-ould have ncgligible training set bias but higli variance of
the PE cstiniates in each fold. On the other and there is a two-fold cross-validation that
Ilas lowcr variancc for caciz fold but liigh bias due to much smallcr training set sizc and
potcnt ially large variance contribution t hat comes from many w q s to diïide the training
M-C rate and SPE estimates for 825 unnormaiiued data CV and 632+Boot SPE estimates WIUI vaqmg random se&
.... . . . ..... - 632Bo0t-SPE -25
CV-SPE
CV-MC
O
-1 6
-1 3 I
0 1 - - J -1 0 O '
O 50 1 O0 150 Equivalent Degrees of Freedom
O v - , 3. à 0 . - B25 : CV-SPE ! :
1; : . -
z . w 0 50 1 00 150
Equivalent Degrees of Freedom
M-C rate and SPE estimates for 825 mean-nomalized data
- 632B00t-SPE
CV-SPE ----a 632B00t-MC
- - - - - - CV-MC
50 1 O0 1 50 Equivalent Degrees of Freedom
Figure 4.2: Internai optimization o f the ridge tciningparameter (Sec. 4.3.1): cspressed a s a ri iini hcr o f Effectir-e Degrees o f Freedom, for Penalized Discriminant -4naf~sis in tiio-class prol~lcrzi. This esample uses the data projectcd on the B25: tensor prodiict B-spline basis set. Tlie ciiri-es slloii- tlic change o f Prcdic tion Error (both .\ lisclassifica tion (JIC) ra tc and Sc1 uarcd Prcdiction Error (SPE)) as a function o f Effectii-e Degrees o f Freedom (EDF). Bot h cross-i-aliclation (Cl3 and .632Bootstrap cstimates are eshihiteci. The top left panel slions changes for un-normaiizecl data. wliile tlie hottom panel deals ii-ith mean-normalized data. Tlic top Nglit panel slioiis and -632Bootstrap SPE curr-cs for mean-normalized data 11-hich n-cre obtained n-ith a different random seed for ei-erj- EDF: tliese portray the grea ter i-ariabilitj- o f Cl- estimates. Thin lines - Cl- estimates, Thick lines - .632+Bootstrap estima tes. solid lines - estima tes o f SPE, daslied lines - estima tes o f MC rate.
set in tn-o. Leave-one-out bootstrap may be seen (Efron. 1983) as approsimating tivo-folcl
cross validation but with rnany two-fold splits and the .632+ corrcction deah with the bias.
i \ C have thus chosen the bootstrap estimator for reporting the rest of the results in this
t hesis.
Tlic bottom panel shows the Cumes for the nican-nornialized data. n-here each scan has
becn divitlecl by its brain-vosel mean. The crrors are lower. indicating that a large source
of variance has been rernoved. The 1IC rate (as estimated by bootstrap) drops down to
jtist belon- 13%. n-hicli is a large improvenient over the no-inforniation rate of 50%. l lore
noticeah1~-. minimum SPE is around -104. as compared to 0.25 for no-inforniation value.
The SPE minimum is a t around 50 EDF. for the normalized data. and around 40 EDF for
il ri-norrnalizcci. This suggests. t hat after opt imizing for the degree of smoot hness. t he niotiel
is able to estract more information from normalized data. This observation is consistent
n-ith oiir intuition: mean-nornialization removes a large source of variance increasing the
sigrial-to--rioisc ratio and allowing more gencralizal~lc structure to be found in the data.
4.3.2 External Optimization with Different Bases
\I'c liai-c irivestigated tiic influence of image representation on the rcsulting canonical ini-
agcs and thc prediction error. Figure 4.3 s h o w PE cumcs (SPE: top panel and AIC ratc:
bottom panel) across EDF for al1 representations. \ lé will concentrate on the SPE curves
first. ,411 TPS-projectcd data behave very sirnilarly, escept for the B15 basis. Tlie B15
representation is likelj- too coarse to capture al1 but the major. spatiallx estensise compo-
ricnts of the underiying activation pattern. Its minimum occurs quite early on the EDF
scalc (EDF=36.6)? and its SPE rises sharply as EDF increascs. indicating that there is
less iiseful structure to be estracted in this representation. The Braw SPE curve shows
the worst performance for small EDF. and for larger EDF increascs faster than al1 but the
B15 projectcd hasis. This representation seems most sensitive to the choice of the tuning
SPE curves for 2 class problem
Equivalent Degrees of Freedom Misclassification rate curves for 2 class problem
20 40 60 80 100 120 140 Equivalent Degrees of Freedom
Figtirc 4.3: Prediction Error (PE) curt-es as a function o f tlîc effècti~.e degrces o f freedom (EDF) for al1 tensor-product B-spline (815-835) bases and unprojected rari- (Braw) and sniootl~ecl (Bsmooth3,5) scans. The upper panel shows Squared Prcdiction Error (SPE=(l- ) rr-licrc pc is the estimated posterior probahility for the true class), and the lon-er pari cl dcpicts Jfisclassification rate (MC rate) as a percentage o f the total number o f scans rnisclassificcl. The images from un-projectcd da ta arc shon-n nit h larger-rr-idth curves. The niarkcrs sl~on- t h minima for each curr-e. (The minimum for the 830 curr-e is not show-n as i t occureci bq-ond the figure frarne).
paranicter. The other two unprojected representations behave sirnilarly to the projected
oncs. csccpt ttiat the two minima are farther apart and their SPEs are larger than those
of the projected basis for most of the EDF range.
Esarnining SPE plots a t around 20 EDF. ive note that the SPE curvts of projected bases
arid Braw arrange themselves in order of increasing srnoottiness and decreasing SPE. This
places BSmooth3 betwen B30 and B35. and 8Smooth5 between B20 and B25. At these
low degrees of freedom. the canonical image is very Hat. csccpt for sorne cstended bunips
occiirring at thc most predictive spatial locations. The canonical images arc then driven
by large featiircs. rather than by small regional changes. and smoother representations are
iinderstandably better in those circumstances. Thereforc. it rnakes sense t hat the curvcs
ordcr tlierriselves according to the dcgree of srnoothriess of the reprcsentation.
\ \ c also esamined the placernent of minima for each representation in the SPE-EDF
plane. The unprojccterl bases eshibit lowering of the prediçtion crror witli rising EDF.
as one mo\-cs froni srnoot hest (BSmooth5) to the roughest (Braw) representat ions. The
projcctcd hasis csliibit a similar pattern but the trend secms to be heading for largcr SPE
ivith largcr EDF for "roughti' representations. while al1 along maintaining a l o w r SPE
t h a n the iinprojcctcd bases. Onc esplanation is that the projectcd basis are somewhat more
cfficicrit for this problern allowing the carionical images to contain more structure (higher
EDF) n-hile controlling the SPE levcl. Tlicir utility is. however. lirnited and the SPE starts
to risc witli very largc numbers of basis functions and an associated higher EDF. -Us0 note.
tliat the EDF-at-min-SPE is again ordered with respect to smoothness of representation.
for Braw (28.500 voscls) and projected basis ( 5 25.000 basis functions). and that again the
smoot hcd. iinprojcctcd representations fa11 sonicwhcre in between 815 and B35.
The 1IC rate curves in the bottorn panel of Fig. 4.3 again demonstrate the markedly
diffcrent behaviour of the B15 representation. Even t hough this time it achieves seconct-
lowest crror a t its minimum, it performs significantly worse for the higher EDF values.
as bcforc. Escept for B15. al1 curves from un-projected data have worse 11C rates than
TPS projected data. for niost of the EDF spectrum. Braw eshibits performance that is
visibly worse than the other unprojected cun-es. reversing the ranking of the minima of the
SPE plots. -411 curves. besides B15. flatten-out beyorid EDF= 70. lié also note. that the
EDF-a-min-SIC are much higher. but less well determined than their SPE counterparts.
Ir1 both cases. the SPE measiire of PE seems much informative than the SIC rate
curves. Tlie minima in thc SPE Cumes are more pronounced. and t h curves thcmsclves
smoother. \\é also bclicve. that the measure which takes into account how much the
~~ iodc l ' s prcclictions are right or wrong. is more informative to Our goal of assessing the
niodcl gencralizability. The ,\IC rate plots are highly variable. due to the discontiniious
( O / l ) crror structure. Tliat they flatten out for higher EDF. with thcir niinirna occurririg
niucli frirthcr to thc right than in the SPE plots is likely due to tlie fact that the two-\Y-
classification problerri is driveri mostly by a few strong regions of activation which arc cliiite
clrar indiators of the active statc: niostly the right sensory motor arca and left cerebelliim
(sec ron- C. Fig 4.4). Oncc enougli weight is put ori tliese rcgions. (acfiieved \vit11 liigh
EDF) the classifier can perform ivell. on average. regardless of the rest of the image. There
\vil1 bc sorric scans. however. where the actil-ation maps needed to perform classification
arc diffcrcnt. In tliose cases canonicai images composed using high EDF. \vil1 perform ver-
t \\-hile such cases will be penalizcd rclati\.cl~- rnildly in terms of misclassification
(tlicy will score a penalty of 1 regardless of hoiv %adly" they misclassify). their posterior
probabilities will be v e v much off. and tIieir SPE penalties inuch stiffcr.
I\-e prefer the SPE measurc of the prediçtion error. Tlie SIC rate is. liowever. bettcr
cstablishcd i r i tcsting pattern recognition models and has a so~newliat more direct inter-
pretat ion. \\é report botli while concentrating our interpretation on SPE-based results.
In al1 that follow we will use mean-normalizecl data and .632+Bootstrap estimates.
Hon-ever. as evidenced by Fig. 4.4: the SPE ciifferences at the minima are only part
Figure 1.1: Functionaliy activated [ L50] rvater PET vosels above the 93.4 percentile (white or~erlay) interleaved with registered grayscale AIRI brain slices for Penalized Discriminant -4nal~sis of: ( A 1 6 0 A 4 0 ) unprojected raw data presmoothed ri-ith a 3 x 3 x 3 i-owel bos- car kernel (BSmooth3): (BI 6-B4O) unprojected rau- data ivithout presmoothing (Braw); (Cl 6-C40) tensor product spline basis with 15 spline bases in each spatial dimension (615): (D l ô-D4O) tensor product spline basis rvit h 35 spline bases in each spatial di- mension (B35) (activation images A to D have decreasing squared prediction error (SPE) r-dues a s illustrated in Figure 4.3): ( E I G E 4 0 ) is a pooled standard deviation t-test image o f scans presmoothed with a 3 x 3 x 3 m e l boscar kernel - the Bon ferroni t-value (t =L65) at the 93.1 percentile rvas used to define a conservative activation threshold u-ith which to compare activation image peaics (white overlay) for a fixed number o f r-osels. PET and J fRI slices are 128 x 128 with 3.1 m m pisels with center-to-center sljce spacing o f 3. J mm (i-e.. slice -423 and A26 are separated by (26 - 23) 3.4 = 10.2mm) and are parallel to the .AC-PC plane. which coincides with slice 24. Image left = brain left.
of the story. This figure shows the top 6.6 percent activation in nine chosen slices from
the canonical image obtained using: A - BSmooth3. B - Braw. C - B15. D - 635
represcntations. and E - a t-test image. The 6.6% threshold is defined by the Bonferroni
t-value of image E and !vas selected to compare equal number of vosels across the five
activation images. The ridge parameter. A. \vas set to optimize tlie SPE for each basis.
The first four PD-4 images are arranged in order of decreasing SPE. Of particular interest
arc Braw and B35 images. These tzvo rcprescntations are somewhat analogous to each
othcr: Braw and B35 images contain the most structure in their unprojected and projected
groups. rcspcctively). their respective minimal SPEs are not far from the group minima.
and thex both represent the roughest representations in each group. The figure shows
that the B35 representation resuits in a less noisy and fragmented image n-hich is more
visuall>- appeali ng. BSrnooth3 regains more smoot hness as compared to Braw. making it
niore appcaling. but it cioes so at the espense of predictive performance. \\é also note
sigriificant intcresting clifferences on the t-test image which secms to bc missing potentiall~
important structures: contralateral midbrain tegrnentum in slice E23 whicti contains key
parts of the niotor system such as the substantia nigra. the ipsilateral auditory area in slice
E30. and while ipsilateral parietal and premotor regions are scen on slices A36 and B36.
D36 siion-s on1~- the parietal area and E36 the premotor area.
Figure 4.5 shows the scatter plot (one point per Talairach vosel) comparing the t-tcst
aritl tlic 835 images. There is a non-linear trend upwards for the significant vosels: in the
iipper lcft part of the plot. which shows that thc rnutually significant activation regions are
more proriounced in the 835 canonical image. The circle shows a small cluster of voscls in
tlic primary \-isual cortex that have been elevated by the PD.-\ from the 2oth percentile in
tlic t-test image to the 90'" percentile in the B35 image - these vosels are visible in the
prirnary 1-isual cortex in the sIices A26, B26 and D26 in Fig. 4.4.
Ir1 this fingcr-opposition data set a single-vosel t-test using pooled standard deviation
-1 0 -5 O 5 1 O 15 Single-voxel t-test (pooled standard deviation)
Figurc 4.5: Scatter plot o f pairs o f actiration image i-alues for al1 Talairach hrain i-osels ( 1 point/\-osel) for a single-vosel t-test image using a pooled standard dei-iation estimate. conzpared CO penalized discriminant analysis (PD.4) o f a tensor product spline reprcsenta- tion rr-i th 35 B-spline bases along each spatial dimension (B35). The dashed line depicts the principal asis from a principal component analjesis o f the scatter plot distribution. The cir- clc bigliligh ts a group o f r-osels in the primarj- risual region tha t har-e moi-ed from rhe 2 0 ' ~ percentile in the t-test image to the goth percentile in the PD=\ image. The solid i-ertical Zinc cfcpicts the Bonferroni t-\due (t=4.65) at the 93.4 percentile of the t-test distribution o f r-ose1 \-dues (white o\.erla_r-~ ron- E o f Figure -4.4) and the solid horizontal line reflects the 93. -4 percen tile (r-alue=O. 0065) for the PD-4 distri bution o f vosel values (white overlaj- rorr- D o f Figure 4.4).
(SD) est imates has been shown to prcdict population based activation image patterns
significantly better than for single-vosel t-tests using individual vosel SD estimates. and to
perform a t the same level as a canonical variate analysis built with the SI'D basis (Strother
ct al.. 199Sa). Therefore. it is not surprising that the BSmooth3 arid t-test activation
O\-trlays in Figure 4.4 are quite sirnilar. probably rcflecting the fact that for this simple
two-state ana lp i s the ridge penalty is relatively large (see section 3.3.3). Howevcr: t here
are several important differcnccs between the PD-4 solutions and the pooled t-test result.
Figure 4.5 demonstrates that the PD-4 result has nonlinearly enhanced the most significant
voscls relative t o the corresponding t-test values and "noise" values around zero. -1t least
one area ( the p r i m a p visual vosels within the circle in Figure 4.5) has been completely
reordcred relative to other activated regions so tliat it is non. potentially active while in the
t-test result it was negative and not distinguishable from noise. In addition. in Figure 4.4
thcre arc "activated areas" which are plausible given this motor task paced by auditory
ciies. t hat appear in slices 19. 33. 30 and 36 of the PD-1 results in rows -4. B and D. but
riot iri the t-test result. The key point here is not that the PD,-1 results are right and
tIic t-test wrong. but that the distribution of potentially activated peaks agree for rnany
cspectetl areas and there is a hint that the tunablc PD.4 results rnay be more sensitive
and able t o identify areas that could change the ncuroscientific interpretation of the brain
resporisc to the task, In addition. the PD-1 frarnework is much more flexible. as shown by
the eight-class results. and internally optimized through prediction error estirnates so that
we do riot need to put as much faith in the validity of distributional assumptions within
a n i r i fercnt ial test ing frameworli.
The Effective Degrees of Freedom (EDF) provides us with the way of calibrating the
aniount of information estracted from the da ta , analogous t o the dimension of the ,O space
in linear regression. It seems. from Fig. 4.3: that it is both the SPE and the EDF-at-
min-SPE that are of interest. Ideall . we would like t o estract as much structure from the
clata as possible (since we believe that the brain function is anything but simple) \\-hile
rriaintaining low levels of SPE and thus high gencralit- of canonical images. In that sense,
t h most appealing results are Braw and B35 representations which have the highest EDF-
at-min-SPE and a small difference in their min SPE. Figure 4.4 shows that of those two.
the projected B35 representation is much more appealing visually. By esamining the trend
of the minima in Fig 4.3 n-e note that !ikely nothing more can be gained in the unprojected
rcprcsentatiuns (the sniootliing tends to worsen the results and n-ith Braw we arc at thc
cricl of the roirghriess scale). while Ive can hopc to achicve better results by other choices of
bases to project the data onto.
4.3.3 Applying PDA to an Eight-Class Problem: State and Tem-
poral Changes
The SPE (Fig. 4.G) ciines show somc ciiffcrences when comparecl to the 2-n-ay classification
problcrn. Thcre is an increasc in the EDF-at-niin-SPE. hinting a t more est racted structure
in t his riiorc complicated. miil ti-cIass set ting. The Braw rcprescntation. which again has t he
1iigIicst EDF. also has higher SPE (0.535) than BSmooth3. the winner in the unprojcctcd
groiip (SPE=O.550). B35 has the lowest SPE (0.3-1.5) and second lowest EDF-at -min-SPE
(62.1). By cliance. onc ~voulcl espect 0.757 for the no-information SPE.
LIorc iniportantly, Figure 4.7 show that the mode1 is able to estract two components.
wliicli WC u priori consider important: state and temporal effects. The class centroids
projectcd on thc first canonical image arrange themselves largely in the temporal ordcr
in cach of the two states. There is also a large jump betwen first active/baseline scan
(class 1 and 2) and the second of thcse scans. This is intuitimly appealing as the subjects
arc probably still learning (in the case of the first active scan) or reflecting on the tasks
to t ~ c perforrned and generally adjusting to the situation (in the case of the first baseline
scan). T h first canonical image clearly separatcs the baseline (odd class numbers) and
SPE cutves for 8 class problem
20 40 60 80 100 120 140 Equivalent Degrees of Freedorn
Figrire 4 . 6 Square Prediction Error (SPE) curws: in an 8-cfass problem. as a function o f Equiwlent Degrees o f Freedom for 8 penalized cliscriminant modefs with diflerent represen- ta t i o ~ s : 5 tcnsor-product B-spline projected datasets n-ith 1-a--ing numbers of basis func- tions (B I5 to B35) and 3 unprojected ran* (Braw) and smoottied (BSmooth3 and BSrnooth5) datascts (tliicker lines). The markers show the minima for each curr-e.
-2 -1 O 1 2 Discriminant Var 1
F igtire 4.7: Pro jec ting the da ta on first t ri-O canonical images obtain bj- Penalized Discrim- inant -4nal~-sis o f the 8-n-a'- classification problem. The points are labeled according to the class (1: first baseline. 2: first active: 3: second baseline, etc)? and the class means are sli on-n in circles. This figure rras O btained from tensor-prod uct B-spline projcc ted da ta u T-
iz~g 35 B-spline basis in each dimension (635) moclcl n-ith A corresponding to the minimum Squarcd Prcdiction Error (SPE). but is sirnilar across basis and A S.
acti\-o scaris. It is worth repeating. that the mode1 had no kno\vlcdge about both statcs
and temporal ordering: its task n-as simply to differentiate among 8 unordered classes.
Thcse two components corne iip as the first two canonical \-ariates and thus accourit for
thc rriajority of the betn-cen-class variance (62%). The figure also shows: ( 1 ) potcntial
interaction hctn-ecn the two esperimental states a n d the temporal proccss. as the means in
the two groups arrange themselves on lines with differcnt slopcs, ancl (2) a first scan effect
in both states. scans 1 and 2.
Bj. esamining the SPE cux-ves ive note that the EDF-at-min-SPE are somewhat highcr
(62.1 vs 56.8. for B35 and 75.0 vs 68.8 for Braw) suggesting that more information is
cstractcd from the data. when the temporal structure of the problem is not includcd in
t lie nithin-class covariance. Tliis improvement is consistent with our observation that the
temporal structures seem different for two esperimental states: potentially violating the
comrnoii within-state covariance assumption for t he two-class analysis. This setting also
dcmonstratcs that the projcctcd rcprescntations are potentially even more useful in more
sopliisticated situations: the lowest SPE for unprojected d a t a is achieved with BSmooth3.
alid for projected da t a with B35 somewhat reversing the trend found in the two-class
problcni. This shows. even more clearly than in the two-way problcm. that t o estract more
of the generalizable structure Ive need to impose some constraints. \ié have attempted to
compare t hc first canonical image from this PD-4 (corresponding to the csperiniental state
classification) t o the cario~iical iniage obtained in the t\\-O-class paradigm. but one problem
i s tlic arbitrar>- rotation allon-ed in thc space of the first two canonical images. As a possible
fiitiire work u-e ma>- dcvelop a PD-4 mode1 where the first Canonical \ar ia te is specified
from the two-\va- analysis. and apply PD.\ t o tlic eight-class problcm to discover secondary
structures.
4.4 Result with Wavelet Expansions
The 182 scans have been preprocessed with a Discrete \\>\-elet Transform. ivliich is eqiiiva-
lent to projcct ing t hem on a n-avclct basis. Two familics were used: Daubechies ort tiogorial
n-avelcts and Coiflcts. which is also an orthogonal farnily. \le have invcstigated orcler 2 and
3 Daul~ecliies n-avelets (Daub2 and Daub3) and ordcr 2 Coifiets with 6 cocfficicnts (CoifN6).
Daiibcchies \t-avelets are perhaps the most famous wavclet family Thcy n-erc constructed
to he as symmctric as possible ( the onIy full'- symmetric, orthogonal wavelet with compact
support is a Haar wavelet). The order numbcr refers to the highest moment of the wavelct
furict ion tliat is cqual t o zero: Daubcchics wavelet of order 3 has mean and second moment
cqual to zero. This is directl). relatcd t o the smoothness of the LI-avelet. Coiflets have
additional zero-moment reqiiiremcnts on the scaling function (Burrus et al.. 1998).
The Donoho S: Joiinstone thesholding bring about a tremendous dimension rcduction.
The tliresholdcd Daub2, Daub3 and CoifN6 representations result in 1884. 4187 and 3850
wavclet functions rcspcctively This has to be contrasted with 9670 for B25 and 35.071
B35. Givcn the much bctter prediction rcsults achicvcd by wavelets. this s c a t rcduction
of ciiniensionality seems to have successfully decreased the variance of projections.
l \C first look a t the 2-class problem that estracts the single baseline-activation image.
Thc top panel in figure 4.8 compares one particular ~vavelet representation with B-splines
and unprojccted results. The improvement is dramatic and is miich larger than the differ-
ericc txtwrcn B-splines and unprojected representations. \\é offer one possible esplanation
for t iiis irnprovment. \\avelets combined \vit ti threslioldirig oEcr a great reduction in di-
rrie~isionality. wit hout lowering the discriminatory capability of the PD,\. Reduction of
dinicnsionality Iias the effect of lowering the variability of the results: that is of tlie esti-
rnatcd canonical variates. hence projection and hence posterior probabilit- estimates. It
scexiis that tliis rcdilction in variance is much greatcr than tlie associated increase in bias of
tlicsc est iniatcs. In fact. due to the scak-spacc tiling property of wvelcts. even the thresh-
oldcd n-avelct basis ni-- Iiave better resolution tlian the mucli higher diniensional B35
spliric basis. The t hresliolding a t tempts to kcep t hc high rcsoliition wavelets orily wliere
thcy secrri to be needed to estimate the brain ftinction n-ell. and rcduccs the rcsoliition
clstwliere.
In the hottom panel WC compare the three wavelet familirts with and without thresh-
olding. Thc t tiresholding !ielps the classification problcm somediat . particularly for the
two Daul>ecl~ics families. The Daub2 is a winner with SPE= 0.075 at EDF= 74.6. .As ttris
is tlic coarsest wavelet famil- it indicates ttiat the discriniination probleni liinges on a ive11
cicfinccl. sliarp structure which is best picked up by the low-order Daub2 wavelet. Thc DkJ
thrcshokling gives small. but consistent improvement.
Figure 4.9 shows a fcw slices of the Crinonical Eigenimagc that results from applying
PD-\ u-i t ti ridgc hj-perparameter optimization using two representat ions: Daubechies or
ordcr 2. and B30. The wavelet representation has a much sharper focus on the activated
arcas t han B-spline, n-i t h much srnaller .'bleed-over" from ncighbouring pixels o r slices. On
SPE curves for 2 class problem
Equivalent Degrees of Freedom
SPE curves for various wavelets representations d
1
BWave64CoiNGTh resh BWave64Daub2Thresh
- - - - - - - - - O BWave64Daub4Thresh
BWave64CoiN632~32~16 BWaveDaub232~32~16
---------. BWaveDaub432~32~16
_ - - - - - - - -
a
Equivalent Degrees of Freedom
Figure 4.8: Top panel compares n-al-elet and Bspline results in the 2class problenl. Shown are Da u bcchics order 2 thresholded n-ade t basis compared to ran- and B-spline represen- tations tising .632+ Bootstrap estimate of squarcd predictiori error. The hottom panel shows SPE curres for the tn-O-class prohlem for r-arious n-adet families. 115 compare Dauhcchies order 2 and 3 faniib- and orcler 2 Coiflet system. For eaclz familj- n-e int-estigatc t 11-0 ttircsholding s tra tegies: simplj- "peeling off' one finest detail le\-el in cach dimension (32 x 32 x lG), and Donoho and Johnstone IiisuShrink liard thrcsllolcfing rule.
Figure 4.9: Wsual cornparison of U'avelet (top row) and B30 representations in the 2 class problem. First three slices show portions of the cerebellum, next two displa?- the midbrain portions! and the last three slices depict the activation o f the cortex. The m v s c a l e image is the anatornical .URI scan in the Talairach space and the CV is overlaiived on top of it using the hot-metal color coding. Both images rvere created using EDF t hat rninirnized the SPE: 74.6 for Wavelet vs 53.6 for 630.
t h e o ther hand, it still hasi fewer spikes a n d speckles than the unprojected representations
such as Braw or t -map (not shown) which improves interpretability
Figure 4.10 compares t h e Wavelet a n d B-spline results using a corner-cube environment
(CCE) (Rehm e t al., 1998). CCE finds several connected areas wit h high average activation
(here: a b o ï e 99 percentile) and wit h pre-set minimal volume, for images being compared.
These areas, called CCE foci, a r e then displayed using stems a n d projections ont0 the walls
of the 3D volume. T h e figure shows clearly tha t the B-spline results a re smoother and more
spread than wavelets. .i\lso, t he wavelet PDX shows sonie smaller regions of activation tha t
a r e either absent o r have much smaller activation levels in t h e B-spline volume. This is
due t o the CCE algorithm: except for t h e major centers of activation (motor and auditory
cortices) the relative levels will be lowver in the B-spline CV d u e t o imposed smoothness,
which causes them t o b e larger but suppresses t h e peaks, and thus prevents them from
being picked u p t ~ y CCE.
Figure 4-10: Comparing the il-splines (MO) and wavelets (Daub2Thresh) Canonical lrnages using a corner-CU be environment o f Rehm et al. (1998) in the twcdass setting. Escept for the three major overlaping regions, the foci have been fit inside a bal1 o f the same volume as that of corresponding focus. Blue foci correspond to 830 Canonical Image.
SPE curves for various wavelets representations in 8-class problem
Equivalent Deg rees of Freedom
Figiirc .4.11: Squared Prediction Error for various ii-ar-elet families in the Sclass problem. Three irai-clet functions are ini-estigated: Daubechies order 2 and 4 , and Coiflets order 2 For cacl~ famiij-. ire either remoi-e top-scale 11-areelet coeficient lei-el! rcsul ting in 32 x 32 x 16 ii-ai-clet coefficients or ire apply D&J 1ïsuShrink hard tliresholding (Thresh). -4.5 a Airtlicr dimension reduction technique: ire ini-estigate using all 7. or first 2 C\'L to perform classification (07 r-s 0 2 . )
In the 8 class problem that decomposes a full covariance structure associated with both
t inie aiid esperimental design ivavelets perform surprisingly badly- Figure 4.11 shows the
SPE ciirvcs for the same ivavelet families and thresholding rules that ive used in the 2-
class case. \ le have also investigateci restricting the dimensionality of the PD-4 mode1
from full 7 to 2 . since tliere is an a'priori belief that only badine-acti\-ation and time
structures are important for this problem. That is ive only predict using the first two
canonical variates. IVhile t his rank rcduction helps somewhat : the results are st il1 signifi-
cantl~. worse from those of B-splines and rav- representations: the loirest SPE for wavelet
families is achieved by Daub2 family. D&J thresholded and restricted to two Canonical
Iàriates (Wave64Daub4ThreshD2). It achieves SPE=0.672 a t EDF=61.7 whicli we compare
with 0.5-4.3 a t EDF=6'2.1 for B35 representation.
Figure 4.11 shows that dimensionality reduction greatly improves the prediction for
n-avcllct faniilies. ivhile. as [vas the case in the 2-class setting. the threshokiing strate=
seenis ltss important. This suggests that the errors ma>- be driven b> variance. ivhich is
rcdiiced wticn only '2 C\"s are used. Ive suspect that the common covariance assumption
ma>- l x grossly violated in the wavelet domain when a11 eight classes are used. Some support
for tliis assertion cornes from obsen-ing the curves generated by using al1 7 CI-'S. Ttiey
acliicvc their minima at very loiv EDF. as comparcd to B-spline representations. and risc
sharply afterwards. Since large ridge penalty (and hence small EDF) works to counteract
the effect of unequal covariance matrices. these ivould indicate that the common within-
coi-ariance rnatris assumption may be badly violated.
Chapter 5
Static Force fMRI Analysis
In Section 2.4.2 ive described the static force data. In this cliapter tve will estcnd the
rnc t l iodo lo~ dcveloped for FOPP task. Our goal is to remain in thc same paradiqm as
hcfore: devclop a descriptive tool to offer sevcral views of the data. as driven by the
cspcrimerital setup. but taking into the consideration the residua! covariance structure.
Tliat is. we would likc to look a t the data through the canonical t'ariatcs. whicli describe the
csperiniental --gradientu: tvhere do the esperimental conditions rcally makc the clifference.
\lé fccl. tliat cvcn though the task in front of us is not a classification task. the Discriminant
approacli t O t hc data is st il1 suit able. as i t disassembles the bctwecn-condit ions covariance
struct urc into orthogonal pieces of decreasing influence.
5.1 Modeling the time series effects: Time-Smoothed
PDA
The niain cliallerige of this data, as contrasteci with the PET FOPP data. is the existence
of tirne series cffects on a much finer scale than before. In the PET data, we dealt with the
timc scrics by cstending the class structure (our eight-way PD-4 analusis): and therefore
allowirig the time effects to be arbitrary. The staticforce data contains the 8 siibjects'
t h e serics. cach of length 91. where each image is taken in 4s inten-als. The force levels.
wliicli arc t hc esperimental conditions here. are super-imposed on t his time-series. and
t h e are aboiit 8 scans diiring each instance of a force condition (about 11 for baseline). It
is rcasonablc to suspect. that some part of the variability in the scans is due to the time-
tlcpendcnt changes independent of the conditions. Thesc could be related to time drifts in
t h AIRI machine (the simple linear drift that almost always accompanies the f'rIRI series
lias bccn rcmoved in the preprocessing stage. but there may be "higher order-. changes). to
tlic lieniodynamic processes in the brain that occur during a givcn instance of a condition
and may possess a systematic structure. and, as before. to the long-term brain processes
l i ke adaptation. O\-er-learning and fat igue.
It sccms intuitively clear. that there should be some continuity in the time-scries of
scans. sincc thcy are taken in every 4s. This mcans tlmt corisecuti\-e scans within the
same condition iristance should change in a smooth wq-. apart from noise. l\é may force
sriioothricss ont0 the result similarly as we forced spatial snioothness ont0 the canonical
\-ariates. The ciifference here is that ive are forcing smoothness between the scans that
c.onstitutc the observations. rather than within the result.
5.2 Introducing Between-Scan Smoothness within the
Discriminant Framework
The LD-A algorithm may be cast in terms of the orthogonal projection operator. Pl. that
projccts the scans ont0 the structure of 1'. In the typical LD,i\. 1- is just a iv x J indicator
niatris. and then:
projects any data ont0 tlie class structure. For example. if. as before. S denotes the X x p
scan matris. Pl-S is the Lv x p matris that has scan (ron.) i replacecl by the class-average
of al1 scans in the class that the obsen-ation i belongs to. The between- and within-class
covariaricc niatrices are (see also Eqn. D.1) S~P)-S and ST(I .v - Pl-)S. respectively. The
itiea here is to work with the Pl- opcrator forcing smoothness between s a n s in the time
serics.
\\'c can dcvelop the idea intuiti~vly. as follows. hitially. n-e could partition the t ime axis
into non-ovcrlapping bins. of say three scans each (about 1'2s). \\é could dcsignatc each
bin as a separate class. \\é would then have about 30 classes. just from the time structure
alone (\ive will introduce the combination of the tirne and force levels effects later). If we
assume t hat the scans are in temporal order. t hen the first four rows of tlie I rnatris n-ouid
look as follows:
Our proposal is to replace the rigid 0/1 design abow, which corresponds to square kernels
on tlic tirne asis. with smoother kernels. If WC pick a smooth basis likc B-splil.es. ive can
iicliicve t hc desired effect by setting up the rcsponse matris 1' to be an -V x .J2 matris of .J2
B-splirie basis evaluated at the N timc points. In fact an" smooth kernel-shaped function
coiild be uscd. and our intuition suggests that very similar results would then be obtained.
B-splirics have the advantage of compact support which: together with the banded penaltj-
matris. leads to efficient numerical implerncntations.
\\C still need to include the force levels, which is the main esperimental design effect.
Our proposa1 is to crcate the Y - matrix that combines the force level and the time structures
in a naturaI way. \Ve coiild also impose smoothness ont0 the force structure but ive chose
not to. and allow arbitrary force level effects. This is feasible. since there are only 6 different
force le\-els. and lets us assess the relationship between force levels and the brain response
visually n-hich could then be followed by a more forma1 investigation in a hypottiesis testing
fran:ework.
To complete the story ive propose to penalize the time-mis pararneterization. It is
natural to regillarize ttic B-spline basis by the second-order penalty matris. which penal-
izcs the second-derivative of the resulting function to control its --wiggliness" (Hastie and
Tibstiirani. 1990. Green and Silverman. 1994. e-g..). Therefore the complete proposal is to
set-up the response matris 1- as:
1- = [ I l . Ib] = [fl. . . . . fJi . Bi ( t ) . . . . . BJ,. ( t ) ]
wliere f, arc indicator columns for force Icl-els. and B,(t) are .J2 B-splinc hasis functions
cvaluatccl on tlic tinie points. Thcn the projection matris is constructeci:
Here R is a penalty matris for the B-splines. with rows and columiis. t h correspond the
force lcvcl basis. zcroed out. cind A,- is an another free Liyperparameter that controls the
csact amount of smootliness in the timc domain. \Ve cal1 the resulting mode1 a timc-
smoothed PD.4.
5.3 The O(N) Algorithm for Time-Smoothed Penal-
ized Discriminant
-4s prescritcd above, the algebra associateci wit h the time-smootlied PD-A would entai1
computirig p x p matrices and pvectors, where p is a number of vosels (or image b a i s
fiirictioris) and is mucli larger than X - iVe need to modify the algorithm presented in
Section 3.3.4. n-hich \vas espressed in terrns of the ,V x ,V matris of inner prociitcts. Our
approach is to construct and disassemble the Pl. projection operator. which then lets us
use the usiial PD-\ algorithm.
Specifically. rve start by creating 1- as in Ecp. 5.2. Then \ve compute S = l T I F + XI-R
and tlie Singular \alue decomposition of it:
l\é t hcn computc the normalized response matris:
Sincc Dl- is a diagonal matr is the in\-ersion abore is trivial. One can non- easily show
tliat riinning Our algorithm from section 3.3.4. with a response matris 1 i i from Eq. 5.4 is
cqui\-alciit t o eigcn-analyzing 5 (wi t 11 bot 11 covariance matrices clefined t hrougli the
projection operator. Pl- from Eq. 3.1) which is the holy grail of LD.\.
Tlicrc are man? ivays t o disph>- the rcsults. Obviousl': ~ v c will want to look at the
Canonical \àriates. but it is also important to uriderstarid what do the C\-.s represent.
Tho fks t~s t wa>- to asscss that in the usual LD-4. is to project the class means on a pair of
Ci- 's ancl display the projections. In our case this corresponds to projecting force levels'
and tirne mcans. If 1 j and 1; represent t tic indicator matrices for the force levels and the
tinic points. rcspectivel. n.e need (in the notation of Eqn. (2.8):
wherc I I , is a matrix of time/force level means. and subscript z stands for either time t or
force lewl f structure. This also shows that we can examine the projected means in the
LV-spacc! wi t hout corn pu ting the espensive Canonical ilariates.
5.3.1 Constructing the Second-Order B-spline Penalty Matrix
115 point out in section 3.5 that i t is desirahle to use a "proper" second-derivative penalty
niatris. 9. to penalize the B-splines. Esing such a penalty \\-as computationa1l'- incon-
\-cnicrit for us in the case of B-spline espansion of Canonical lar ia tes . but is completely
fcasible in the current case. To obtain the cubic smoothing spline representation of the
tiriie structure. n-e need a 91 B-spline basis with b o t s a t unique tirne points and sccond-
tlcrivativc 0. Here L w describe a computational tricli that lets us avoid esplicit construction
of the B-spline basis matris and f2 by using an esisting Splus smoothing spline function.
smooth . spline. For a giwn A. smooth. s p l i n e delivers. among other tliings. predicted
\-altics. y:
~vhcrc B is an r l x n niatris of n B-splines eraluated at t h e n design points (assiiniing al1
dcsigri points arc unique). Ive cvaluate smooth. s p l i n e rr times. n-it h rL canonical basis
vcctors for y (i.c. a t the Ph evaluation 9 is a vector of al1 zeros and a single unity a t thc kth
placc). aritl n'itli x wiiich is a sequencc of n design points. In our case. x is a vector with 91
timc points for tlic nIRI scquence. x = [O. 4.8.. . . .360]. For cach cal
n-e get y n-1iic.h is a row of the hat rnatris: and thus after n evaluations
1 to smooth. spline
WC can rcconstruct:
SOK. B-splincs arc jiist one possible (and numerically efficient) basis to obtain a solution
to the smootliing spline probIeni, (Eq. 3.60). but any other full-rank basis systcm will give
the samc fitted values. In particular. we can change B to the unity matr is and obtain the
solution in the Satura1 cubic splines basis (eq. (2.10) Hastie and Tibshirani, 1990). That
is. n-c would obtain:
(a. 10)
wiicrr I ï is a penalty matris for the natural cubic spline basis. For X = 1. we can compute
li fronl H by eigendecomposing it:
thcn in\-erting and subtracting 1 frorn eigenvalues 7 and reconstructing Ilc from these and
the cigcrivcctors:
\\(3 can use I\r with time-structure response 1 from Eq. 5.2 being just an indicator matris.
5.4 Connections wit h Canonical Correlation Analysis
and MANOVA
-4s WC have shou-n in Section 3.3.4 the LD-4 is basically eqiiivalerit to Canonical Corrclation
.Analysis (CC-\) n-lien the class-indicator matris is used for 1-. This connection is cl-en niore
appcaling in the present1~- proposed model. CC=\ does not put an'- requiremcnts on the
rigtit-hand raiables. Thus we may choose any reprcscntation for 1-. for instance the
structure slion-n in Eqn. 5.2. It is up to a rcscarcher to rnake surc that t h e rcpresentation
is sensible [rom the interpretation point of \-iew. In the present contest. we are seeking the
carioriical correlations of scans with both the force l e l d and the timc structure. In addition.
i t ~riakes scnsc to parameterize the time axis using smooth b a i s fiinctions to mode1 part
of the intcrscan corrclations that esist due to pro-ximity in tinic.
T h penalization scheme proposed here is also appealing in the CC-\ contest. The PD-4
rcgiilarization pcnalizes the left-liand side of the CC-\ equation. or modifies the norm for
Icft-hand side Canonicai Variates that correspond to the scan data. The penalization of
tirric-ais B-spline basis. does the same to the right-hand side of the symnietric C C - ! equa-
tion. I t is. again. up to a rescarclier to make sure that the penalization sclieme is reüsonable
frorn the andytic point of view. The model proposed herc is similar to the one described
in Ctiap. 12 of Ramsey and Silverman (1997) which Ive siimmarized in Sec. 1.3: now both
parts of the critcrion (1.6) are trcated as functional and regularized. One difference is that
we have a mised response (or I V ) structure: fised force levels and smooth time which we
deal witli using an additive model.
In Section 3.1.1 ive hinted a t the connection between classical LD-4 and I\I-CïO\:\.
Herc wc will show t hat the proposed t ime-snioot hed PD--4 model has a similar connect ion
witli an appropriatel'. pararneterized 11-\SOC--\.
In Section 12.5 (IIardia et al.. 1919) shows that the test of dimensionality in one-way
.\I.ASOlv-A lcads to similar results as the LDA. Specifically. if Ive assume the model:
for the scans i , . i = 1. . . . . ,Y that are in J classes. and witli the iisual cusriniption of i.i.d.
Gaussiari crrors E - N(0. Si,-). ive can first test the nul1 liypottiesis of cquality bctwcen
thr class incans p,. If ive reject the nul1 tlien LW ham at lcast two options. We can test
for spccific contrasts: as in usiial ,4,\-0\:4. but WC can also perfor~n a more gencral test
of diniensioriality. That is. ive can test whcther the .J class means (wliich lie in the p
cl iriiensionai space of i,'s) span the r dimcnsional Iiypcrplane. wit h r < J - 1). The GLR
trst resiilts i r i the sum of the first r eigenvalues of X,!Ss. which is the decomposition
tliat ais0 @\-es LD-4 results. Also. one can show (Ilardia et al.. 1979. Sec.5.4). that the
cstiniated hyerplane for the class means can be parameterized in terms of thc eigenvectors
of X , - ' Y ~ . wliicli are (up to a scale factor) same as CV's from LDA. Similar connection is
provcli by Hastie and Tibshirani (1995) who use these results to dcriw thc E l1 algorithm
for rcciuced -rank misture discrimination.
11-c will nonT show that similar results hold for the moclcl proposecl in this chapter. If
j ( i ) denotes the class (force Icvel) of scan ii and ti its time. then we can propose a 2-way
The Rcsidual Sum of Squares (RSS) for this model has the form:
Let us pcrforni a change of b a i s to orthogonalizc RSS. This involves left-multiplying scans
-1/2 i, a n d factors a. B wit h X,,. . 1 will retain the same symbols for al1 of these ta avoid trivial
notatiorial dianges. Let us non- assume that the effects span the r-dimensional hyperplane
\vit t i an ort hogona1 basis. that will turn out to be the Canonical Ikriates:
Tliat is Ive asscrt that:
\Ive non- pirameterize thc timc effccts to achieve the smoothness. \le choose a basis for
tlic tiniti asis with .J2 components B I ( t ) . 1 = 1.. . . . .J2 and use that to paramctcrizc the
and thcri to get the timc effccts in this basis:
The RSS non. becornes:
(a. 19)
Let us writc the RSS in a matris form. In addition to Cl- rnatris @ that we ciefined in
Eqn. 5.16. and the response vectors y , that are the rows of the rnatris 1- from Eqn. 5.2. we
define the (-1, + -A) x r matris of effects' coefficients. q:
The RSS can be n-ritten us:
RSS = C llii - @qTy i l l
Lrt <Pc = [@<PI] be a p x p matrix witli columns forming the orthonormai basis for W . The
first r coliinins are the canonical variates &. as defined above. and the remaining p - r
coltirriris arc the orthonormal basis for the orthogonal cornpicnient of the CI' space. \\-e
For an!- clioicc of orthonorma1 CVs? a. Ive caii mininiize the RSS with respect to coefficients:
11. Sincc the second term above does not depend on q7 the result is just a regression of
Ci--projectcd scans aTii ont0 I - and thus the minimizing solution is:
n-hcrc -1- is a n -\' x p scan matris.
To find the canonical variates, we just consider a case of the one Ci-. i-e.. r = 1.
Sirice the C1''s are orthogonal, we can do the minimization separatels wliich simplifies
tlic notation which would otherwise require traces of matrices. With just a single 4. the
partial1'- minirnized RSS becomes:
Tlic riiinimizing #I can now bc easily seen to be the first eigenvector of R = . \-T)-(l-TI~)-ll-TS.
arid i r i liglit of the ortliogonality of 4,'s. ttie full solution to the minirnization problern is
{ i j . @}. wlicrc <D has first r eigerivcctors of R in its columns. To finish the presentation n c
l ia-r to rcmembcr that the minimization was carricd out in the rotated systern. Thereforc
to project thc iinrotated scan i i ont0 the hyperplane. n-e need to change its basis before
projcctiiig it onto 4,'s. Tlius the final estimate of tlic hyperplaric's basis are the first r
rotalcd cigrnvcctors of R. or ~ , ! / ~ 4 > , . k = 1.. . . . r.
Illiat n-c have computed are the 1ILE cstirnates of the successively highcr-dimcnsional
11)-prrplanes that are liypothesized to contain the force Icvcl and tirne cffects. The proposctl
rnodcl n-as the 2 wa>- lI-\';OI'\ with B-spline pararncterization of the timc effect. This
rosult also forrns the b a i s for the GLR test of the dimensionality (as in (SIardia ct al..
1979. Sec.12.5)) for the 2 w a ~ - .\Ir\SO\:-A nniodel witli the proposed parametcrization of tlic
tirne cffccts. To see tliat $ are cssentially the unrotated Ci-s ive note that the>- are the
cigcrivectors of R:
By Theorcrn -4.9.2 of Slardia et al. (1979) ive know that the s,! /~@ are then the eigenwctors
of SI(! zB: wliich gives the LD.4 decomposition.
This connection bctween an appropriately pararneterized 2-way -\I.lSOV-4 mode1 and
our proposa1 nlay be used to obtain furttier insiglits. It is now clearer that the penalization
e basis reduces the effective dimensionalit!
123
,- of coefficients rl ancl thtis regiilar-
izcs the time effccts P(t ) . This is on top of the crude regularization that is provideci bu
liniiting t hc canonical dimensionality. The B-spline penalization prohibits escessive varia-
tion of the time effects and thus forces the estimation procedure to esplain the rariability
in othcr. hopcfully more suitable, ways.
5.5 Penalized Discriminant Analysis of StaticForce data
in B-splines and Wavelet domains
i\-c have applicd the PD.\ mode1 with rictge regression to the StaticForce f'\IRI d a t a de-
scril~ed in Sec. 2.4.2. \\'e usc a 6 class structure: baselinc and 5 force levcls. There are 91
SC-ans for cach subject: 46 of them are in 6 baseline instances. and -4.5 in 5 active classes.
i\-ith S sul~jects n-e ha\-e a total of 728 scans.
Pmlectiw on fifst CanonicaJ Vanate Pqecawis on f i m Canmical Vanate
Figure 5.1: Projections on first 2 CES from the PD4 mode1 appliecl to
The figure 5.1 shows projections on the first two CVs for the StaticForce d a t a using
sccorid-ordcr Daubechies wavelets with thresholding. The right panel shows the results of
the rnodcl that n-as fittecl with EDF=2OO. The projections indicate that the PDA modcl
cstraçts a rcasonablc structure: the first Cl,- divides baseline and active scans and the
second CI' corresponds to force le~els: apart from class 6 (1000g) the scores on this Ci-
incrcase u-ith a force level. The static force esperiment with lOOOg is seen as somcwhat
~Iifferent from the othcr ones: it is apparently quite hard to maintain the force of this level
for 45s t hrough the esperiment. It is reasonable to espect that different brain structures
n-il1 t ~ c in\-ol\-ed.
11-c h l - c performed an extensive predictive analysis study iising 3 thresholded tvavelet
findies. as described in section -1-4 and B25 B-spline basis. \\é also used rcduced rank
ciiscrirnination with 2. instead of 5 canonical variates uscd for prediction. In al1 these cases.
1)otIi SPE aricl misclassification rate achieve their minima at \-ery Ion- degrees of freedom.
Reduccti rank hclps keep the errors from increasing rapid1'- for higher EDF. but is otheru-ise
riot bettcr than the full rank motlcl. The minima obtained are invariablj- around the base
rates for this data: rates that would be achieved by thc rnodel that prcdicts based on the
prior pro1)d~ilities. The base misclassification rate is 36S/72S or 50.55%- Sirnilarly. the
Imsc SPE is 0..3055(1 - 0.5055)~ + 5 * T2/T28(1 - 72/729)' = 0.5'25. sincc for each of fi\-c
classes t lierc are 72 scans (out of 7%). The Ieft panel of figure -5.1 shows the prediction
for tlie PD--4 mode1 fitted with EDF=8. It sho\vs that the mode1 is doirig exactly what u-e
suspcctctl: predicting the a'priori most probable class regardless of the scan characteristics.
That the niinimiim crror occurs at these Ion* dcgrccs of frecdom. suggests that PD-4 is not
a t~ lc to effectively predict the cfass of each particular scan. The predictive failure of PD-4
docs not conipletcly clisqualify it from analyzing the data. -4s we San- in fig. .5.l. PD-4
est raçts two rcasonable components: it is t heir generalizability ovcr subjects t hat is in
qiicstiori. ,Us0 therc is an important time axis here which is completely ignorecf in the
ciirrcnt analysis and which may constitute a much strongcr effcct than the condition under
wliich the scan \vas t a k en.
5.6 Applying Time-Smoothed PDA model to the Stat-
icForce data
11-e apply the tinie-smoothed PD.-\ model to the StaticForce data. \lé use B25: tensor-
prodtict of 2.3 B-spline basis functions in each dimension. for image rcpreseritation. and 91
B-splincs hasis function for time mis with knots at the unique da ta poirits. \\k use thc
--proper" B-spline pcnalization witli the second-order penalty matris.
Time and Force Level Projections: B25 with EDF=50, lambdaY=lO
Time (s)
Force level
Time (s) Time (s)
1 2 3 4 5 6 Force level
!l 1 2 3 4 5 6
Force level
Figiirc -5.2: Projections of and timc-points (first rori-) and force lei-el means onto the first forir Canoziical Images using the time-smootli PD-\ rnodcl nitli 825 Tensor Prodiict B- splinc basis iind B-splines for the time axis. Forcc lei-els ir-cre: 1 - baseljnc. 2-2OOg. 3-400g. -4-G00g. 5-800g. 6-1000g. The tirne-struct rire penaltj- hperparametcr n-as set at XI- = 10.
Figure 5.2 shotvs the projections of the a\-mage of al1 scans a t a given timc point (first
ron-) arid a giwn force level (second row) ont0 the first t h e Canonical Irnagcs. These
accoirrit for about 85% of tiie total variance. Tlie hyperparameters were set a t EDF of
about 50. aricl Xi- = 10. Ttiese were not optimized.
Tlic first C l - accounts for almost 68% of the variance. It dearly separatcs tiie baseline
s ta tc frorn al1 the others. The corresponding time projrçtion shows a possible quaciratic
tiriie rtilatioriship for baselirie States: it s tarts higher for first bascline statc tlicn clccrcases
arid iricreascs back for last baseline. It m+ correspond to some kind of ..anticipation"
. lio\vcver ttic cffect is miich weaker than thc baselinc-activation effcct and thtis liard to
iritcrpret. In addition to this activation effect. the force leveis arc ordcrcd on this Cl* wliich
riiay provicie sonic insight into how the forcc lcvcl is mocleled witliiri ttic brain. Tlic sccond
(ahoiit 10% of thc variance esplainecl) shows a curious tirnc trcncl wliicli is quite liricar
for niost of the tinie intcrval. This ma>- bc rclated to a nunibcr of tiiings (iriclucliiig J IRI
rriac11i11o t r~z id) arid rcquircs fiirthcr scriitiriy. The corrcsponding forcc Icvcl effcct is also
strorig arid ici rriostly gcared ton-ards tlistingiiistiing tiic tliird force Icvrl. Tlic third Ci-
(iil~oii t 7% of \-ariance csplairicd) lias a rat hcr rioisy t imc structiirc. wit ii somc pcriodic
bella\-ioiir. rnostiy visible in tlic carly timc and rclatecl to the h=clinc-alti\-e changes. Thc
corrcsporidirig force lcvcl display hints of striicturc in the brain wliidi is associatccl witti
the strcngth of tlie forcc esertcd. The basclirie condition is an exception liaving tlie samc
score as force 1evc1 4. This may indicatc tliat the baseline is quite distinct from zcro-lm-cl
force tiiat it is supposeci to mode1 and shoiild not be corisidercd togetlier with otlier force
Ici-cls. T h force lcvel ortlering on this Clc suggest that it nia- be thc most interesting to
look at n-lien s e a r c h g for the answcrs cn the relationship bet~vcen tlie an-iourit of force
appIicd and t lie cont rolling brain struct urcs.
\\-e also riotc some relationships betwc.cn thc C1.s discovercd herc ancl in the PD-\
niocicling in the prm-ious section. The first Cl-s of both mode1s are clcarly quite similar.
Figure 5.3: Selected slices of the third canonical image resulting from applying the time- smoothed PD.4 rnodel to the Staticfmce data with 625 basis. EDF=50 and = 10.
The time-smoothed PD.% CV described here has a stronger association with the force
levels in addition to rnodeling the baseline-activation changes. The second CV of the tirne-
snloothed PD-4 mode1 seems to be a novel discovery as it is strongiy related to the time
axis. The third time-smoothed Clw is similar to the second CV of the PD.4 rnodel. The
difference is that it does put the lOOOg force level in the right order with respect to the other
forces. It may be occurring because of the explicit modeling of the time axis: this force
level does not occur as a first active state in any of the 8 subjects. Thus its unexpected
score on the second CV in the PD-A model rnay be a result of the confounding of tinie-order
effect.
In general, me believe that the time-smoothed PD.-\ model is potentially very useful
for modeling f'clRI data. It provides a decomposition of the covariance matris along the
1-ariance coniponents induced by the experimental setup but it also takes into the account
the strong time series effects. Currently we lack the criterion for optimizing the hyperpa-
rarneters and assessing goodness-of-fit, since the classification performance is not longer
useful in this paradigm, but ive mention a possibility in the next chapter.
Chapter 6
Conclusions and Extensions
Tiic prcscnted paradigm provides a flcsible option for constructing summary images from
h t i i PET and f'\lRI studies. It takec into acçount different csperimental setups and is
flcsibic cnoiigh to accommodate two smoot hness sources known to csist in t lie data: spatial
and tcrnporal (fl1RI). For PET stiidics. tbc predictivc analysis constitiites a validating
tcciiniqiic tliat givcs a rescarcher a degree of confidence iri tlic resulting images and allows
hirn/Eicr to makc ciioices (e-g.. amorig cliffcrcnt bases or in riuriibcr of degrees of frceciom).
Our niethocl has some advantages over othcrs proposed in the literature:
a It de& \vitIl the full 3D (4D. n-itli the temporal extension) data in a cohesi\.c way.
without a nced to delineate the rcgions of interest or perforni voscl-based arialysis
a It ackriowlcdges esisting spatial and temporal snioothncss in a simple way via basis
esparision. which has an added benefit of reducing dimensionality and thus possihly
variance
0 Csing a fised basis and regularization it avoids the S\;D basis which are wholly data
and \-ariarice driven and tlius does not take into the account the spatial naturc of
scans. IVhen using an S\--D basis one also faces a task of choosing a subset of tliem.
wliich is an exponential complesity tasko which we avoid 1- regularization with a
single hyperparameter
It provides a simple predictive framework for assessing the goodness-of-fit. \Vhile
prediction is not a goal per se in neuroimaging studies (althougli it may become one
as the diagnostic value of tlic PET/f3IRI brain scans increases). the Prediction Error
provides a simple one-number su rnmac of tlie effectiveness of the resulting imagc(s)
a \lb tlcvclop ari associatecl. computational1~- appealing. algorithm tliat a\-oids con-
st ructing huge covariance matrices
6.1 Extending the predict ive analysis
\\c use prcdiction error cstirnates as a way to botli choose a basis and hyperparameters
aiid to validate a resiilting image. ive would likc to estend this paradipi to thc Tn-O-\la'-
PD-A rriodcl (Section *3.2). One possibility is to use the AI,\SOl--l connection: if n-e think
of Carioriical ikriates (Ci's) as a basis for the scan space. we can use bootstrap to cstiniate
t hc _\Ica11 Square Error (AISE). Specifically. ll-ASO\--A tclls us t liat if oiir mociel is correct.
cach scan is coniposed of a linear combination of Canonical \ariates and an error term
(Ecls- 5.14 and 3.17). \\é propose to validate the process by estimating tlic truc AISE:
Herc. t lic double expectations taken over the distribution of the training sets. X . and then
01-cr t lie distribution of independent test scans. io. IVe use the within-covariance rather
tlicri t h Euclidean norni to orthogonalize the Canonical lariates. as in Section 3.4. This
way. -211SE is (up to an additive constant) a log-likelihood of a ncw test scan, io.
To proceed. Ict us first orthogonalizc the system, as before (Sec. 5.4). \Vhat we have
non. arc t h cstimatcd orthogonal canonical variates. Ok: and we can writc the norm in the
ncn- basis systern as:
for some canonical coefficients. n-hich combine both the time and force level structure.
T-l/2. Tlic starrcd quantities refer to the rotated quantities. for esample i, = ,,,. 10.
1'5-c propose to estimate the AISE using the best linear cocfficients. for a given test
scari. io. and a kth C\-. o k . Our rnotit-ation for this is that wc arc intcrcstcd in Iiow wcll the
estirnatcd Ci-s represent thc data. and ; jkO are a nuisance parameters in this contest. By
"t)cst" wc understand as resulting in the smallest )ISE. It is trivial to sho~v. since Canonical
iariates are orthogonal in the rotated basis, that the minimizing cocfficients. are just
projcctioiis of a test scan ont0 eacli CI-. Since:
a r i d thcrcforc we can project the test scan onto the iinrotatcd Canoriical \ariate to calculatc
t tir cocfficicnts. Csing Eqn. E.2 \vc sce that 5ok ma>- be calculatecl witlioiit actuall5- resorting
to t lic projection wliich is an cspcnsive ( O ( p ) ) operatiori.
Ttic '\ISE then is a double cspcctatiori of:
I t should t x possible to compute the first sumniand using orily outer-product matris G and
t hc mode1 fit tcd to the training set, or quantities. without resorting to the espensi\.c
operation e n the actual scans.
To estimatc IISE (Eq. 6.1) ive can use .632+Bootstrap or cross-validation. Gi\.cn the
cstimatcs of obtained using the bootstrap set. we would apply them to computc
the '\ISE for the scans in the validation set. and average as beforc. Even in the "reguiar"
PD.\ niodels. this could be a more appealing alternative to the prcdiction error that ive
6.2 Comparing the Results Across Non-Predictive Paradigm
It would be of interest to compare the results of our model (Smooth Canonical Images) with
thcse of othcr methods currently in use. such as t-maps. ASO\'A/-\SCO\---4 preprocessed
PC--4 and Scaled Subprofile .\lodel. For two classes. one possible \va'- to compare these
n-oiilcl be the ROC analysis. ROC cunes are a measurc of a classification mode1 predictive
pou-cr n-lien n-c do not want to assume any thresholds for determining t h class. The area
iirider ROC ciirve represents the total amount of information about the class in the result.
iirider lincar rnodel. One powrful fcature of ROC analysis is that it is invariant under
moriotonic trarisformat ion of an image.
The prefcrrcd paradigm would be to perform a bootstrap studj-: for each bootstrap
siiniplc coriiputc the surnmary image (Canonical Image. t-map. first Eigen-Image or Group
In\-ariant Siibprofile) and project each test scan onto it saving the score and the truc class.
.At the crici comprite the ROC curves and areas undcr it for each model. Similar paradigm
Kas proposcd and tested on a set of sirniiiated data. by Lange et al. (1999). and i t lias bceri
warrnly recci\.ed by the c-omrnunity.
Diffcrcnt approacti. tliat works for more than two classes \vas devcloped by Strothcr
tt al. (1SC)Sh). and called SP-AIRS. It involvcs assessing the variabiiity of ttic resulting image
iising pairwisc permutation studies. Briefly. one perfornis a large nunibcr of espcrimcnts in
which the data is split randonily in two halves. One obtains the summary irnage for eadi
half arici cornputes the correlation coefficient between them. The coefficients are averaged
ovcr many random samples to give a total variability measure of the image. One problem
n-ith SP-AIRS is that it does not take into accorint tlic '-bias" in thc result: by bias I
nican somc measure of the relevancy of the resulting image: a usclcss rnctliod that always
rcturns the samc image would score perfectly in this system. However. for any -'reasonable"
riicthod. cspecially if one that has been internally optimized using, for esample prediction
crror. SP-AIRS gives a useful indication of the total variability.
6.3 Wavelets and Basis Select ion Techniques
\\-c have esperimented with two kinds of basis: Tensor-Product B-spline and wavelets.
Thcre are many other possibilities of course. and cven within these two meta-families a
great rnany more things ma_\- be esplored.
Our currcnt approach with B-spline basis is to delete tliose basis that fa11 outside of
t h ovcraI1 mask. \\è have not conducted systematic espcrinients to check whether the
ovcrall rnask should be an ,\SD or an OR of al1 masks. or pcrhaps something in betwcen.
Morc gcncrally. sonie basis selection (a'la wavelet denoising. pcrhaps) may be useful. Our
approach has been to shift the burden of basis sclection ont0 the ridge penalty. However
tliis nia'. be an over-siniplistic s t r a t e u and some combination ni-- be desirable. One
possit~ilit!- would bc to use a sum-of-absolute-values penalty. like the L-ASSO strate= of
Tihhirani ( 199.3). This offers a compromise betivecn shrinking and basis select ion and has
Iwcn siicccssfiil in rnaiiy situations when compared to classical shrinkage of ridgc rcgression.
On the otlier liand. ive have mentioned beforc that it woiild bc dcsirahle to replace the
riclgc penalty \vit11 the second-deril-ati\-c one to perhaps obtain a thin-plate spline solution.
It ma? bc possible to combine both strategies: a L-ASSO-like pcrialty for shrinkagc and
basis select ion n-it h t tiin-plate spline second derivativc penaltj- One major obstaclc is to
iiriplcnicrit this in a computationally appealing wq. that would. sirniIarIy to the algorithms
prcscritcd in this thesis. avoid constructing covariance basis in the vosel space.
Tlicrc csists a more systematic approach to selecting basis froni many families. Wavelet
packcts and t lie associated Best-Basis Pursuit algorithms (e-g.. i'idakovic. 1999) start \vit h
01-erconipletc dict ionaries which contain redundant basis from the one family or multiple
families. Bcst-Basis pursuit \vas developed for signal and image denoising. but the idea
lias Ixcn estcnded to multiple images and LD=\ by Coifman and Saito (1991. 1996). which
clcscribc the Local Discriminant Baszs (LDB). LDB searches a large redundant basis clic-
tioriary in a rapid way picking these basis that have high discriminatory power. Crucial
t O fast irnpicrnentation is the addith-ity of the discriminatory measure. The rsarnples are
I\Lullback-Liebler divergence and Hellinger distance. \Ve definitely feel that the area of basis
select ion. especially \vit h tvavelct basis. warrants niore exploration.
I\'e feel strongly tha t working in the 1-avelet domain has great potential in neuroimaging.
It lias a potential for great dimensionality reduction without affecting tlie rcsults in major
way. Indccd. if done correctly. one may obtain better rcsults. as tve saw in two-class PD,\
:triiilysis for tlie FOPP data, due t o decrcase in variance. Our approach to basis selcction
bascd on image-wise thrcsholding is simple to implement but lias large potential drawbacks.
First. it docs not pool information across scans. One possibility would bc t o perform a
robust version of -lSO\--l analysis. using medians and absolutc distances instead of rneans
arici square metric. to estirnate the levei/cliannel-dcpenclent noise across scans. This would
b c i n riircct analogy t o the current ,\LAD estimcitor but n-ould take variability across scans.
as wcll as poterit ially between siibjects and conclitions. into account . It is qriite possible
t hat u-i t h t lie current strateg'. ive may be cleleting basis important for signal discrimination.
It is also possible t h in sornc parts of a scan. the variabilit~. in higii. but that part still
lias sorric discriminatory power. Possibly more likcl>- is the rcvcrsc sccnario: there arc
lou--1-ariability regions in the scans with low discrirriinatory powr . which currcntIy sur\-ive
the thrcsholding. pcrhaps a t the espcnse of other regions. only becausc WC do not takc the
discrimination problem into account wheri constructing the wavelet basis. One problem
whcn es te~ id ing thc thresholding strategies t o account for subjcct and condition effects is
t hat it will require a different approach for validating the image: using fised subjcct effects
arid conditions to select wavelet basis would currently recluire that tliis step be performed
for cvenj bootst rap sample. which would be compiitationally proiiibitivc.
\\kvcIets also offer a possibilit? for more intelligent penalization sciienies. Each wavelet
lias a position and the scale associated with it. and thus u-e may use an'. prior information
to differcritiate penalties for different spots in the brain on different scales. For csample.
n-c ma?- perlalize places with white nlatter or ventricles more. as they are not likely to
participate in the brain function. The smoothcr rcpresentation is alrcady penalized less:
since t h e are twice as many wavelet coefficients a t the nest higher level. and since each
coefficierit receil-es the same penalty collect i\.ely the ridge penalt~. *-favours" smoot her
rcsul ts: this could he enforced n-it h location-based penalties mentioncd above. Since al1
tiicsc schenics result in a diagonal penalty matris they are easy to implcmcnt in the current
algori t h i c paradigm.
6.4 Inference and other issues
\\-r ha\-e not done much work on region-specific inference. LD-A and otticr miiltivariate
tcchniqucs. as applied to images. are spatially global in nature and have mainly a descriptive
appcaI. I I C usc prediction error to raliciate the procediirc. but have not made ariy
at tcmpt s to dcsignate specific regioris as significantly act ivated.
-A simple approach would bc to assume normality. coristruct a T-map frorn a Canon-
ical l a r i a tc aiid t liresliold using t lie Bonferroni correction. This rnay bc rnuch appealing
hcre tlian it \vas in the case of t-maps constructed with Statistical Paramctric LIapping
(Scctiori 2.3.1). sincc the basis coefficients, 7. that resiilt frorn PD-4 and projcctcd images
are potcntiall~. a lot less corrclatcd than the original scans. For one. somc spatial correla-
tiori lias bccn removed ria basis expansion step. and PD,\ decorrelatcs Canonical Iariatcs
furthcr by n-orking in the rotated space. Similarlj; it may be possible to utilize Gaussian
Ranclom Field thresholding of 11-orsley et al. (1992) on the reconstructed Canonical Image:
orle n-ould assume that rindcr the nul1 hypothesis the canonical variate resulting from PD-\
applicd to t h projected data is a zero-mean field. as before. Then the covariance matris
(2.21) of the canonical image. that results from applying Eqn. 3.34, would be possible to
calculatc using the properties of the b a i s used. This may be more appealing than the SPlI
approach since the homoscedasticity is more tenable in the CV space and the Prediction
Error (or AISE) woiild give us some non-parametric confidence.
--hotlier issue is that of canonical dimension reduction: choosing a number of significant
carionical variates. For esample. in the 8-class FOPP problem. we felt that first 2 canonical
\-ariates retain most of the structure associated n-ith the problem. The casiest approach
~voiild be to estend the prediction error selection to choose the canonical dimension. Some
asyniptotic results ma_\- be tenable for this problem. however. since Ive are ivarking with
the surnmav data. -4 related issue is that of allowable rotation of canonical variates. For
csaniple. if only the first two CI-'s are designatecl as significant. what n-e reaily cibtain
is a two-clirnensional \-iew of the between-class covariance. It is quite possible that some
rotation of the Cl-'s u-ould result in more appealing structures. Similar issues are present in
t h principal cornponent ana l~s i s Iiterature and a niinilxr of automatic rotation procediires
(suc-11 as \'-\RI,\I,lS) liavc becn developed.
LD-4 and PD-l depend heavily on the ability to cstiniate the covariance matris. Since a
fiill-riink covariance matris cannot be est imated ive --clieat" a lit tle by pcnalizing it . wfiich
c+f~cti\-cl? adds sonie volume in each direction. 1Iore irnportant1~-. LD-\/PD-\ use pooled
cstiniatcs of coi-ariance over al1 classes. assuming the same shape. Tlic alternative of esti-
riiatirig separatc covariance matrices for cach class. called Qiiatiratic Discrimiriant -4nalysis
is clcarly untenable in the present case. Friedman (1989) offers one intermediatc solution.
tcrriied Rcgularized Discriminant -4nalysis: first shrink the covariance matris for each class
t owards a circular one (via ridge penalty). and separatel'. ridge-penalize the average covari-
ance matris. Tn-O hyperpararnetcrs result. wIiich may be estimated witti cross-validation or
lmotstrap. as in our case. -4nothcr possibility is the Alised Discriminant Arialysis ()ID-\)
proposal of Hastie and Tibshirani (1993). There cach class (or al1 of the data) is modcled
b?- a misture of Gaussians with a resulting mixture of CO\-ariance matrices modeling the
cornnion covariance structure. Each mist ure-covariance mat ris is penalized with a global
hyperparameter. which helps keep the degrees of freedom Ion-. This proposal tias a poten-
tial for niodelirig different shapes for each class using different mean and covariances for
class-spccific Gaussian cornponents. SIDA algorit hm involves Expect at ion-.\ Iasimizat ion
(E l [ ) iterations of basic PD.\ algorithm. and is thus computationally appealing in our case.
sincr Ive can run the analysis in the O ( S ) tirne.
Appendix A
Tensor Product B-Spline Basis
Tcrlsor products providc a general way to estend a one-dimensional basis to more dimen-
sions. Sec. for instance, Green and Silverman (1994). which deals with cubic splinc bases
or Ogdcn (1997). for an csaniple of a tensor product basis in the i r a -de t domain. On the
riiodeliiig siclc. Friedman (1991) develops a powerful and adaptive mode1 using first-order
tcnsor prodixct B-spline basis with backward elimination-
B-spliiics. discusscd a t length by de Boor (1978). twre tlcvclopcd as a nunicrically-
rfificicnt Insis for polynomial splincs. If B, dcnotes the jLh B-spline basis. ive compose a
3-D basis 11y niult ipl~ing the unidimensional ones:
Tlius the basis in 3D involves al1 possible products of the unidimensional bases. Figure -A.1
shows the B-spline basis in one and two dimensions. One notable fcature of B-splines is
their compact support, which results in banded design and penalty matrices leading to
efficient cilgorithms.
Figure -1.1 : 1 D and 2 0 B-spline basis.
Appendix B
Basis Expansion of Canonical
Variates
Iri tliis apperidis we show. that the basis expansion of canonical variates lcads to LD.4 or
PD--4 ivith projccted data.
-4s in section 3.3. let a resulting canonical image be constrained as
wlicrc B is a basis matrix with onc basis in cach column, cl-aluated over tlic rosels (rows).
PD.\ cari be cspressed as aii optimization prol~lcm: for t ~ o classes it finds . that
iriasiriiizcs * 3 T ~ B E T d subject to .3T2ii-0 = 1. For more than two classes one successivcly
iriasimizcs the critcrion subject t o the orthogonality with metric El[-. which does not affect
t lie follon-ing resuit.
If ive add the constraint on 3 the criterion and condition bccome r T B T r B E T m and
TTBTP Br. respectivelu. The between and within covariance matrices are:
Pl- is an orthogonal projection operator on the column space of 1- (9- = 1-(1'T17-11'T):
for LD-4. A = 0. and for PD-4 R is a chosen penalty rnatris in the original space. It is clear
non.. that the PD--4 problem with smoothness basis constraint is an unconstraincd PD-\
problem in projected data rnatris S B and a modified penalty. R*. Our choice. partly for
computational cspediency, and partly duc to the limited knowlcdge of the true nature of
the data in relation to tlie TPS basis. has been to set R' = 1.
Appendix C
CCA via Regression
Hcrc WC dcrive the (unpenalized) CC-\ algorithm via regression (set also Hastie et al..
199.3). Let -1- bc the nurnber of observations (i-c.. scans) with p variablcs as inputs (liere.
\-oscIs or hsis functions). \ \ e assume here. for the unpenalized version. that -1- > p. Let
S be t h _\- x p data matris . and let 1- bc a -\; x J class-indicator mat ris. \vit h .J = niiniber
of classcs. \ \c can obtain the solution to tlie CC.1 problcm from the regular SirD of:
n-licrc r, arc siiigular values. and D, = diag(c). Anticipating the LD-A problcm (-4p-
pciidis D). we are interested in left carionical variates B. which ive will refer to as Ci-'S.
Thcse arc tlic rcscaled left eigenvectors of 1< (Eq.3.40): or:
bccailse of the normalization requirement of CC=\. If S and 1- have becn centcred then we
can use the sample estimates:
Thus thc sample version of Eq. C.1 becomes:
5ow lirli. wliose eigenvectors are the right eigenvectors of Ii: is:
If WC choose orthogonal contrasts for classes (Le.. normalizc 1- (1*T17-11" ive obtain:
wlicrc 1 ' = S (STS)-'ST1-. The two-steps ment ioncd above. are now clccarly 1-isible: run a
iriul t i-rcsponsc rcgression of (centered) da ta matris S on (centered and ort honorrnalized)
groiip-intlicator matris. 1.. ( i e : f - = ~ d ) aiid derim the right-liand cigcnvectors of I<
(Eq. C.1). by eigenanalysis of >-TF:
Tlieri ohtairi the left-hancl eigerivectors using Eqs C.1. C.2 and (2.4:
h
wlitre 3 is a matris of coefficients from the regression step.
Appendix D
Correspondence Between CCA and
LDA variates
111 t h i ç section ive will deriw the esact amount of rescaling needcd t o r:onvert the CC.\
1-ariatcs. B. in the notation of Eqs. C.1. C.2 to LD.4. canonical vi.iriates. BtD.-\. proving
Eq. 3.42.
LD.4 is a generalized cigenvalue problem: find BLo;\ t hat successively niasiinizes BTX B n B
siibjcct to BTSi,-B = 1: where roET. rit- are betneen- and within-class covariance matri-
ces. as in the prcvious appendis. If S has been centeredt thcn, for LD-4:
Sincc Bw arc left eigenwctors of w-e have that:
and tlius NT nced to rescale: BLDA4 = BD(1-c2)-1/2 to mett the LDA constraint.
Appendix E
Deriving Predictions in the
n-Dimensional Space
In this section we show hon- to derive posterior probability estimates using tlic fittcd values
arid cigc~iclccornposition stcp resiilts.
Frorn eqiiations 3.47-3.49 ive riotc that WC nced xOBLD.., ancl TBLoa4 to obtain the
estimates- Son-. using Eqs C.8 and 3.42. ~ v c havc that:
~ I i e r e !jo is a vector of fitted 1-alues for predictor xo.
The Ii x p matris of çlass centroids ( j i k ' s ) rn- be obtained by ( I -T l - ) - l I - T S . wiiere
1- is an .v x I< class-indicator matris. Therefore the requircd IC quantities. jî;-BLDil. are
(l-'I-)-'I-TSBLD.4 and may be calculated. similari- as in Eq. E.2. using rescaled fitted
values and -4'. By using Eq. E.2. al1 fi posterior probabilities are obtained for ro.
Appendix F
Ridge Regression With the Outer
Product Matrix
;\r;~- pcrialized regression. can be cspresscd using only .V(-\-- 1) dot products of obsermtions
in p( =~iiiriiher of columns) dimensional space. or using an oiiter-procluct rnatris. G = SST.
For oiir piirposcs. S is a n image rnatris. one image per ron.. \\ë will work with ridge
rcgrcssion. but ariy penalized regressioii can bc I>rouglit into ridgc foriii L\. siiitûl~lc cliatigc
of lmsis.
Riclgc rcgression is tlic solution of the following probleni:
argmin ( y - S B ) ~ ( ~ - S ; 3 ) + ~ 3 ~ 3 S 3
By taking derivati~ses wrt / Ive have that:
Tliiis t hc fit tcd valucs are
and t h predictecl values a t new design points. S'.
wlicrc G' is a n -l'G x matris of dot-products bctwcen -\$ ncw images anci -\* training
images.
.-\notticr dcrivation looks a t the projection matris SA = s ( s ~ - ~ + M)-lST. Start with
Singular I k luc Decomposition of S:
hi oiir case. the images will usually span the ,\- diniensional subspace of the p dinietisional
\-oscl spacc. To kccp things general. let's assume ttiat the imagcs span a k <= ,l' tlimeri-
siorial spacc. i.c.. C- is a -V x k. D is a diagonal with k strictlx positive entrics. and \ - is
p x P. Tlic matrices C . I - are colurnn-orthonormal. Le..: Ik = b - T ~ = I 'TI '.
so\t-.
Lirics F.7 and F.10 corne from the fact that (1-?(D2 + XI,)} and { C ( D 2 + A I N ) } are
cigcrisoliit ions of xT-y + XI, and S S ~ + XIlv: respcctivel>-. ancl t hat bot h of t hese matrices
arc invcrtiblc.
The PDA algorithm is composed of a penalized multiresponse regression of ari image A
niatris on groupindicator matris. I - ' followed by the eigenanalysis of 1 Tlius. if al1
WC neect is posterior probabilities at an? (neu-) image. xo ive can operate entireiy in the
spacc of obscrx-ations. (rnuch smaller than the space of predictors). once G and G* are
precomprited. \ l e need one more step to deal n-ith ceritering of S matris using only the
outcr-product matris G.
Appendix G
Centering the Design Matrix
For PD.\. we rieed t o center the matris S first. beforc computing G. Howc\-er. to run the
rcsarripling validation studics. the training set will changc for cach bootstrap (or CI-) itera-
t ion. plus we necd to ccnter the validation esamples by the training set mcans. This n-ould
rqiiire precomputing the outer-product matris for cadi bootstrap itcration separately. de-
fyirig tlic coi-liputational adt-antage of tliis operation. \\> therefore need to find t h way
to conipiitc tlic ccntered version of G and a \va!- to center validation set matris. for ariy
sclcctiori of training set esamples. given an unccntcrcd G computed using al1 uncentered .V
obscr\-ations.
Let Ghll = xST be the outer-product matris of al1 un-centered data. An- gil-en
l>ootstrap/C\' sample specifies a subsct of rom of S. as a training set. tvitli the rest being
a \-alidation set. From therc. onc obtains G and G* to get predictioris (Eq. F.4). If G-41i is
rcarrangcd. so that first NI colurnns/ro~vs correspond to the training images. and last :Vo
to the \*didation images. then G and G* are Nl x Nl upper-left. and iVo x .VI lower-lcft
subniatriccs of GZlll: respectively.
Thc centering operator associated with any N x p matris S is:
wherc laV cidenotes a colurnn n-vector of ones. \\C want G = -TST in terrns of G. \\è have.
csplicitly:
= G - A G
r e ( A ) = Si + gk - 4: and ij,: 5 denote column (row) mean and over-a11 mcan of
G. respcctivcly. For G': with validation points, we proceed sirnilarly. using the colurnn
meuns of G:
whcrc. siriiilarly as before. = 9; + gr. - 6. and ijl is a nieaii of ith rou1 of G'.
Bibliography
R.J. -Adler and -4.11. Hasofer. Level crossings for random fields. Annals of Probability, 4:
1-12. 197G.
13-11. -4nclcrson. T.11-. ,Inderson. and 1. OIkin. '\la-ximum likelihood estimators and likeli-
hooci ratio criteria in multivariate components of variance. The Annals of Statistics. 11
(2):-405-4LI. 1986.
T.I\-. -Anderson. A n introduction to rnultivariate .statistical ana1y.si.s- .John \\-iIey &L Sons,
sccoricl edi t ion. 1984.
T.\\-. .Anclcrson. Components of variance in AI-\SOI.--A. In P.R. Iirishnaiah. editor. Mufti-
variate .4nnl?~.si.s - VI. pages 1-8. Elsevier Science Publishcrs. 1985.
11. Jlntonini, 11. Barlaud, P Mathieu. and 1. Daubechies. Image coding iising n-avelct
trarisforrn. IEEE Trans Image Process, 1:205-220. 1992.
B.-\. ,Ardekani, S.C. Strother. J.R. -Anderson, 1. Law. O.B. Paulson, 1. Kanno. and D.-A.
Rottenbcrg. On the detection of activation patterns using principal components analysis.
In R.E. Carson, 1I.E. Daube-\Vit herspoon. and P. Herscovitch, ectitors, Quantitative
functional brain imaging with Positron Emission Tomography, pages 253-257. -1cademic
Prcss, San Diego, C-A: US-4, 1998.
?;.P. .'lzari. P. Pictrini. B. Honvitz: K.D. Pettigrew, H.L. Leonard. J.L. Rapoport. 1.I.B.
Schapiro. and S.E. Swedo. Individual differences in cerebral metabolic patterns during
pharmacot herapy in obsessive-compulsive dissorder - a multiple regression discriminant
arialysis of positron emission tomographie data. Biological Psychiatnj. 34( 1 1) :795-809.
1993.
11. Barinaga. [VIiat makes brain neiirons run'? Science. 276: 196-8. 1997.
R.E. Bellrilan. Adaptive Control Process. Priiiceton Lniversity Press. 1961.
C.S. Burrus. R.-1. Gopinath. and H. Guo. Introduction to wavelets and wauelet transforms:
A primer. Prentice Hall. 1998.
R.B. Biiston and L.R. Frank. A mode1 for the coupling bet~veen cerebral blood flow and
os'-gcn mctabolism during neural stimuiâtion. J Cereb Blood Flow Metabol. ï7:64-72.
1991.
II..]. Catalan. 11, Honda, R.P. \tecks. L.G. Cohen. and SI. Hallctt. The furictional neii-
roiiriatorii>- of simple and compks sequential fingcr movements: a PET study. Brain.
121:253-264. 1998.
C. CIark. R. Carson. R. I<essIer: R. AIargolin. 11. Buchsbaum. L. DeLisi. C. King, and
R. Cohen. -Altcrnativc statistical models for the csaniiriation of clinical positron emissiori
toniography/fluorode-osyglucose data. J Cereb Blood Flow MetaboL 5: 142-1.50. 1985.
R.R. Coifrnan and S Saito. Constructions of local orthonormal bases for classification and
rcgrcssion. Comptes Rendus Acad. Sci. Paris, Serie 1. 3 l9 (2 ) : 191-196. 1994.
R.R. Coifrnan and -\: Saito. Irnproved discriminant bases using cmpirical probability density
estirnation. In Proceedings Computing Section of Amer. Statist. Assoc., pages 312-321.
19C)G.
P. Cra\.e and G. IVahba. Smoothing noisy data with spline functions. Numerishe Mathe-
rnatic. 31:371-403, 1979.
S.-A.C. Cressie. Statistics for spatial data. .John \\ïleÿ Sr Sons. Sen- 1-ork. revised edition.
1993.
C. de Boor. A practical guide to splines. Springer-i.cerlagt Sew \ork. 1978.
D.L. Donoho and 1-11. Johnstone. Ideal spatial adaptation by wavelet shriiilmge. Biomet-
rica. 81 (3):425-55. 1994.
D.L. Donoho and 1-11. Johnstone. -4dapting to unknown smoothness via n-a\-elet shrinkage.
Journal of A m e n c a n Statistical Societjy. g O ( 4 X ) : 1'200-1224. 1995.
R.0. Dtida and P.E. Hart. Pattern Classification and Scene Recognition. IYiley. Sew York.
197.3.
B. Efrori anci R.J. Tibsliirani. An Introd~uction t o the Bootstrap. Chapman ,\rici Hall. 1993.
B. Efron and R.J. Tibshirani. Irnpro\vments on cross-1-alidation: The .632+ bootstrap
rrictliod. .J. of Arnerican Statistical As.soc.. 9'L:Z-B-560. 1997.
Bratllcy Efron. Estimating ttie error rate of a prcdiction rulc: improvement on cross-
\-alidation. Journal of American Statisticnl Society, 785316-331. 1983.
R.-\. Fiçticr. The use of multiple measurements in tasonomic problenis. Annais of Eugenics.
7 : 179-188. 1936.
PT Fos and LI,\ Alintum. Soninvasive functional brain mapping by changedistribution
anal>-sis of averaged PET images of HZ "0 tissue activit~-. J Nucl Med. 30: 141-9. 1989.
R. S. .J. Frackowiak, Iiarl T. Fristori? C. D. Frith, R. J . Dolan, and J . C . llazziotta. Human
Brain Function. Academic Press, San Diego. CA, C-S.A? 1991.
1' Freund. Boosting a rwak learning algorithm by majority Information and Computation,
121(2):25G-285i, 1995.
1- Freund and R Schapire. Esperiments with a new boosting aIgorithm. In iblachine
Lenning: Proceedzngs of the Thirteenth International Conference. pages 148-136. 1996.
.J. Friedman. T. Hastie. and R. Tibshirani. -Additive logistic regression: a statistical rieu-
of hoosting. Annals of Statistics, In Press.
.J.H. Friedman. Rcgularized discriminant analysis. Journal of Amen'cun Statistical Society.
8-1 (40.5): 165-17.5. 1989.
.J.H. Friedman. IluIti\ariate adaptit-e regression splines (rvith discussion). .-Innuls of Statis-
tics. 19:l-141. 1991.
J.H. Fricciman. -An overvicw of prtdictive lcarning and function approsiniation. In
i*. Clierkassky. .J.H. Friedman. and H. jhchsler. editors. From Statistics to neural
net u!ork.s: Theor-y and pattern recognition applications. S-AT0 -\SI Scries. pages 1-6 1.
Springcr-\krlag. Berlin. 1994.
K..J . Friston. Imagirig neuroscicnce: Principles or maps? Proc Nat1 Acad Sci. 95:796-802.
1998.
K..J. Friston. C.D. Firth. P.F. Licldle, and R.S.J. Frackowiak. Functionat connectivity: Thc
principal cornponent analysis of Iargc (PET) data sets. J Cereb Blood Flow Metabol, 13:
S E - 1 - 4 . 1993,
K..]. Friston, C.D. Frith. P.F. Licldle, and R.S..J. Frackowiak. Comparing functional (PET)
images: Th assessmcnt of significant change. J Cereb Blood Flow Metabol. 10:458-466.
1991.
K . J . Friston. -\.P. Holmes, K.J. 13orsley. J-P Poline, C.D. Frith, and R.S.J. Frqckou-iak.
Statistical parametric maps in functional imaging: A general lincar approach. fiman
Brain iWapping, 2:189-210, 1995.
1;-.J. Friston. J.B. Poline. A.P. Holmes. C.D. Frith. and R.S.J. Frackowiak. -4 multivariate
a~iaij-sis of PET activation studies. Human Brain Mapping? 4: 140-15 1. 1996.
P.J. Green and B.\Y. Silverman. Nonparametric regression and generalized linear models :
a roughness penalty approach. Chapman and Hall. London. 1994.
L.K. Hansen. J . Larsen. F..L Sielsen. S.C Strother. E. Rostrup. R. Savoy. S. Lange.
.J. Sidtis. C. Svarer. and 0.B. Pauison. Generaiizable patterns in Seuroimaging: Hou*
mariy principal components'! ~Veuroimage. 30: 1-1 1-9. 1999.
;\.SI. Hasofer and R..J. -Adler. Upcrossings of random fields. Aduances in Appfied Probabil-
ity[SupplL 10: 14-21. 1978.
T. Hastie and R. Tibsliirani. \hrying-coefficient models. Journai O/ the Royal Statistical
Societg series B' Zi(l):ï. ' j ï-ï96. 1993.
T. Hastic and R. Tibshirani. Discriminant analysis by mixture modelling. Journal of the
Ro?pl Stutistical Society series B. 58:15.5-176. 1995.
T..J. Hastic. -4. Buja. and R.J. Tibsliirani. Penalized discriminant analysis. rlnnals of
Stntistics. 237.3-102. 1995.
T.J. Hast ic and R.J. Tibshirani. Generalized Additive Models. Chapman and Hall. 1990.
.J. Hertz. -4. Iirogh, and R.G. Palmer. Introduction to the Tlreory of Neural Computation.
-Addison-\\'csley, Redwood C i t s C.4, 1991.
SI. Hintz-.\Iadscn, L X . Hansen, J. Larsen. 1I .W. Pedersen, and 11. Larsen. Scural classifier
coristruction using regularization. prunning and test error estimation. Neural Networks.
11:1659-1670. 1998.
1-11. Johnstone and B.W. Silverman. \\avelet threshold estimators for data with correlated
noise. JRSB. 59:319-351, 1997.
.J.S. Kippcnham. \\:.\V. Barker, J . Sagel, C. Grady, and R. Duara. Seural network
classification of normal and Alzheimer's disease sub jects using high-resolut ion and low-
rcsolution PET cameras. J Nucl bled, 35:ï-la. 1994.
C'. Iijems. S.C. Strother, J.-4. ,Anderson, 1. Law. and L X . Hansen. Enhancing the multivari-
nte signal of [ '"0]water PET studies with a new nori-linear neuroanatomical registration
algorithm. IEEE Tram Med Img, 18:301-319. 1999.
S. Lange. S.C. Strother. J.R. -inderson. F.,\. Sielsen. ,\. Holmes. T. Kolenda. R. Savoy.
and L.K. Hansen. Plurality and resemblance in nIRI da ta analysis. NI. 10:'LS'L-303.
1999.
B. Laiitriip. Iiai Hansen. L.. 1. Law. S. '\forch. C. Svarer. and S.C. Strotticr. 1Iassire
n-~ight sliaring: a cure for estremely ill-posed problems. In H.J. Hcrrmann. D.E. \\olf.
a n d E. Poppcl. editors. Supercomputing in Brain Research: Frorn Tomography to Neural
Net u~orks. pages 137-148. I\orlci Scicntific. 1995.
D. \Ialorick ancl -4. Grinvald. Interactions bctu-ccn electrical activity and cortical microcir-
ciilation rcvcaled by imaging spcctroscopy - implications for functional brain mapping.
Science. ?12(S26l) :Xil-a54. 1996.
K. 1'. IIardia, .J. T. Kent, and .J. 11. Bibby n/hltiuiariate Atzalgsis. -4cademic Press.
London. Great Britain, 1979.
P. 1IcCullagh and J .-A. Nelder. Generalited Linear hlodels. Chapman ad Hall, London:
CI<. 2 edition. 1989.
A.R. '\IcIntosh. F.L. Bookstein. J.V Haxbs and C.L. Grady. Spatial pattern analysis of
functional brain images using partial least squares. Neuroimage, 3:143-157. 1996.
-4.R. 1lcIntosh. L. Syberg, F.L. Bookstein. and E. Tulving. Different ial functional con-
riectivitj- of prefrontal and medial temporal cortices during episodic memory retrieval.
Hu~nun Brain Mapping. 5:3'23-327, 1997.
.J.R. lloeller and S.C. Strother. -1 regional covariance approach to the anal'-sis of functional
patterns in positron ernission tornographic data. J Cereb Blood Flow Metabol. 1l:Al'Ll-
--l1.3.5. 1991.
J .R. 1Iocller. S.C. Strot her. J . J . Sidtis, and D.-4. Rottenberg. Scaled Subprofile 1Iotlel: -4
statistical approach to the analysis of functional patterns in positron emmision tomog-
raphy data. J Ccreb Blood Flow Illetabol. 7:619-658. 1987.
S . llorch. A multiuariate approach to functional neuromodeling.
Pli D t hcsis. Danish Technical University. Lyngbl-. Dennmark. 1998.
http://eivind.imm.dtu.dk/publications/phdthesis.html~
S. 1Iorc.h. L-I i . Hansen. S.C. Strother. C. S\-arer. D.-A. Rottenberg. B. Lautriip. R. Savo~-.
a r i r l O.B. PaiiIson. Sonlincar vcrsus linear models in functional neuroimaging: Learning
cilri-CS and generalization crossover. In 3. Duncan and G. Gincli. editors. Information
pro'ocessz'ng in medical irn aging. \-oliirnc 12.30 of Lecture Notes in Cornput er Science. pages
2.59-270. Springer-\érlag. 1997.
F.A. Siclseri. L.K. Hansen, and S.C. Strother. Canonical ridge analysis \vit h ridge parani-
cter optimization. Neuroimage, ï (Part 2 of 3):S758, 1998.
C.R. Soback. S.L. Strominger? and R..J. Demarest. The human neruous .s@em: introduc-
tion and review. Lea &L Febiger, 1991.
S. Ogan-a. T.11. Lee. .\.FI. Kay, and D.\V. Tank. Brain magnetic resonance imaging with
contrast dependent on blood osygenation. Proc. Natl. Acad. Sci. USA, 87:9868-9872:
199Oa.
S. Ogawa. T.'\I. Lee. - A S . Sayak. and P. Glynn. Osygenation-sensitiw con t ra t in magnetic
rcsonance image of rodent brain ât high fields. Magn. Reson. Med.. 14:68-78. 1990b.
R.T. Ogdcri. Essential wauelets for statistical applications and data analyszs. Birkhauser.
Boston. 1 991.
.J .JI. Ollingcr and J.-4. Fessler. Posit ron-emission toniography. IEEE Signal Processing
i ' r l c tga~i~~e . pages 43-55. 1997.
F. O'Sullivan. Discretized Laplacian smoothing by Fourier methods, Journal of American
Stati.stical Societg. 86(415):634-6-12. 1991.
11 .E. Pajcvic. 1I.E. Daube-\Vitherspoon. S.L. Bacharach. and R.E. Carson. Soise charac-
teristics of 3-6 arid 2-d PET images. lEEE Truns Med 'mg. 17:9-23, 1998.
.J.O. Rarriscj- and B.\\-. Silverman. Functional data analgsis. Springer-Ierlag, Sen- 1-ork.
1997.
C.R. Rao. Tlic utilization of multiple measurcnients in problenis of biological classification
(witli cliscirssiori) . Journal of the Royal Statistical Society series B. 10: 159-203. 1945.
Ii. R e h . K. Lakshminarj-an. S. Frutiger. K A . Schapcr. DI\-. Surnners. SC. Strottier. JR.
Anderson. and Da\. Rottenberg. A syrnbolic environment for visiializating activated foci
in functiorlal neuroimaging dat asets. Medical Image .4nalysis. 232 1.5-2'26. 1998.
B.D. Riplc~.. Spatial Statistics. IViley, Sm- York. 1981.
B.D. Riplcy. Pattern Recognition and Neural Networks. Cambridge Cniversity Press. Cam-
l~ridgc. GBI 1996.
D.-1. Rottenberg. J . J . Sidtis. S.C. Strother: Schaper Ii .-A.I J.R. Anderson. 11.J. Selson,
and R.I\'. Price. .ibnormal cerebral glucose metabolism in HIV-1 seropositives with and
n-it hout clementia. .J Nucl Med, 37: 1133-1 111, 1996.
C.E. Ruttimann. 11. Lnsert R.R. Rawlings. D. Rio. S.F. Rarnsej-. Il.\\-. Honimer. .Je-1.
Frank. and D.R. 11-einberger. Statistical analysis of funct ional 11RI data in the wavelet
ciornain. IEEE Trans Med Img. ZÏ(2): 142-154. 1998.
S. SacIato. G. Campbell. 1'. Ibaiiez. 11-P Deiber. and 11. Hallett. Complesity affects
rcgional cerebral blood flow change during sequeritial finger movcrnents. Journal of
Xeuroscience. 16(8):2693-2100. 1996.
S.C. S t rothcr. J . R. -Anderson. K.-\. Schaper. .J. J. Sidtis. J-S Liow. R.P. \\;oods. and D.-4.
Rottenberg. Principal component analysis and the scaled subprofile mode1 compared to
int crsubject averaging and statitistical paramet ric mapping: 1. .*Functional Connectiv-
ity" witli [ l " ~ ] w a t e r PET. d Cereb Blood Flow iCfetabol. 15:TX-753. 199%-
S.C. Strothcr. J.R. -Anderson. S - L Su. .J-S Lio. D.C. Bonar. and D.-4. Rottenberg. Quan-
titative cornparisoris of image registration techniques based on hi& resoliition 1IRI of
thc brairi. .J Comp.irt .ilssi.st To,mogr. 15:9.54-962. 1994.
S.C. Strother. 1. Kanno. and D.,A. Rottcnberg. Principal component analysis. variance
partitioning and "functional connectivity". J Cereb Blood F1o.w ildetabol. 15:353:360.
199.5b.
S.C. Strotfier. S. Lange. J.R. Anderson. K . - k Schaper, K. Rehrn. L.K. Hansen. and Da-\.
Rottcnlxrg. -Activation pattern reproducibility: measuring the effects of group size and
data analysis models. Human Brain Mapping, 5:312:316. 1997.
S.C. Strother. S. Lange. R.L. Savos J.R. ,Anderson, J . J . Sidtis, Hansen L.K., P.,\. Ban-
dettirii. K. O'Craven, ,LI. Rezza, B.R. Rosen, and D. , l Rottenberg. ,\Iultidirnensional
state-spaces for flIRI and PET activation studies. Neuroimage, 3(2):S98, 1996.
S.C. Strother, K. Rehm, N. Lange. J.R. ~Andcrson, K.,-\. Schaper. L.K. Hansen, and D A .
Rottcnberg. 1Ieasuring activation pattern reproducibility using resampling techniques.
In R.E. Carson. 1I.E. Daube-Witherspoon. and P. Kerscovitcti. editors. Quantitati.ue
functiorml brain imaging with Posi tron Ernission Tomography. pages 341-246. .-\cademic
Prcss. San Diego. 1998a.
S.C. StrotIicr. K. Rehm. S. Lange, J.R -Anderson. K.-\. Scltaper. L.K. Hansen. and Deal.
Rottcnberg. Sleasuring activation pattern reprodiicibility iising resampling techniques.
In R.E. Carson. 1I.E. Daube-\Vitherspoon. and P. Herscovitch. d i t ors. Quantitative
funct ionnl brain imaging with Positron Emiss ion Tornograplr y, pages 233-2.57. -4cademic
Prcss. San Diego. C-4, 1998b.
R. Tibshirani. Regression selection and shrinkage via the lasso. Journal 01 the Royul
Statistical Society series B. 1:267-288. 1995.
\'.S. \kpriik. The nature of statzstical learning theon/. Springer-ICrlag, Sew 1-ork. 1993.
Brarii 1-itiakovic. Statistical Modeling b?j ~wave1et.s. \\-ille!- Scries in Probability arid Statis-
tics. .John \\-iley k Sons. Inc.. 1999.
H. \hoc i . Canonical ridge and econometrics of joint production. Journal of Econornetrics.
4:147-166. 1976.
G race iialiba. Spline h1odel.s for Observational Data. SI.111. Pliiladcl phia. P-4. 1990.
S. \\-ald. A . Riihe. H. \Yaldt and \\'..J. Dunn. The collinearity problem in linear-regrcssiori:
The partial least-squares (PLS) approach to generalized inverses. SIAM J of Scien and
Stat Corr~puting, 5(3):735--7.13, 1984.
R.P. \\oods. S.R. Cherry: and J-C. Ilazziotta. -1 rapid automated algoritlim for accuratel-
aligning and reslicing positron emission tomography images. J C o m p u t Assis t Tomogr:
16:6'20-633. 1992.
R. P. iloods. S.T. Grafton. J.D. Watsori. S.L. Sicot te, and J.C. llazziot ta. ,Automateci
image registration: II. Intersubject validation of linear and nonlinear models. J Cornput
.-l.s.sist Tomogr. '22: 153-165. 1998.
R.P. iloods. .J-C. llazziotta. and S.R. Cher-. Automated image registration. In K. Ce-
mura. S.-4. Lassen. T. Jones. and 1. Kanno. editors. Proceedings brain PET '93 AKITA:
quuntificntion of brain function. pages 391-400. -Amsterdam. 1993. Escerpta lfcdica.
K. .J. \\orsle'= J-B. Poline. 1i.J Friston. and -4.C Evans. Characterizing the resporise of
PET and nIRI data using multivariate linear models. Neuroimage. 630.5-319. 1997.
K.J. 1l'orslc~-. -4.C. Evans. S. lfarret , and P. Seelin. ,A three dimensional statistical analysis
for CBF activation studics in human brain. J Cereb Blood Flo,w Metabol. 12900-915,
1992.
K . J . ilorslcy. S. llarret. P. Seelin. A.C. \andal. Ii. .J. Friston. and -4.C. Evans. -4 unified
statistical approach for determining significant signals in images of ccrcbral actiration.
Hvrnun Bmin Mapping . -1:.58-731 1996.
G .-A. Iii-igilt. Ilasnet ic resonance imaging. IEEE Szgnal Processing Magazine. pages 56 4 6 .
1991.