Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Pred ictive G aus s ian C las s ificatio n o f F unctio nal M R I Data
by
G rigo ri Yo urgano v
A thes is s ubm itted in co nfo rm ity w ith the requirem ents fo r the d egree o f D o cto r o f Philo s o phy
Ins titute o f M ed ical S cience Univers ity o f T o ro nto
© C o pyright by G rigo ri Yo urgano v 2 0 1 3
ii
Pred ictive G aus s ian C las s ificatio n o f F unctio nal M R I d ata
Grigori Yourganov
Doctor of Philosophy
Institute of Medical Science University of Toronto
2013
Abs tract
This thesis presents an evaluation of algorithms for classification of functional MRI data. We
evaluated the performance of probabilistic classifiers that use a Gaussian model against a popular
non-probabilistic classifier (support vector machine, SVM). A pool of classifiers consisting of
linear and quadratic discriminants, linear and non-linear Gaussian Naive Bayes (GNB)
classifiers, and linear SVM, was evaluated on several sets of real and simulated fMRI data.
Performance was measured using two complimentary metrics: accuracy of classification of fMRI
volumes within a subject, and reproducibility of within-subject spatial maps; both metrics were
computed using split-half resampling. Regularization parameters of multivariate methods were
tuned to optimize the out-of-sample classification and/or within-subject map reproducibility.
SVM showed no advantage in classification accuracy over Gaussian classifiers. Performance of
SVM was matched by linear discriminant, and at times outperformed by quadratic discriminant
or nonlinear GNB. Among all tested methods, linear and quadratic discriminants regularized
with principal components analysis (PCA) produced spatial maps with highest within-subject
reproducibility. We also demonstrated that the number of principal components that optimizes
the performance of linear / quadratic discriminants is sensitive to the mean magnitude, variability
and connectivity of simulated active signal. In real fMRI data, this number is correlated with
behavioural measures of post-stroke recovery , and, in a separate study, with behavioural
iii
measures of self-control. Using the data from a study of cognitive aspects of aging, we accurately
predicted the age group of the subject from within-subject spatial maps created by our pool of
classifiers. We examined the cortical areas that showed difference in recruitment in young versus
older subjects; this difference was demonstrated to be primarily driven by more prominent
recruitment of task-positive network in older subjects. We conclude that linear and quadratic
discriminants with PCA regularization are well-suited for fMRI data classification, particularly
for within-subject analysis.
iv
Ackno w led gm ents
Most importantly, I would like to express my gratitude to my academic supervisors, Dr. Stephen
Strother and Dr. Randy McIntosh, who have helped me enormously during my graduate studies.
This thesis would not exist without their mentorship, without their financial and moral support.
I am also deeply grateful to Dr. Stephen Small from University of California (Irvine) and to Dr.
Cheryl Grady from University of Toronto, for letting me use the data from their beautifully
designed experiments. I would also like to thank Natasa Kovacevic for doing a fantastic job at
careful preprocessing of the data. I am heavily indebted to Dr. Ana Lukic for developing the
simulation framework that my thesis rests on, and to Dr. Xu Chen for teaching me how to use
this framework.
I have been very lucky to closely collaborate with Dr. Tanya Schmah, Dr. Marc Berman and Dr.
Nathan Churchill. I have learned a great many things from them, and certainly hope that our
collaborations continue into the future!
Many thanks to Dr. Rafal Kustra and Dr. Richard Zemel, members of my Program Advisory
Committee, for being patient with me during the long period of my doctoral studies and for
providing invaluable feedback!
v
T able o f C o ntents
Chapter 1 Introduction .................................................................................................................... 1
1.1 The nature of signal and noise in fMRI .................................................................................... 1
1.1.1 Magnetic Resonance Imaging and the BOLD signal ........................................................ 1
1.1.2 Hemodynamic response and temporal resolution of fMRI ............................................... 3
1.1.3 Spatial resolution of fMRI ................................................................................................ 4
1.1.4 Sources of noise in fMRI .................................................................................................. 5
1.2 Experimental design .................................................................................................................. 7
1.3 Statistical analysis ..................................................................................................................... 9
1.3.1 General Linear Model ....................................................................................................... 9
1.3.2 Multivariate classification ............................................................................................... 10
1.4 Selecting the method of analysis ............................................................................................. 12
1.5 Data sets for evaluation of algorithms .................................................................................... 13
1.6 Structure of the thesis .............................................................................................................. 14
Chapter 2 Evaluating algorithms for fMRI data analysis ............................................................. 16
2.1 Metrics of performance ........................................................................................................... 16
2.1.1 Receiver Operating Characteristics (ROC) methodology ............................................... 16
2.1.2 Reproducibility; NPAIRS framework ............................................................................. 20
2.1.3 Accuracy of Classification .............................................................................................. 23
2.1.4 Prediction-Reproducibility Plots ..................................................................................... 24
2.2 Data sets .................................................................................................................................. 26
2.2.1 Simulated data ................................................................................................................. 27
2.2.2 Stroke recovery study ...................................................................................................... 30
2.2.3 Aging study ..................................................................................................................... 31
Chapter 3 Probabilistic classification of fMRI data ...................................................................... 33
vi
3.1 General considerations ............................................................................................................ 33
3.2 Constructing spatial maps for classifiers ................................................................................ 36
3.3 Quadratic discriminant ............................................................................................................ 38
3.4 Linear discriminant ................................................................................................................. 39
3.5 Univariate methods: Gaussian Naive Bayes classifier, General Linear Model ...................... 41
3.6 Regularization of the covariance matrix ................................................................................. 45
3.7 Non-probabilistic classification: Support Vector Machines ................................................... 46
Chapter 4 Methods for estimating the intrinsic dimensionality of the data .................................. 48
4.1 Principal Component Analysis and dimensionality reduction ................................................ 48
4.2 Probabilistic Principal Component Analysis .......................................................................... 50
4.3 Methods of dimensionality estimation .................................................................................... 52
4.3.1 Akaike Information Criterion .......................................................................................... 53
4.3.2 Minimum Description Length ......................................................................................... 55
4.3.3 Bayesian evidence ........................................................................................................... 56
4.3.4 Stein's Unbiased Risk Estimator ..................................................................................... 57
4.3.5 Predicted Residual Sum of Squares ................................................................................ 59
4.3.6 Generalization error ........................................................................................................ 60
4.3.7 Reproducibility and Classification Accuracy .................................................................. 61
4.3.8 The Area under a ROC curve .......................................................................................... 62
Chapter 5 Intrinsic dimensionality estimation: results .................................................................. 64
5.1 Simulated data ......................................................................................................................... 64
5.1.1 Analytic methods ............................................................................................................ 64
5.1.2 Empirical methods: PRESS and Generalization error .................................................... 67
5.1.3 Empirical methods: Reproducibility and Classification Accuracy ................................. 68
5.1.4 Summary of performance on simulated data .................................................................. 71
5.1.5 Effect on data analysis .................................................................................................... 75
vii
5.2 Intrinsic dimensionality estimation in real data ...................................................................... 77
5.3 Lessons learned ....................................................................................................................... 80
Chapter 6 Intrinsic dimensionality and complexity of fMRI data ................................................ 84
6.1 Intrinsic dimensionality in a study of self-control .................................................................. 84
6.2 Complexity of cortical networks in fMRI study of stroke recovery ....................................... 86
Chapter 7 Evaluation of classifiers: simulated fMRI data ............................................................ 94
7.1 Pool of classifiers .................................................................................................................... 94
7.2 Performance of classifiers on simulated data .......................................................................... 96
7.2.1 Classification accuracy .................................................................................................... 98
7.2.2 Reproducibility ................................................................................................................ 99
7.2.3 Partial area under the ROC curve .................................................................................. 100
7.2.4 ROC evaluation of GLM, PCA, ICA and PLS ............................................................. 101
7.3 Summary of evaluation of classifiers on simulated data ...................................................... 104
Chapter 8 Evaluation of classifiers: real fMRI data .................................................................... 106
8.1 Data sets ................................................................................................................................ 106
8.2 Evaluation on real data: Stroke study ................................................................................... 108
8.3 Evaluation on real data: Aging study .................................................................................... 113
8.4 Spatial maps for the aging study ........................................................................................... 117
8.4.1 Reproducibility of spatial maps across subjects and across methods ........................... 118
8.4.2 Group-level classification of spatial maps .................................................................... 125
Chapter 9 Conclusions and Future Research .............................................................................. 133
9.1 Evaluation of classifiers for fMRI data ................................................................................. 133
9.2 Directions for future research ............................................................................................... 135
9.2.1 Simulations of multiple networks ................................................................................. 135
9.2.2 Dimensionality in the aging study set ........................................................................... 136
9.2.3 Modifications to LD and QD; additional performance metrics .................................... 137
viii
References ................................................................................................................................... 139
Appendix Works with significant contribution from the author ................................................. 150
1 Peer-reviewed publications .................................................................................................... 150
2 Conference presentations ....................................................................................................... 151
ix
Lis t o f T ables
Table 6.1. Correlations between fMRI-based measures and behavioural measures.. ................... 92
Table 8.1. Cortical areas that show sensitivity to age ................................................................. 128
x
Lis t o f F igures
Figure 1.1. The temporal profile of a model hemodynamic response function .............................. 3
Figure 2.1. Examples of different distributions of H1 (solid green) and H0 (dashed red), and the
corresponding ROC curves. A: perfect detector; the ROC curve is a step function. B: Chance
distributions overlap completely; the ROC curve is the identity line. C: the usual situation,
distributions overlap somewhat and the area under the ROC curve is between 0.5 and 1. D:
degenerate case, the H1 distribution is completely contained inside H0 distribution; the ROC
curve shows the characteristic "hook" at the bottom. ................................................................... 18
Figure 2.2. Scatter plots of 2 simulated spatial maps, corresponding to reproducibility of r=0.5
(A) and r=0 (B). Major and minor axes are displayed for each scatter plot. The major axis
contains signal and noise, and variance along this axis is 1+r, while the minor axis contains only
noise, with variance 1-r. ................................................................................................................ 23
Figure 2.3. Prediction-Reproducibility (P-R) plots in the presence and absence of a simulated
cortical network. The black line shows the P-R trajectory when the activation loci are linearly
coupled into a single covarying network. The grey line shows the trajectory when the activation
loci are independent. The size of the markers corresponds to the number of principal components
used in the analysis (varied from 1 to 40). We show the average trajectory for 100 simulated data
sets described in Section 2.2.1. ..................................................................................................... 26
Figure 2.4. The phantom in baseline (left) and activation (right) states. Noise is not displayed.. 26
Figure 5.1. Normalized cost function of several methods of dimensionality estimation. ............ 65
Figure 5.2. Reproducibility and classification accuracy for linear and quadratic discriminants, as
a function of number of principal components, for two simulated data sets. Left and right plots
correspond to a weak (V=0.1) and strong (V=1.6) variance of the signal, respectively, for
moderate connectivity of ρ=0.5. The dotted line shows the classification accuracy corresponding
to random guessing. ...................................................................................................................... 68
xi
Figure 5.3. Plot of the first 10 eigenvalues of the covariance matrix of a single data set, for M =
0.01 (top) and M = 0.03 (bottom); ρ is set to 0.5, and V varies from 0.1 to 1.6 in increments of
0.5. Eigenvalues are averaged across 100 simulated data sets. .................................................... 69
Figure 5.4. Median dimensionality estimates in simulations, as calculated by various methods
(see legend and text), shown as a function of the relative signal variance, V, defined as the
variance of the amplitude of the Gaussian activation blobs relative to the variance of the
independent background Gaussian noise added to each voxel. M is set to 0.01 for the top row and
to 0.03 for the bottom row. The three panels from left to right in A and B show three increasing
levels of correlation, ρ, between Gaussian activation blob amplitudes. Range bars on the first
(V=0.1) and last (V=1.6) data points reflect the 25%–75% interquartile distribution range across
100 simulation estimates. .............................................................................................................. 72
Figure 5.5. Asymptotic relationship between global signal-to-noise ratio (gSNR) and
dimensionality that optimizes reproducibility, for linear (A) and quadratic (B) discriminant maps.
Marker size indicates relative signal variance, V, from 0.1 (small) to 1.6 (large). Five colours
encode five different levels of M, and the spatial correlation is encoded by different symbols. .. 74
Figure 5.6. Partial ROC area (corresponding to false positive frequency range of [0…0.1]) as a
function of the relative signal variance, V, calculated for linear discriminant (LD, on the principal
component subspace, with subspace size selected by various methods), and for univariate general
linear model (GLM). M is set to 0.01 for the top row (A) and to 0.03 for the bottom row (B). The
three panels from left to right in A and B show three levels of correlation, ρ, between Gaussian
activation blob amplitudes. Error bars show standard deviation across 16 active loci (centers of
Gaussian activation blobs). ........................................................................................................... 76
Figure 5.7. Asymptotic relationship between global signal-to-noise ratio (gSNR) and optimal
dimensionality in real fMRI data: stroke study (A) and aging study (B). Each marker indicates a
subject. .......................................................................................................................................... 76
Figure 5.8. Optimal dimensionality and global SNR in a group study. Each marker indicates an
age group; each solid line indicates a task that is contrasted with fixation.. ................................ 76
xii
Figure 6.1. Intrinsic dimensionality in the self-control study, for the subjects in the high-delaying
and the low-delaying groups. Dimensionality is estimated by optimization of LD classification
accuracy. Error bars represent standard errors across the subject group. ..................................... 85
Figure 6.2. Scatter plots for four combinations of fMRI-based measures and behavioural
measures: QD dimensionality versus final peg test performance (A); generalization-error
dimensionality versus final pinch test performance (B); sphericity versus pinch test improvement
(C); spectral distance versus pinch test improvement (D). Each subject is represented with a
specific symbol.. ........................................................................................................................... 85
Figure 6.3. Scatter plot for the first vs. the second principal component produced by PLS analysis
of the correlation matrix given in Table 6.1. Squares and circles denote fMRI-based and
behavioural measures, respectively... ........................................................................................... 91
Figure 7.1. Performance of the pool of six classifiers on simulated data sets. Top, middle and
bottom row show three metrics of performance. The three columns correspond to three levels of
mean signal magnitude M, and the three sub-columns to three levels of spatial correlation ρ... .. 97
Figure 7.2. Performance of the pool of six classifiers on simulated data sets, measured by partial
area under the ROC curve. The three columns correspond to three levels of mean signal
magnitude M, and the three sub-columns to three levels of spatial correlation ρ.... ................... 100
Figure 7.3. Partial area under ROC curve measured for maps that are produced by GLM, ICA,
PLS and PCA... ........................................................................................................................... 103
Figure 8.1. Performance of the pool of classifiers on the stroke recovery dataset for three
contrasts (healthy/impaired, early/late, and finger/wrist). The top figure shows the accuracy of
classification, and the bottom figure shows the reproducibility of spatial maps for six algorithms
of classification... ........................................................................................................................ 108
Figure 8.2. Ranking of six classifiers for two performance metrics: classification accuracy (top)
and map reproducibility (bottom). Ranks of 1 and 6 correspond to the best and the worst
performer, respectively. If the ranking of classifiers is not significantly different, they are linked
with a thick horizontal bar. Significance is established with Friedman and post-hoc
xiii
nonparametric testing, as described in the test. For the contrasts with significant difference in
ranking, critical distances (CD) are also specified ...................................................................... 111
Figure 8.3. Performance of a larger group of classifiers, used in a study by Schmah et al. (2010)
to classify the data in the stroke recovery study... ...................................................................... 113
Figure 8.4. Performance of the pool of classifiers on the dataset from the aging study. Left and
right columns correspond to subjects in the young and the older age groups, respectively... .... 114
Figure 8.5. Ranking of classifiers in the young age group. Classifiers linked with a horizontal bar
are not significantly different in their ranking... ......................................................................... 115
Figure 8.6. Ranking of classifiers in the older age group... ........................................................ 115
Figure 8.7. Across-subject reproducibility of within-subject spatial maps created by different
classifiers. The left and right panels correspond to the young and the older groups of subjects,
respectively... .............................................................................................................................. 119
Figure 8.8. Jaccard overlap of within-subject spatial maps across subjects. The left and right
panels correspond to the young and the older groups of subjects, respectively ......................... 120
Figure 8.9. Correlation of average spatial maps across classifiers. Individual subject maps created
by each of the 6 classifiers have been averaged across all subjects from our study... ................ 121
Figure 8.10. Jaccard overlap of average spatial maps across classifiers, for 2 strong contrasts
(RT/FIX and DM/FIX) .............................................................................................................. 122
Figure 8.11. DISTATIS plots of similarity of within-subject maps created with different
classifiers... .................................................................................................................................. 124
Figure 8.12. Accuracy of group-level classification of individual maps that have been created by
six different classifiers. Within-subject maps have been classified according to the age group of
the subject ("young" versus "older") ........................................................................................... 126
Figure 8.13. Cortical areas affected by aging, as revealed by the RT/FIX contrast. The top row
shows the group-difference map, thresholded at p<0.05. The middle and bottom rows show the
unthresholded group-average maps for the young and the older group, respectively ................ 130
xiv
Figure 8.14. Cortical areas affected by aging, as revealed by the DM/FIX contrast. The top row
shows the group- difference map, thresholded at p<0.05. The middle and bottom rows show the
unthresholded group-average maps for the young and the older group, respectively ................ 131
1
Chapter 1 Intro d uctio n
1 .1 T he nature o f s ignal and no is e in fM R I
In the past two decades, functional Magnetic Resonance Imaging (fMRI) has become one of the
most popular methods of studying brain activity (Bandettini, 2007). Broadly speaking, it is based
on measuring magnetic properties of the blood in the brain that are associated with cortical
activity (strictly speaking, these magnetic properties are measured at the level of nuclei of
hydrogen atoms, that are part of the water molecules of the blood). This technique is made
possible by a series of advances in neuroscience, physics and medical engineering.
1 .1 .1 M agnetic R es o nance Im aging and the B O LD s ignal
Magnetic Resonance (MR) imaging is well established in medicine. It is commonly used for
diagnostics, as a safer (although more expensive) alternative to X-rays. A comprehensive
treatment of the principles of fMRI can be found in, e.g., Buxton (2009); here we give a
simplified picture. When a strong constant magnetic field is applied to the body, and a portion of
hydrogen nuclei that constitute the water molecules in the tissue align along the direction of the
magnetic field. This creates a net magnetization of the tissue; the amount of magnetization
depends on the magnetic susceptibility of the tissue. The susceptibility is higher in oxygenated
blood than in deoxygenated blood; therefore, the magnetization of oxygenated blood is higher.
While the constant magnetic field is on, a second magnetic field is briefly turned on; this second
field is of much smaller magnitude and of perpendicular direction to the first field. If it oscillates
with a specific frequency (which depends on the magnitude of the constant field and on the type
of the tissue nuclei, which is, in our case, hydrogen), the phenomenon of magnetic resonance
occurs: the protons constituting the hydrogen nuclei enter a higher-energy state. When the
oscillating field is turned off, these protons return to their previous energy state, emitting a radio
wave in the process. This wave forms the basis for the MR signal received by the MR scanner.
Because of the difference in magnetic susceptibility, this signal is higher in oxygenated blood
than it is in deoxygenated blood. In 1936, Pauling and Coryell studied magnetic properties of
hemoglobin molecules, and discovered that oxygenated and deoxygenated hemoglobin had
different magnetic susceptibility. Thulborn et al. (1982) applied this idea to MR imaging and
2
found that the transverse relaxation time of water molecules in blood depended on the level of
oxygenation. The relationship between blood oxygenation and neuronal activity was studied with
Positron Emission Tomography (PET). Synaptic activity requires oxygen and glucose, which are
supplied via blood circulation. Using PET imaging, Fox and colleagues (1988) observed a large
increase in blood flow and metabolic rate of glucose after tactile stimulation but the measured
increase in metabolic consumption of oxygen was much lower in comparison. The change in
oxygenated blood flow driven by neuronal activity can be measured by MR scanner; unlike PET,
this measurement is not an invasive procedure.
Application of MR technology to study brain activity was suggested in a series of papers by
Ogawa and colleagues. In a study conducted in 1990 (see Ogawa, Lee, Nayak, & Glynn, 1990),
they acquired high-resolution images of rat brains at high magnetic fields (7 and 8.5 Tesla). They
saw dark lines that corresponded to anatomical divisions of the cortex. The prominence of these
lines depended on the level of oxygenation and on dilation of blood vessels, and the authors
proposed MRI as an alternative to PET in studying brain function. Indeed, the follow-up studies
(see Kwong et al., 1992, and Ogawa et al., 1992) showed that changes in MR images were
related to neuronal activity.
The mechanism of coupling between neurons and blood vessels is a subject of intense
investigation. The current understanding can be found in the review by Attwell and colleagues
(Attwell et al., 2010). When glutamate, which meditates the most common type of neural
activity, is absorbed by the neurons and by nearby astrocyte cells, a series of signaling chemicals
are released. This chemical signal is received by soft muscles, which respond by dilating the
blood vessels. This causes a marked increase in oxygenated blood flow to active cortical regions.
However, only a small fraction of oxygen is metabolized. The resulting over-supply of
oxygenated hemoglobin is used in fMRI as a marker of neuronal activity. The term “blood-
oxygenation-level dependent signal” (“BOLD signal”) was proposed by Ogawa et al. (1990) and
used to refer to the MR signal that is tuned to deoxygenated hemoglobin. Later studies found that
the BOLD signal matches electrophysiological measurements of neuronal activity, such as local
field potential (Logothetis et al., 2001) as well as firing rate (Mukamel et al., 2005).
3
1 .1 .2 Hem o d ynam ic res po ns e and tem po ral res o lutio n o f fM R I
The temporal profile of a BOLD response to a short neuronal event is called a “hemodynamic
response function” (HRF). This HRF time course of BOLD signals is quite slow relative to the
time course of underlying neuronal activity. It takes about 5 seconds for the BOLD signal to
reach its peak, and then it slowly decays. Often, an “undershoot” is observed: the BOLD signal,
after reaching the peak, decays below the baseline level, and then (up to 20 seconds after the
neuronal event) it returns to the equilibrium level (Glover, 1999). The profile of the BOLD
response varies across subjects (Aguirre, Zarahn, & D'esposito, 1998). Also, the magnitude of
the BOLD signal is influenced by the vascular density and therefore varies across spatial
locations in the brain (Harrison et al., 2002). In many cases, researchers use a simple model of
the hemodynamic response, which does not account for spatial and inter-subject variations: HRF
is modeled as a sum of two gamma distributions (Glover, 1999; Friston et al., 1998). We have
used this model to create hemodynamic effects in our simulations; this was done by convolving
the simulated signal with the impulse response function given by
2
2
21
1
1
)(exp
)(exp)(
21
b
dt
d
tc
b
dt
d
tth
aa
. (1.1)
Figure 1.1. The temporal profile of a model hemodynamic response function.
4
The parameters of the function were set according to Worsley (2001): a1 = 6, a2 = 12, b1 = b2 =
0.9 seconds, c = 0.35, and dk = akbk. Figure 1.1 illustrates the impulse response function h(t)
graphically.
Because of the slow time course of HRF, it is hard to resolve neuronal responses to a quick
succession of events with fMRI. Discrete neuronal events should be separated by at least 4
seconds in order to be resolved (Zarahn, Aguirre, & D'Esposito, 1997A). However, provided that
the stimuli are presented repeatedly and their timing is randomized, studies have shown
significant differential functional responses between two events (e.g. flashing visual stimuli)
spaced as closely as 500 ms apart (Burock et al., 1998).
1 .1 .3 Spatial res o lutio n o f fM R I
The spatial resolution of the BOLD signal is tied to the density of the spatial structure of the
vascular system (Harrison et al., 2002). Blood is supplied to the active neurons by capillaries,
which form a dense net around neurons. The diameter of a capillary is comparable to a size of a
red blood cell (Huettel et al., 2004). Blood circulates through a system of arteries and veins,
which branch out into arterioles and venules. It has been suggested (Harel et al., 2006), that the
blood flow mechanism is controlled on the level of arterioles on the pial surface of the brain.
Arterioles are thin (10 to 50 micrometers in diameter), and feed the cortical tissue in about 1-mm
thick cylinders around them. In comparison, the thickness of the grey matter in neocortex is
about 3 to 5 mm.
Spatial resolution of an fMRI study is usually chosen by the researcher with several
considerations in mind. The limit of this resolution is usually defined using the notion of a voxel,
a three-dimensional cell which is our sampling unit in space. Researchers in current fMRI studies
usually use voxels of size 1×1×1 to 5×5×5 mm3. The finer the spatial resolution, the more time
needed to scan the cortical surface of interest. The researcher needs to make sure that the choice
of spatial resolution translates to reasonable volume acquisition time, especially if the whole
brain needs to be covered.
Field strength influences the spatial structure of fMRI images. To achieve reliable acquisition at
reasonably high resolution, the majority of modern MRI scanners use 1.5 to 3 Tesla fields.
Above 3 Tesla, we can image the brain at the resolution of voxels of sub-millimeter volume and
5
approach the spatial resolution of cortical columns (Kriegeskorte & Bandettini, 2007). Such high
fields have been found safe for human subject exposure, and are slowly being accepted in the
neuroimaging community as more 7 Tesla scanners are purchased and installed. However, the
usefulness of high-field fMRI is a somewhat controversial issue, because, at higher field
strength, the MR signal is more severely affected by physiological processes as well as by
physical artifacts (see Kleinschmidt, 2007, and also Huettel et al., 2004, p. 236-241).
1 .1 .4 S o urces o f no is e in fM R I
The amplitude of BOLD signal changes caused by task-related neural activity is quite small
relative to the intensity of the background MR signal. More importantly, BOLD signal is
corrupted by noise from various sources. A good summary of noise in fMRI can be found in
Huettel et al. (2004). fMRI literature mentions several kinds of noise, which are characterized by
their origin and spatio-temporal properties. The noise that does not originate from neuronal
activity needs to be removed from the data prior to statistical analysis; this step is called pre-
processing. The most important kinds of BOLD noise are listed below.
Thermal noise (also called Johnson noise) is due to thermal motion of the electrons within the
subject and the scanner hardware. This noise is always present at temperatures above absolute
zero, and has been shown to be proportional to the field strength (Edelstein et al., 1986) and
independent of the magnitude of the MR signal (Kruger & Glover, 2001). This noise is
temporally and spatially independent and has an additive effect.
Scanner noise is caused by imperfections in the imaging hardware, most importantly by the
inhomogeneities of the magnetic field and by gradient coil nonlinearities (Jezzard & Clare,
1999). It is proportional to the field strength (Kruger & Glover, 2001). Special shimming coils
are usually used to correct for the magnetic field inhomogeneities, but it should not be assumed
that shimming removes them completely. The effect of these inhomogeneities is the strongest in
the parts of the brain where different tissues are located next to each other, for example, when
the brain tissue is situated close to bone or air-containing sinuses.
Because of subjects’ head motion, the signal recorded over time from a specific volume of the
magnetic field may contain, at different time points, contributions from different spatial areas of
a subject’s head; this severely degrades the BOLD signal. Head motion can also cause drifts in
6
fMRI signal. The severity of motion effects is at its worst at the edges of the brain, as well as at
the brain-CSF and brain-air interfaces. To alleviate the impact of head motion, the head of the
subject is packed into an MR head holder for scanning using padded cushions or vacuum packs.
The effect of head motion is usually corrected during pre-processing of the data, typically by
aligning the fMRI volumes to a reference volume using rigid-body transformation (see Jenkinson
et al., 2002; Woods et al., 1998; Cox & Jesmanowicz, 1999). However, residual motion effects
remain even after this procedure (Lund et al., 2006). An alternative approach is to discard
volumes severely impacted by head motion (Power et al., 2011). Head motion is a significant
contributor to noise, especially in older subjects as well as clinical patients (e.g. patients with
Parkinson’s disease). It should also be noted that the motion of the head could be coupled to the
task (many fMRI experiments require some kind of motor or verbal response from the subject).
Another possible source of head motion is the vibration of the scanner, which may cause drifts in
BOLD signal (Foerster, Tomasi, & Caparelli, 2005).
The contribution of physiological processes to BOLD fluctuations is often referred to as
physiological noise (Huettel et al., 2004). In contrast to thermal and scanner noise, which can be
reduced by decreasing spatial resolution, physiological noise is independent of the size of the
voxel (Bodurka et al., 2007; Kriegeskorte & Bandettini, 2007). The variability due to
physiological noise increases with the amplitude of the BOLD signal and with field strength
(Kruger & Glover, 2001). Cardiac, respiratory and neural activity are the main sources of
physiological noise.
The effect of cardiac activity is especially prominent in the vicinity of major blood vessels
(Dagli, Ingeholm, & Haxby, 1999). Heartbeat induces subtle tissue motion. It also creates
fluctuations in blood volume, which in turn create fluctuations in BOLD signal. In the power
spectrum of fMRI time series, cardiac activity creates peaks at the frequency of the heartbeat
(around 1 Hz) and its harmonics.
The displacement of brain tissue due to respiration is considerably larger than displacement due
to cardiac activity. In addition, changing lung volume creates susceptibility variations in the
magnetic field (Raj, Anderson, & Gore, 2001). Respiration, like cardiac activity, introduces
peaks in the power spectrum at the frequency of respiration (Raj et al., 2001), which lies at the
range 0.1-0.3 Hz (Wise et al., 2004). The spatial effect of respiration is rather more global than
7
the effect of cardiac activity (Glover, Li, & Ress, 2000), but it is mort prominent at large CSF
pools such as ventricles (Perlbarg et al., 2007). A common approach to correction for cardiac and
respiratory artifacts is to externally measure the pulse and the lung volume (pulse-oximeter and
respiratory belt are often used for these purposes; see Glover et al., 2000; Lund et al., 2006).
Another important source of physiological noise is spontaneous neuronal activity. The brain is
always active, and the neuronal activity related to the response to experimental tasks is a fraction
of overall neuronal activity (when analyzing experimental fMRI data, it does not seem possible
to tell whether BOLD signal corresponds to spontaneous or to salient neuronal activity) . It has
been suggested that spontaneous brain activity makes the brain more efficient in switching from
one state to another (Deco, Jirsa, & McIntosh, 2011; McIntosh et al., 2010). The power spectrum
of this source of noise obeys the 1/f law, which suggests a significant amount of autocorrelation
(Zarahn, Aguirre, & D'Esposito, 1997B).
Kruger and Glover (2001) have studied the relative contribution of the mentioned sources to
overall BOLD noise. The picture is somewhat different across brain tissues: overall variability
(measured with standard deviation of BOLD signal) is about twice as high in grey matter as it is
in white matter. Variability due to thermal and scanner noise is homogenous across the brain and
across subjects, and it explains about 10% of variance in the grey matter and about 35% in white
matter. Variability due to spontaneous activity explains 70% of grey matter variance and 45% of
white matter variance; the actual percentage of explained variance varies between subjects, but it
was uniformly found to be the largest source of variance. Variability due to head motion and to
cardiac and respiratory activity explained 10% of grey matter and 8% of white matter BOLD
signal variance; the contribution of these sources of noise was very different in different subjects.
1 .2 E xperim ental d es ign
The traditional approach in designing fMRI experiments is contrasting the neural response in two
(or more) experimental conditions. This approach uses subtraction logic, where two
experimental conditions or brain states are assumed to be different in only one factor (Culham,
2006), and differential BOLD response measures the influence of this factor. The condition
where this factor is present is called the task condition, and the other condition is called a
baseline condition.
8
Ideally, the two conditions should be different either in their presented stimuli or in behavioural
tasks. For example, when we want to study attention to visual stimuli, it makes sense to compare
passive viewing of the stimuli to the engaged viewing of the same stimuli (when the subject is
instructed to pay attention). When we are interested in response to visual stimuli, we may
contrast passive viewing with looking at a blank screen. If we contrast engaged viewing of
stimuli with looking at a blank screen, the differences in BOLD signal may be attributed to two
different factors, visual perception and attention, and the influence of these two factors cannot be
resolved. Careful selection of a baseline task is therefore critical in task-driven experiments.
A common way to design the presentation of the conditions is to use block design. A block is the
duration of time when the subject is repeatedly performing the task. “Task blocks” are
alternating with “baseline blocks”. This design is statistically powerful, because trials are
repeated many times within a block. However, sometimes we are interested in the response to a
single trial. In this case, the researchers use event-related design, where the task is performed
once, followed by some time of inactivity (Buckner et al., 1996).
Subtraction logic naturally leads towards hypothesis-driven research, where the researcher uses
the experimental data to test a certain hypothesis about the neural effect of the studied factor.
The research hypothesis might state that the effect is observed in certain cortical areas. Also, the
researcher might want to quantify the relationship of the effect of the stimulus. The research
hypothesis is contrasted with the null hypothesis, which states that the effect is not observed.
Specifically, the null hypothesis implies that the data sampled in different experimental
conditions comes from the same population.
There is an alternative to hypothesis-driven research, where the researcher is not looking for a
confirmation or rejection of a certain hypothesis, but rather lets the data “speak for themselves”
without formulating a hypothesis prior to analysis. This alternative is called data-driven
research. A popular data-driven approach is to use principal component analysis (Sychra et al.,
1994; Friston et al., 2000) or independent component analysis (Beckmann & Smith, 2004); both
of these methods represent the data as a sum of spatio-temporal components which are ordered
by the amount of data variance they explain. Another data-driven approach is to use clustering,
for example, Cordes et al. (2002) propose to group voxels into clusters according to the
similarity of time courses.
9
1 .3 Statis tical analys is
After the fMRI data have been collected and pre-processed, statistical analysis helps the
researcher to answer the most important questions: does the data support the research
hypothesis? What is the relationship between reported behavioural states and the observed data?
Where in the brain is the effect of interest observed? The methodology of answering these
questions has been an active field of research since the beginning of the 20th century, when the
pioneering work of Ronald Fisher, Karl Pearson, William Gosset and others developed the
framework for analyzing experimental results. This framework has been adopted by all
experimental natural sciences, including neuroscience.
1 .3 .1 G eneral Linear M o d el
In the fMRI community, the first complete framework for statistical analysis was developed by
Worsley and Friston (2005). This framework is univariate; that is, the observed signal is assumed
to be independent and identically distributed across all spatial locations. This framework is based
on multiple regression analysis commonly called the General Linear Model (GLM), which we
will describe in Section 3.4. The common use of GLM is to produce a spatial map, where the
significance of each voxel’s expression of the effect of interest is evaluated with a t test.
The univariate approach to fMRI data analysis assumes independence of the BOLD signal across
voxels, and, therefore, ignores all the interactions between the voxels. The advantage of this
approach is the simplicity of the model, with a relatively small number of degrees of freedom.
However, the brain is a connected system of neurons, organized into several functional networks
(Toro et al., 2008); therefore, the univariate model of the brain is hardly realistic. This suggests
an expansion of the GLM analysis where the interactions between brain areas are taken into
account. This type of analysis is multivariate: the fMRI volume is conceptualized as a vector in
multi-dimensional data space, where each voxel corresponds to a dimension. A subtype of
multivariate analysis is multivariate classification, where the volumes are classified into groups
based on the cognitive state of the subject at the time of the volume's acquisition. Multivariate
classification is discussed below; some other approaches to multivariate analysis are briefly
discussed in Section 7.2.4.
10
1 .3 .2 M ultivariate clas s ificatio n
An alternative to univariate GLM is to use classification algorithms for decoding the brain state.
An important example of this approach is classification of fMRI volumes according to the type
of the behavioural task that the subject was performing when the volume was acquired. If the
subjects come from different groups, we can also try to classify the subject’s fMRI data
according to the group to which the subject belongs. There are many algorithms developed for
pattern classification (see, for example, Bishop, 2006), and several have been successfully
applied to fMRI data (see e.g. Schmah et al., 2010, Misaki et al., 2010; Ku et al., 2008; Mitchell
et al., 2004). Many of these algorithms are multivariate1, that is, they imply that the brain state
information is encoded in the interactions between the volumes.
The advantages of multivariate classification over univariate analysis are described in the review
article by Haynes & Rees (2006). Multivariate analysis is a natural choice when the brain state is
encoded by the neuronal activity of a group of brain areas, rather than a single area (this is called
distributed representation; see Haxby et al., 2001). In any isolated voxel, the difference in
BOLD signal across the brain states can be too small and/or inconsistent for successful decoding
of a brain state. However, the brain state can be decoded when the BOLD signal measured for a
ensemble of voxels is analyzed together. For example, in a multivariate classification study by
Kamitani and Tong (2005), the subject was presented with simple visual stimuli: bars rotated by
a specific angle. In the visual cortex, this angle is encoded in orientation-selective cortical
columns, which are much smaller in size than a typical fMRI voxel (such as 3×3×3 mm voxels
used in the study). Individual voxels in that study showed poor selectivity to stimulus
orientation; however, the output of a linear combination of visual-cortex voxels contained
enough information for successful decoding of the stimulus orientation. 3×3×3 mm voxels used
in the study). This study, along with several others (e.g., Haxby et al., 2001; Mitchell et al., 2004;
see also reviews of Norman et al., 2006, and Haynes & Rees, 2006) has popularized multivariate
classification as a method of fMRI data analysis. However, it should be noted that multivariate
methods have been proposed to analyze fMRI and PET data analysis for a while before the mid-
1 An important exception is Gaussian Naïve Bayes (GNB) classifiers, described below.
11
2000s "multivariate" boom (for example, see Lautrup et al., 1994; Friston et al., 1995; Morch et
al., 1997; Strother et al., 1997).
Algorithms of classification can be grouped into two categories: probabilistic and non-
probabilistic. Classifiers from the first group assume a probabilistic model for the data. Model
parameters are estimated during the training of the classifier. For example, the probabilistic
classifiers discussed in this thesis use a multivariate Gaussian model for each class ("brain state")
of the data. Classification of a volume is done by assigning the volume to the most probable
class. On the other hand, non-probabilistic classifiers do not use a probabilistic model. A simple
example of a non-probabilistic classifier is nearest-neighbour classification (see e.g. Schmah et
al., 2010): a volume is assigned to the same class as its nearest neighbour. Another important
example is Support Vector Machine (SVM) method (see e.g. LaConte et al., 2005; Mourao-
Miranda et al., 2005; Cox & Savoy, 2003; this method was also used by Kamitani and Tong in
the study described above). SVM methodology constructs the surface that separates the classes
by maximizing the margin between the two classes while simultaneously minimizing the
misclassification rate (see Section 3.6 for details); this process does not involve estimation of the
parameters of the probabilistic model. In general, non-probabilistic approaches view construction
of probabilistic models as a more general problem that can be by-passed when solving a more
specific problem of classification (Ng & Jordan, 2002). However, the "extra step" of
probabilistic model estimation can be useful in the context of a neurobiological study. For
example, the estimated Gaussian model captures a sufficient share of information about the
connectivity between the brain areas (Hlinka et al., 2011).
In Chapter 3, we describe several methods of probabilistic classification, that all use multivariate
Gaussian distribution to model the fMRI data. Each class is sampled from a Gaussian
distribution defined by a mean vector and a covariance matrix. The most restrictive model,
Gaussian Naïve Bayes classifier, assumes that the covariance matrices are diagonal; in essence,
this is a univariate model of classification. Linear and quadratic discriminants are methods that
do not assume diagonality of covariance matrices. Of these two, linear discriminant (LD) is the
more constrained method because it assumes that the population covariance matrix is the same
for all classes; quadratic discriminant (QD) makes no such assumption. Both linear and quadratic
discriminants can classify the data just as accurately as SVMs, if the probabilistic model is
carefully regularized to prevent overfitting. To our knowledge, quadratic discriminant has not
12
been used previously to classify fMRI data. We demonstrate that in some situations it is the most
accurate among the classifiers we have tested (see Sections 7.2.1 and 8.2).
If we are able to classify the out-of-sample fMRI volumes accurately, we can conclude that there
is a difference in the brain’s response to different behavioural tasks. The next question is how to
identify the spatial locations where this difference is manifest. Kjems et al. (2002) have proposed
a method of constructing spatial maps for a given classification algorithm, where each voxel is
weighted according to its contribution to classification. They have also derived the general
equation of spatial maps for probabilistic canonical variate analysis, which is an extension of
linear discriminant to three or more classes. Rasmussen and colleagues (2011) have further
developed it for SVMs, as well as for kernel logistic regression and kernel linear discriminant.
We propose a modification to Kjems’s method (discussed at the end of Section 3.1) and show
how to construct spatial maps for quadratic discriminant and for univariate classifier known as
Gaussian Naive Bayes (GNB).
1 .4 Selecting the m etho d o f analys is
The current thesis examines the question of selecting the method of classification for fMRI data.
We use a framework proposed by Strother and colleagues (2002; 2010). Previously, this
framework was applied to evaluate pre-processing techniques (Strother et al., 2004; LaConte et
al., 2005; Churchill et al., 2012A; Churchill et al., 2012B). The framework tests the algorithm’s
ability to (a) accurately decode the task, and to (b) construct a reproducible spatial map.
Comparative evaluation of classifiers has been carried out on real fMRI data in a series of studies
(e.g., Misaki et al., 2010; Ku et al., 2008; Mitchell et al., 2004; see also our paper, Schmah et al.,
2010). However, these studies have evaluated only the accuracy of out-of-sample classification,
without assessing the quality of spatial maps. Reproducibility of spatial maps is a metric that is
complementary to classification accuracy (LaConte et al., 2003); both of these metrics should be
taken into account when evaluating a classifier. An algorithm that accurately predicts the brain
state but is unable to create a reproducible spatial map is of limited use to the researcher: it
indicates that the brain states are indeed different, but is not able to say which areas of the brain
are reliably implicated in this difference. Also, spatial maps are useful to identify the task-
coupled artifacts. Consider the case when a subject is performing a motor task and a baseline
(rest) task. The subject’s motion could be stronger during the motor task than during the
13
baseline, and (if the motion is not regressed out carefully) a classifier can capitalize on this
difference to accurately predict which task the subject was performing. Inspection of spatial
maps in this case will reveal that the edges of the brain (i.e. the voxels where the motion is the
strongest) are the biggest contributors to this classification. Here, the classification is accurate
not because of the difference in neuronal response to the task, but rather in some interacting
external factor.
1 .5 Data s ets fo r evaluatio n o f algo rithm s
We have evaluated a pool of classifiers on several fMRI data sets, simulated as well as real.
Simulated environments have several attractive features: it is possible to create a large number of
artificial “subjects”, and we can carefully explore the influence of different aspects of simulated
fMRI signal on the classifiers’ performance. The key advantage is the knowledge of “ground
truth” (that is, we always know the location of the active signal in a given volume). This
knowledge can be utilized in order to perform ROC (receiver operating characteristic) analysis.
Partial area under the ROC curve serves as the third metric of performance, together with
classification accuracy and reproducibility of spatial maps.
We have used a simulation framework proposed by Lukic et al. (2002) and further developed in
our paper (Yourganov et al., 2011). The data are generated to model a block-design fMRI study
with two conditions, “active” and “baseline”. The task-related signal is absent from the
“baseline” volumes; in the “active” volumes, it is distributed across a spatial network of “active
areas”. We model a wide range of situations by changing three parameters: the mean magnitude
of task-related signal, its temporal variance, and the correlation across the active areas. This
allows us to identify the situations when certain algorithms are better performers, and situations
when all classifiers perform equally well.
The real data for the evaluation come from two studies: a longitudinal study of recovery from a
motor stroke, and a study of cognitive aspects of aging. The first study has been conducted on
nine patients recovering from stroke, who have been scanned in four separate fMRI sessions
spanning half a year after the stroke (Small et al., 2002). At each session, the subjects’ BOLD
activity has been measured during performance of simple motor tasks; also, a series of
behavioural tests have been performed outside of the scanner. Pooling across the four sessions
gives us an opportunity to evaluate the classifiers on a large amount of data within a subject. In
14
addition to this evaluation, we can also pose a question: does the recovery from a stroke,
measured with behavioural tests, correspond to a change in cortical activity, measured by fMRI?
The second study involved subjects from two age groups: young and old (Grady et al., 2010).
Each subject performed a series of tasks of varying difficulty (the difficulty has been matched
across the age groups so the behavioural accuracies of the young and the old subjects are
approximately equal). The pool of classifiers were applied to classify the fMRI volumes within a
subject to each of the tasks. After that, we performed a group-level analysis, where we classified
the within-subject spatial maps into two age groups. This classification helped identify the tasks-
related brain areas recruited differently by the young and the old subjects.
1 .6 Structure o f the thes is
The initial goal of the work presented in the thesis was to evaluate linear and quadratic
discriminants on simulated as well as real fMRI data sets, and to compare them with univariate
methods and SVM. LD and QD use multivariate Gaussian distributions to model the data; this
distribution is perhaps the best-studied multivariate distribution in statistics. In addition, Hlinka
et al. (2011) have shown that multivariate Gaussian model provides a reasonably good
approximation of fMRI data: it captures about 95% of mutual information of the connectivity
between brain regions. The theoretical properties of linear discriminant have also been well-
researched; see, for example, books by Mardia et al. (1979) and by Seber (2004). Quadratic
discriminant is somewhat more obscure; to our knowledge, we are the first to apply it to
classification of fMRI data (Schmah et al., 2010). These methods are perhaps better suited for
fMRI analysis than univariate methods, because the brain areas are known to be highly
correlated (Toro et al., 2008, among others), and, presumably, better described by a multivariate
model. The MATLAB code for our implementations of linear and quadratic discriminant, as well
as for the evaluation framework, will be available from Dr. Strother's lab website.
The structure of this thesis is as follows. Chapter 2 gives a detailed description of the framework
of evaluation: the metrics of performance and the split-half resampling framework. Also, it
describes the data sets used for evaluation (simulated and real). Chapter 3 describes the
methodology of probabilistic classification of fMRI data. The three algorithms based on
Gaussian distribution are described: multivariate linear and quadratic discriminants, and
univariate GNB. Then we give a brief description of regularization of covariance matrices (a
15
necessary step in both linear and quadratic discriminants). We focus on one approach to
regularization, where the covariance matrix is approximated by a subset of its principal
components.
Chapter 4 is dedicated to the problem of estimating the number of components that give an
efficient approximation to the covariance matrix. This number is referred to as intrinsic
dimensionality of the data: the number of dimensions required to capture the signal of interest.
The problem of its estimation is rather difficult, and we give a survey of methods developed to
address this problem. In Chapter 5, these methods are tested on the simulated data (where the
structure of an underlying active spatial network is known). We conclude that intrinsic
dimensionality is best estimated with cross-validation methods, rather than with information
theory. We also show how the estimated dimensionality in real data is related to signal-to-noise
ratio. Chapter 6 presents a further description of intrinsic dimensionality in two real data sets. In
the first data set, we demonstrate that intrinsic dimensionality is linked to self-control ability in
healthy individuals. In the second set, we show how dimensionality (as well as other measures of
complexity of fMRI signal) reflects the process of cortical re-organization that accompanies
stroke recovery.
Chapters 7 and 8 contains an overview of evaluation of a pool of classifiers. First, they are tested
on simulated data (Chapter 7): we describe how the performance of classifiers is influenced by
changes in magnitude, variance and connectivity of the task-related signal. Then, we test the
algorithms on two real fMRI data sets (Chapter 8): the stroke recovery set, and the set from an
aging study. We describe the spatial maps created by the classifiers: their within-subject and
across-subject reproducibility, and the correlation of spatial maps across different classifier
algorithms. We also classify the individual maps according to the age group of participant, and
show how to use this classification to identify cortical areas that are affected by age. Finally,
Chapter 9 is the conclusion of this thesis with discussion of several future research directions.
16
Chapter 2 E valuating algo rithm s fo r fM R I d ata analys is
2 .1 M etrics o f perfo rm ance
The performance of an algorithm can be measured in multiple ways. We have selected three
performance metrics. The first one, area under a Receiver Operating Characteristics (ROC)
curve, has been widely used in the medical and machine learning literature as a measure of
accuracy and susceptibility to errors. This metric uses error rates that are measured using
knowledge of the "ground truth". We apply this metric only to simulated data. The other two
metrics, predictive accuracy and reproducibility of spatial maps, were proposed by Strother and
colleagues (2002; see also Kjems et al., 2002) as they can be obtained from the same resampling
procedure.
2 .1 .1 R eceiver O perating Characteris tics (R O C ) m etho d o lo gy
The ROC methodology is used to measure the quality of signal detection. It was first applied to
evaluate detection of objects with radar, and has been used in psychophysics as a way to describe
how well humans can detect stimuli (see Metz, Herman, & Shen, 1998). It has become a standard
tool in medical diagnostics and imaging, after the work of Lusted (1968) and Swets (1988). This
methodology, generally speaking, summarizes the capability of the detection algorithm to
accurately detect the signal that is present in the data, and to detect the absence of signal if the
signal is indeed not there. To make this evaluation, we need to know the "ground truth", i.e.
whether the signal is really present in the data or not. It was developed for binary detectors
(which detect the presence or absence of a signal), but it can be extended to a detector that rates
the magnitude of the signal using a discrete scale (Metz, 1986). Here we will consider the binary
case only.
A binary detector analyzes the input data and makes a decision where the signal is absent (such a
decision is called a "negative") or present (a "positive"). We can utilize the knowledge of
"ground truth" and say whether the detector was right or wrong. Therefore, a "positive" is a "true
positive" if the detector was right and the signal was in the data, or a "false positive" otherwise.
Analogously, we can talk about a "true negative" and a "false negative". In the case of detecting
task-related activation in fMRI data, the detecting algorithm produces a spatial map where each
17
voxel value indicates the magnitude of task-related effect in the corresponding spatial location.
To decide whether a voxel is active or inactive, we see whether the voxel value passes a
predefined threshold or not. There are two kinds of errors a detection algorithm can make given a
specific threshold. Type I error happens when the voxel known to be inactive surpasses the
threshold of activation and is therefore a false positive. Type II error is, conversely, an
occurrence of a false negative. ROC methodology evaluates the detector in terms of the
frequency of Type I and Type II errors. It computes the "false positive frequency" (FPF) and the
"true positive frequency" (TPF) for all possible thresholds. The plot of TPF versus FPF is known
as the ROC curve.
To compute this frequency, we need two types of data sets, which are labeled H0 and H1. Sets of
H0 type ("null" sets) contain data where the effect of interest is not present, and sets of H1 type
("alternative" sets) contain data where the effect is present. If the voxel from H1 that is known to
be active passes the threshold, it is a true positive; if it is below the threshold, it is a false
negative. All voxels in H0 that pass the threshold are false positives. We need many example sets
of both H1 and H0 type for robust estimation of the error frequencies.
A ROC curve is constructed for a single voxel that is known to be active in H1. We move the
threshold from most conservative (when no voxels pass the threshold) to most liberal (when all
voxels pass it). For each threshold, we look at this voxel in all the H1 sets, and count the number
of occurrences when the voxel value passes the threshold. This gives us the number of true
positives. Then we count the number of H0 sets where this voxel has a value that passes the
threshold; this gives us the number of false positives. Dividing by the total number of H1 and H0
sets gives us estimates of true positive frequency and false positive frequency, respectively.
We can also define TPF and FPF in continuous, rather than discrete, terms; this would
correspond to an ideal situation with infinitely many examples of H0 and H1 sets. For the
threshold t, we can define FPF and TPF in terms of continuous probability distributions of the
voxel value x:
t
t
dxxptTPF
dxxptFPF
)()(
)()(
1
0
(2.1)
18
Here, p0(x) and p1(x) are probability distributions of voxel values in the H0 and H1 sets,
respectively. We use the LABROC software (Metz et al., 1998) to generate smooth ROC curves
from a set of discrete (FPF, TPF) pairs.
Figure 2.1. Examples of different distributions of H1 (solid green) and H0 (dashed red),
and the corresponding ROC curves. A: perfect detector; the ROC curve is a step function.
B: Chance distributions overlap completely; the ROC curve is the identity line. C: the
usual situation, distributions overlap somewhat and the area under the ROC curve is
between 0.5 and 1. D: degenerate case, the H1 distribution is completely contained inside H0
distribution; the ROC curve shows the characteristic "hook" at the bottom.
The shape of the ROC curve depends on the amount of overlap between p0(x) and p1(x). Figure
2.1 shows some characteristic examples of ROC curves. In the best case, the detector is so good
that the distributions p0(x) and p1(x) do not overlap. If we set our threshold above all the values
of p0(x) but below all the values of p1(x), FPF will be zero and TPF will be one. If the threshold
is higher than that, TPF is less than one but FPF is still zero. For a smaller threshold value, FPF
is greater than zero but TPF is one. The corresponding ROC curve is a step function, where TPF
instantaneously rises from 0 to 1 at FPF = 0. If the separation between p0(x) and p1(x) is not ideal
and there is some overlap, the ROC curve rises smoothly rather than instantaneously, because
19
there exist thresholds for which both FPF and TPF are less than 1. In the worst case, p0(x) and
p1(x) completely overlap, so for any possible threshold TPF and FPF are equal and the ROC
curve is an identity line.
There is one more case that we should consider: when p0(x) is wider than p1(x) so p1(x) is
completely contained within p0(x). In this case there is no threshold for which TPF>FPF. ROC
curves in this case have a characteristic "hook" at the bottom, which corresponds to thresholds
that give FPF>TPF. Pan and Metz (1997) call such curves "degenerate" because they indicate
performance which is, for a certain range of thresholds, worse than chance. This kind of curve
indicates that the signal and noise have very different distributions. Examples of such curves
were reported in the fMRI literature; see Figure 2 in Constable et al. (1995), and Figures 4 and 5
in Lange et al. (1999).
To compare the performance of several detectors, we can inspect their ROC curves. Several
papers (for example, Constable et al., 1995; Lange et al., 1999; Skudlarski et al., 1999; Lukic et
al., 2002; Beckmann & Smith, 2004) have used this method to evaluate the performance of
algorithms for fMRI data analysis. As a quantitative metric of performance, we can use the area
under the curve. This area corresponds to the probability that the detector will assign a higher
value to a voxel randomly chosen from H1 than to a voxel that is randomly chosen from H0.It is
proportional to a Mann-Whitney U statistic, which is used in a non-parametric test (Conover,
1999) to determine whether two samples (in our case, the voxel values from H1 and H0) come
from the same distribution (Mason & Graham, 2002). We can also use the partial, rather than the
full, area under a ROC curve, for example the area for FPF between 0 and 0.1; this is equivalent
to setting the critical significance level α to 0.1 (Skudlarski et al., 1999). We use this metric to
evaluate signal detection of a series of classifiers (see Chapter 7). Also, we use it to evaluate
different methods of intrinsic dimensionality estimation for discriminant analysis (see Chapter
5).
The knowledge of "ground truth" is essential to ROC methodology. Therefore, simulated data
are commonly used to construct ROC curves. There are two common approaches. The first is to
generate H0 and H1 sets using a Gaussian distribution (Lukic et al., 2002; Beckmann & Smith,
2004; Yourganov et al., 2011). The second approach uses real resting-state fMRI data for H0
sets, and adds artificial activation signal at specific locations in resting-state data to generate H1
20
sets (Constable et al., 1995; Lange et al., 1999; Skudlarski et al., 1999; Beckmann & Smith,
2004). With the first approach, we can easily generate a large number of sets from both H0 and
H1, and test our detectors in a variety of simple, easily-controlled situations, which, however,
might be a poor approximation to real fMRI data. The second approach uses realistic H0 data, but
the H1 sets are a mixture of very complex real fMRI "noise" and typically simplistic artificial
"signal" so again the extent to which the results reflect real data performance is largely unknown.
Also, it has been proposed (Garret et al., 2012) that the variability of BOLD signal in resting
state is different from the variability observed when the subject is performing a task; therefore, it
seems questionable to use resting-state "noise" to generate H1 sets.
It should also be added that a number of studies (Nandy & Cordes, 2004; Le & Hu, 1997) have
proposed methods of applying ROC methodology to real fMRI data. The drawback of these
methods is relying on assumptions that are impossible to test. For example, it is assumed that the
regions found using a t test with high degree of confidence are the "true" activation regions and
contain no false positives. However, there is a way to apply ROC analysis to probabilistic
classification of real data: the "ground truth" information can be provided by the class labels. A
real data set, preferably with reasonably large separation between the classes, can serve as the H1
set; H0 set can be obtained from it by permuting the class labels. We use a probabilistic classifier
(such as linear or quadratic discriminant) to compute the probability of each voxel belonging to
the class indicated by its class label. After this, we apply the typical ROC analysis: we vary the
threshold and compute the frequency of true and false positives for each threshold (a volume
from the H1 set gives a true positive when the corresponding probability is higher than the
threshold; for the H0 set, the class labels are random, therefore the volume is a false positive
when the corresponding probability surpasses the threshold). The set of (FPF, TPF) pairs is
visualized as an ROC curve, and the area under the curve can serve as a performance metric.
This type of ROC analysis was not carried out in the work described in this thesis; it presents an
interesting possibility for future research.
2 .1 .2 R epro d ucibility; N PAIR S fram ew o rk
Like all experimental sciences, the goal of neuroscience is to discover reproducible results. A
measure of reproducibility is particularly useful for fMRI studies, because the data are often
corrupted by noise of strong magnitude and complicated structure. Reproducibility of spatial
21
maps evaluates the stability of spatial locations where the neural effect of interest is expressed.
For the unthresholded maps, reproducibility is monotonically related to global signal-to-noise
ratio (LaConte et al., 2003; Yourganov et al., 2011).
Several reproducibility metrics were proposed in the literature for both PET and fMRI
experiments. For example, a paper by Grabowski and colleagues (1996) obtained repeated
measurements on a relatively large number of subjects (eighteen). Subjects were randomly
separated into two cohorts, and a set of univariate algorithms was applied to analyze each cohort.
Voxels that passed the threshold of activation were grouped into clusters, and neuroanatomical
interpretation was given to each active cluster. The researchers studied whether the active
regions obtained on one cohort would also be found in (a) a different cohort, (b) the same cohort
but different session, and (c) different methods of univariate analysis. The frequency of obtaining
the same active region was used as a per-region measure of reproducibility.
The problem with such an approach is its reliance on neuroanatomical interpretation, which can
be subjective, ambiguous and variable across subjects. An fMRI study by Rombouts et al. (1998)
proposed a different measure of reproducibility between two repeated sessions: the proportion of
voxels that are found to be active in both sessions. Unlike the metric proposed by Grabowski et
al., this measure is computed on voxels, not on the anatomical regions; however, both measures
are very sensitive to the choice of activation threshold. Maximum reproducibility across two
sessions (averaged across subjects) was 0.75 for Bonferroni-corrected critical level of α=0.05,
but with the increase of threshold the number of activated voxels dropped and the proportion of
overlapping voxels dropped accordingly. To complicate the issue, maximum reproducibility
could be even higher (0.78) when different thresholds could be applied to the two sessions.
Maitra (2010) gives an overview of various metrics of similarity of thresholded maps, and
advocates the use of Jaccard's overlap metric. For two thresholded maps, Jaccard overlap is
determined by the ratio of the number of voxels in the intersection of the two maps to the number
of voxels in their union. Maitra argues that this metric is more intuitive than Dice overlap (used
by Rombouts et al., 1998), to which it is monotonically related.
A PET study by Strother et al. (1997) proposed a metric of reproducibility that is voxel-based
and at the same time threshold-independent. Given a pair of unthresholded spatial maps,
reproducibility is defined as Pearson's product-moment correlation coefficient r. The two maps
22
are computed using split-half resampling scheme: the data are split into two independent sets of
approximately equal size, and both sets are analyzed independently. Later fMRI studies (e.g.,
Tegeler et al., 1999; Raemaekers et al., 2007) used Pearson’s correlation to estimate
reproducibility of maps that came from two repetitions of the experiment on the same subject.
In a follow-up paper by Strother et al. (2002), split-half resampling became a basis of the
NPAIRS (Nonparametric Prediction, Activation, Influence and Reproducibility reSampling), a
data-driven pseudo-ROC framework for evaluation of the analysis chain. In NPAIRS,
reproducibility is the median correlation of two split-half maps (the median is taken across the
splits). The splitting procedure is repeated as many times as possible to stabilize the estimation of
reproducibility. The advantage of splitting the data into two halves (rather than a larger number
of subsets as in k-fold cross-validation) is that split-half resampling maximizes the amount of
data used to compute the spatial maps; it has been shown to stabilize variable estimates in
analyses of highly dimensional ill-posed data sets (Meinshausen & Buhlmann, 2010). When
resampling strategies for evaluation are used, the half-splits must be mutually independent. The
presence of correlation between samples leads to over-estimation of performance metrics, such
as reproducibility. In group studies, this independence can be achieved by using different
subjects for each split (but keeping the number of volumes in each split approximately equal). In
within-subject splits, the half-splits can be composed from different experimental runs. If there is
only one run for each subject, the split must be done so that the temporal separation between the
volumes in two splits is as large as possible, preferably more than 20 seconds (the timecourse of
the hemodynamic response; see Glover, 1998). This may often be achieved in block designs by
making sure that images from the same scanning block end are in the same half-split.
Examining the scatter plot of the two spatial maps can provide some insights about
reproducibility of our results. Given that the two maps are matched in their scale and variance,
the scatter plot of perfectly matching spatial maps (r=1) is a line of identity. In the worst case of
zero reproducibility (r=0) the plot is a cloud of a roughly circular shapes (see Figure 2.2 A). For
intermediate values of r, the cloud is elongated along the line of identity (see Figure 2.2 B). The
amount of deviation from this line can serve as an indicator of non-reproducible effects, such as
random noise. Our scatter plot can be analyzed using two axes, the "major" axis along the line of
identity and the "minor" axis orthogonal to it. The major axis contains a mixture of signal and
noise, and the minor axis contains only the noise that is uncorrelated with the signal. If the two
23
Figure 2.2. Scatter plots of 2 simulated spatial maps, corresponding to reproducibility of
r=0.5 (A) and r=0 (B). Major and minor axes are displayed for each scatter plot. The major
axis contains signal and noise, and variance along this axis is 1+r, while the minor axis
contains only noise, with variance 1-r.
maps are normalized to have unit variance, the variance along the major and minor axes is
re 11 and re 12 , respectively (Strother et al., 2002). If the variance of the noise along the
major and the minor axes is equal, we can estimate the signal variance as 21 ee , and define a
measure of global signal-to-noise as a ratio of the signal variance to the noise variance
(Yourganov et al., 2011):
r
r
e
eegSNR
1
2
2
21 (2.2)
This measure indicates the strength of reproducible signal that is contained in the spatial map
relative to non-reproducible noise.
2 .1 .3 Accuracy o f C las s ificatio n
Accuracy in predicting brain states is a natural metric to evaluate the fMRI data classifier. In the
NPAIRS framework, this metric is computed simultaneously with reproducibility of spatial
maps. Taken together, these two independent and complementary metrics can serve as an
alternative to evaluation with ROC methodology (LaConte et al., 2003).
24
There is a subtle difference between the "prediction accuracy" and "classification accuracy"
metrics. The task of a classifier is to assign a label to a data point (i.e. fMRI volume).
Probabilistic classifiers do it by computing the Bayesian posterior probability of the data point
belonging to class 1, class 2, etc., and assign it to the class that corresponds to the highest
probability. Non-probabilistic classifiers, notably Support Vector Machines (Vapnik, 1995),
classify the data point without computing these probabilities explicitly and therefore cannot be
considered probabilistic. By “classification accuracy”, we will refer to the accuracy of class
assignments (i.e. the proportion of correct assignments). For probabilistic classifiers, we can also
compute the “prediction accuracy” which is the posterior probability of the data point belonging
to the correct class. Consider, for example, applying a probabilistic classifier to a data point in a
two-class problem. The probabilities of this data point belonging to class 1 and class 2 have been
estimated as 0.6 and 0.4, respectively. If class 1 is indeed the correct class, the classification
accuracy is 1 (because 0.6>0.4, class participation has been estimated correctly) and prediction
accuracy is 0.6. The original NPAIRS framework (Strother et al., 2002) uses prediction
accuracy; however, we use classification accuracy instead, in order to be able to evaluate both
probabilistic and non-probabilistic classifiers.
To compute unbiased and robust estimates of classification/prediction accuracy, cross-validation
and resampling methods are normally used (see Efron & Tibshirani, 1993, particularly Chapter
17). The data set is split into a “training set” and a “test set”. The training set is used to train the
classifier, i.e. to estimate the parameters in the model that is used for classification. Then the data
in the test set are classified according to that model. Training and test sets should be as close to
independent as possible, so the estimate of classification/prediction accuracy computed on the
test set data is realistic (otherwise it will be biased upwards). We perform many such splits and
use the mean (or median) accuracy as our metric of classifier’s performance. In the NPAIRS
framework, we use one half of the data to train the classifier, and the other half as a test set; then,
we reverse it and use the second half for training and the first half for testing.
2 .1 .4 Pred ictio n-R epro d ucibility Plo ts
The use of prediction/classification accuracy as a performance metric has a long history in
statistics and machine learning (see Efron & Tibshirani, 1993; also Demsar, 2006). The novel
contribution of the NPAIRS framework was to combine this metric with the measure of
25
reproducibility of spatial maps. Excellent prediction accuracy does not guarantee that the
algorithm also produces reproducible spatial maps. This can be demonstrated on an example
when the signal that differs between the two classes is localized in a large and strongly correlated
cortical network, and a classifier randomly picks one voxel from that network. When a spatial
map is constructed according to each voxel's contribution to classification (see Section 3.2), the
voxels in the network will have low weights in the spatial map, with the exception of the one
voxel used in classification. Since the location of that voxel is random, the reproducibility of the
maps will be poor. Also, a highly reproducible map does not imply accurate classification. As a
somewhat pathological example, consider an algorithm that assigns constant values to all voxels.
Spatial maps are therefore perfectly reproducible, but have no predictive value whatsoever. In
general, we can view prediction and reproducibility as two semi-independent and complementary
measures that serve as indicators of bias and variance of our analytic model.
NPAIRS uses Prediction-Reproducibility plots (P-R plots) as a way to display these
complementary metrics (see Strother et al., 2002; LaConte et al., 2003). We use a split-half
resampling framework to compute these metrics for different splits, and plot median value
(across splits) of prediction accuracy P versus median value of reproducibility R. In the ideal
case (with infinitely large contrast-to-noise ratio and gSNR), the analytic model produces
perfectly reproducible maps and is able to predict mental states perfectly, corresponding to the
point (P=1, R=1) on the plot. We can judge the quality of our model by its proximity to this
point. This is analogous to the ROC methodology, where the ROC curve for a perfect detector
passes through the point (FPF=0, TPF=1).
P-R plots are particularly useful when our analytical model contains a hyperparameter. A
hyperparameter is a parameter that cannot be estimated automatically from the training data
alone; it is typically estimated by optimizing the model’s performance on the independent test set
(Lemm et al., 2010). NPAIRS provides a resampling framework for such estimation, and we can
tune the hyperparameter by optimizing prediction accuracy and/or map reproducibility
(Rasmussen et al., 2012B). The evaluation of our model for a particular value of the
hyperparameter gives us a point on the P-R plot, and by changing this parameter we can observe
a P-R trajectory. An important example of a hyperparameter is the number of principal
components that is used to regularize the linear discriminant. In Figure 2.3, we show two P-R
trajectories for this hyperparameter using the simulated data sets described in the upcoming
26
Section 2.2.1. Trajectories were computed using simulated data. In one case, simulated data
contained a correlated active network (black line with circles); in the other case, the active areas
were not correlated (grey line with squares). The size of the marker indicates the number of
principal components: smallest markers correspond to using just the first principal component in
our analysis, and largest markers indicate that the first 30 components were used. We can see
that the P-R trajectory is different in the presence and absence of underlying network
connections. If it is present, our prediction accuracy does not achieve the level that we can have
in the absence of the network (using a sufficiently large number of components). On the other
hand, spatial maps are much more reproducible if the network is present (and a small number of
components is used).
Figure 2.3. Prediction-Reproducibility (P-R) plots in the presence and absence of a
simulated cortical network. The black line shows the P-R trajectory when the activation
loci are linearly coupled into a single covarying network. The grey line shows the trajectory
when the activation loci are independent. The size of the markers corresponds to the
number of principal components used in the analysis (varied from 1 to 40). We show the
average trajectory for 100 simulated data sets described in Section 2.2.1.
2 .2 Data s ets
Our proposed framework of evaluation uses both simulated and real data to evaluate the
performance of an algorithm. When we generate artificial fMRI data, we have control over the
27
structure of signal and noise in the data, and therefore we can study the behaviour of our
algorithm in a variety of controlled situations. It is possible to create artificial data using
complicated methods such as neural mass models (Deco et al., 2008), but we have chosen to use
simple Gaussian simulations with many fewer parameters. This allowed us to perform a thorough
examination of how a certain parameter influences the performance of a classifier. Another
advantage of simple simulations is an opportunity to generate very large data sets and therefore
decrease the variance of performance metrics in our evaluations.
However, simulated data sets are not enough for evaluation. Real data have to be included as
well to provide us with a reality check, although comparisons between real and simulated data
are difficult due to lack of control over the real data and simplicity of the simulated data. We
have used real data from two studies: a longitudinal study of stroke recovery, and a study of
aging that involved a large number of people from different age groups performing cognitive
tasks of varying difficulty. The stroke recovery study, because of its longitudinal nature, provides
a large amount of fMRI volumes per subject, which is advantageous for complicated machine
learning algorithms that require a large amount of training data (Schmah et al., 2010). The aging
study allows us to evaluate our analytical methods on different age groups and on different
cognitive tasks of incremental behavioural difficulty that use the same visual stimuli; in addition,
it allows us to compare our findings with previously published results (Grady et al., 2010; Garret
et al., 2012) that have been obtained with different analytical methods.
2 .2 .1 S im ulated d ata
We have used computer-generated data to simulate a block-design experiment with two
conditions: activation and baseline. Data have been generated using the algorithm described by
Lukic and colleagues (2002), with some modifications, and results are reported in (Yourganov et
al., 2011). All images contain the same simplified single-slice "brain-like" background structure
with additive Gaussian noise. An elliptical background structure contained in a 60×60 pixel
image consists of “grey matter” in the center and on the rim of the phantom, with “white matter”
in between. The amplitude of the background signal in the “grey matter” is 4 times higher than in
the “white matter”; informally, this could be interpreted as the reflection of the fact that the grey
matter in the human brain consumes 4 times more energy than the white matter (Logothetis and
Wandell, 2004). Parameters of the phantom have been deduced from a PET study, and are also
28
representative of spatially smoothed fMRI data (Lukic et al., 2002). Gaussian noise is spatially
smoothed by convolving the image with a Gaussian filter that has a full-width-at-half-maximum
(FWHM) 2 of 2 pixels. After smoothing, the standard deviation of the noise is 5% of the
background signal. Images in the “activation” condition contain 16 Gaussian-shaped signal
“blobs” distributed over the image (12 in the “grey matter” and 4 in the “white matter”) and
added to the smoothed noisy background image. Figure 2.4 shows examples of baseline and
activation images (noise is not displayed; although activation signal could be negative as well as
positive, we only show positive signal for the sake of clarity). The FWHM of the activation blobs
vary between 2 and 4 pixels. Simulated experimental data sets are composed of N baseline and N
activation images per set (N = 100), so the total number of observations in a set is 2N=200. We
use a mask with J=2072 pixels covering the "brain" to exclude locations outside of the phantom
from analysis.
Images are arranged into 10 “epochs” of 20 images each to simulate a block design with epochs
of 10 “baseline” images followed by 10 “activation” images. To simulate the hemodynamic
response, each pixel’s time course is convolved with a hemodynamic response function (HRF)
defined by the sum of two Gamma functions (Glover, 1999). Parameters of the HRF model have
been taken from Worsley (2001): a1 = 6, a2 = 12, b1 = b2 = 0.9 seconds, c = 0.35, TR (time to
acquire the full brain volume) = 2 seconds.
Figure 2.4. The phantom in baseline (left) and activation (right) states. Noise is not
displayed.
2 FWHM defines the spread of the filter. For a Gaussian filter, it is proportional to variance: 2ln22FWHM
29
Amplitudes of the Gaussian activation signal blobs are sampled from a multivariate Gaussian
distribution. The mean amplitude of each activation is specified proportionally to the local value
of the background signal:
E[ak] = Mbk, (2.3)
where ak is the amplitude of kth activation, E[ak] is its expected value, bk is the value of noise-
free baseline image at the center of the kth area, and M is the proportionality constant. To study
the effect of M on performance of the algorithms, M has been set to different levels (0, 0.01,
0.02, 0.03 and 0.05) in different realizations of our simulated experiment. These levels of M
correspond to contrast-to-noise ratios (CNRs) of 0, 0.2, 0.4, 0.6 and 1.0. HRF convolution
changes these values to empirical measurements of 0, 0.3, 0.6, 1.0, and 1.6, respectively3.
The variance of the amplitude of the Gaussian activation signal in our multivariate Gaussian
distribution, denoted by σk2, is defined proportionally to the variance of the independent
background Gaussian noise added to each voxel, vk2:
σk2=Vvk
2, (2.4)
where the proportionality constant V has been varied from 0.1 to 1.6 in different realizations of
the experiment. In this dissertation, we refer to V as the relative signal variance, which may be
thought of as a form of physiological variation of the activation signal. The third parameter of
our multivariate Gaussian model is the correlation coefficient, ρ, which defines the covariance
between Gaussian activation signal amplitudes at the kth and lth locations (k ≠ l):
cov (ak, al) = ρ σk σl. (2.5)
The value of ρ has been set to 0, and 0.5 and 0.99 to define a simple distributed spatial network
(Lukic et al., 2002). This value is the same for all regions in the network.
3 Empirical measures of CNR are computed as follows. For a given active locus, we compute the difference in mean
signal of the active and baseline images (discarding the first 2 volumes in each block) in the H1 set, and divide it by the standard deviation of the timecourse at this locus in the corresponding H0 set. This is repeated for all 16 loci, and the result is averaged across the loci and then across the 100 sets.
30
The amplitudes of the multivariate Gaussian signal in the “active” state are defined by the three
parameters: CNR (or M), V and ρ. These values are the same for all volumes in a simulated
experimental set, but may differ across sets. All CNR values are those measured empirically after
convolution by the HRF. This Gaussian signal simulation incorporates the three ideas of (1) a
mean signal level, (2) physiological variation of signal levels about their mean across successive
scans, and (3) a partition of this signal variation between physiological noise and network
variation defined by the chosen value of the correlation coefficient coupling between distributed
Gaussian blobs.
To compute ROC curves, we have created 100 examples of H0 and H1 type sets for each setting
of (M, V, ρ). Each of the H1 sets consists of N activation images and N baseline images. Each of
the H0 sets consists of baseline images only. We have computed spatial maps for each set using
the algorithm under evaluation. At each of the 16 activation loci, we have built the ROC curve
using the voxel values at this particular locus. True positive frequency has been computed using
100 maps from the H1 sets, and false positive frequency has been computed from 100 H0 sets.
We have used LABROC1 software (Metz et al., 1998) to generate smooth ROC curves from
discrete values of (TPF, FPF).
2 .2 .2 Stro ke reco very s tud y
We also analyzed data collected by Small et al. (2002), in a longitudinal block-design study of
stroke recovery, in which 9 stroke patients were scanned in 4 different sessions at 1, 2, 3 and 6
months after the stroke. During each session, subjects were instructed to perform a motor task
(finger tapping alternating with wrist flexion) alternating with blocks of rest. Two runs were
recorded for subjects performing the task with their healthy hand, and 2 runs – with the hand
impaired by the stroke. Whole-brain fMRI data were acquired using 1.5T MRI scanner. 24
horizontal 6-mm thick slices were obtained. Scanning parameters were: volume acquisition time
(TR) = 4 seconds, echo time (TE) = 35 milliseconds, flip angle = 60º. Out of 24 slices, we
selected a subset of 7 slices that corresponded to parts of the brain involved in finger and wrist
motion (4 slices containing the cerebellum and 3 slices containing the motor areas). The number
of voxels selected for analysis was 10,499.
The data for all 9 subjects were spatially aligned to each other and corrected for motion. The
volumes within each motor-task block were divided by the average volume of the last two scans
31
of the preceding rest block, for intensity normalization and to filter out low temporal frequencies
(McIntosh & Lobaugh, 2004). Rest blocks were discarded after that. By pooling the data across
runs and sessions, we obtained 1280 volumes per individual subject; this large number of
volumes per subject allowed us to carry out within-subject analysis easily. The data set is used to
examine the utility of dimensionality estimates and related covariance measures as biomarkers of
stroke recovery (Yourganov et al., 2010; see also Chapter 6). An analysis of the performance of
various within-subject classifiers is also described in detail in Schmah et al. (2010) and
summarized in Chapter 8.
2 .2 .3 Aging s tud y
We analyzed another set of real fMRI data that was collected in a study by Grady and colleagues
(2010). To study the impact of aging on cognitive abilities, a number of subjects from three age
groups (“young”, 20-31 years, 19 subjects; “middle aged", 56-65 years, 14 subjects; “old”, 66-85
years, 14 subjects)4 were scanned during performance of five behavioural tasks. The tasks were:
1. fixation to a dot presented in the middle of the screen ( "FIX");
2. reaction task: detection of a visual stimulus and reporting its position on the screen ("RT");
3. perceptual matching, where the participant had to match the "target" sample presented in the
upper portion of the screen to one of the three stimuli presented in the lower portion ("PM");
4. delayed matching test of working memory, where the target stimulus was presented and then
removed from the screen, followed by a 2.5 seconds blank-screen delay. After this, three stimuli
were presented and the participant had to match them to the target ("DM").
During each experimental run, the fixation condition was presented in 8 20-second blocks. The
other four conditions were presented in 2 blocks for each condition, each block lasting
approximately 40 seconds (the duration varied slightly because the stimuli were generated at the
time of scanning runs). Four scanning runs were acquired on each participant, with 300 volumes
in each run. Scanning was done on a 3T scanner with the following parameters: TR = 2 seconds,
4 In our group-level analysis, we have pooled the “middle aged” and the “old” groups together to form one group of
older subjects.
32
TE = 30 milliseconds, flip angle = 70º). Functional images of a whole brain were acquired in 28
axial slices, 5-mm thick.
Preprocessing of data in this study was somewhat more extensive compared to the stroke study;
it is described in detail in the paper by Grady et al. (2010). First, the transformation of aligning
functional images to a common atlas was computed (details are described in Grady et al., 2010).
Then the images underwent slice time correction (with AFNI package; Cox, 1996) and motion
correction (with AIR package, Woods et al., 1998). This was applied to the original images,
which were afterwards transformed into a common anatomical space. Then the images were
smoothed with a Gaussian kernel (FWHM = 7 mm), and artifact-carrying components were
removed by using Independent Component Analysis (with the MELODIC package; Beckmann
& Smith, 2004). Using a standard white matter mask, mean white-matter signal was obtained by
averaging white-matter voxels; the mean signal was then regressed from the time course of each
voxel. The same was done for the mean CSF signal. Finally, linear trends were removed.
33
Chapter 3 Pro babilis tic clas s ificatio n o f fM R I d ata
3 .1 G eneral co ns id eratio ns
The problem of predicting a mental state for an fMRI volume is a problem of classification: a
specific volume is classified into one of the groups that represent various mental states.
Typically, a mental state is associated with performing a certain behavioural task. For example,
in the stroke recovery study described in Section 2.2.2 mental states are associated with left
finger tapping, right finger tapping, left and right wrist flexion, and rest. During training, a
classifier learns the association between training volumes and the corresponding mental states.
Afterwards, the classifier is applying this association to the test data in order to assign each
training volume to one of the mental states.
A researcher who wants to perform classification on fMRI data is facing a problem of selecting a
classification algorithm from a large group of methods created by the statistical and machine
learning community. Some of these methods are probabilistic: they construct a probabilistic
model for each class, and compute the probability of the fMRI volume belonging to each class,
so the volume is assigned to the most probable class. Another group of classifiers is non-
probabilistic: the mapping of fMRI volumes to class labels is done without constructing
probabilistic models. A popular example of non-probabilistic classifiers is Support Vector
Machines (SVMs; see Vapnik, 1995).
Classification algorithms can be univariate or multivariate: univariate classifiers assume that
fMRI signal is independent across voxels, and multivariate classifiers take interactions between
voxels into account. The brain is a network of interacting cortical areas, so the assumption of
voxel independence does not hold for fMRI data (also, artifacts in the data could be another
source of dependency between voxels); nevertheless, univariate classifiers usually have less
degrees of freedom than multivariate classifiers, and therefore require a smaller amount of
training data. Another distinction is between linear and nonlinear classifiers: linear classifiers
separate the two classes with a linear hyperplane in feature space, and nonlinear classifiers use a
nonlinear surface.
34
Generative classifiers operate in a Bayesian framework. Consider the classification problem
where the data vector x needs to be assigned to a class. The class memberships are unique, and
the number of classes is Nclass. Bayes rule can be adapted to the classification problem as follows
(Bishop, 2006):
)(
)()|()|(
x
xx
P
cclassPcclassPcclassP
(3.1)
where,
P(class = c | x) is the probability that a given data vector x belongs to class c. It is also
called posterior probability of class c. Data vector x is assigned to the class with the
largest posterior probability.
P(x | class = c) is the probability of x, if we know that it belongs to a class c. Generative
classifiers assume a probabilistic model for each class, so we can compute this
probability easily. This probability is called the likelihood of x for a class c.
P(class = c) is the probability that an arbitrary vector, without considering its numerical
value, belongs to class c; in other terms, it is the likelihood of occurrence of vectors
belonging to class c. For example, if all class memberships are equally likely, this
probability is equal to 1/ Nclass for all classes. This is called the prior probability of class
c, because it is not influenced by x and can be computed before obtaining x. In contrast,
the posterior probability of class c is computed (using Bayes rule) after obtaining x.
P(x) is the probability of x, i.e. the likelihood of observing this data vector. It is also
called the marginal probability of x. It can be written as a sum of likelihoods of x for all
our classes:
classN
k
kclassPkclassPP )()|()( xx (3.2)
35
For simplicity, consider a problem of two-class classification (Nclass = 2). If the two classes are
equally likely to occur, their prior probabilities are equal5, and a vector x is assigned to class 1 or
class 2 depending on which likelihood is greater, P(x | class = 1) or P(x | class = 2).
Equivalently, the class membership of x is given by the sign of the decision function, defined as
)2|(
)1|(log)(
classP
classPD
x
xx . (3.3)
If this function is positive, x is assigned to class 1; if it is negative, to class 2. If this function is
zero, the assignment cannot be made because the membership in either class is equally likely.
The set of points where D(x) = 0 is called the decision boundary, which can be seen as a hyper-
surface that separates the two classes.
Here, we will describe three probabilistic classifiers that use a Gaussian model to compute
likelihood functions. However, the assumptions behind the Gaussian model are different:
1. Quadratic discriminant is the most general method. Each class is modeled with a
separate multivariate Gaussian distribution with a specific mean vector and covariance
matrix.
2. Linear discriminant is a constrained version of the above: the covariance matrix is
assumed to be the same for all classes. The only difference between classes is in their
mean vectors.
3. Gaussian Naive Bayes classifier is still more constrained: the covariance matrix is
assumed to be diagonal. This is equivalent to using a univariate Gaussian distribution for
each dimension of the data. If we make an additional assumption that the covariance
matrix is the same for all classes, the decision function becomes linear; if this assumption
is not made, it is a nonlinear function. We will call these two variants linear and nonlinear
Gaussian Naïve Bayes, respectively.
5 If we do not want to assume that the class memberships have equal prior probabilities, we can estimate the prior
probability of a class using training data: as a ratio of the training vectors belonging to the class to the total number of training vectors.
36
3 .2 C o ns tructing s patial m aps fo r clas s ifiers
A spatial map, computed for a given classifier, indicates the relative importance of different
spatial locations in classification of fMRI volumes. We propose to construct spatial maps by
taking a voxel-wise derivative of the decision function. This is similar to the technique of
“sensitivity maps” proposed by Kjems and colleagues (2002). The value of the ith voxel of the
spatial map is computed as
j
j
ii D
xNy )(
1 )(x , (3.4)
where x(j) is the jth volume, and N is the number of volumes. Essentially, we take the decision
function for each volume, compute its partial derivative at a specific voxel location, and average
it across all volumes. Therefore, yi indicates the average impact of the ith voxel on the decision
function, and reflects the importance of this voxel in classification.
In the original method proposed by Kjems et al., the maps are made by taking the square of
voxel-wise derivative of P(class = c | x), posterior probability of class membership. This
approach has two disadvantages. The first disadvantage is numerical instability; Formula 3.3 can
be rewritten as
))(())(exp(1
1)|1( x
xx D
DclassP
, (3.5)
where σ denotes the sigmoid function: σ(z) = 1 / (1 + exp(-z)). Thus
)())((')|1( xxx Dx
DclassPx ii
(3.6)
where σ´(z) = exp(-z) / (1 + exp(-z))2 = σ(z) (1 - σ(z)). The term σ´(D(x)) is close to zero
whenever x is far from the decision boundary. This may be considered a desirable property
theoretically, but in our experiments, only a small number of volumes in the training set were
close enough to the decision boundary for this term to be numerically nonzero, leading to a non-
robust dependence on a small number of training volumes. Therefore, we advocate using the
37
derivative of decision function D(x) instead of P(class = c | x) when making a sensitivity map
(Yourganov et al., 2010).
The second disadvantage is loss of sign information due to squaring of ix
P(class = c | x). The
sign of ix
P(class = c | x) encodes the class preference of the ith voxel: it indicates whether the
signal in that voxel should be increased or decreased in order to increase P(class = c | x)
(Rasmussen et al., 2012A). This also applies when we take the derivative of D(x) instead of
P(class = c | x); for a two-class problem, we can say that a positive value of
)(xDxi
corresponds to a preference of the ith voxel for class 1, and a negative value to a
preference for class 2. If the sign information is preserved (as in Formula 3.4), sensitivity maps
can be interpreted analogously to statistical parametric maps (Worsley, 2001), where the sign of
the voxel indicates whether the contrast is expressed positively or negatively in that voxel.
In this thesis, we construct the spatial maps using the Formula 3.4; this corrects the drawback of
our earlier paper (Yourganov et al., 2010), where we have used the square of )(xDxi
. For a
multi-class problem, sensitivity maps can also be constructed according to a method proposed in
a recent paper by Rasmussen et al. (2012A). This method computes one sensitivity map for each
observation by taking the voxel-wise partial derivative of log P(class = c | x), and then groups
the maps into homogeneous clusters, producing one map per cluster (the number of clusters is
determined by optimizing generalization error in a cross-validation framework).
Let us consider the case when the derivative of the decision function D(x) with respect to x can
be expressed analytically as a function of x: )()( xx
xd D
. Then the sensitivity map can be
expressed as a function of the derivative:
trainN
kk
trainN 1
)(1
xdy . (3.7)
38
3 .3 Quad ratic d is crim inant
The quadratic discriminant (QD) method was first proposed in a 1947 paper by Smith (1947) to
classify data that come from multivariate Gaussian distribution with class-specific means and
covariance matrices. Cooper (1963) developed this method further and advocated its use for a
wide range of multivariate distributions. The description of this method and comparison to other
classification methods can be found in books by Seber (2004) and by Hastie et al. (2009). We
have demonstrated the efficacy of QD in classifying fMRI volumes (Schmah et al., 2010). Each
class is modeled with a multivariate Gaussian distribution with mean vector μc and covariance
matrix Σc:
)()(2
1
2
1
2
1
2)|(cc
Tc
ecclassP c
K μxΣμxΣx
. (3.8)
Here, K is the number of dimensions in our data. In real-world situations, the class-specific
population parameters μc and Σc are unknown. Instead, we can use unbiased estimates of mean
and covariance, which are computed using the training set:
c
c
N
j
T
cj
cj
ci
N
j
j
cc
N
N
1
)()(
1
)(
1
1
1
mxmxS
xm
(3.9)
Here, Nc is the number of training examples in class c. When computing likelihood functions, we
substitute sample mean mc and sample covariance matrix Sc into equation 3.6 in place of
population mean μc and population covariance matrix Σc, respectively.
For two equally likely classes, the decision function is
21
2211
112
1
2
1
2
1log
2
1)( mxSmxmxSmx
S
Sx TT
QDD . (3.10)
We can see that this function is a quadratic form on x. The decision boundary is a K-dimensional
quadric surface.
39
Let us now derive the formula for the sensitivity map. First, we can compute the derivative of the
decision function DQD (x). We can re-arrange the terms in Formula 3.10 and express DQD (x) as a
sum of a quadratic term, a linear term and a constant term:
constD TTTQD xSmSmxSSxx )()(
2
1)( 1
221
111
21
1 . (3.11)
The derivatives of the quadratic and linear terms can be computed easily since we know that, for
any symmetric matrix A, we can write (see e.g. the appendix in Mardia et al., 1979)
AyAxyx
AxAxxx
2
2
T
T
(3.12)
Therefore, the derivative of DQD (x) is a vector
)()(()()( 21
211
11
221
111
21
1 mxSmxSSmSmxSSxd TTQD . (3.13)
3 .4 Linear d is crim inant
When the classes are sampled from multivariate Gaussian populations that all share the same
covariance matrix, the Bayes-optimal method of classification is called linear discriminant (LD).
It can be seen as a special case of QD, but this method was developed before QD and is overall a
more popular method of classification. The algorithm was outlined in a 1936 paper by Fisher,
although Hotelling and Mahalanobis were developing similar ideas around the same time (see
Hodges, 1955, for a historical review). The earliest neural network prototype, Rosenblatt's
perceptron (1958), was mathematically equivalent to a linear discriminant, and therefore limited
to solving linearly separable classification problems, as shown by Minsky and Seymour (1969) .
In neuroimaging, LD has a long history as a method of analysis of fMRI and PET data (see e.g.
Friston et al., 1996; Tegeler et al., 1999; Strother et al., 1997, 2002). The treatment of the
mathematical theory behind LD is described in several books (e.g., Mardia et al., 1979; Seber,
2004). For binary classification, linear discriminant analysis is equivalent to canonical variate
analysis (Strother et al., 2002; Mardia et al., 1979).
40
LD assumes that data from all classes are sampled from multivariate Gaussian distributions,
where the mean vectors are specific to a class but the covariance matrix is the same across
classes. The likelihood function of a vector x which belongs to a class c is
)()(2
1
2
1
2
1
2)|(c
Tc
ecclassPK μxΣμxΣx
. (3.14)
This expression is identical to equation 3.7, except that the common covariance matrix Σ is now
used in place of the class-specific covariance matrix Σc. This sharing of covariance matrix across
classes leads to several significant simplifications. The decision rule for two equally likely
classes is now a linear, rather than quadratic, function on x:
)()(2
1)( 12
121 mmSmmxx
T
LDD . (3.15)
Here, sample means mc are estimates of class means μc (see Formula 3.9). Matrix S is an
unbiased estimate of population covariance Σ, and is computed by pooling sample covariance
matrices Sc:
c
N
c
cclass
N
NSS
1 2. (3.16)
Here, Nc is the size of the cth class, N is the total number of data points, Nclass is the number of
classes, and class-specific sample covariance matrices Sc are computed using Formula 3.9.
Since the decision function is linear, the decision boundary (the locus where DLD (x) is zero) is a
hyperplane in K dimensions. The derivative of the decision function can be easily computed
applying Formula 3.12:
)()( 211 mmSxd
LD . (3.17)
The derivative does not depend on x, so spatial maps can be computed without averaging across
training volumes: the vector dLD (x) serves as a spatial map.
It is interesting that Fisher (1936) has derived Formula 3.15 without using a probabilistic (i.e.
multivariate Gaussian) model. Instead, he used an intuitive formula for "good classification" that
41
maximizes differences between classes while minimizing differences within a class.
Mathematically, the problem is to find a vector d that maximizes the ratio
Sdd
BddT
T
. (3.18)
Here, B is the "between-class covariance matrix"; for a two-class problem, it is equal to
T
N
NNB ))(( 2121
21 mmmm , (3.19)
which is a singular matrix of rank one. The vector d that maximizes the ratio 3.18 is the first
eigenvector of the matrix S-1B, and it exactly corresponds to dLD (x) given by Formula 3.17. To
classify the test vector x, Fisher has proposed to look at the sign of the product dx, and to assign
it to class 1 or 2 depending on whether dx is positive or negative. It should be mentioned that,
while Fisher had not assumed a Gaussian distribution for the data, he still used the assumption
that the covariance matrices were equal across classes. This assumption is called
homoscedasticity; conversely, the situation where covariance matrices differ is called
heteroscedasticity.
3 .5 U nivariate m etho d s : G aus s ian Naive Bayes clas s ifier, G eneral Linear M o d el
Gaussian Naive Bayes (GNB) classifier is designed to handle situations when our data are
represented in a coordinate system with independent dimensions. We still make the assumption
that data are sampled from a multivariate Gaussian population. There are two versions of GNB,
linear and nonlinear; we will now describe the nonlinear scenario as the more general case. Here,
we don’t make the assumption of homoscedasticity (that is, the assumption that the population
covariance matrix is the same for all classes). Since the dimensions are independent, the
covariance matrices are diagonal. The parameters for each class are:
2
22
21
2
1
0
0
00
,
cK
c
c
c
cK
c
c
c
Σμ . (3.20)
42
Substituting this diagonal form of Σc into the formula for the multivariate Gaussian distribution,
we get the expression for the likelihood function:
K
ii
K
i ci
ciiK
ici
K
cclassxPx
cclassP11
2
2
1
22 )|(2
exp)2()|(x . (3.21)
Here, xi is the ith element of vector x, and it is sampled from a univariate Gaussian distribution
with mean μci and standard deviation σci.
The decision function for two equally likely classes is:
K
i i
iiK
i i
iiK
i i
iNGNB
s
mx
s
mx
s
sD
12
2
22
12
1
21
1 1
2
22log)(x (3.22)
Here, mci is the ith element of the class-specific sample mean mc, and
cN
jci
ji
cci mx
Ns
`
2)(
1
1 (3.23)
is the sample standard deviation along the ith dimension for class c.
Let us now consider the linear case, where all the classes are sampled from the distribution with
the same population covariance matrix. Therefore, the standard deviation along each dimension
is the same for all classes. We can estimate it using the mean of the standard deviations across
classes. The decision function simplifies to
K
i i
iiiiLGNB
s
mxmxD
12
21
22
2)(x (3.24)
and the ith element of its derivative is simply
221
i
iii s
mmd
. (3.25)
43
Note that this expression is very similar to a two-sided t test (Freund, 1992) along the ith
dimension:
Ns
mmt
i
ii
/21
. (3.26)
The t test is frequently used for neuroimaging data analysis since the early 1990s. Following the
work of Fox et al. (1988), Friston and colleagues (1991) have proposed t tests to measure task-
related activation in PET images. Later, they adopted this approach for fMRI data (Friston et al.,
1995A; Friston et al., 1995B; Worsley & Friston, 1995), by extending it and making it more
flexible. The extended approach is called the General Linear Model (GLM), or, more properly,
univariate GLM; see Kiebel & Holmes (2003) for a comprehensive introduction to GLM. Like
GNB, it is based on the assumption that the signal is independent across dimensions (voxels).
However, the analysis is more elaborate than performing the simple t test as in Formula 3.26.
The signal from each voxel is represented as a linear sum of predictors and an error term, so that
for the ith voxel of the jth volume we can write:
k
ijkijkj
i egx )( . (3.27)
Here, gjk is the value of the Kth predictor at time j. The predictors are pre-defined before the
analysis, and the first part of the analysis is estimation of predictor weights βki by linear
regression. A predictor is defined to encode a particular effect. For example, to encode task
effects, the predictor is defined so gjk = 1 if the jth voxel was acquired during that task, and gjk =
0 otherwise (to model the hemodynamic effect, the predictor is convolved with a hemodynamic
response function). It is also common to include predictors that model linear drifts (in form of gjk
= j), as well as low-order-polynomial drifts.
Equation 3.25 can be written in vector form, where xi is the time course of the ith voxel:
k
ikkii egx (3.28)
and also in matrix form, where X is the data matrix:
eGβX . (3.29)
44
The least-squares unbiased estimate of β is
XGGGβ TT 1
)(ˆ
. (3.30)
To study the differential response of a voxel to a set of tasks, we define a contrast vector c. We
select the predictors that we want to contrast with each other and set the corresponding elements
of c to 1 or -1. For example, to contrast the task-related response to a baseline, we can select
predictors that encode "task" and "baseline" conditions; the "task" predictor is set to 1, and the
baseline predictor to -1. The remaining predictors that we do not want to include into our
contrast have ck set to zero. Then, the magnitude of the estimated differential response of the ith
voxel is simply iTβc ˆ , where iβ is the vector of estimated predictor weights for the ith voxel. The
significance of this effect can be estimated with a t statistic, computed as follows:
}ˆ{
ˆ
i
iT
iVar
tβ
βc . (3.31)
Estimation of variance of iβ is difficult. The simplest approach assumes that the residual errors
(eij in Formula 3.27) are independent and normally distributed. In this case,
1)(}ˆ{ GGeeβ Ti
TiiVar . (3.32)
However, the assumption of temporal independence of residual errors is violated in typical fMRI
data. Because hemodynamic response is so slow, there is strong temporal dependence in the error
terms. Worsley and Friston (1995) have tried to account for this by estimating temporal
autocorrelation and including it in Formula 3.32. The use of t tests to make spatial maps have
been very popular in neuroscience community. Several of the most frequently used software
packages for fMRI data analysis are based on GLM, or at least include it as an option. Some
examples of such software are: SPM6, AFNI7, FSL8 and BrainVoyager9.
6 http://www.fil.ion.ucl.ac.uk/spm/
7 http://afni.nimh.nih.gov/afni/
8 http://www.fmrib.ox.ac.uk/fsl/
45
Temporal autocorrelation in the BOLD signal creates a problem for GLM analysis. Ordinary
least squares estimation assumes that the error terms are temporally independent and identically
distributed. If the voxelwise residual errors are correlated across time, the test statistics (and their
corresponding p values) are biased. A naïve application of a t test uses (number of volumes
minus number of predictors) as number of degrees of freedom; but the number of effective
degrees of freedom is smaller because of temporal autocorrelation, hence the bias in the t
statistic. Purdon and Weisskoff (1998) propose an approach of dealing with this bias: to get rid
of long-range temporal correlation by high-pass filtering, and to use first-order autoregression
model to account for short-range correlation. It is not necessary to perform high-pass filtering at
the pre-processing stage; it can be done within the univariate GLM framework by including low-
frequency oscillations into the design matrix.
3 .6 R egulariz atio n o f the co variance m atrix
The expression of the probability density for a multivariate Gaussian distribution includes an
inverse of its covariance matrix. This matrix could be class-specific, as in the quadratic
discriminant method (see Formula 3.9), or common for all classes, in the linear discriminant
method (see Formula 3.16). This matrix is of size K×K, where K is the number of dimensions in
the data (in fMRI, it is the number of voxels). This number could be quite large, especially if the
whole brain is acquired; typically, this number is on the order of tens of thousands. On the other
hand, the number of observations (i.e. fMRI volumes) is usually much less than that; it is on the
order of hundreds for single-subject studies, or it could be several thousand when we pool across
subjects or sessions.
High spatial resolution of fMRI comes with a price: the number of observations is less than the
number of dimensions. In this case, the covariance matrix is rank-deficient because we don't
have enough data. The rank of our matrix Sc (see Formula 3.9) is Nc-1 (here, Nc is the number of
observations, and we need to subtract 1 because of taking out the mean when computing the
covariance matrix), and it is much less than the number of voxels. The problem of inverting a
9 http://www.BrainVoyager.com/
46
rank-deficient covariance matrix is ill-posed (Friedman, 1989), and the inverse of Sc does not
exist. This is problematic for our classification (the expressions for decision functions, in
Formulas 3.10 and 3.15, involve Sc-1 and S-1. The GNB method does not have this problem in
either heteroscedastic or homoscedastic case, because the covariance matrix is diagonal and
therefore invertible.
The problem of inverting a rank-deficient covariance matrix is solved by regularizing the matrix,
that is, by approximating it with a full-rank matrix. One approach to regularization is to use
Principal Components Analysis (PCA) to project the data into low-dimensional space. Chapters
4 and 5 are dedicated to a detailed description of this approach. Another method of making the
covariance matrix invertible was described in a paper by Kustra and Strother (2001). In the
homoscedastic case, the covariance matrix S can be replaced by Sλ = S + λI, which is full-rank.
The heteroscedastic equivalent is class-specific matrix Sλc = Sc + λcI. The parameter λ needs to be
selected carefully; some metrics of optimization are proposed in the next chapter.
Finally, we can reduce the data dimensionality by feature selection, that is, selecting a subset of
voxels for our analysis. If the number of selected voxels does not exceed the number of
observations in any class, the covariance matrices are full-rank. This method is described by
Mitchell et al. (2004). Although GNB does not have a problem with inverting the covariance
matrix, its efficiency is improved with this kind of voxel selection, as found by Mitchell and
colleagues.
3 .7 N o n-pro babilis tic clas s ificatio n: Suppo rt Vecto r M achines
In our evaluation of probabilistic classifiers, we have compared them with a non-probabilistic
classifier, the method of Support Vector Machines (SVM), which is popular in the field of fMRI
analysis (see e.g. Cox & Savoy, 2003; LaConte et al., 2005; Mourao-Miranda et al., 2005;
Misaki et al., 2010). This method does not build a probabilistic model for the classes, but creates
the decision function in a way that simultaneously maximizes the margin between the two
classes and minimizes the misclassification rate (Cortes & Vapnik, 1995). We have tested the
simplest version of SVM that uses a linear kernel;. Decision function for a linear-kernel SVM is
linear in x:
47
bD TSVM xwx)( . (3.34)
The vector w and the scalar b are found by minimizing the expression
trainN
nnC
1
22
2
1 w , (3.35)
subject to the constraints:
,1)( nnnt xw for n = 1, …, N ,
where x1, …, xN are the training volumes and tn is 1 for volumes in class 1 and –1 for volumes in
class 2. The problem of finding optimal set of (w, b, ξ1, …, ξN) has a unique solution, which can
be found by quadratic programming. The variables ξ1, …, ξN are called slack variables; ξi
measures the degree of misclassification for vector xi. The quantity 2/||w|| is called the margin.
The tradeoff hyperparameter C specifies the importance of accuracy of classification relative to
maximizing the margin; higher values of C force the slack variables to be smaller.
We have used a MATLAB library LIBSVM10 (Chang & Lin, 2011) to compute weights w and
offset b. The spatial map for this decision function is defined by the vector w (LaConte et al.,
2005). Because of time limitations, we did not evaluate SVMs with nonlinear kernels, although
they sometimes outperform linear-kernel SVMs in classifying fMRI volumes (Schmah et al.,
2010). Construction of spatial maps for nonlinear-kernel SVMs is possible, although
computationally expensive (it is described in Rasmussen et al., 2011 and 2012A).
10
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
48
Chapter 4 M etho d s fo r es tim ating the intrins ic d im ens io nality o f the d ata
4 .1 Principal C o m po nent Analys is and d im ens io nality red uctio n
A typical fMRI data set consists of a relatively small number of volumes, each volume
containing measurements of BOLD signal for a large number of voxels. In within-subject
analysis of fMRI volumes, each volume is an observation, and each voxel defines a dimension of
the data space. Because of the complex spatio-temporal structure of BOLD signal, there is a
large amount of correlation across dimensions, making this representation highly redundant.
Another problem with high dimensionality of fMRI data is difficulty of multivariate modeling: if
we aspire to describe interactions between voxels, the number of degrees of freedom in our
model will exceed the sample size. This problem is encountered in applying Linear and
Quadratic Discriminants, as outlined in Section 3.5. Also, high data dimensionality results in a
heavy computational burden.
Principal Component Analysis can greatly improve the situation by projecting the data into a
low-dimensional space. The vector basis of this space is defined by a relatively small number of
"eigenimages". Eigenimages are mutually orthogonal vectors in voxel space. Each volume can
be represented as a linear combination of eigenimages. This orthogonal basis is given by
factoring of the data matrix X according to Eckart-Young Theorem (Reyment & Joreskog,
1996):
UZVUX T . (4.1)
Here, U and V are matrices with orthonormal columns, Γ is a diagonal matrix, and TΓVZ . All
these matrices are real-valued. The diagonal entries in the Γ matrix are called the singular values
of X. If J is the number of voxels and N is the number of volumes, then the size of X and U is
J×N, and the size of Γ, V and Z is N×N. The columns of U form the orthonormal basis of
eigenimages, and the rows of Z contain the coordinates of each data point in the eigenimage
space. The number of eigenimages is N, so this representation reduces the dimensionality of the
49
data from J (typically tens of thousands) to N (typically hundreds). This decomposition of X into
U, Γ and V is called a singular value decomposition (SVD) of X.
We can rewrite the Formula 4.1 as a sum:
N
i
Tiii
1
vuX , (4.2)
where ui and vi are the ith columns of U and V, and γi is the ith singular value. The product
γiuiviT is the ith principal component (PC) of the data (Reyment & Joreskog, 1996). Principal
components are usually ordered according to the amount of variance they explain. The first
component is the best one-dimensional approximation to our data, in the least-squares sense. The
linear combination of the first two components is the best two-dimensional approximation, and
so on. Alternatively, we can say that the first eigenimage defines the direction along which the
data is most variable. If we remove the variation along this direction from our data, the new
direction of the largest variance is the direction of the second eigenimage. The third eigenimage
defines the direction of largest variance if the variations along the directions of the first and
second eigenimage, and so on. With this ordering, the declining order is imposed on the singular
values: γ1 ≥ γ2 ≥ … ≥ γN.
For mean-centered data, eigenimages are also the eigenvectors of the covariance matrix. This
matrix is computed as T
NXXS
1
1
. If X is factored according to the Formula 4.1, then, taking
the orthonormality of V into account, S can be written as
TTT
NUUΛUUΓΓS
1
1, (4.3)
where Λ is a diagonal matrix with entries 2
1
1ii N
. This formula can be rewritten as
UΛSU , (4.4)
50
so for any i between 1 and N we can write iii uSu . Therefore, the eigenvectors of the
covariance matrix are the eigenimages of X, and the corresponding eigenvalues of S are squared
singular values of X, scaled by1
1
N. The proportion of variance of S explained by the first K
principal components isN
K
1
1 . The rank of S is determined by the number of its non-zero
eigenvalues.
We can think of fMRI data as a mixture of signal of interest and irrelevant noise. One goal of
dimensionality reduction is to get rid of components that contain mostly noise. In a cross-
validation framework, these noisy components are specific to the training set, and do not
generalize to the test set (Hansen et al., 1999). Principal components dominated by random noise
tend to have eigenvalues at the tail end of the spectrum (it should be noted that components
containing correlated structured noise can have relatively high eigenvalues). The number of
components that contain signal of interest is the intrinsic dimensionality of the data (Cordes &
Nandy, 2006).
Overall, the data dimensionality is reduced with two steps. First, we project the data onto a set of
eigenimages given by the Eckart-Young theorem, reducing the dimensionality from J to N. Then,
we further reduce it by identifying the K principal components that contain the signal (K≤N). The
second step is more difficult than the first, and a number of methods of estimating K have been
proposed; we review and test them in this chapter.
4 .2 Pro babilis tic Principal C o m po nent Analys is
Using the ideas of Principal Component Analysis, Tipping and Bishop (1999) have created a
probabilistic generative model called Probabilistic Principal Component Analysis (PPCA). We
will briefly discuss this model because of its importance in estimation of intrinsic dimensionality
(see, for example, Minka, 2000; Beckmann & Smith, 2004; Ulfarsson & Solo, 2008; Hansen et
al., 1999); for a comprehensive treatment of PPCA, see also Bishop (2006). J-dimensional vector
x is represented as a weighted sum of basis vectors hj, the mean vector m and an error term e:
emHwemhx
K
jjjw
1
. (4.5)
51
Here, K is the number of basis vectors, and H is a J × K matrix where ith column is the basis
vector hj. PPCA assumes the following distributions of w and e:
) ,(~
),(~2I0e
I0w
N
N. (4.6)
Therefore, the distribution of x is also a multivariate Gaussian, with mean m and covariance
matrix given by IHHΣ 2 T . The parameters that define the distribution of x can be
combined into a parameter vector θ = (H, m, σ2).
Tipping and Bishop (1999) have provided the solution for θ that maximizes p(X|θ), the
likelihood of the observed data. The logarithm of the likelihood function is
)(log)2log(2
)|(log 1SΣΣX traceJN
p , (4.7)
where S is the sample covariance matrix. We maximize the expression 4.7 using the eigenvalue
decomposition S = UΛUT. Since the number of basis vectors is K, the first K eigenvectors of S
define the signal subspace, and the remaining J − K eigenvectors define the noise subspace. The
variance of the noise is estimated as the mean of the last J − K eigenvalues of S:
N
KiiKJ 1
2 1ˆ . (4.8)
and H is estimated from the eigendecomposition of S (UK contains the first K eigenvectors of S,
and diagonal matrix ΛK contains the corresponding eigenvalues):
RIΛUH 2/12 )ˆ(ˆ KK , (4.9)
where R is an arbitrary orthogonal matrix. Finally, the maximum-likelihood estimate of m is the
sample mean:
i
iNxm
1. (4.10)
52
The maximum-likelihood estimate of model parameters is )ˆ,ˆ,ˆ(ˆ 2 mH . The coordinates of
data vector xi in principal-component space are given by
)ˆ(ˆˆˆˆ12 mxHIHHw
iTT
i . (4.11)
4 .3 M etho d s o f d im ens io nality es tim atio n
Estimation of intrinsic data dimensionality has been identified as "perhaps the most important
problem in using PCA" (Ulfarsson & Solo, 2008). Many methods of solving this problem have
been proposed in the statistical, machine-learning, and signal-detection literature. Not all of these
methods work in the situation when the number of variables (J) greatly exceeds the number of
observations (N), a situation typical for fMRI data (J >> N). A review of methods for the J < N
situation is given by Peres-Neto and colleagues (2005).
Some early methods (or, rather, rules-of-thumb) of intrinsic dimensionality estimation are
described by Mardia et al. (1979). For example, one can retain the PCs that, taken together,
explain 90% of the variance in the data; or one can look for the “knee” in the eigenvalue plot (the
point at which the eigenvalue spectrum of the covariance matrix flattens out, which in a white-
noise model indicates noise-dominated components); this is called the "scree-plot method". Both
of these methods are subjective, because the threshold of 90% is an arbitrary choice, and the
scree-plot method involves visual inspection. The estimates thus obtained are poor estimates of
intrinsic dimensionality (Beckmann & Smith, 2004).
In our publication (Yourganov et al., 2011), we have surveyed a set of methods of dimensionality
estimation. They can be classified into two broad categories, which we call analytic and
empirical. Methods in the first category estimate intrinsic dimensionality K by maximizing some
criterion that is computed on the whole data set X. The criterion is an analytic expression on X,
and it is formulated so the resulting PPCA model has some desirable properties from the point of
view of Bayesian prediction and/or information theory. Empirical methods do not use an analytic
criterion, but instead optimize some metric of performance on an independent test set. Cross-
validation is used to repeatedly split the data into training and test sets. This is computationally
expensive, compared with analytic methods. However, there is evidence empirical methods are
more sensitive to the true structure of the data; analytic methods rely on asymptotic properties
53
and tend to over-estimate the intrinsic dimensionality (see Hansen et al., 1999; Yourganov et al.,
2011). Also, from the pragmatic point of view, metrics of performance on an independent test set
are often easier to interpret than information-theoretic criteria.
In our survey, we have tested a variety of methods that will be described below. The analytic
approach is represented by optimization of Akaike information criterion, minimum description
length, Bayesian evidence and Stein's unbiased risk estimator. Examples of metrics to be
optimized with an empirical approach are predicted residual sum of squares, generalization error,
spatial map reproducibility, and prediction/classification accuracy. These methods have been
tested on our synthetic data described in Section 2.2.1. In addition, optimization of the area under
the ROC curve has served as the "gold standard" method of dimensionality estimation.
4 .3 .1 Akaike Info rm atio n C riterio n
Akaike (1974) proposed a criterion for determining the number of dimensions in the
dimensionality model parameters based on information theory. This criterion is known as Akaike
Information Criterion (AIC), and it has been applied to estimate the number of Gaussian signal
sources (Wax & Kailath, 1985) and, consequently, as the number of independent components in
fMRI data (Calhoun et al., 2001; Li et al., 2007). Here we follow the formulation of AIC given
by Stoica & Selen (2004), who give a useful review of several other analytical methods.
The aim of AIC is to make the data likelihood computed with maximum-likelihood estimates of
model parameters )ˆ|( Xp asymptotically approach the true data likelihood )|( Xp , as the
sample size grows to infinity. The similarity between )ˆ|( Xp and )|( Xp is measured with
cross-entropy:
)]ˆ|([log XpEI . (4.12)
Here, Eθ is the expected value under true model parameters θ. In practice, they are unknown.
Akaike approximates the expression 4.12 with the expected value under the maximum-likelihood
parameter estimates (rather than true model parameters):
])ˆ|([logˆ DpEI
X , (4.13)
54
where D is the number of degrees of freedom in the model. The unbiased estimate of this
approximation to I is
DpI )ˆ|(logˆ X . (4.14)
The model is chosen to maximize Î, or, alternatively, to minimize the expression
DpAIC 2)ˆ|(log2 X , (4.15)
which is known as the Akaike Information Criterion.
Wax and Kailath (1985) applied AIC to select the number of Gaussian sources for a model that
represents the data as a linear combination of sources. This model is similar to PPCA, except that
the sources are not assumed to be orthogonal. Also, the expected mean m of the data vectors is
assumed to be zero. The method was developed for complex sources, but we will assume that the
sources are real-valued. The model can be written as
eHwehx
K
jjjw
1
. (4.16)
The noise vector e is independent from the data, and has a multivariate Gaussian distribution
with zero mean and a covariance matrix given by σ2I. The log-likelihood term in 4.15 can be
computed using the maximum-likelihood estimators of H and σ2 given by 4.9 and 4.8. This term
is given by
J
Ki
J
KiiiJ
Kii
J
Ki
KJi
KJKJNKJ
KJ
p1 1
1
1
)/(1
log1
log1
)(1
log)ˆ|(log
X (4.17)
Let us now compute the number of degrees of freedom in the model. Using maximum-likelihood
estimates, the model is fully specified by σ2 and the first K eigenvalues and eigenvectors of the
sample covariance matrix S. The eigenvectors are J-dimensional, and constrained to be mutually
orthonormal. Number of degrees of freedom in UK is therefore )1(2
1 KKKJK
(normalization removes K degrees of freedom, and mutual orthogonalization removes
55
)1(2
1KK degrees of freedom). The remaining parameters (K eigenvalues and σ2) contribute
K+1 degrees of freedom to the model. Overall, the number of degrees of freedom is
1)1(2
1 KKJKD . (4.18)
By substituting expressions 4.18 and 4.17 into 4.15, we get an expression for AIC, which can be
evaluated for a range of values of K. The value of K that minimizes AIC can serve as an estimate
of intrinsic data dimensionality (Calhoun et al., 2001; Li et al., 2007).
4 .3 .2 M inim um Des criptio n Length
Rissanen (1978) proposed another information-theoretic criterion based on encoding theory.
Imagine that we encode our data X with binary digits. To minimize the length of encoded data,
we assign shorter codes to data vectors that occur more often. Minimum Description Length
(MDL) criterion is used to select the model that assigns the shortest binary sequence to the data.
Here we are considering the encoding of any possible data, not just the observed data that were
used to estimate the model parameters. Strictly speaking, the problem of finding such model is
equivalent to computing Kolmogorov complexity and is not computable. Rissanen's criterion is
an approximation developed for large data sets.
For an estimate of the parameter vector , the probability of the data vector x is given by the
likelihood )ˆ,ˆ,ˆ|( 2mHXp . For the observed data, the model that maximizes the likelihood is
also the model that produces the shortest description length. However, maximum-likelihood
estimates are biased (Bishop, 2006) and need to be corrected so the model produces the shortest
description of the previously unseen data. Rissanen arrived at the following criterion:
NDpMDL log2
1)ˆ|(log X , (4.19)
which is similar to AIC (formula 4.15) but includes the logarithm of N (sample size) into the
term that corrects the bias of the maximum-likelihood estimate. It is worth noting that the
Bayesian Information Criterion (BIC) introduced by Schwarz (1978) has the same formulation
(see Wax & Kailath, 1985).
56
Wax and Kailath (1985) have applied MDL to the problem of selection of the number of
Gaussian sources. The resulting formulation of MDL is similar to their formulation of AIC, with
the log-likelihood term computed according to 4.17 and the number of degrees of freedom
computed according to 4.18. This formulation was adapted to the problem of selecting the
number of independent components by Calhoun et al. (2001) and further by Li et al. (2007).
MDL was implemented as a part of the GIFT software package11. The efficiency of BIC as a
dimensionality estimation method was tested by Cordes and Nancy (2002), by Ulfarsson and
Solo (2008) and by Beckmann and Smith (2004).
4 .3 .3 Bayes ian evid ence
Minka (2000) put the problem of selecting the number of principal components into a fully
Bayesian framework. PPCA is used as the probabilistic model. For a given data set and its
sample covariance matrix, the maximum-likelihood estimates of model parameters H and 2 are
determined according to the value of K. Minka proposed to select K that produces H and 2 in
such a way that the likelihood )ˆ,ˆ,ˆ|( 2mHXp is maximized. In other words, we need to select K
that maximizes Bayesian evidence )|( Kp X .
Bayesian evidence can be computed by integrating over the parameters of the model:
dKppKp )|()|()|( XX . (4.20)
Here, θ denotes the parameter vector (H, m, σ2). The factor p(θ|K) is the prior distribution of
model parameters for a given value of K. Minka uses an non-informative (flat) prior for m, and
the priors distributions for H and σ2 have a hyper-parameter α which controls the sharpness of
the prior (α is small for non-informative priors). H is factorized as RIΛUH 2/12 )ˆ( K , where
R is an arbitrary orthogonal matrix. The priors for U, ΛK and σ2 are:
11
http://icatb.sourceforge.net
57
K
i
iJK
i
iJKp
Kp
KJKJKp
1
2/)1(
2
22
)2/)1((2~)|(
),(~)|(
)2))(2(),((~)|(
U
. (4.21)
Approximate solutions for the integral in 4.20 can be found using Laplace approximation, where
the prior probability p(θ|K) is approximated with a Gaussian distribution (see Bishop (2006) for
treatment of Laplace approximation). We arrive at the following approximation of Bayesian
evidence:
2/2/1)(
2/
1
)2(ˆ)|()|( KZ
KmKJN
NK
ji NKpKp
AUX . (4.22)
Here, m = JK−K(K+1)/2, and the determinant of the Hessian matrix AZ is given by
K
i
J
ijjiijz
1 1
11 ))(~~
( A , (4.23)
where λk is the kth eigenvalue of the sample covariance matrix S, and k~
takes the value of λk if
k<K and the value of 2 otherwise.
Beckmann and Smith (2004) use this Bayesian approach to select the number of principal
components in fMRI data. The components are then rotated so the mutual information shared
across components becomes zero. This method is called Probabilistic Independent Component
Analysis and is implemented in the popular MELODIC software package12.
4 .3 .4 Stein's U nbias ed R is k E s tim ato r
Ulfarsson and Solo (2008) have developed a method of selecting the number of principal
components, which is based on minimization of risk (defined as the expected mean squared
error). The data are represented with the PPCA model given in equation 4.5. In this model, the
12
http://www.fmrib.ox.ac.uk/analysis/research/melodic/
58
risk is the expected value of the squared difference between mHw and its estimate mwH ˆˆˆ .
The authors use the unbiased estimate of risk proposed by Stein:
2
1 1
22ˆˆˆ2
ˆˆˆ1 Jtrace
NNSURE
N
i
N
iiT
i
ii
mwHx
xmwH . (4.24)
Here, H and m are maximum-likelihood estimates given by Formulas 4.9 and 4.10, and iw is
given by 4.11. We could use maximum-likelihood estimate of σ2 given by 4.8, but Ulfarsson and
Solo state that this estimate is too much influenced by our choice of dimensionality K. Using
random matrix theory, they develop a different estimate of σ2 which does not require a good
estimate of K. Note, however, that maximum-likelihood estimate of σ2 is still used to compute H
and iw .
The noise variance σ2 can be estimated from the eigenvectors of the sample covariance matrix S.
If columns of the J × N data matrix X come from a multivariate-Gaussian distribution, S has a
Wishart distribution. Let γ = J/N. As J → ∞ and N → ∞, the distribution of eigenvalues of S
converges to a Marcenko-Pastur distribution:
otherwise 0
if ))((2)( 2
baabf
(4.25)
where 22/12 1 a and 22/12 1 b . The values of a and b define the asymptotic
boundaries on the range of eigenvalues: as the size of the matrix grows, λ1 → b and λJ → a.
Ulfarsson and Solo also provide the asymptotic formula for the median eigenvalue for given γ
and σ2. Let F½ be the asymptotic median eigenvalue for the situation when σ2 = 1.
The random-matrix estimate of noise variance is computed in two steps. First, we get a rough
estimate by dividing the median eigenvalue of S by F½. Then, b is obtained using this rough
estimate. A rough estimate of K is obtained as the index of the largest eigenvalue of S which is
greater than b. The final estimate of σ2 is computed as the ratio of the median of λK+1,…, λN to
F½.
59
With this estimate of noise variance, we can compute SURE (Stein's Unbiased Risk Estimator)
for a given K according to 4.24. To estimate intrinsic dimensionality, we select K that minimizes
SURE.
4 .3 .5 Pred icted R es id ual Sum o f S quares
Cross-validatory procedures of dimensionality estimation were introduced by Wold (1978) and
by Eastment and Krzanowski (1982; see also Krzanowski & Kline, 1995). The number of
principal components is selected to optimize the Predicted Residual Sum of Squares (PRESS)
statistic. Let K be the number of components that are used to approximate the data matrix. With
respect to a specific voxel, this approximation can be written as
K
kkikjkij vux
1
ˆ . (4.26)
The PRESS statistic is computed as the mean squared residual, by squaring the difference
between the true voxel signals and its corresponding approximations:
J
j
N
iijij xx
NJPRESS
1 1
2ˆ1
. (4.27)
To estimate intrinsic dimensionality, we select K that produces the approximation that minimizes
K.
However, there is a flaw in this simple scheme of computing PRESS. All voxels of the data
matrix are used to compute the U, Γ and V matrices. Therefore, the value of ijx is used to
compute its approximation ijx given by 4.27. This introduces a bias into the computed PRESS
statistic. To avoid it, we need to make sure that computation of ijx is performed without
utilizing ijx . Eastment and Krzanowski (1982) solve this issue by removing, in two separate
steps, the jth row and ith column from X. Let X(i) be the data matrix with ith column removed,
and X(j) be the matrix with the jth row removed. We compute singular value decomposition for
these matrices:
60
Tjjjj
Tiiii
)()()()(
)()()()(
VΓUX
VΓUX
(4.28)
Then we compute the approximation for ijx using )( jU , )(iV , and both )(iΓ and )( jΓ :
K
k
ik
ijk
jk
jjkij vux
1
)()()()(ˆ . (4.29)
This way, PRESS is computed by leave-one-voxel-out cross-validation. The true value of ijx
serves as test data that is independent of the training data (that was used to compute ijx ).
4 .3 .6 G eneraliz atio n erro r
Hansen and colleagues (1999) have developed a cross-validatory approach to dimensionality
estimation which is more efficient than PRESS minimization. It utilizes repeated splits into
training and test data, which is faster than the somewhat cumbersome process of removing rows
and columns from the data matrix as described above. The metric of performance is
generalization error, which tells us how well the model computed from the training set
generalizes to the independent test set. Hansen et al. have also produced a valuable comparison
of the empirical estimate of generalization error to its analytic estimate.
Generalization error is computed using the log-likelihood function. The theoretical formula for
generalization error is
xxx dppG )()|(log)( , (4.30)
which tells us how well the model specified by θ generalizes to all possible data vectors. In
practice, we can estimate G(θ) from a finite set of observations. In an earlier paper, Hansen and
Larsen (1996) have derived an analytical estimate for G(θ):
NDpG /)|(log)(ˆ X , (4.31)
where D is the number of degrees of freedom in θ. Note the similarity to other analytical criteria,
such as AIC (Formula 4.15) and MDL (Formula 4.19).
61
Hansen et al. use the PPCA model, where the log-likelihood function is given by Formula 4.7.
This function is used to compute the analytical estimate of G(θ) in Formula 4.31. An empirical
estimate of G(θ) can be computed using a split into training and test sets. The training set is used
to compute a maximum-likelihood estimate )ˆ,ˆ,ˆ(ˆ 2 mH for a specific value of intrinsic
dimensionality K, as described in Section 4.2. Then, generalization error is the negative log-
likelihood of the test set. Omitting the proportionality coefficient N/2 from the log-likelihood
function given by expression 4.7, we get
testtest traceJG SΣΣ 1ˆˆlog)2log()ˆ(ˆ . (4.32)
Here, Stest is the sample covariance matrix of the test data, and Σ is an estimate of the population
covariance matrix according to model : IHHΣ 2ˆˆˆˆ T . We compute )ˆ(ˆ testG for a series of
training-test splits of the data over a range of K, and select K that minimizes the mean value
(across splits) of )ˆ(ˆ testG .
4 .3 .7 R epro d ucibility and C las s ificatio n Accuracy
Another method of intrinsic dimensionality estimate comes from the NPAIRS literature (e.g.,
Strother et al., 2002; Strother et al., 2004; LaConte et al., 2003; Shaw et al., 2003). Here, the
dimensionality of the data is reduced with SVD, and a subset of principal components forms the
low-dimensional input to an analysis model. Performance of the model is greatly influenced by
the number of principal components selected for analysis. Selecting too few components leads to
a simplistic model with an inherent bias; selecting too many is analogous to overfitting (trying to
model the noise in the training set). The result of overfitting is increased variance of the model
performance on the independent test set (LaConte et al., 2003). This is an example of bias-
variance trade-off, where a complex model has high variance and low bias, and a simplistic
model has high bias and low variance (Hastie et al., 2009). In our case, the number of PCs that
we retain for our analysis defines the complexity of the model.
The NPAIRS framework uses two metrics of performance: reproducibility of spatial maps and
accuracy in predicting mental states. These metrics were described in Sections 2.1.2 and 2.1.3,
respectively. The approach suggested by the NPAIRS literature is to retain the number of
principal components such that one or both of the two metrics are optimized. It is possible to
62
optimize a combination of the two metrics, namely, the Euclidean distance from the "perfect
predictor". On prediction-reproducibility plots, "perfect predictor" corresponds to the point (R=1,
P=1), and the number of components that corresponds to the point closest to (R=1, P=1) can
serve as an estimate of intrinsic dimensionality. The prediction-reproducibility plots are
described in Section 2.1.4.
Instead of optimizing accuracy of prediction, we can select the number of components that
optimizes accuracy of classification (the difference between prediction accuracy and
classification accuracy is described in Section 2.1.3). Classification accuracy can be used if our
algorithm classifies the data without explicitly computing posterior probability of data belonging
to a particular class. In a recent paper (Schmah et al., 2010) we have selected the number of
principal components to optimize classification accuracy. In a follow-up paper (Yourganov et al.,
2010), we have shown that these estimates of dimensionality are relevant for prediction of
recovery from a motor stroke. Post-stroke recovery of function is a process of re-building the
cortical networks disrupted by the stroke, and the number of principal components that best
describes our data is an indicator of this recovery.
4 .3 .8 T he Area und er a R O C curve
In simulated data where the "ground truth" is known, we can use ROC metrics to select the
number of principal components. We use a subset of principal components as a low-dimensional
approximation to our data, and process it with our analytical model. Then we compute the
frequency of false positives and true positives, and construct the ROC curve as described in
Section 2.1.1. The partial area under the ROC curve that lies between FPF = 0 and FPF = 0.1 is
our optimization metric for selecting the number of principal components.
In our survey of intrinsic dimensionality estimation (Yourganov et al., 2011), we have used this
method as our "gold standard" method of estimating the dimensionality in simulated data. In our
simulated data, it is not always possible to say what is the "true" intrinsic dimensionality. When
ρ=0.99, the active areas are strongly coupled and can be described with a single PC, therefore it
could be argued that the true dimensionality in this case is 1. In the other extreme case, when
ρ=0, the active areas are mutually independent and the true dimensionality is the number of
active areas (sixteen). However, it is unclear what the true dimensionality is when ρ=0.5.
Therefore, we do not use the idea of "true dimensionality" when evaluating the methods of
63
dimensionality estimation; instead, we see how closely their results match the estimates obtained
with ROC area optimization.
64
Chapter 5 Intrins ic d im ens io nality es tim atio n: res ults
5 .1 S im ulated d ata
Dimensionality estimation methods have been tested on a large number of artificial fMRI data
sets. Simulations are described in Section 2.2.1. Each data set contains 200 volumes organized
into alternating "activation" and "baseline" epochs, with 10 volumes per epoch. Volumes in
“baseline” epochs are constructed by adding spatially smoothed Gaussian noise to a background
structure. Volumes of the “activation” epochs are constructed in the same way, with an addition
of Gaussian signal at 16 specific “active” areas. We call this signal “task-related”, because it is
present in the “activation” epochs and absent in “baseline” epochs. Time courses from every
pixel have been convolved with a model HRF. Because of the slow time-course of the
hemodynamic response, we have discarded the first 2 volumes of each epoch, leaving us with
160 volumes in total.
We have used these simulated data to study the impact of various parameters of task-related
signal on the estimates of dimensionality. These parameters are: mean magnitude of the active
signal (M), ratio of variance of this signal to the variance of background noise (V), and
correlation of timecourses across the 16 nodes of our simulated signal network (ρ). We have
varied M from 0 to 0.05 (in increments of 0.01), V from 0.1 to 1.6 (in increments of 0.25); ρ was
set to the values of 0, 0.5, and 0.99. For each setting of (M, V, ρ), we have constructed 100
artificial data sets as described above, and an additional group of 100 “false positive” data sets
consisting of 200 “baseline” epochs.
We have discovered that, in fact, many methods give the same estimates of dimensionality
irrespective of M, V and ρ. Other methods, however, are more sensitive: when the task-related
signal is stronger and/or the network coupling is more pronounced, this is reflected in smaller
estimates. Our results have been reported in a publication (Yourganov et al., 2011).
5 .1 .1 Analytic m etho d s
We have tested a number of analytic methods of dimensionality estimation: optimization of
Akaike Information Criterion (AIC), Minimum Description Length (MDL), Stein's Unbiased
65
Risk Estimator (SURE), and Bayesian Evidence. The first three methods have been implemented
in MATLAB, and the corresponding cost functions have been evaluated for the range of PC
dimensionalities from 1 to 160. Bayesian evidence has been computed by the MELODIC
software (a part of the FSL package for fMRI data analysis); this software has computed
Bayesian evidence for a range of K from 1 to 103. In the results presented below, there is a
difference between Bayesian evidence (as a part of MELODIC software) and other cost
functions (implemented by us in Matlab). This difference might be due to the additional
processing steps that MELODIC takes, such as eigenspectrum adjustment. It would be
interesting to test our own implementation of Bayesian evidence, rather than test MELODIC as a
black box; however, this testing was not done because of the time limitations.
Figure 5.1. Normalized cost function of several methods of dimensionality estimation.
The results are displayed on Figure 5.1, where we plot the cost function for each PC
dimensionality. Cost functions are normalized to the [0...1] range. The number of PCs that
minimizes the cost function is the estimate of intrinsic dimensionality. Dashed lines correspond
to the analytic methods of estimation; AIC and MDL cost functions are plotted together, because
they are virtually identical except for the scale factor. Solid lines show the cost functions for two
empirical methods (PRESS and generalization error), which are discussed in the next section.
The figure plots the results for the data set created using the following parameters: M = 0.01, V =
0.1, and ρ = 0. The plot of cost functions for other values of M, V and ρ looks identical, which
66
suggests that the four analytic methods give the same dimensionality estimates irrespective of the
underlying network structure and of signal-to-noise ratio in the data.
With the exception of Bayesian evidence optimization, these methods tend to produce inflated
estimates of intrinsic dimensionality. Cost functions for AIC/MDL and SURE tend to decrease
with increasing K until they reach the point of saturation (AIC/MDL is the first method to reach
this point, at K = 140). It is interesting to observe the similarity between SURE (an analytic
method) and PRESS (an empirical method). Optimization of PRESS is equivalent to
minimization of squared difference between the data matrix and its approximation computed
using a subset of principal components. Optimization of SURE is similar conceptually, but,
instead of trying to approximate the data matrix as closely as possible, we try to minimize the
squared difference between the "true" PPCA model and its maximum-likelihood estimate. The
similarity between these two cost functions indicates that PPCA provides a fairly good
representation of our simulated data (expect for the fact that our simulations involve HRF
convolution).
Analytical methods do not involve evaluation on an independent test set and can be thought of as
"optimization on the training set". Indeed, in order to get the "best" approximation of the training
set, we should use all of the principal components; this is reflected on the Figure 5.1, where the
cost function of AIC/MDL and SURE steadily decreases as we increase the number of
components. The "saturation" of the cost function when K>140 is perhaps due to the temporal
autocorrelation in the data imposed by HRF convolution.
The cost function for Bayesian evidence behaves quite differently from AIC/MDL and SURE. It
must be noted that we used the FSL MELODIC software to compute Bayesian evidence,
whereas the other analytical methods were implemented by us in Matlab; therefore, the
difference might be due to implementation as well as due to conceptual difference between the
cost functions. For instance, prior to dimensionality estimation, MELODIC performs high-pass
temporal filtering and adjustment of eigenspectrum in order to account for limited sample size
(Beckmann & Smith, 2004). It is interesting to note that the dimensionality estimate obtained
with this method is 16, which is the number of active regions in our simulations. The goal of
MELODIC is to represent the data matrix as a linear combination of sparse sources of signal
(Beckmann et al., 2005); we can say that MELODIC identifies the sparse sources (active areas)
67
correctly. It fails to detect the fact that these sources are correlated into a single spatial network;
this is consistent with reports that MELODIC splits functional cortical networks into different
"sources" (e.g. Abou-Elseoud et al., 2010). Therefore, it is a bad choice of analysis if we are
interested in analyzing functional networks.
5 .1 .2 E m pirical m etho d s : PR E S S and G eneraliz atio n erro r
Fugure 5.1 shows results of some empirical methods, in addition to analytic methods. The
empirical methods shown here are optimization of PRESS and of generalization error (GE in
short), computed for the same data set. The corresponding cost functions are normalized to
[0...1] range. As described in the previous section, PRESS method behaves in much the same
way as the cost functions for SURE, AIC and MDL methods. Optimization of GE gives us a cost
function with very different behaviour. It has a minimum at K=1, and then rises sharply.
The difference between these two empirical methods could be due to the difference in their
"training/test" framework. GE optimization utilizes the resampling framework common in
machine learning, when a subset of observations is held out during the training stage, to be used
later for testing the trained model. PRESS optimization does not use a test set in the traditional
sense; rather, it tries build an voxel-wise approximation of the training set (making sure that the
voxel which is to be approximated is not used to compute the approximation; see Section 4.3.5).
Figure 5.1 shows that this approach is very similar to optimization of SURE.
GE optimization tries to identify the PC subspace that is common to the training and test sets.
Our results indicate that this subspace is limited to the first principal component, that is, the
direction of maximum variance in the data. This variance could be driven by the difference
between the "active" and "baseline" volumes (GE optimization is an unsupervised method, that
is, it does not account for the fact that the data come from two classes). It could be argued that
our active network is indeed one-dimensional when ρ>0, making GE optimization suitable for
this situation. However, when ρ=0, the signal is independent across loci and therefore cannot be
considered truly one-dimensional. Overall, the understanding of GE optimization requires some
further analysis (including simulations of multi-dimensional networks).
68
5 .1 .3 E m pirical m etho d s : R epro d ucibility and C las s ificatio n Accuracy
The methods discussed so far in this chapter give very consistent estimates of intrinsic
dimensionality of simulated data, irrespective of the signal parameters. The cost functions
displayed on Figure 5.1 are virtually identical for all levels of M, V and ρ (with one exception:
Bayesian evidence optimization shows a slight tendency to increase with growing V, as shown
below). No matter whether loci of activation form a spatial network or are mutually independent,
the cost function is optimized with the same number of components.
Figure 5.2. Reproducibility and classification accuracy for linear and quadratic
discriminants, as a function of number of principal components, for two simulated data
sets. Left and right plots correspond to a weak (V=0.1) and strong (V=1.6) variance of the
signal, respectively, for moderate connectivity of ρ=0.5.
However, other empirical methods are much more sensitive to the spatio-temporal structure of
the signal. This sensitivity has been observed when we have used cost functions that directly
measure the performance of our data analysis, using such metrics as reproducibility of spatial
maps, accuracy of predicting the task, and area under ROC curve. Figure 5.2 shows how
reproducibility and classification accuracy vary with K, for two methods of analysis (linear and
quadratic discriminant). Left and right plots correspond to two artificial data sets. Both sets
69
contain relatively strong task-related signal (M =0.03), and a moderate level of spatial correlation
across active loci (ρ =0.5). What is different in the two plots is the temporal variance of the task-
related signal: it is very small (V=0.1) in the data set displayed on the left plot, and quite large
(V=1.6) in the data set displayed on the right plot.
This difference in V has a marked effect on the performance of both linear and quadratic
discriminants. For small V (left plot), both performance metrics are optimized when we use the
number of components that lies somewhere between 7 and 10. For large V, we get much better
performance when just the first component is used. Reproducibility of LD and QD has a clear
maximum at K=1, and classification accuracy is optimized when K is no more than 3.
The difference in cost functions for different values of V can be explained by the effect of V on
the shape of the eigenspectrum of the sample covariance matrix. This is demonstrated on Figure
5.3, which shows the first 10 eigenvalues of the sample covariance matrix. The network
correlation ρ is 0.5, M is set to 0.01 (upper panel) and 0.03 (lower panel), V varies from 0.1 to
1.6; the plot shows the eigenvalues averaged across 100 simulated data sets. We can see that, for
high values of V, the first eigenvalue is clearly separated from the rest. If V is not sufficiently
high, the first eigenvalue is not distinct from the rest of the spectrum.
Figure 5.3. Plot of the first 10 eigenvalues of the covariance matrix of a single data set, for
M = 0.01 (top) and M = 0.03 (bottom); ρ is set to 0.5, and V varies from 0.1 to 1.6 in
increments of 0.5. Eigenvalues are averaged across 100 simulated data sets.
70
The effect of V is most evident in the first eigenvalue, and close to negligible in the remaining
spectrum. This happens because the active areas form a single correlated network, which, in the
absence of noise, would be fully captured by the first principal component. In our simulations,
modification of V changes the variance of the active signal, but the variance of the noise is kept
constant. The increase of V leads to the increase of the total variance that is due to the active
signal, and this, in turn, increases the proportion of variance explained by the first principal
component (that is, the magnitude of the first eigenvalue).
This shift in eigenspectrum shape is analyzed in a paper by Hoyle and Rattray (2004), where it is
described as a phase transition of the eigenspectrum. Their simulated data contained a
multivariate-Gaussian latent variable embedded in white Gaussian noise, which is similar to our
simulations. The latent variable can be thought of as a symmetry-breaking direction: white
background noise is symmetrical in data space, and the latent variable defines the direction of
asymmetry. The authors studied the effect of the variance along this symmetry-breaking
direction (which is similar to our definition of relative signal variance V) on the eigenspectrum.
For a given number of N-dimensional data samples, there is a critical value of variance along the
symmetry-breaking direction; if the variance is above it, the first eigenvalue is separated from
the remaining eigenspectrum.
This phase transition is described in some earlier papers in statistical physics (Biehl & Mietzner,
1994; Watkin & Nadal, 1994; Reimann et al., 1996; see also a later paper by Hoyle & Rattray,
2007). It is usually formulated using α, the ratio of sample size to the number of variables (which
is the number of voxels for fMRI data). If α is above a certain critical threshold, it is possible to
learn the symmetry-breaking direction from the data: this direction is given by the first principal
component. If α is below the critical level, such learning is impossible. Earlier results by Biehl
and Mitzner (1994) show that this critical value depends on contrast-to-noise ratio as well as on
the variance along the symmetry-breaking direction. The order parameter of the phase transition
is the angle that the first eigenvector of the covariance matrix forms with the symmetry-breaking
direction. Below the critical value of α, the expected value of this angle is zero, i.e. we don't have
enough training examples to learn this direction. Above the critical value, the angle is non-zero
and eventually approaches 1.
71
The effect of V on our performance metrics can be explained with this phase transition. If the
network correlation ρ is above zero, our activation signal is essentially one-dimensional, because
most of the task-related variance can be explained by the first principal component. For low V,
the number of training examples is not enough to learn this one-dimensional network of
activation signal. If V is high enough, learning is possible because we have passed through the
phase transition and the number of examples is sufficiently high to capture the active network in
the first principal component. In this situation, adding more components is equivalent to
overfitting, and performance degrades when the number of components is more than one (see
Figure 5.2). This phase transition will be further explained in the next section.
5 .1 .4 Sum m ary o f perfo rm ance o n s im ulated d ata
Figure 5.4 shows the dimensionality of simulated data estimated by the methods described
above. The value of M is fixed at 0.01 (Figure 5.4 A) and at 0.03 (Figure 5.4 B). The three panels
from left to right in A and B correspond to levels of long-range spatial correlation in the network
(ρ=0, ρ=0.5, ρ=0.99), and the horizontal axis shows the relative signal variance, V, for each
panel. The plots record the median estimate of dimensionality, measured across 100 data sets,
and the error bars show the 25%-75% percentile range of the estimates. Error bars are displayed
only for the smallest (0.1) and the largest (1.6) levels of V, and there are no error bars for ROC
optimization curves because all 100 data sets are used to generate a single ROC curve for each
plot value.
This figure shows the estimates of dimensionality that are obtained by optimization of
performance of linear and quadratic discriminants. Three metrics of performance are used:
reproducibility of spatial maps, accuracy of classification, and partial area under the ROC curve.
For comparison, we also show the estimates obtained by optimization of Bayes evidence (as
computed by MELODIC software). We do not include other methods of estimation described in
chapter 4 (optimization of AIC, MDL, SURE, PRESS and GE criteria). As described in Sections
5.1.1 and 5.1.2, these methods give consistent estimates that are not influenced by M, V and ρ.
Plots of dimensionality estimates clearly show the transition of dimensionality that happens with
the emergence of spatial active network in the background noise. This transformation happens
when ρ is above zero, and when V reaches a certain threshold. At this point the network is
effectively captured in the first principal component, and the performance metrics are optimized
72
Figure 5.4. Median dimensionality estimates in simulations, as calculated by various
methods (see legend and text), shown as a function of the relative signal variance, V,
defined as the variance of the amplitude of the Gaussian activation blobs relative to the
variance of the independent background Gaussian noise added to each voxel. M is set to
0.01 for the top row and to 0.03 for the bottom row. The three panels from left to right in A
and B show three increasing levels of correlation, ρ, between Gaussian activation blob
amplitudes. Range bars on the first (V=0.1) and last (V=1.6) data points reflect the 25%–
75% interquartile distribution range across 100 simulation estimates.
at K=1. This is more evident when M = 0.03 (bottom row of Figure 5.4 B), where all
performance metrics converge and the signal is estimated as one-dimensional. When M = 0.01,
LD cannot robustly classify the simulated data, and the optimization of classification produces
highly variable dimensionality estimates (the variability in estimates increases as V grows). QD
73
is a much better classifier at such low levels of M, getting better with increasing V, due to the
fact that QD (unlike LD) is sensitive to the difference in covariance matrices of “active” and
“baseline” epochs, and this difference increases with V. Therefore, at low levels of M, estimates
of dimensionality obtained with QD are more robust compared to LD.
For both LD and QD, optimization of reproducibility of spatial maps is a method that is quite
sensitive to the emergence of the spatial activation network. For non-zero ρ, this method robustly
estimates the one-dimensional signal subspace if V is above the critical level. This critical level
depends on both M and ρ, and is somewhere between V=0.5 and V = 1. There is a good
correspondence in dimensionality estimates given by optimization of reproducibility and by
maximization of ROC area. Therefore, when we select K to optimize reproducibility of maps, we
also optimize signal detection of LD and QD. In the case of LD, optimization of classification
accuracy does not necessarily optimize signal detection when M is low. Using analytical methods
such as optimization of Bayesian evidence does not optimize signal detection because this
method fails to discover the one-dimensional spatial network in the data.
When ρ=0, the spatial network of activation cannot be captured by the first principal component.
All performance metrics are optimized with K>1. Performance of QD is optimized using lower
number of components relative to LD, reflecting the difference in the degrees of freedom in the
LD and QD models (QD has roughly twice as many degrees of freedom than LD). In the
situation when ρ=0, using Bayesian evidence will not be as damaging to signal detection as it is
when ρ>0.
The transition in intrinsic dimensionality can be also demonstrated using the notion of global
signal-to-nose ratio (gSNR), introduced in Section 2.1.2. Figure 5.5 shows a plot of intrinsic
dimensionality (estimated by optimizing reproducibility) versus gSNR. Spatial maps are
computed with linear discriminant (A) and with quadratic discriminant (B). On this plot, V is
varied between 0.1 and 1.6, ρ is at 0, 0.5 and 0.99, and M takes the values of 0, 0.01, 0.02, 0.03
and 0.05. Each marker displays the average dimensionality and gSNR across 100 simulated sets.
Size of the marker indicates V, shape of the marker encodes ρ, and colour encodes M. The plot
demonstrates the asymptotic relationship between gSNR and intrinsic dimensionality. If gSNR
(which measures the strength of reproducible signal) is high, the intrinsic dimensionality is
estimated as K=1. The structure of task-related signal is apparent in the data, because the first
74
eigenvalue clearly stands out in the eigenspectrum. As we lower gSNR by manipulating the
covariance matrix (i.e. by lowering V and/or ρ, keeping M constant), there comes a point of
phase transition. The first principal component loses its privileged position, and single-
dimensional representation of the data as one correlated network is broken up into apparently-
independent subnetworks. A large number of principal components is required for this multi-
dimensional representation. The phase transition can be seen as a loss of structure in the
representation of the data when gSNR falls below the critical level.
Figure 5.5. Asymptotic relationship between global signal-to-noise ratio (gSNR) and
dimensionality that optimizes reproducibility, for linear (A) and quadratic (B) discriminant
maps. Marker size indicates relative signal variance, V, from 0.1 (small) to 1.6 (large). Five
colours encode five different levels of M, and the spatial correlation is encoded by different
symbols.
The critical level of gSNR (where the phase transition occurs) depends on the value of M13. For
LD, it is around 0.5 for M =0.01 and around 1 for M =0.03. For QD, the transition is smoother
and critical levels are harder to determine, although they seem to be close to the corresponding
critical levels in LD. For very strong M (0.05), it seems that all of our examples have gSNR
13
It is important to keep in mind that our definition of M uses our knowledge of the "ground truth" in simulated data, whereas gSNR requires no such knowledge. M is specific to the simulation procedure, and gSNR is computed from the reproducibility of spatial maps and is therefore specific to the method of data analysis, such as LD.
75
greater than the critical level, because the dimensionality is estimated as K=1 or K=2 for all
levels of ρ and V; therefore, the phase transition does not occur when M=0.05.
5 .1 .5 E ffect o n d ata analys is
Linear and quadratic discriminants, operating on a subspace of principal components of the data,
require careful selection of the PC subspace size. A non-optimal choice of PC dimensionality
leads to poor performance of LD and QD. This is especially evident when we retain all principal
components for our analysis, without discarding the noisy PCs. The paper by Mourao-Miranda et
al. (2005) compares the performance of LD on a full PC basis to support vector machines, and,
indeed, LD shows very poorly. However, when noisy PCs are discarded, the performance of LD
is comparable to SVM, as shown by LaConte et al. (2005).
Figure 5.6 shows how the performance of a linear discriminant is affected by dimensionality
estimation. Here, the performance is measured with the partial area under a ROC curve, for false
positive frequency between 0 and 0.1. Separate ROC curves have been constructed for each
center of activation using LABROC software (Metz et al., 1998). The plot shows the average
partial area across 16 active loci, with error bars indicating standard deviation across loci.
Several methods of dimensionality estimation have been used to determine the number of
components to be used in linear discriminant analysis. We use the MELODIC and GIFT
software packages to estimate dimensionality with optimization of Bayesian evidence and MDL,
respectively. The performance of univariate General Linear Model (GLM) is also shown for
reference. The dashed line indicates the performance of a randomly guessing detector, where true
and false positives are equally likely.
Among the methods of dimensionality estimation used in Figure 5.6, reproducibility
optimization and classification optimization are the two methods that are influenced by global
signal-to-noise ratio. Cost functions for the other three methods (MDL, generalization error and
Bayes evidence) have the same shape irrespective of gSNR. These functions are plotted on
Figure 5.1; generalization is optimized when K=1, MDL is optimized with very high number of
PCs (K=140), and Bayesian evidence is optimized for values of K between 15 and 17 (which
roughly corresponds to our number of active loci, 16). As we can see in Figure 5.6, optimization
of MDL is the most disadvantageous method to estimate K: when LD is performed using 140
principal components, it performs approximately at the level of random guessing. MDL
76
optimization severely overestimates the dimensionality; 140-dimensional representation of data
is dominated by noise.
Figure 5.6. Partial ROC area (corresponding to false positive frequency range of [0…0.1])
as a function of the relative signal variance, V, calculated for linear discriminant (LD, on
the principal component subspace, with subspace size selected by various methods), and for
univariate general linear model (GLM). M is set to 0.01 for the top row (A) and to 0.03 for
the bottom row (B). The three panels from left to right in A and B show three levels of
correlation, ρ, between Gaussian activation blob amplitudes. Error bars show standard
deviation across 16 active loci (centers of Gaussian activation blobs).
Estimates of K obtained with optimizing Bayesian evidence are much more reasonable, and LD
always performs well above chance when these estimates are used. However, this method can be
recommended only when the activation signal is independent across the 16 active loci (that is,
when ρ=0). In this situation, the data are best described by the model that uses 16 principal
77
components, which is close to the estimates given by Bayesian-evidence optimization. However,
when ρ>0, this method of estimation is clearly sub-optimal (except for the lowest levels of signal
variance V). The obtained estimates cause LD to perform below the level of univariate GLM,
whereas optimization of reproducibility, classification, or generalization results in LD
performing better than GLM and, in some cases, near the theoretical maximum of signal
detection (corresponding to partial ROC area of 0.1).
Optimization of generalization is a good strategy when the activation signal is correlated across
loci (ρ>0). In this situation, the active loci form a single spatial network; therefore, the data can
be adequately modeled with K=1. This is especially true for strong mean signal (M=0.03);
performance of LD with K=1 (which is the value that optimizes generalization) is close to
perfect. For weaker mean signal (M=0.01), performance is more modest but nevertheless the
highest of all dimensionality-estimation methods we have used.
There is some variability in the estimates of K that optimize reproducibility; this variability is
greater for weak mean signal (M=0.01). For both levels of M, as we increase V and ρ, K is
estimated to be 1 in a larger proportion of simulated data sets. This trend results in better ROC
performance of LD for growing V and ρ. The same trend is observed in LD when it is combined
with optimization of classification. The variability of estimates here is even higher than in
reproducibility optimization; nevertheless, for M=0.03 performance is comparable to the level
obtained with optimization of reproducibility. When M=0.01, ρ=0.99, and V>1, LD performs at a
high level for three methods of dimensionality estimation (generalization, reproducibility and
classification).
5 .2 Intrins ic d im ens io nality es tim atio n in real d ata
Intrinsic dimensionality has also been estimated on the real fMRI data sets described in Sections
2.2.1 and 2.2.2. ROC-type analysis cannot be performed on real data, so we cannot use it to
answer which method of estimation is better for signal detection. However, we have seen the
same asymptotic relationship between gSNR and optimal PC dimensionality as we have seen in
simulated data (Figure 5.5). For the data from the stroke study, this relationship is demonstrated
on Figure 5.7 A.
78
The vertical axis shows the number of PCs that maximizes reproducibility of LD maps in each
individual subject, and the horizontal axis shows the gSNR that corresponds to this maximum
reproducibility. Each marker corresponds to an individual subject. LD analysis has been
performed to classify the fMRI volumes according to the behavioural task ("finger tapping"
versus "wrist clenching"). Although the number of subjects is too small to make strong
statements, the pattern here is similar to the one observed in the simulated data (Figure 5.5):
subjects with greater gSNR require, in general, a smaller number of principal components to
optimize map reproducibility. This does not hold across all subjects; for example, subject S054
uses more principal components than subject S059 and yet achieves greater reproducibility, and,
therefore, greater gSNR. It is possible that we observe two asymptotic curves here, one with
subjects S059, S103 and S145, and the other with subjects S054 and S090; the remaining
subjects could belong to either asymptotic curve. LD classification accuracy for subjects S059,
S103 and S145 is lower (90%) than on subjects S054 and S090 (96%). This difference could be
driven by different levels of magnitude of underlying task-related BOLD signal. In simulated
data, we have shown that varying M can shift the asymptotic curve along the horizontal axis (see
Figure 5.5).
Figure 5.7. Asymptotic relationship between global signal-to-noise ratio (gSNR) and
optimal dimensionality in real fMRI data: stroke study (A) and aging study (B). Each
marker indicates a subject.
This asymptotic relationship has been observed in all of our data sets, for all classification tasks.
For example, Figure 5.7 B demonstrates this relationship for the aging study data. Here, each
79
marker represents one subject, with age groups encoded with marker types. LD has been used to
classify each subject's volumes according to the task (delayed matching versus fixation), and to
build the corresponding spatial maps. Reproducibility of these maps has been used to estimate
intrinsic dimensionality and to compute gSNR. Compared with the stroke set displayed on
Figure 5.7 A, the gSNR values are larger and the estimated dimensionality is overall on a smaller
scale. This can be explained by stronger contrast in the aging-study set and, perhaps produced by
the stronger magnetic field (3T here versus 1.5T in the stroke study set). Within each age group,
the asymptotic dimensionality tends to hold. The overlap across the age groups is large, but it can
nevertheless be observed that the young subjects from the youngest age group exhibit smaller
gSNR values and larger dimensionality, relative to older subjects.
Figure 5.8. Optimal dimensionality and global SNR in a group study. Each marker
indicates an age group; each solid line indicates a task that is contrasted with fixation.
This asymptotic relationship is not specific to within-subject analysis; we have also observed it
in a group analysis (Yourganov et al., 2011). The data comes from a study of cognitive aspects of
aging (described in Grady et al., 2006); its design is similar to the aging study that we have used
for our evaluation. In the 2006 study, the participants come from 3 age groups (young: 20-30
years, 12 subjects; middle-aged: 40-60 years, 12 subjects; old: 65-78 years, 12 subjects). They
are presented with black line drawings of nameable objects, and words corresponding to names
of objects. The experiment consists of two “shallow” memory-encoding tasks, two “deep”
encoding tasks, and two recognition tasks. During the two shallow-encoding tasks, the subjects
are asked to report whether the pictures are large or small, and whether the words are printed in
80
upper or lower case. During the two deep-encoding tasks, the subjects are asked to determine
whether the pictures (or words) corresponded to living or non-living entities. During the two
recognition tasks, the subjects are instructed to report whether or not they had seen the presented
stimuli (pictures or words) previously. The tasks are performed during 24-second epochs, which
are interspersed with fixation blocks of equal length.
fMRI data has been acquired at 1.5 T scanner. The analysis has consisted of LDA classification
of fMRI volumes according to the task; there are 6 contrasts, each active task being contrasted
with fixation. The volumes have been pooled across the subjects. Figure 5.8 plots the
dimensionality that optimizes reproducibility of maps versus global SNR, for each of the six
contrasts and 3 age groups. The same asymptotic relationship between gSNR and dimensionality
can be observed here. In the young group, dimensionality is lower and gSNR is higher than in
the two older groups. In within-subject analysis, we observe the opposite trend, with younger
subjects having higher dimensionality and lower gSNR (see Figure 5.7B); this could be
explained by the hypothesis that younger subjects form a more homogeneous group, although the
variability in within-subject BOLD data is higher than in the older subjects (for discussion of
BOLD variability in different age groups, see Garret et al., 2012).
5 .3 Les s o ns learned
In this chapter, we have evaluated several methods of estimating the intrinsic dimensionality of
the fMRI data, that is, of determining which principal components contain relevant signal and
should be retained for further analysis, and which PCs contain noise and should be discarded.
This step is important for multivariate Gaussian classifiers that use PCA regularization; for
example, the performance of linear discriminant can be greatly influenced by the number of
retained PCs (see Figure 5.6, as well as LaConte et al., 2003, and Yourganov et al., 2011).
Methods for estimating intrinsic dimensionality of the data can be categorized into two groups:
analytic and empirical methods. Methods in the first group determine the number of principal
components that produce an approximation of the data, which is optimal for some information-
theoretical criterion. We have tested optimization of the following criteria: Akaike information
criterion, minimum description length, Stein's unbiased risk estimator, and Bayesian evidence
(computed using Laplace approximation). All of these methods, except for Bayesian evidence,
are optimized when an unreasonably large number of components are used (in our simulations,
81
the optimum was reached when 87.5% of PCs were retained). This estimate of dimensionality is
virtually useless: when LD-PC is operating on such high number of PCs, its performance is
barely above chance (see Figure 5.6). This over-estimation is consistent with results reported by
Cordes & Nandy (2006); they have demonstrated that temporal autocorrelation in the noise
inflates the estimates given by analytical methods such as AIC and MDL. Optimization of
Bayesian evidence, as implemented in the MELODIC FSL package, produces much more
reasonable estimates of dimensionality (retaining 10% of PCs in our simulations). This is
perhaps not an inherent advantage of Bayesian evidence optimization per se, but is due to the
way it is implemented in FSL MELODIC: before the dimensionality estimation is carried out,
the eigenspectrum of the data is adjusted to account for limited sample size (see Beckmann &
Smith, 2004). When LD-PC is operating on the PC subspace defined with this method, its signal
detection is well above chance; in the situations when the simulated sources of signal are
independent of each other, this method is the best performer (Figure 5.6). However, when the
sources are spatially correlated, PC subspace size is estimated better with empirical methods.
The empirical methods select the number of PCs that optimize the actual performance on the
independent test set. We have evaluated the following performance metrics: classification
accuracy of LD-PC, reproducibility of LD-PC maps, area under the ROC curve computed for
LD-PC maps, predicted sum of squares (PRESS), and generalization error of the probabilistic
PCA model. Of these methods, optimization of PRESS gives highly inflated dimensionality
estimates, comparable to some analytic methods (AIC, MDL, SURE). This method is also very
computationally intensive: for a J×N data matrix, it needs to compute J*N singular value
decompositions. This peculiar behaviour of PRESS is perhaps due to the fact that PRESS
optimization is theoretically different from the other empirical methods we have tested: it does
not attempt to separate the PCs into the "signal-containing" and "noise-containing" subspaces,
and does not use a held-out test set to optimize its cost function.
In our simulations, generalization error is optimized when 1 PC is used (in rare and seemingly
random cases, it is optimized with 2 PCs ). This is a reasonable estimate when the simulated
sources are spatially correlated and form a single network, which is best described with a single
PC. This estimate is more questionable when the sources are uncorrelated (theoretically, the best
estimate of dimensionality in this case is the number of sources); however, presumed under-
estimation of dimensionality this does not impair the signal detection when the mean signal is
82
strong. In this condition, LD-PC is not very sensitive to the number of PCs as long they are in the
range from 1 to (approximately) the number of sources. For weak mean signal and uncorrelated
sources, optimization of generalization error is a suboptimal choice of dimensionality estimation
for LD-PC.
In our simulations, analytic dimensionality-estimation methods, as well as a subset of empirical
methods (optimization of PRESS and of generalization error), are not sensitive to the parameters
of simulated signal of interest (that is, reproducible signal which is different between classes).
Such sensitivity is observed in empirical methods that optimize performance of LD-PC.
Estimates of PC dimensionality are small when the signal is correlated spatially and its variance
is relatively large. In this situation, the signal of interest is captured by a small number of
components. This behaviour is observed for all three measures of LD-PC performance: area
under the ROC curve, classification accuracy and reproducibility. However, if the performance
itself is poor and highly variable, the dimensionality estimates that optimize it are unstable (for
example, this is observed when we optimize classification accuracy when M=0.01).
Reproducibility is a more robust measure of performance than classification accuracy; its
optimization gives more stable estimates.
Overall, we recommend selecting the PC subspace size that optimizes a combination of the two
performance metrics (reproducibility and classification accuracy). As proposed by Zhang et al.
(2008), the number of components is selected in order to minimize
22 ..1.1 accuracyclassmedilityreproducibmed . (5.1)
The geometrical interpretation of this cost function can be demonstrated using the idea of
prediction-reproducibility plots (Section 2.1.4). A perfect classifier would have both
reproducibility and classification accuracy equal to 1; therefore, it is represented as the point that
has coordinates (1,1) in the prediction-reproducibility space. The Δ metric is the Euclidean
distance from the perfect classifier. By minimizing it, we attempt to find the PC subspace that is
both reproducible and predictive; it is more stable than optimization of either of the two metrics
by themselves (classification accuracy is particularly unreliable in the noisy data sets that are
inherently hard to classify). A study by Rasmussen et al. (2012B) also suggested optimization of
the Δ metric to select regularization parameters of multivariate classifiers that do not use PCA
83
regularization (Rasmussen et al., 2012B). That study demonstrated an advantage of the Δ metric
over prediction accuracy: when the regularization of classifiers was optimized for prediction
accuracy, the resulting spatial maps were sparser and less reproducible compared to Δ-optimized
regularization. This can serve as a reminder that in fMRI classification the quality of spatial
maps is just as important as classification accuracy.
84
Chapter 6 Intrins ic d im ens io nality and co m plexity o f fM R I d ata
In the previous chapter, we have used simulated data to show how estimation of intrinsic PC
dimensionality is influenced by the structure of the signal of interest (that is, the signal that is
reliably different between classes in a classification problem). Our measure of gSNR indicates
how easy it is to detect this signal in the data. If gSNR is high, the algorithms (such as LD and
QD) are better able to classify the data accurately and to produce reproducible maps. We have
demonstrated that higher gSNR is linked to lower intrinsic dimensionality (estimated by
optimizing map reproducibility or classification accuracy of linear and quadratic discriminants).
In this chapter, we present some behaviour correlates of PC dimensionality. We were motivated
by the result of McIntosh et al. (2008), where the intrinsic dimensionality of EEG signal (for a
specific channel) was estimated as the number of PCs that explain 90% of the trial-to-trial
variability. This dimensionality was measured for a group of children 8 to 15 years old, as well
as for adults, and has been shown to positively correlate (1) with age and (2) with behavioural
accuracy in a face memory task. These positive correlations were also observed for per-channel
multi-scale entropy, which is a nonlinear measure of signal complexity. These results suggest PC
dimensionality as a linear (and perhaps indirect) measure of complexity of EEG signal, which
increases with maturation. We have extended this idea to fMRI data, obtained from two studies:
a study of self-control and a longitudinal stroke recovery study; the results are described in this
chapter. In the first study, we have found significant difference in PC dimensionality between the
groups that exhibit different tendencies in self-control. In the second study, we have found that
PC dimensionality estimates (as well as some other measures of fMRI eigenspectrum) correlate
with behavioural measures of post-stroke recovery of motor function.
6 .1 Intrins ic d im ens io nality in a s tud y o f s elf-co ntro l
In a paper recently published in Nature Communications (Berman et al., 2013), we examined PC
dimensionality in healthy adults with varying levels of self-control abilities. The participants
have previously been involved in a seminal study of self-control across the life span (Mischel et
al., 2011). At the age of 4, they have been tested for their ability to control appetitive impulses:
they have been presented with a choice, either to eat one marshmallow immediately, or to wait
85
for a period of time and be rewarded with two marshmallows. At adolescence, these participants
have been assessed for their self-control ability using parental ratings; assessments have been
repeated (with self-reported ratings) when the participants have been in their 20s and 30s.
Participants who have demonstrated better self-control at age 4 (that is, they could resist
immediate temptation on the marshmallow test) have also shown, as adolescents, greater ability
to plan, to concentrate, and to cope with stress, as well as higher SAT scores. As adults, these
participants demonstrate higher educational achievement, higher subjective sense of self-worth,
and have lower rates of drug abuse. We have separated the participants into two groups, "high
delayers" and "low delayers", according to the participant being above or below average in the
self-control measures.
Figure 6.1. Intrinsic dimensionality in the self-control study, for the subjects in the high-
delaying and the low-delaying groups. Dimensionality is estimated by optimization of LD
classification accuracy. Error bars represent standard errors across the subject group.
A subset of participants (12 from each group), now in their 40s, has been recruited for an fMRI
study. During scanning, the participants have been instructed to perform the directed-forgetting
task that tested one's ability to control the contents of working memory. There are two types of
trials: "lure" and "control". "Lure" trials require suppressing information in working memory
whereas "control" trials do not. The difference in behavioural performance between "lure" and
86
"control" trials can be used as an assay of one's ability to control the contents of working
memory. Behavioural performance does not differ significantly across the two groups as both are
equally impaired on "lure" trials relative to "control" trials in accuracy and reaction time. LD
analysis has been performed to classify the fMRI volumes according to trial type; the accuracy of
this classification is not significantly different for the two groups. However, there is a significant
difference in the intrinsic dimensionality, that is, the number of principal components required to
optimize LD classification. Figure 6.1 shows the optimal number of PCs for each subject, as well
as the group averages. The high delayers, as a group, need a smaller number of PCs to optimize
LD classification; the low delayers show larger mean dimensionality as well as larger variability
in dimensionality. Univariate analysis demonstrated that subjects in both groups recruit the same
cortical areas in the directed-forgetting task, i.e., there is no significant group difference in the
magnitude of BOLD signal. However, recalling the results of our simulations, smaller values of
dimensionality in the high-delaying group might suggest that the signal of interest in the active
areas had more variability (V) and/or stronger network coupling (ρ). In addition, we have
constructed within-subject LD maps using the optimum-classification dimensionality estimates.
We can classify the within-subject maps according to the group with 71% accuracy using QD;
LD classification is lower (58%), because the group difference in homogeneity of the covariance
matrix requires a highly non-linear boundary to separate the two groups.
6 .2 C o m plexity o f co rtical netw o rks in fM R I s tud y o f s tro ke reco very
Using data from the longitudinal study of stroke recovery (Small et al., 2002), we studied the
relationship between PC dimensionality, computed on fMRI data, and improvement of motor
function, measured with standard behavioural tests. We were able to find the correlation of post-
stroke recovery with intrinsic dimensionality, although not all of our methods of dimensionality
estimation have demonstrated such correlation. Our results were published (Yourganov et al.,
2011).
Nine subjects recovering from a stroke in the motor area of the brain were tested in their ability
to perform hand movements, both with the healthy hand and with the hand impaired by the
lesion. All subjects were scanned at 4 sessions (taking place at 1, 2, 3 and 6 months post stroke).
Each session included behavioural assessment of recovery of motor function. Of the several
87
behavioural tests in the original study, we use the results of three tests (each has been performed
with both the healthy and the impaired hand):
1. Strength of the hand grip, measured with a dynamometer.
2. Strength of the pinch between thumb and index finger, measured the same way.
3. Performance on the nine-hole peg test, defined as 1/(time to complete).
Each test produces two behavioural measures of recovery: improvement of performance and final
performance. Improvement has been calculated as the difference between the performance of the
impaired hand on the first and last session, divided by the mean (across all 4 sessions)
performance of the healthy hand. The final performance is computed as the ratio of the
performance of the impaired hand on the fourth session only, again divided by the mean
performance of the healthy hand. During the fMRI scanning sessions, subjects have been
instructed to perform two tasks: wrist flexion/extension and tapping index finger and thumb
together. These movements are performed with the healthy as well as with the impaired hand, on
alternating runs. Each subject has been tested during four sessions, each session consisting of
behavioural evaluation and fMRI recording.
In an earlier paper (Schmah et al., 2010), we used these data to evaluate a number of
classification algorithms. fMRI volumes were classified into several categories: according to the
task (finger or wrist movement), to the side of the body (movement of the healthy hand or of the
impaired hand), and to the time elapsed since the stroke ("2 earlier sessions" versus "2 later
sessions"). Both linear and quadratic discriminants performed well relative to other methods of
classification. Quadratic discriminant is particularly good at differentiating between early and
late sessions; the accuracy of classification is at least 99% in all subjects. The PC dimensionality
that provided maximum classification accuracy of QD is highly variable across subjects, ranging
from 41 to 191 principal components. Encouraged by the high accuracy of classification (which
suggests that QD is the right model for early/late session classification), we correlated this
dimensionality with behavioural recovery measures (improvement as well as final performance
of all three tests).
88
Our results show the negative correlation between dimensionality estimates and performance on
the final session: Pearson's r is -0.56 for the pinch test, -0.69 for the grip test, and -0.74 for the
peg test, with p-values (uncorrected for multiple comparisons) of 0.11, 0.04 and 0.02,
respectively.
Figure 6.2. Scatter plots for four combinations of fMRI-based measures and behavioural
measures: QD dimensionality versus final peg test performance (A); generalization-error
dimensionality versus final pinch test performance (B); sphericity versus pinch test
improvement (C); spectral distance versus pinch test improvement (D). Each subject is
represented with a specific symbol.
We can speculate that the stroke recoverers with the highest behavioural performance on the
final session are the subjects that show the greatest recovery, therefore, the difference between
the early and the late sessions is more pronounced and can be captured by a small number of
89
PCs. However, correlation with behavioural improvement of performance in all three tests is not
significant (p>0.5), perhaps due to small and anatomical heterogeneous sample. The scatter plot
of the final performance on the 9-hole peg test versus QD dimensionality is shown on Figure 6.2
A.
We have observed another interesting correlation: final performance is positively correlated with
dimensionality estimates obtained by minimizing generalization error ("GE dimensionality"). In
our simulations, this method consistently estimates the dimensionality as 1 principal component
(and, very rarely, as 2 PCs), irrespective of the M, V and ρ of the task-related signal; this could be
due to the fact that the variation in our simulated data is driven by the difference between the
"active" and "baseline" classes and therefore could be captured with a single PC However, when
this method is applied to the stroke set, estimates of intrinsic dimensionality differ across
subjects (varying between 20 and 80 principal components), because the variance in the BOLD
signal in stroke recovery data has a much more complicated structure than what is observed in
our simulations . These estimates show strong correlation with all three behavioural measures of
final performance: Pearson's r is 0.73 for the pinch test, 0.84 for the grip test, and 0.57 for the
peg test (p=0.03, 0.004, and 0.1). Correlation with improvement of performance is, again,
insignificant (p>0.5). Scatter plot is displayed on Figure 6.2 B.
It is interesting to observe the difference between optimization of generalization and
optimization of classification: for the first method, estimates correlate positively with final
recovery, and for the second method, this correlation is negative (but also strong). There is a
fundamental difference between these two methods of dimensionality estimation: QD
dimensionality is modeling the difference between the early and the late sessions, and
generalization-error dimensionality captures the overall complexity of the data, with early and
late sessions pooled together14. It could be assumed that good recoverers have more complex
cortical networks that require larger number of components to describe them (with a probabilistic
PCA model). At the same time, the difference between the early and late stages of recovery is
more pronounced, requiring a QD model with fewer components.
14
We have also tried to estimate generalization-error dimensionality for early and late sessions separately to see whether there is a change that reflects behavioural improvement. The difference between early and late sessions was not significant (p=0.86 for the paired Wilcoxon test).
90
We found a correlation between improvement of behavioural performance and some measures
computed on the eigenspectrum of the covariance matrix (where the eigenvalues are sorted in
descending order). The first of such measures is the index of sphericity. This measure is
calculated on the covariance matrix with the data pooled across all sessions. The sphericity index
reflects the curvature of the plot of this matrix's eigenvalues. An index of 1 corresponds to a
situation when all eigenvalues are identical. The smaller this index is, the sharper is the drop in
the eigenspectrum. A sphericity index of 1 corresponds to a flat eigenspectrum, which would be
observed for an infinitely large sample of white Gaussian noise. In a limited sample of white
Gaussian noise, the eigenspectrum of the covariance matrix drops off slowly. If we add spatially
correlated signal that has sufficiently high magnitude and variance, the spectrum becomes less
spherical. Figure 5.3 shows how the presence of the network makes the first eigenvalue to stand
out, making the spectrum less spherical. We define sphericity index using the Greenhouse-
Geisser correction to the Box criterion (Schmah et al., 2010; Abdi, 2010):
N
ii
N
ii
N
1
2
2
1
1
1
. (6.1)
Here, N is the number of eigenvalues. This measure shows strong negative correlation with
improvement on pinch test and peg test (Pearson's r = -0.79 and -0.63; p = 0.01 and 0.07).
Correlation with improvement on hand test is insignificant (Pearson's r = -0.38, p = 0.3), as are
correlations with final measures of recovery (p>0.35 for all three behavioural tests). Overall, the
subjects showing a highly non-spherical covariance matrix (with a sharp drop in the
eigenspectrum) show the greatest improvement in two out of three behavioural measures of
recovery (pinch test and peg test) . The scatter plot of sphericity versus improvement in pinch
test is shown on Figure 6.2 C. In addition, we computed the sphericity for the covariance
matrices of early and late sessions separately. A paired Wilcoxon test fails to reveal any
significant difference between "early" and "late" sphericity (p=0.91).
To get a measure of difference in covariance matrices between the early and late sessions, we use
spectral distance, which we define as the Euclidean distance between the two ordered
eigenspectra:
91
N
i
Li
Eid
1
2)()( . (6.2)
Here, λi(E) denotes the eigenvalues of the "early" covariance matrix (sessions 1 and 2 combined),
and λi(L) -- eigenvalues of the "late" covariance matrix (sessions 3 and 4 combined). This measure
shows strong positive correlation with improvement on the pinch test (Pearson's r=0.79, p=0.01)
and a weaker one with the improvement on the grip test (Pearson's r=0.56, p=0.11). Correlation
with improvement on the peg test is insignificant (Pearson's r=0.38, p=0.31). Correlations with
behavioural measures of final recovery are also not significant (p>0.57 for all three behavioural
tests). The scatter plot of spectral distance versus improvement on pinch test is displayed on
Figure 6.2 D. A large spectral distance between the early and late sessions indicates a prominent
longitudinal change in the brain networks. It is therefore not surprising that it positively
correlates with behavioural improvement of motor performance.
Figure 6.3. Scatter plot for the first vs. the second principal component produced by PLS
analysis of the correlation matrix given in Table 6.1. Squares and circles denote fMRI-
based and behavioural measures, respectively.
To further analyze the correlations between fMRI-based measures and behavioral test results, we
performed a Partial Least Squares (PLS) analysis (McIntosh & Lobaugh, 2004; Krishnan et al.,
2011). The matrix of correlations R (displayed in Table 6.1) was decomposed using singular-
value decomposition: R=UΓVT. The first two principal components explain 43% and 39.5%,
92
respectively, of the total correlation (this is computed by dividing each singular value in Γ by the
total sum of singular values); taken together, they explain 82.5% of the total correlation. These
two principal components can be seen as the two orthogonal directions that capture (in the
optimal least-squares sense) the associations between fMRI-based and behavioural measures.
Figure 6.3 shows the scatter plot of the weights on the first two principal components (which
correspond to columns of U for fMRI-based measures and to columns of V for behavioural
measures). We see that the first (horizontal) direction captures the fMRI-based measures that are
computed on the eigenvalues (sphericity and spectral distance) and the behavioural measures
based on the improvement in performance. The second (vertical) direction captures the measures
based on the PC dimensionality that optimizes prediction (unsupervised or supervised) and the
behavioural measures based on the performance at the final session.
improvement final improvement final improvement final
sphericity index ‐0.68 (0.05) ‐0.30 (0.44) ‐0.87 (0.0045) ‐0.07 (0.88) ‐0.43 (0.25) 0.17 (0.68)
spectral distance 0.48 (0.19) 0.18 (0.64) 0.73 (0.03) ‐0.18 (0.64) 0.50 (0.18) ‐0.03 (0.95)
QD dimensionality ‐0.28 (0.46) ‐0.83 (0.008) 0.15 (0.71) ‐0.38 (0.31) ‐0.02 (0.98) ‐0.42 (0.27)
GE dimensionality ‐0.01 (0.99) 0.45 (0.22) ‐0.21 (0.6) 0.67 (0.05) 0.16 (0.68) 0.78 (0.015)
peg test pinch strength grip strength
Table 6.1. Correlations between fMRI-based measures and behavioural measures. Each
table entry shows Spearman's correlation coefficient, with the corresponding p-value in
parentheses. When corrected for multiple comparisons, none of the correlations are
significant at FDR≤0.05 level.
We have demonstrated that the process of post-stroke recovery of motor function is reflected in
the fMRI data, and, in particular, in the eigendecomposition of the covariance matrix. The PC
dimensionality is correlated with the performance on the last session, whereas the measures
computed on the eigenvalues of the data matrix (sphericity and spectral distance) correlate with
the improvement of performance in the 6-month period. These two directions of correlation are
mutually orthogonal (Figure 6.3), indicating that they might capture two relatively separate
behavioral recovery processes: the absolute level of performance reached over 6 months
recovery, and the change from the initial damaged brain required to reach this final performance.
This might also explain the opposing trends observed in sphericity and GE dimensionality (good
behavioural recovery is correlated with low sphericity and high GE dimensionality). Overall, our
93
results suggest that significant changes in the eigenspectra across sessions reflect underlying
changes in the BOLD functional connectivity and the re-organization of the brain, which leads to
effective recovery (Yourganov et al., 2010).
94
Chapter 7 E valuatio n o f clas s ifiers : s im ulated fM R I d ata
7 .1 P o o l o f clas s ifiers
In Chapter 5, we have evaluated different methods of estimating the intrinsic PC dimensionality
of the data. In this chapter, we describe the evaluation of classification algorithms that have been
used to classify fMRI volumes according to the task performed during the acquisition of the
volume. We have evaluated a group of classifiers well-established in the field, such as linear
discriminant, support vector machines, and Gaussian Naïve Bayes classifier; in addition, we have
tested quadratic discriminant, which is novel in fMRI classification. We have also constructed
spatial maps, which display the contribution of each spatial location to classification of the
volumes. Both simulated and real fMRI data have been used in this evaluation. This chapter
describes the results of the evaluation on the simulated data; this study will be submitted for
publication in summer of 2013.
Our pool of classifiers consisted of quadratic and linear discriminants (noted as QD and LD),
Gaussian Naïve Bayes classifiers (GNB-L and GNB-N for linear and nonlinear variants of GNB,
respectively), and a support vector machine (SVM) with a linear kernel. QD, LD and GNB are
probabilistic classifiers that use Gaussian distributions to model the classes; Chapter 3 describes
these classifiers and corresponding spatial maps. To evaluate SVM, we use the implementation
provided by LIBSVM library (Chang & Lin, 2011). For linear SVMs, the training weight vector
serves as a spatial map (this is proposed, among other approaches to visualizing the SVM model,
in a paper by LaConte and colleagues (2005); see also Rasmussen et al., 2012B). We did not test
SVMs with non-linear kernels, despite the evidence that they provide an advantage over linear
kernels in some situations (Schmah et al., 2010). The decision to exclude nonlinear SVMs from
our pool of classifiers was motivated by the fact that we wanted to evaluate the spatial maps
created by the classifiers (unlike our previous study described in Schmah et al, 2010, where we
evaluated classification oaccuracy of classifiers without considering the cossreponding spatial
maps). It is possible to construct spatial maps for nonlinear-kernel SVMs (Rasmussen et al.,
2011 and 2012A); however, this is computationally expensive and was not performed due to
time limitations.
95
For LD, we try two approaches to regularizing the pooled covariance matrix. The first approach
approximates it with a subset of its principal components; this is also the approach we used for
QD. The question of selecting the size of this subset is addressed in detail in Chapters 4 and 5,
where we show the usefulness of resampling-based approaches. In particular, reproducibility of
spatial maps and, to a lesser extent, accuracy of classification are two optimization metrics that
are sensitive to connectivity of spatial networks and yield reasonably good ROC performance of
LD. In the current evaluation, we have used a combination of reproducibility and classification
accuracy to determine K, the size of the optimal PC subset. Following Zhang et al. (2008), we
use split-half resampling to compute classification accuracy and map reproducibility for a range
of values of K, and compute the median values of these two metrics across splits. Then we select
the value of K that minimizes
22 ..1.1 accuracyclassmedilityreproducibmed . (7.1)
A perfect classifier would have both reproducibility and classification accuracy equal to 1. We
can think of these two metrics as axes that define our performance space; the perfect classifier
has coordinates (1, 1) in this space, and the Δ metric is the Euclidean distance from the perfect
classifier.
The second approach to regularizing the covariance matrix is ridge regularization, described in
Section 3.5; here, a diagonal matrix is added to the covariance matrix, making it full-rank. The
diagonal matrix is λI, and the value of λ is selected with a cross-validation procedure similar to
selecting K. Using split-half resampling, we compute the reproducibility of maps and accuracy of
classification for different values of λ. The value of λ that optimizes the Δ metric (given by
Formula 7.1) is used to regularize the within-class covariance in the linear discriminant.
The SVM model contains the hyper-parameter C, which regulates the trade-off between the
complexity of the model and the accuracy of classification (LaConte et al., 2005). We determine
the value of C in the same way as we select K and λ: by running split-half resampling, computing
median (across splits) classification accuracy and map reproducibility, and optimizing the Δ
metric. Our implementation of GNB (linear and nonlinear) has no regularization hyper-
parameters that need to be tuned with cross-validation.
96
Overall, our pool of classifiers consists of 6 models:
1) QD: quadratic discriminant with PC regularization of covariance matrices (the same
value of K is used for both classes);
2) LD-PC: linear discriminant with PC regularization of the pooled covariance matrix;
3) LD-RR: linear discriminant with ridge regularization of the pooled covariance matrix;
4) SVM: support vector machine with a linear kernel;
5) GNB-N: nonlinear Gaussian Naïve Bayes classifier, where the (diagonal) covariance
matrices are allowed to differ across classes and the decision function is nonlinear;
6) GNB-L: linear GNB classifier, where the covariance matrix is pooled across classes and
the decision function is linear.
We will now proceed to the results of evaluation of these classifiers on simulated data.
7 .2 Perfo rm ance o f clas s ifiers o n s im ulated d ata
The simulation framework described in Section 2.2.1 has been used to simulate a block-design
experiment with two conditions: active and baseline. A simulated experimental run consists of
100 active and 100 baseline volumes, arranged in alternating 10-volume blocks. Task-related
signal (which is present in the active blocks and absent from baseline blocks) is determined by
three parameters: mean expected magnitude (M), variance (V, defined as a ratio between the
variances of task-related signal and of the background noise), and correlation across the spatial
network of distributed activations (ρ). To evaluate the effect of each of these parameters on the
performance of classifiers, we have tested various levels of M (0, 0.02, and 0.03), V (0.1 to 1.6,
in increments of 0.25), and ρ (0, 0.5 and 0.99). For each setting of (M, V, ρ), we have generated
100 data sets; an additional 100 sets consisting of baseline volumes are created for ROC analysis.
When analyzing the simulated data, the first 2 volumes of each block have been discarded
because of slow temporal response of HRF. This reduces the size of the data sets to 160 volumes.
Voxels outside the “simulated brain” have been discarded, leaving 2072 voxels for further
analysis. Each simulated run has been split into two sets, both containing 5 active and 5 baseline
97
blocks. We have computed the median values of map reproducibility and classification accuracy
across 20 such splits.
The results of the evaluation are displayed on Figure 7.1. The rows of this figure correspond to
the metric of performance: classification accuracy (top), and reproducibility of maps (bottom).
The lines show median performance, taken across 100 data sets generated for each setting of (M,
V, ρ); the error bars show standard deviation of performance. The columns of Figure 7.1
correspond to the levels of M (0, 0.02 and 0.03). The three levels of ρ (0, 0.5 and 0.99) are shown
as three sub-panels of each level of M. Each of these sub-panels consists of a performance plot,
where the horizontal axis represents V, going from 0.1 to 1.6. The vertical axis of the plot is the
mean magnitude of the performance metric.
Figure 7.1. Performance of the pool of six classifiers on simulated data sets. Performance is
measured with classification accuracy (top row) and reproducibility of spatial maps
(bottom row). The three columns correspond to three levels of mean signal magnitude M,
and the three sub-columns to three levels of spatial correlation ρ.
98
7 .2 .1 C las s ificatio n accuracy
In terms of classification accuracy (top row of Figure 7.1), there are strong similarities between
LD-PC, LD-RR, SVM and GNB-L. These methods can all be referred to as “linear classifiers”,
because they all use a linear decision function to compute class membership, and the decision
boundary that separates the two classes is a hyperplane. Their accuracy is lowest when M=0, and
tends to increase with mean signal strength when M grows. For M = 0.03, we see a negative
effect of increasing V, which is modulated by ρ (it is negligible when ρ = 0, and strongest when
both ρ = 0.99). Changing V influences the spread of the “active” volumes, and as it grows the
two classes become harder to separate and their class covariances become less similar.
Separation between the two classes can be estimated with the Mahalanobis distance between the
class centroids m1 and m2:
211
21 mmSmm TMahd . (6.2)
Here, S is the pooled within-class covariance matrix, that is, the average between the sample
covariance matrices of each of the two classes. To invert it, we use PC regularization. LD-PC is
used to select the number of principal components for this approximation, by determining K that
optimizes the Δ metric in each data set.
When M = 0.03, mean classification accuracy is to a large extent predicted by median dMah. For a
given M, we have computed 21 mean values (across 100 data sets) of dMah, one each for 3 levels
of ρ and 7 levels of V. Pearson’s correlation coefficient between 21 values of mean dMah and the
corresponding mean classification accuracies is 0.954 when M = 0.03. For lower levels of M, this
correlation is not observed, perhaps due to instability in estimating K.
The remaining classifiers in our pool, QD and GNB-N, show quite different trends in
performance (different from the linear classifiers as well as from each other). The positive
influence of M is also observed, but the effect of V and ρ is more complex. These two methods
use a non-linear decision function to classify the volumes, and the covariance matrix is estimated
separately for each class. Therefore, these methods can use the difference in the covariance
matrices to their advantage: if the separation between the class centroids is small (that is, M is
low), prediction of class membership can be enhanced using the different sample covariance
99
matrices. This is evident when M = 0: here, both nonlinear methods get better as V increases. For
QD, this beneficial effect of V is modulated by ρ; when M = 0, ρ > 0, and V is sufficiently large,
QD is the most accurate classifier, peaking at mean accuracy of 66% when ρ = 0.99 and V = 1.6..
When ρ = 0, the functional nodes of the active network are independent, and GNB-N is the best
model for our data, achieving 60% mean accuracy at the largest setting of V.
When M>0, the performance of these two nonlinear classifiers is influenced by V in a more
complex way. When M is 0.03, performance is high (greater than 70%), but the influence of V is
detrimental to performance, due to growing overlap between the classes. In this aspect, the
nonlinear classifiers resemble the linear; indeed, the performance of the methods starts to
converge when M = 0.03. At M = 0.02, there seems to be a transition when the effect of V
changes from beneficial to detrimental, and V seems to have no effect on mean classification
accuracy.
7 .2 .2 R epro d ucibility
The bottom row of Figure 7.1 shows the reproducibility of spatial maps produced by our pool of
classifiers. If we compare it with the classification accuracy plot, we see that the classifiers here
can be grouped in a different way:
1. univariate methods: two versions of GNB
2. multivariate methods that use PC regularization: LD-PC and QD
3. other multivariate methods: SVM and LD-RR
Inside each group, reproducibility is quite similar, but the groups are clearly distinct in most
cases. Let us inspect the situation for a given level of mean signal, M. We see the same pattern
across all values of M:
As expected, network coupling (ρ) has no effect on univariate methods. There is a slight
detrimental effect of V, which is noticeable at high levels of M. In most cases, univariate
maps are less reproducible than multivariate maps. It is interesting to note here that
pooling of variance across classes has no effect on reproducibility, because performance
of GNB-N and GNB-L is the same.
100
PC-based methods (LD-PC and QD) get a tremendous boost from increasing V, when the
active areas are coupled (ρ > 0). For sufficiently large levels of V, reproducibility of these
two methods greatly surpasses reproducibility of all other methods in our pool
Other multivariate methods (SVM and LD-RR) are not influenced by V and ρ, and have
effectively identical performance (reflecting the real-data findings of Rasmussen et al.,
2012). However, the relative ranking of SVM and LD-RR among the pool of classifiers
depends on ρ: when ρ = 0, they are the sometimes the best methods in terms of
reproducibility, along with LD-PC. The same holds when ρ >0 and V is very small. In
other situations (ρ > 0 and V > 0.1) they perform worse than PC-based methods.
This ranking of the methods is largely consistent across M. Overall, M has a positive effect on
reproducibility: for all methods, spatial maps become more reproducible as M grows.
Figure 7.2. Performance of the pool of six classifiers on simulated data sets, measured by
partial area under the ROC curve. The three columns correspond to three levels of mean
signal magnitude M, and the three sub-columns to three levels of spatial correlation ρ.
7 .2 .3 Partial area und er the R O C curve
In addition to the NPAIRS metrics (classification accuracy and reproducibility), we have also
measured the performance using partial area under the ROC curve. The results are displayed on
Figure 7.2, which is organized similarly to Figure 7.1: the three panels correspond to three levels
of M (0, 0.02, 0.03), and the three sub-panels inside each panel correspond to three levels of ρ (0,
101
0.5, 0.99). We have computed a ROC curve for each of the 16 active areas, and estimated the
partial area corresponding to false positive frequency range from 0 to 0.1 (see Section 2.1.1). The
lines show the median value of partial ROC area across the 16 areas, and error bars show its
standard deviation. The dashed line shows the partial ROC area of 0.05, which corresponds to
random guessing.
We see that the classifiers group in the same fashion with respect to ROC area as they do with
respect to spatial map reproducibility reflecting the fact that they estimate the same spatial signal
detection performance:
1) PC-based multivariate methods;
2) other multivariate methods (SVM and LD-RR);
3) univariate methods.
The first group is the best performer in terms of ROC area. QD is better than LD-PC when M is
zero, and LD-PC is better than QD when M = 0.03 and ρ = 0. In other situations, their
performance is identical. Both methods are sensitive to V, and their performance increases as V
grows from 0 to 1. When mean signal is relatively strong (M = 0.03), both LD-PC and QD are
near-perfect in their signal detection (partial ROC area approaches the theoretical maximum
value of 0.1), for all levels of V and ρ.
Univariate methods are uniformly the worst performers. In the absence of mean signal, they
never rise significantly above chance. When M>0, they are much better than chance, but their
performance drops as V grows. This decline is less severe when M is large. As expected, ρ has no
effect on performance of univariate detectors. Pooling of variance across classes is beneficial for
signal detection: GNB-L is slightly, but consistently, better than GNB-N. The second group of
algorithms (SVM and LD-RR) is intermediate in terms of performance: better than univariate
methods, but never as good as PC-based multivariate methods.
7 .2 .4 R O C evaluatio n o f G LM , PCA, ICA and PLS
In addition to the pool of classifiers described above, we have evaluated a different group of
popular methods of fMRI data analysis, using the ROC methodology. These methods do not
102
classify the fMRI volumes; their goal is to produce spatial maps that summarize brain activity.
Some of these methods are unsupervised (i.e. they produce maps that show important sources of
temporal variance in the fMRI signal, irrespective of the task), and some are supervised (they
characterize the response of the brain to a particular task or set of tasks). We have tested the
following unsupervised methods:
Principal Component Analysis (PCA): signal detection of the first eigenimage of the
mean-centered data matrix has been evaluated with ROC methodology;
Independent Component Analysis (ICA). MELODIC software has been used to
decompose the data matrix into independent components (ICs). Each component consists
of a spatial map and the associated timecourse. We have correlated these timecourses
with the predictor function, to identify the component that showed strongest correlation.
For the predictor function, we have convolved the boxcar function (alternation between
task and rest blocks) with the hemodynamic response function (HRF, described in
Section 1.1.2). The spatial map of the most-correlated independent component has been
evaluated with ROC methodology.
The supervised methods that we have tested are:
Mean-centered Partial Least Squares (see McIntosh et al., 1996; McIntosh & Lobaugh,
2004; Krishnan et al., 2011). This method looks at the principal components of the
product of the data matrix and design matrix (the latter encodes the task for each
volume). For our situation of two-class, one-subject analysis, this product has only one
eigenvector, which is proportional to the difference in class means. We do not perform
the bootstrapping procedure that is standard in PLS, because of the nature of our
simulations (which is, one session per subject, and one subject per experiment).
General Linear Model (GLM; see Section 3.4). We have used only one predictor in our
analysis: boxcar function convolved with HRF.
The results are displayed on Figure 7.3. As in the previous section, for our metric of signal
detection we use the partial area under the ROC curve for a false discovery rate between 0 and
0.1.
103
Figure 7.3. Partial area under ROC curve measured for maps that are produced by GLM,
ICA, PLS and PCA.
When M = 0 we see that PCA is better (sometimes dramatically so) at signal detection than other
methods. Signal detection of PCA improves as V grows; this is expected, because V is the ratio
of task-coupled variance to noise variance. When V is large, the variance in the active loci is
higher than in the inactive loci, helping the active loci to “stand out” in the eigenimages. For
example, when V = 1, the variance is twice as much in the active loci (which contain the sum of
signal and noise) as in the inactive loci (which contain only the noise). When active loci form a
spatial network (i.e. when ρ > 0), the first eigenimage accounts for a relatively large share of
variance. Therefore, at large levels of V and ρ, the first eigenimage captures most of the task-
driven variance. Because this variance is expressed only in the active loci, they get high weights
in the eigenimages. This explains the success of the PCA algorithm at M = 0, when other
methods are either at chance or only slightly better than chance. PCA also outperforms all the
methods in our classifier pool, including LD and QD (compare Figures 7.3 and 7.2).
When M > 0, performance of PCA gets better because the mean difference is another source of
variance that is captured by the first eigenimage. This variance is manifest only in the active loci,
so PCA is doing a good job at detecting them. The performance of GLM and (to a lesser extent)
PLS improves dramatically when M>0. PLS is directly proportional to the difference in class
means and not affected by ρ or V. GLM, as a univariate method, is not affected by ρ, and V has a
detrimental effect; the GLM map consists of the t values at each voxel location; these values
decrease in magnitude as V grows. Overall, GLM is the best method of the four when either ρ or
V is small but is still worse than LD or QD regularised on a PC subspace (see Fig. 6.1). It
104
outperforms univariate GNB (both linear and nonlinear). Remember that in our simulations the
signal is convolved with HRF, which has a somewhat slow response; consequently, the first
couple of volumes in the “active” blocks do not reflect the mean difference. In our
implementation of GNB, we discard the transition scans at the beginning of each block, reducing
the sample size (2 images are discarded at the beginning of each block, so only 160 volumes out
of original 200 are retained). GLM does not discard the transition scans, but, rather, this slow
response is modeled in the predictor. In our simulations, GLM has the additional benefit of using
the “true” HRF function in the predictor, i.e. the same function that has been used in the
construction of the data and in the analysis.
7 .3 Sum m ary o f evaluatio n o f clas s ifiers o n s im ulated d ata
Figure 7.2 shows that the multivariate classifiers based on PCA regularization (that is, QD and
LD-PC) are the most efficient methods of detecting the simulated signal. This advantage over
other classifiers from our pool is explained by the performance of PCA as shown in Figure 7.3.
For V>1 in Figure 7.3, PCA by itself performs similar to or better than SVM and LD-RR.
Therefore, when an adaptive PCA subspace is used to provide discriminant features for QD and
LD, they do not need to further improve performance by much to easily outperform the other
approaches as signal detectors for variance-driven covariance structures.
As we can see in the top row of Figure 7.1, PCA regularization does not provide any advantage
to LD with respect to the accuracy of classification: LD performs at the same level of accuracy
regardless of the method of regularization (PCA or ridge regularization). At M=0.03, the
accuracy of all classifiers from our pool is more or less similar and displays the same declining
trend with increasing V. At lower levels of M, the difference in accuracy is driven by linearity of
classifiers rather than by regularization method (there is an additional difference between
nonlinear univariate and nonlinear multivariate classifiers, that is, between GNB-N and QD).
Comparing Figures 7.1 and 7.2, we can say that signal detection has more similarities to
reproducibility of spatial maps than to classification accuracy. We observe the same trends in
signal detection (Figure 7.2) as in reproducibility (Figure 7.1, bottom row): the PCA-regularized
classifiers are the best performers, and their performance improves with the growing variance of
correlated active signal; the univariate methods perform the worst, and are negatively influenced
by growing V; finally, SVM and LD-RR perform at the intermediate level, and the influence of
105
increasing V is minimal (a slight negative trend can be observed with respect to signal detection).
Principal components capture the most important sources of variance in the data. In case when
these sources are correlated, they are to some degree captured by the first principal component.
With growing V, there is an increase in the portion of total variance that is due to the active
signal; this is reflected in the improvement of signal detection of the first principal component
(Figure 7.3). This, in turn, helps the methods that use PCA for regularization to improve their
signal detection (Figure 7.2) and the reproducibility of their maps (Figure 7.1, bottom row). To
the extent that variance-driven covariance structures provide a good description of brain
networks as reflected in real BOLD fMRI data sets then we can expect QD and LD on a
regularised subspace to also outperform other approaches as signal detectors.
106
Chapter 8 E valuatio n o f clas s ifiers : real fM R I d ata
8 .1 Data s ets
This chapter presents the results of evaluation of our pool of classifiers on two real fMRI data
sets. In the previous chapter, we have used simulations to investigate the influence of parameters
of simulated signal (namely, M, V and ρ) on the performance of our classifier pool. In this
chapter, we apply this classifier pool to real fMRI data; the ranking of classifiers according to
their performance is largely similar to the one observed in simulations. We concentrate on
within-subject analysis, so the analysis of real fMRI data is comparable to our analysis of
simulated data, where we have not attempted to simulate the across-subject heterogeneity.
However, one of our real data sets was also subject to group-level analysis, described in Section
8.4.2.
The two real fMRI sets come from the stroke recovery study (see Section 2.2.2) and from an
aging study (see Section 2.2.3). These two sets represent two scenarios with different advantages
and disadvantages. The first set is composed of a small number of highly heterogeneous subjects,
with a large number of fMRI volumes per subject. In the second set, the number of subjects is
much larger, and they are more homogeneous (within their age group) because they all come
from a healthy population. However, the number of fMRI volumes per subject is much smaller in
the second set. Also, the first set is a longitudinal study with 4 scanning session per subject; this
gives us an interesting opportunity to classify the volumes based on the time after stroke ("early"
versus "late" sessions) in addition to classification of volumes based on the task ("finger tapping"
versus "wrist flexion" and "healthy hand movement" versus "impaired hand movement"). The
second study is not longitudinal, but the subjects come from two age groups, which allows us to
study the effects of age on classifier performance.
Another important advantage of the second set is the range of behavioural tasks performed by
subjects during scanning. In the first set, the subjects performed simple motor tasks (finger
tapping and wrist flexion, with alternating hands). In the second set, the subjects performed a set
of visuomotor tasks with varying cognitive load: simple reaction task (RT), matching of a target
stimulus to one of the three simultaneously presented stimuli (PM, for "perceptual matching"),
107
and a short-term memory task, where the target stimulus was removed from the screen before
presentation of the three stimuli (DM, for "delayed matching"). The forth "task" was passive
fixation on a dot in the middle of the screen (FIX). We have used four binary contrasts for our
within-subject analysis: DM/FIX, RT/FIX, DM/RT and DM/PM. We can assume that the
RT/FIX and DM/FIX are relatively strong contrasts, because different brain networks are
recruited during active visuomotor tasks and during passive fixation (see Grady et al., 2010). The
DM/PM contrast can be assumed to be the weakest, because of the similarity between the DM
and PM tasks (because of this similarity, we did not study PM/FIX and PM/RT contrasts).
Finally, the DM/RT contrast is intermediate. These four contrasts represent a range of "contrast
strengths", which is somewhat analogous to varying M, V and ρ in our simulations.
Both data sets were analyzed using a split-half resampling framework. The volumes in the stroke
set were pooled across 4 scanning sessions, and then split into two half-sets; the temporal
separation between the two half-sets was at least 16 seconds. 20 such splits were created for each
subject. In the aging study, 4 scanning runs were acquired for each subject; these 4 runs were
evenly split into half-sets, providing 3 splits per subject. For each binary contrast (in both stroke
and aging data), we have balanced the number of volumes between the two classes within haf-
sets by subsampling from a larger class.
The first part of this chapter (Sections 8.2 and 8.3) evaluates the classifiers in within-subject
analysis framework. The results obtained on the two real datasets are compared with the results
of within-subject analysis of simulated data sets. The second part of the chapter (Section 8.4)
uses the aging study data to investigate some questions that go beyond within-subject analysis..
First, the within-subject spatial maps are evaluated on their reproducibility across subjects.
Second, we inspect the agreement between the maps created by different classifiers on the same
subject. Finally, we classify the within-subject maps based on the age group of the participant;
we create the group-level spatial map for this classification and use it to identify the spatial
locations where the effect of age is expressed.
The results of evaluation on the stroke recovery set have been published previously (Schmah et
al., 2010); for that paper, we have used only one metric of evaluation: the accuracy of
classification. The remaining results will be submitted for publication in early 2013.
108
8 .2 E valuatio n o n real d ata: Stro ke s tud y
Data collected for the longitudinal study of stroke recovery has been used to evaluate the
performance of our pool of classifiers defined in Section 7.1. The results are displayed in Figure
8.1. Classifiers have been applied to three binary classification problems: “healthy hand versus
impaired hand”, “early session versus late session”, and “finger tapping versus wrist flexion”.
For each of the 9 participants, we have created 20 training-test splits of equal size; classification
accuracy and map reproducibility have been averaged across 20 splits for each subject
separately. The figure shows box-and-whisker plots of the mean values of classification accuracy
(left panel) and map reproducibility (right panel) across the 9 subjects. Multivariate classifiers
have been regularized by optimization of the Δ metric in the split-half framework (defined in
Formula 7.1).
Figure 8.1. Performance of the pool of classifiers on the stroke recovery dataset for three
contrasts (healthy/impaired, early/late, and finger/wrist). The top figure shows the
accuracy of classification, and the bottom figure shows the reproducibility of spatial maps
for six algorithms of classification.
109
The plot shows that there is a large amount of overlap in the performance of our classifiers.
Accuracy of classification is above chance for all three classification contrasts. For the
"healthy/impaired" contrast, the distribution of accuracies across the nine subjects is roughly the
same for all classifiers, with QD being slightly better than others on average. For the "early/late"
contrast, the across-subject accuracy is the greatest for QD, followed by SVM and nonlinear
GNB. The difference in classifiers' accuracy in the third contrast, "finger/wrist", is more
pronounced than in the first two contrasts. Multivariate linear classifiers (LD-PC, LD-RR and
SVM) are most accurate. In this contrast, the across-subjects variability of classification accuracy
is smaller than in the other two; this also applies to across-subjects variability of reproducibility.
For the other two contrasts, reproducibility is more heterogeneous across subjects (in the extreme
case, reproducibility of nonlinear GNB for the "early/late" contrast ranges from 0 to 1). Of the
three classification contrasts, performance in the "finger/wrist" contrast is the most stable across
subjects. Anatomically, the subjects in this data set showed a large heterogeneity in the location
and severity of the lesion, as well as in their behavioural recovery of motor function (Small et al.,
2002). It is possible that this drives the heterogeneity of classification accuracy and
reproducibility that we observe for the "healthy/impaired" and the "early/late" contrasts.
The classification performance is more stable in the "finger/wrist" contrast, because the two
tasks, "finger tapping" and "wrist flexion", recruit different areas of the motor cortex. The
function of motor cortices has been disrupted in stroke patients, but the data from the healthy and
impaired hands are pooled together for this classification task, which can reduce the
heterogeneity due to stroke.
With such a heterogeneous data set, it is difficult to compare classifiers against each other using
just the box-and-whisker plot (such as one shown on Figure 8.1). We perform additional
statistical testing to answer the question whether the ranking of classifiers is consistent across
subjects. The framework for testing this question with non-parametric statistical tests is
described in Conover (1999; pages 369-373), and also in Demsar (2006). For each contrast and
subject, we rank the methods from 1 to 6 according to their median performance (either accuracy
or reproducibility), with rank 1 being the best performer. Whenever two methods are tied, say for
ranks r and r+1, they are assigned a common rank of r+0.5. First, we test the null hypothesis that
110
all methods perform equally well (and the difference in their ranking is therefore not significant),
using the Friedman test15. We compute the statistic
k
j
b
iij
k
j
b
iij
kbkr
kbrk
1 1
22
2
1 1
4
)1(
2
)1()1(
, (8.1)
where rij is the rank of the jth classifier in the ith subject, b is the number of subjects and k is the
number of classifiers. The approximate distribution of this statistic is χ2 with k-1 degrees of
freedom. Friedman test is the non-parametric equivalent to repeated-measures ANOVA.
If Friedman test results in the rejection of the null hypothesis, we can proceed to post-hoc testing,
where we test the significance of difference in ranking between a pair of classifiers. For each
contrast separately, we compute the average rank of each classifier across subjects. Then, the
ranking of a pair of classifiers is significantly different if the difference in their average ranks is
no less than the critical distance, defined as
bkb
rrb
t
k
j
b
i
k
j
b
iijij
/)1)(1(
21 1 1
2
1
2
2/1
(8.2)
where t1-α/2 is the 1-α/2 quantile of the t distribution with (b-1)(k-1) degrees of freedom, and α is
the significance level, which we set to 0.05. This threshold of significance uses Bonferroni
correction for multiple comparisons, and is therefore somewhat conservative.
15
We give the formulas according to Conover (1999), where the tests are adjusted for the situation when ties are present. Demsar (2006) does not make this adjustment.
111
Figure 8.2. Ranking of six classifiers for two performance metrics: classification accuracy
(top) and map reproducibility (bottom). Ranks of 1 and 6 correspond to the best and the
worst performer, respectively. If the ranking of classifiers is not significantly different, they
are linked with a thick horizontal bar. Significance is established with Friedman and post-
hoc nonparametric testing, as described in the test. For the contrasts with significant
difference in ranking, critical distances (CD) are also specified.
Figure 8.2 displays the results of this test graphically. Classification accuracy is evaluated in the
top row of Figure 8.2, and the reproducibility of spatial maps is evaluated in the bottom row.
Each bar represents a contrast; if there is a significant difference between the ranks (as estimated
by Friedman test), critical distances (CDs) are computed. The average ranks of classifiers are
marked on each bar, with 1 being the best performer and 6 being the worst. If the ranking of
classifiers is not significantly different (that is, if the distance between the corresponding markers
is smaller than the critical distance), they are linked together with a horizontal line under the bar.
Figure 8.2 shows that, in terms of classification accuracy, QD is the best-ranking method for two
out of three contrasts, but its ranking is not significantly different from SVM and LD-PC (for
"healthy/impaired" contrast) and from SNV and GNB-N (for "early/late" contrast). For that
second contrast, the methods fall into two groups with significantly different ranks: (QD, SVM,
GNB-N) are consistently better than (LD-PC, GNB-L, and LD-RR). For the "finger/wrist"
contrast, SVM has the highest rank, and two versions of LD are not significantly worse. QD and
GNB-L are ranked consistently lower than these three methods, and GNB-N is uniformly ranked
as the worst method in all subjects. In terms of reproducibility, we see that PC-based methods
(LD and QD) tend to have the highest ranking (with the exception of QD in "early/late" contrast).
This is consistent with our simulations, where maps for QD and LD-PC are much more
112
reproducible compared with other classifiers (see Figure 7.1, when M=0.02 or 0.03, ρ=0.5 or
0.99, and V>1). However, this is not significant in any of the three contrasts.
In our publication (Schmah et al., 2010), we used the stroke dataset to evaluate a large group of
classifiers, which included QD, LD-PC, linear SVM and GNB, and also K-nearest neighbours,
logistic regression, restricted Boltzmann machines (RBMs), and two nonlinear kernels of SVM.
In that publication, we used classification accuracy as our only metric of performance, and have
not studied the spatial maps (which are difficult to construct for some of the classifiers).
Regularization has been performed with optimizing classification accuracy, rather than the Δ
metric we used above. The box-and-whisker plots of within-subject classification accuracies are
shown in Figure 8.3. Classification accuracies are higher than the numbers shown on the top row
of Figure 8.1, for two reasons. First, the classifiers in Schmah et al. publication were tuned to
optimize classification, without accounting for reproducibility of maps (as in Figure 8.1).
Second, the size of the training set was larger (75% in Schmah et al. publication, compared to
50% in split-half resampling procedure used to compute Figure 8.1). The most dramatic
difference is demonstrated by QD in the "early/late" contrast: with accuracy-driven
regularization, it is at least 99% accurate in all the subjects, with the median accuracy 99.81%.
The only method that outperforms QD for this contrast is RBM with generative training, a much
more complex method with a much larger computation time (it requires 7.4 hours to classify one
subject's data set, whereas QD requires only 49 seconds). Overall, the performance of LD and
QD is comparable to (and, in many cases, surpasses) the performance of more complex methods,
such as RBMs, logistic regression, and kernel SVM. This suggests that multivariate Gaussian
distribution is a useful probabilistic model for fMRI data; this is supported by a study by Hlinka
et al. (2011) that showed that linear correlation provides a good description of connectivity
between brain regions as measured by fMRI.
113
Figure 8.3. Performance of a larger group of classifiers, used in a study by Schmah et al.
(2010) to classify the data in the stroke recovery study.
8 .3 E valuatio n o n real d ata: Aging s tud y
The data recorded in the aging study (see Section 2.2.3) have been used as another real data set
for the evaluation of our pool of classifiers. This data set is larger than the stroke set: it consists
of 19 young and 28 older subjects, whereas the stroke set consists of 9 subjects. It is also
presumably less heterogeneous: all the subjects have been screened with a questionnaire to
exclude health problems, and the anatomical MRI scans have been inspected to rule out severe
abnormalities (Grady et al., 2010). We have used the set of six classifiers (QD, LD-PC, LD-RR,
linear SVM, linear and nonlinear GNB) to classify the fMRI volumes based on behavioural task.
We examine interactions between performance of classifiers, strength of the contrast, and age
group (our dimensionality studies show the difference in intrinsic dimensionality between the
two age groups; see Figure 5.7).
We have tested 4 contrasts: RT/FIX, DM/FIX, DM/RT, and DM/PM (these tasks are described
in detail in Section 2.2.3). All analysis has been performed separately for each subject and
contrast. Multivariate classifiers have been regularized with the optimization of the Δ metric.
114
Figure 8.4 displays the within-subject classification accuracy and reproducibility on a box-and-
whisker plot; two age groups are plotted separately (left and right columns correspond to young
and older subjects).
Figure 8.4. Performance of the pool of classifiers on the dataset from the aging study. Left
and right columns correspond to subjects in the young and the older age groups,
respectively.
Examination of performance in Figure 8.4 shows that the grouping of classifiers with respect to
their performance is similar to the grouping we see in the simulations (see bottom row of Figure
7.1; also, Figure 7.2). In simulated data, we observed the following grouping of methods with
respect to reproducibility: (1) QD and LD-PC, (2) SVM and LD-RR, (3) GNB-L and GNB-N.
This trend can be observed in the aging data as well. To study the ranking of classifiers, we
performed post-hoc Friedman tests; results are shown on Figures 8.5 (for the young group) and
8.6 (for the older group).
115
Figure 8.5. Ranking of classifiers in the young age group. Classifiers linked with a
horizontal bar are not significantly different in their ranking.
Figure 8.6. Ranking of classifiers in the older age group.
Figure 8.4 shows that in both age groups the RT/FIX and DM/FIX contrasts are the easiest to
classify. This is expected, because the RT and DM behavioural tasks require visual attention and
motor action, whereas FIX is a passive condition. Therefore, the cortical recruitment in the two
classes of the binary classification problem is expected to be quite different. In the DM/PM
contrast, this is not the case: both of these tasks required matching a target to one of the three
stimuli. In the PM task, the target is presented simultaneously with the three stimuli, whereas in
the DM task the target is presented first and then removed, and there is a 2.5 second interval of
blank-screen display before the three stimuli are presented. Therefore, the difference between the
two tasks is in the 2.5-second interval (with TR interval being 2 seconds) when the subject had to
retain the target in short-term memory. Our pool of classifiers largely ignores this difference:
when distinguishing between DM and PM, they are only slightly better than random guessing.
The plot of reproducibility indicates that the PC-based methods (LD-PC and QD) are able to find
reproducible spatial networks within the DM/PM contrast data; however, the activity of these
116
networks is not predictive of a mean difference between the stimuli16. The DM/RT contrast is
intermediate: the RT task requires attention and motor action, but, unlike the DM task, it requires
neither perceptual matching nor short-term memory. Due to this difference between the classes,
the accuracy is better than chance on most subjects, but, on average, worse than the accuracy in
the two strongest contrasts (RT/FIX and DM/FIX).
Comparing the performance of our classifier pool across the age groups, we can see that the
classifiers frequently perform better on the older subjects. We know that the older subjects have
performed the behavioural tasks as accurately as younger subjects; however, the reaction time is
higher in the older group (Grady et al., 2010). The older subjects’ data can be easier to classify
because they have spent more time carrying out the behavioural task. Another contribution factor
might be compensation in older adults: to achieve the same level of behavioural accuracy as the
younger subjects, the older subjects recruit larger areas of the cortex (Mattay et al., 2006),
making the signal easier to detect.
We have evaluated the significance of the age-related difference in accuracy with a Mann-
Whitney U test (Conover, 1999; this test is a non-parametric equivalent to Student's unpaired t
test). To account for multiple comparisons, we have used false discovery rate (FDR) correction
with FDR=0.05 (Genovese et al., 2002). In two contrasts (DMS/FIX and RT/FIX) accuracy of
some classifiers has been found to be significantly different between the young and the older
groups. In the DMS/FIX contrast, these classifiers are LD-PC, LD-RR and SVM; in the RT/FIX
contrast, the difference is observed in LD-PC, LD-RR and QD. This significant difference in
accuracy implies the difference in cortical recruitment, which was explored in our group-level
classification presented in Section 8.4.2.
16
The BOLD activity in these networks might not be predictive of the task, but the correlation within these networks might be very high (relative to the rest of the brain), making them easy to detect with PC analysis and, therefore, highly reproducible. Visual inspection of PC-based classifier maps for the DM/PM contrast showed strong sensitivity in the default-mode regions. These regions are not expected to be "recruited" per se by either DM or PM tasks, but they exhibit high correlation of BOLD signal (Spreng & Grady, 2010).
117
8 .4 Spatial m aps fo r the aging s tud y
In the previous section, we have focused on within-subject analysis, and (among other things) on
reproducibility of spatial maps within a subject. The next step of evaluation is to address the
following questions:
1) for a given classifier, how reproducible are the maps across subjects?
2) for a given subject, how well do the maps computed with different classifiers agree with
each other?
In order to compare the maps across subjects and methods, they have to be normalized so the
distribution of noise is matched across subjects/methods. This normalization is described in
NPAIRS literature (Strother et al., 2002; LaConte et al., 2003), where the resulting normalized
maps are called "reproducible statistical parametric maps, Z-scored" (rSPM{Z}). The
distribution of signal and noise is computed from the scatter plots of two split-half maps (which
are divided by their respective standard deviations in order to bring them to the same scale). As
described in Section 2.1.2, the scatter plot has a major axis along the line of identity and a minor
axis perpendicular to it. Variation along the minor axis is due to the noise uncorrelated with
signal of interest, and variation along the major axis contains a mixture of signal and noise. For
two split-half maps z1 and z2, the projection of scatter-plot points onto the major and minor axes
is (z1+ z2)/2 and (z1- z2)/2, respectively (Strother et al., 2002). To obtain rSPM{Z}, we divide the
projection onto the major axis by the standard deviation of the projection onto the minor axis.
We repeat this procedure for all splits; each split gives us a scatter plot, and rSPM{Z} patterns
are averaged across splits.
Visual inspection of normalized rSPM{Z} maps, averaged across subjects for each contrast and
classifier, has shown that the brain areas with the strongest expression of contrast (i.e., largest Z-
scores) are neurobiologically meaningful. For the RT/FIX and DM/FIX contrasts, the brain areas
with preference17 to FIX condition are members of the default mode network18 (Fox et al., 2003;
17
Preference is defined analogously to Rasmussen et al. (2012A). Let's say that a volume is assigned to class 1 when the decision function, computed on that volume, is positive; if it is negative, the volume is assigned to class 2. A voxel has a preference for class 1 if increasing the signal in that voxel leads to increase in the decision function.
118
Grecius et al., 2005; Toro et al., 2008). These areas are: posterior cingulate cortex, anterior
cingulate cortex with adjacent ventro-medial prefrontal area, bilateral angular gyri, and bilateral
superior frontal gyri. The areas that show preference for the active task (RT in the RT/FIX
contrast, DM in the DM/FIX contrast) are the motor areas (primary motor cortex and middle
cingulate/supplementary motor area), bilateral insulae/frontal opercula and dorso-lateral
prefrontal cortices, and large areas of bilateral intraparietal lobules. These regions are often
recruited when performing an externally driven task (Grady et al., 2010); they have been shown
to be correlated amongst themselves and anti-correlated with default-mode regions (Toro et al.,
2008). Grady et al. (2010) refer to the network formed by these areas as the "task-positive
network".
For the DM/RT contrast, the preference for RT is found in bilateral insulae, anterior and
posterior cingulate gyri, middle cingulate/supplementary motor area, and left primary motor
cortex. Preference for DM is found in posterior area of intraparietal lobule (bilaterally). The
same area shows preference for DM in the DM/PM contrast, whereas preference for PM is found
in visual and middle cingulate areas. Neurobiological interpretation of the results of DM/RT and
DM/PM contrasts is difficult because of high variability across subjects (see below), and requires
further analysis.
8 .4 .1 R epro d ucibility o f s patial m aps acro s s s ubjects and acro s s m etho d s
Across-subject reproducibility has been evaluated by correlating the individual spatial maps
between all possible pairs of subjects within the age group. We have computed spatial maps for
all subjects and contrasts, using 6 classifiers from our pool; the regularization hyperparameters
have been tuned to optimize the Δ metric. The spatial maps have been normalized as described
above. After that, for each classifier separately, we have computed Pearson's correlation
coefficient across all possible pairings of subjects within each group. The young group consists
of 19 subjects, giving us 171 possible pairings; older group has 28 subjects and 378 possible
18
Greicius et al. (2003) define the default-mode areas as the areas that show "relative decreases in neural activity during task performance compared with a baseline state".
119
pairings. The box-and-whisker plot on Figure 8.7 presents the distribution of these pairwise
correlations in the young and the older groups. This figure demonstrates that across-subject
reproducibility is very similar for all classifiers in our pool. The advantage that PC-based
methods have in within-subject reproducibility does not generalize across subjects. LD-PC and
QD are operating on a subset of highest-ranked principal components; these components capture
the primary sources of variance within a subject, which boosts the reproducibility of QD and LD-
PC individual maps. However, across-subject reproducibility of PC-based classifier maps is
much smaller than their within-subject reproducibility, which suggests that these primary sources
of variance are heterogeneous across subjects. Figure 8.7 also shows that across-subjects
reproducibility of univariate maps (GNB-L and GNB-N) is comparable to reproducibility of
multivariate maps. It should be noted that our procedure of normalization is essentially
multivariate, because the standard deviation along the noise (minor) axis is pooled across voxels;
therefore, the rSPM{Z} patterns created for GNB-L and GNB-N are not truly univariate.
Figure 8.7. Across-subject reproducibility of within-subject spatial maps created by
different classifiers. The left and right panels correspond to the young and the older groups
of subjects, respectively.
The contrast has a marked effect on across-subject reproducibility, which tends to be larger for
the strong contrasts (RT/FIX and DM/FIX). Reproducibility of maps for the weakest contrast,
DM/PM, is near zero on average. There is also a difference between the age groups: in the older
group, the maps are more reproducible across subjects. This difference is significant for all
120
classifiers in the two strong contrasts (p<0.05 on a Mann-Whitney U test, corrected for multiple
comparisons with FDR at α=0.05). In the DM/RT and DM/PM contrasts, it is significant in LD-
RR, GNB-L and GNB-N maps. In the DM/PM contrast, it is also significant in SVM maps.
Recall that within-subject reproducibility is not significantly different between the age groups
(Figure 8.4, bottom row); there seems to be more group heterogeneity in the older group, but the
amount of within-subject heterogeneity is roughly equivalent for the two age groups.
Figure 8.8. Jaccard overlap of within-subject spatial maps across subjects. The left and
right panels correspond to the young and the older groups of subjects, respectively.
Similarity of within-subject spatial maps across subjects was also evaluated using Jaccard
overlap. We have thresholded the rSPM{Z} maps, created for each subject, classifier, and
contrast, to match the false discovery rate of 0.1 (see Genovese et al., 2002, for a description of
this method of correcting spatial maps for multiple comparisons). The Jaccard overlap between
the two thresholded spatial maps X and Y is
YX
YX
(8.3)
This overlap was computed for all possible pairings of subjects within an age group, for each
classifier and contrast. Results are displayed on Figure 8.8; the similarity of Jaccard overlap to
reproducibility (Figure 8.7) is evident. Similarity across subjects is more pronounced in the two
121
strong contrasts; in this case, there is more similarity in the older group compared with the young
group. In the weakest contrast (DM/PM), across-subject similarity is close to zero.
Figure 8.9. Correlation of average spatial maps across classifiers. Individual subject maps
created by each of the 6 classifiers have been averaged across all subjects from our study.
The next question is, how similar are spatial maps across methods? When we average the
normalized maps across subjects, the maps created by different classifiers tend to be quite
similar. Figure 8.9 plots Pearson's correlation between pairs of average maps for each pairing of
6 classifiers. The average has been taken across all subjects, young and older pooled together
(the same trends have been observed for group averages). For three contrasts (RT/FIX, DM/FIX
and DM/RT), we observe a large amount of consensus in our classifiers: the correlation values
are 0.84 and higher. For these contrasts, the weakest correlation is observed between pairings of
one multivariate and one univariate method. Consensus between pairs of multivariate methods is
higher: the smallest observed value is 0.95 for RT/FIX, 0.99 for DM/FIX, and 0.93 for DM/RT.
Between the two univariate methods, correlation of average maps is 1.0 in all observed contrasts.
Across-methods correlation in the weakest contrast (DM/PM) is smaller than in the other three
122
contrasts; it is at least 0.63. In all contrasts, highest correlation is observed in pairs (QD, LD-PC),
(LD-RR, SVM), (GNB-N, GNB-L).
In addition, we computed the Jaccard overlap of average maps created by different classifiers for
the same contrast. The within-sibject rSPM{Z}maps were averaged across all subjects, and
thresolded at FDR≤0.1. Using formula 8.3, we computed the Jaccard overlap between averaged
maps created by all possible pairs of classifiers for a given contrast. For the two weakest
contrasts (DM/RT and DM/PM), the Jaccard overlap was consistently zero (even with the
somewhat liberal choice of threshold of 0.1). The results for the two strongest contrasts are
presented in Figure 8.10. The strongest similarity is observed in classifier pairs (QD, LD-PC);
(LD-RR, SVM); (GNB-L, GNB-N). The similarity between the univariate and multivariate
classifiers is relatively low. Finally, the similarity between different multivariate classifiers is
stronger in the D</FIX contrast than in RT/FIX contrast. These results are consistent with
across-method reproducibility results shown in Figure 8.9.
Figure 8.10. Jaccard overlap of average spatial maps across classifiers, for 2 strong
contrasts (RT/FIX and DM/FIX).
Correlation and overlap of averaged maps does not consider the variability of maps across
subjects. To further investigate across-methods similarity, we have used DISTATIS, a variant of
multidimensional scaling which takes the across-subject variability into account. Full treatment
of DISTATIS can be found in publications by Abdi and colleagues (2005, 2009). The goal of
multidimensional scaling is to find a low-dimensional representation of high-dimensional data in
123
such a way that the distances between low-dimensional representations of any two data points
are good approximations to the distances between these points in the original high-dimensional
space (see Mardia et al., 1979, pages 394-409). For our purpose, the high-dimensional data are
the spatial maps, and we define the distance between ith and jth map as 1-ρij, where ρij is
Pearson's correlation coefficient of ith and jth maps. Distance matrix contains the distances
between all possible pairings of the data points. Multidimensional scaling finds a low-
dimensional representation from the eigendecomposition of the distance matrix. DISTATIS is an
generalization of this method for a set of distance matrices. This method combines the distance
matrices into a single compromise matrix, and projects the original distance matrices onto the
compromise matrix.
In order to apply DISTATIS, we compute within-subject distance matrices. Our pool of
classifiers consists of six methods, so we compute a 6×6 distance matrix for each of the 47
subjects. We double-center these matrices; let Si denote the doubly-centered distance matrix for
the ith subject. Then we compute the similarities between distance matrices for each pair of
subjects. The similarities are computed with an RV coefficient, which indicates how much
information is shared between two matrices (Abdi et al., 2009). We form the 47×47 matrix of RV
coefficients, and compute its first eigenvector p1. The ith coordinate of this eigenvector indicates
how similar the ith subject's distance matrix is to all other subjects' distance matrices. Then, the
compromise matrix S+ is formed as a weighted sum of doubly-centered distance matrices:
subjects
iii
#
1
SS (8.4)
where the weight αi is the ith coordinate of p1, divided by the sum of all coordinates of p1. The
compromise matrix S+ is the best way (in the least-squares sense) to represent all 47 distance
matrices with a single matrix.
A low-dimensional representation of S+ is computed from its eigendecomposition. For easy
visualization, it is common to use 2-dimensional representation (using the first two principal
components of S+). Figure 8.11 plots this 2-dimensional representation for each of the four
contrasts. The first and second principal components are represented by the horizontal and
vertical axes, respectively; on each axis, we specify the amount of variance explained by each
124
component. We project the centroids of the spatial maps created by each of the six methods onto
this coordinate space; the projections of centroids are marked with +. In order to compute the
confidence intervals around the centroids, we have drawn 1000 bootstrap samples from our set of
47 subjects. The confidence intervals are shown as ellipses around the centroids.
Figure 8.11. DISTATIS plots of similarity of within-subject maps created with different
classifiers.
Figure 8.11 displays the familiar grouping of methods: (QD and LD-PC), (SVM and LD-RR),
and (GNB-L and GNB-N). This pairing is also observed in simulated data when we evaluate the
reproducibility and ROC properties of the algorithms (Figure 7.1, bottom row; Figure 7.2); it is
also observed in evaluation of within-subject reproducibility in the aging study (bottom rows of
Figures 8.5 and 8.6). Within each pair of methods, there is a similarity in computational models:
both QD and LD-PC use PCA-based regularization, both GNB-L and GNB-N are univariate
Gaussian Naive Bayes classifiers, and both SVM and LD-RR use a L2 penalty for regularization.
This pairwise similarity between methods is especially strong in the DM/FIX and RT/FIX
contrasts, where the corresponding ellipses overlap almost completely. In two weak contrasts
(DM/RT and DM/PM), the overlap in the (QD and LD-PC) and in the (SVM and LD-RR) pairs
of ellipses is reduced, although the two GNB ellipses fully overlap. Therefore, the strength of the
125
contrast influences the consensus between the multivariate methods that use the same
regularization scheme.
8 .4 .2 G ro up-level clas s ificatio n o f s patial m aps
In addition to reproducibility, we have evaluated another aspect of spatial maps: their ability to
capture age-related information. Grady et al. (2010), using the same data set, have reported the
differences in cortical recruitment between the young and the older groups, along with
behavioural differences in reaction time. These differences have been found with Partial Least
Squares (PLS) analysis pooling all the subjects within the age group. In this section, we have
studied how well the age-related difference is preserved in within-subject spatial maps created by
different classifiers. We frame this problem in terms of classification: given a within-subject
spatial map, can we reliably predict the age group of that subject? We have already shown that
classification accuracy, as well as across-subjects reproducibility, is different between the age
groups, at least for some classifiers and contrasts. We can therefore expect, at least in some
situations, a better-than-chance accuracy when we classify the individual maps according to the
age group.
We have used LD-PC and QD for this group-level classification. Within-subject maps have been
created for 6 classifiers and 4 contrasts, as before; multivariate classifiers have been regularized
with Δ-metric optimization. The maps have been converted to rSPM{Z} as described in the
previous section. To match the sample size from the two age groups, we have used all of our 19
young subjects, and have randomly selected 19 subjects from the older group. This procedure has
been performed 10 times. During each iteration, we have performed 1000 splits of the within-
subject maps into training, validation and test sets. The test set has been formed by randomly
selecting one young and one older subject; the same procedure has been used to form the
validation set (assuring that different subjects are selected for test and validation). The remaining
2*17=34 subjects have formed the training set. This splitting scheme maximizes the size of the
training set, which is advantageous when data are heterogeneous. For each subject, the accuracy
of classification has been computed as the frequency of correct prediction of age group for that
subject.
126
Figure 8.12. Accuracy of group-level classification of individual maps that have been
created by six different classifiers. Within-subject maps have been classified according to
the age group of the subject ("young" versus "older").
The accuracy of LD-PC are displayed on a box-and-whisker plot in Figure 8.12 (QD produces
very similar results). We can see that the age group could be predicted from individual maps
with better-than-chance accuracy, irrespective of which classifier has been used to create the map
(the only exceptions are QD and LD-RR in DM/PM contrast). In the two strongest contrasts,
RT/FIX and DM/FIX, the maps are easily classifiable; median classification accuracy is between
74% (for LD-PC maps) and 85% (for GNB-N maps). This suggests that the individual spatial
maps contain the information that encodes age; all the classifiers from our pool are able of
capturing this information. The maps for the DM/RT contrast are harder to classify; median
accuracy ranges from 57% for SVM maps to 70% for the GNB-N maps. Classification for the
weakest contrast, DM/PM, is less accurate but still above chance (expect for QD and LD-RR
maps in some subjects).
This indicates that the individual maps indeed contain age-relevant information. To locate the
voxels that encode this information, we have created spatial maps for the LD-PC classification of
individual maps. For each contrast, six of these group-level maps have been made, one per
classifier from our pool. Maps have been computed according to Formula 3.17 from Chapter 3
127
(with our data set consisting of individual maps rather than fMRI volumes), and converted to
normalized rSPM{Z} patterns. We refer to these maps as "group-difference" maps, to clearly
distinguish them from the "group-average maps", that is, group-specific averages of individual
maps. On the group-difference map, voxel weights reflect sensitivity to age, that is, the
importance of this voxel in classifying the subjects’ maps according to the age of participants.
Positive values of a voxel indicate that it tends to have higher Z-scores in the older subjects’
individual maps; negative value indicates the opposite (the voxel tends to be higher in the young
subjects’ individual maps).
Interpretation of the group-difference maps is not trivial. In our situation, the individual maps are
created for a binary classification problem (say, task A versus task B); a positive voxel value in
individual maps reflects preference for task A, and a negative value reflects preference for task
B. A positive voxel value in a group map indicates that, in this spatial location, the difference (A-
B) is higher in the older than in the young subjects. This could happen in a number of situations:
older subjects are sensitive to A in this spatial location, and young subjects to B;
in both subjects, this spatial location is sensitive to A, but more so in older subjects;
in both subjects, it is sensitive to B, but more so in young subjects.
To distinguish between these three situations, individual maps need to be inspected along with
group-difference maps.
Using LD-PC, we have constructed 6 group-difference maps for each of the 4 contrasts (making
one group-level map for each bar in Figure 8.12). Group-difference maps have been normalized
to rSPM{Z} and thresholded at α=0.05 using false discovery rate correction for multiple
comparisons. Thresholded maps have been visually inspected to find brain areas that might
encode the age effect. Table 8.1 lists the areas that survived the correction for multiple
comparisons in a majority of group- difference maps for a given contrast19. No such areas have
been found for the DM/PM contrast because of lack of consensus among group- difference maps.
19
Since the maps for GNB-L and GNB-N classifiers were very similar, they counted as one vote. The areas listed in Table 8.1 were the areas found significant by at least 3 votes out of 5 (4 multivariate methods and one univariate).
128
This table also lists the preferred condition for each area and each age group. The preference is
determined as follows: if the "Task A versus Task B" contrast is expressed positively in this area
of the normalized group-average map (that is, the z-scores are positive), the preference is for
Task B; if the expression of the contrast is negative, the preference is for task A. The table also
lists the z-scores for each area and age group; first, we have identified location of most active
voxel in the group-difference map, and then we have obtained the z-score from this location in
the group-average LD-PC maps.
area preference in young preference in older
DM/FIX
precuneus FIX (2.2) DM (‐2.3)
L intraparietal sulcus DM (‐2.5) DM (‐5.4)
L dorsolateral PFC FIX (0.4) DM (‐3.3)
R intraparietal sulcus DM (‐2) FIX (0.1)
posterior cingulate FIX (3.5) FIX (1.6)
R sup. cerebellum DM (‐3.7) DM (‐6.6)
L sup. cerebellum DM (‐3.3) DM (‐5.6)
L primary motor DM (‐3.2) DM (‐5.6)
RT/FIX
precuneus FIX (1.5) RT (‐2.6)
L intraparietal sulcus RT (‐3.2) RT (‐6.5)
R intraparietal sulcus RT (‐2.8) RT (‐5.2)
L primary motor RT (‐2.4) RT (‐5.1)
R primary motor RT (‐0.8) RT (‐3.3)
L temporal pole RT (‐3.2) RT (‐2.6)
DM/RT
R intraparietal sulcus DM (‐0.7) RT (2.1)
L intraparietal sulcus RT (0.8) RT (2.2)
R primary motor RT (0.4) RT (2.4)
L primary motor RT (0.9) RT (3)
middle cingulate / SMA RT (0.7) DM (‐0.7)
L superior temporal RT (0.4) RT (2.4)
Table 8.1. Cortical areas that show sensitivity to age. Abbreviations: PFC, prefrontal
cortex; SMA, supplementary motor area.
The majority of the areas listed in the Table 8.1 for the RT/FIX and DM/FIX contrast are the
task-positive areas with stronger preference (that is, higher absolute z-scores) for the active task
(RT or DM) in the older group. In addition, a subregion of right intraparietal sulcus (a task-
positive area) has been found to have higher preference for DM in the younger group, and an
129
area located at the left temporal pole had a higher RT preference in the younger group. Posterior
cingulate cortex, the default-mode region, has a stronger preference for FIX in the young group
(however, it is below the FDR-corrected threshold in the RT/FIX contrast). Precuneus is
particularly interesting, because it has a strong FIX preference in the young subjects and strong
preference for the active task in the older subjects. This is the only area where we have found the
strong difference in preferred condition between the young and the older groups.
Overall, the difference between the age groups seems to be driven by the stronger preference for
the active task in the older subjects, and, perhaps, a stronger preference for FIX in the young
subjects. This is consistent with the hypothesis of compensation in the older individuals: to reach
the same level of behavioural performance as the young subjects, they recruit the task-positive
regions more prominently (see Grady, 2008, for the discussion of compensatory recruitment in
aging). The areas listed for the DM/PM contrast followed this trend: most of them are the task-
positive areas with stronger RT preference in the older group. The only exception is the region in
the middle cingulate / supplementary motor area, which shows RT preference in the young group
and DM preference in the older group; however, this preference is somewhat weak (|z|<1).
Figures 8.13 and 8.14 show some examples of group-difference maps with corresponding group-
average maps. The maps are constructed with LD-PC; most areas listed in Table 8.1 for the
DM/FIX and RT/FIX contrasts are displayed in Figures 8.13 and 8.14, respectively. In the group-
average maps, the cortical recruitment for the active tasks (RT and DM) is more spatially
extensive in the older group, which conforms with the hypothesis of compensatory recruitment.
In addition, the younger group shows more extensive recruitment of default-mode regions for
FIX (this is not universal; in Figure 8.13, older group shows more extensive recruitment of
ventromedial prefrontal region, a default-mode area). This is perhaps due to the fact that the
active task is behaviourally harder for the older subjects (who have shown higher response times
compared with young subjects). It is known that the activity of default-mode network is
modulated by the behavioural difficulty of the active task (Grady et al., 2010); it can be higher in
young subjects due to the fact that they find the active task easier to perform than the older
subjects.
It is interesting to compare our findings with an the results that were obtained by Grady and
colleagues (2010) on the same data set with a different method of multivariate analysis, partial
130
least squares (McIntosh et al., 1996). This method finds the latent variables (LVs) that best
explain the covariance between the fMRI data and the experimental design variables; for this
study, the experimental variables are the age group and the behavioural task. The most
significant LV (accounting for 24% of the covariance) captures the difference between the
fixation and the active tasks; the corresponding brain areas are members of the default mode and
the task positive networks. The second most significant LV (accounting for 12.6% of the
Figure 8.13. Cortical areas affected by aging, as revealed by the RT/FIX contrast. The top
row shows the group-difference map, thresholded at p<0.05. The middle and bottom rows
show the unthresholded group-average maps for the young and the older group,
respectively.
131
Figure 8.14. Cortical areas affected by aging, as revealed by the DM/FIX contrast. The top
row shows the group- difference map, thresholded at p<0.05. The middle and bottom rows
show the unthresholded group-average maps for the young and the older group,
respectively.
covariance) captures the age effect, that is, the difference between the young and the older
groups. Significance of latent variables has been estimated with a permutation test, and the
remaining LVs were found to be insignificant.
The second latent variable reveals a list of areas that encode the age effect. It is important to
point out that these areas are important sources of covariance after the variance due to the first
LV has been factored out. Therefore, the areas where the fixation-versus-active-task effect is
stronger than the young-versus-older effect will generally not be captured by the second LV.
This is different from our group-level analysis, where we can identify the areas that show a
strong age effect regardless of the strength of the fixation/active task effect in these areas. In our
132
group-difference maps, the areas identified by second LV are encoding the age effect to some
degree, but most of them do not survive the FDR correction for multiple comparisons. The only
area that has been found to be significant in both studies is the precuneus; consistently with our
results, Grady et al. have observed the difference in preferred condition between the age groups.
The majority of the areas where we have found the age effect (see Table 8.1) are not captured by
the second LV, but they are captured by the first LV, potentially because the task effect in these
areas dominates the age effect. The conclusion made Grady and colleagues, that the older group
shows more extensive recruitment of the task-positive regions, and the younger group has more
extensive recruitment of the default mode regions, is supported by our results.
133
Chapter 9 C o nclus io ns and F uture R es earch
9 .1 E valuatio n o f clas s ifiers fo r fM R I d ata
The evaluation of classifiers presented in the thesis shows that predictive multivariate Gaussian
methods (linear and quadratic discriminants) are useful and efficient tools for fMRI
classification. However, optimal performance of LD/QD requires careful regularization, because
number of dimensions (voxels) in a typical fMRI set far exceeds the number of observation
(volumes). If PCA is used to reduce the data dimensionality, performance of LD/QD is strongly
influenced by the number of PCs retained for analysis. As pointed out by LaConte et al. (2003),
selecting too few or too many components produces a model that does not describe the data
adequately. In some previous studies that evaluated LD-PC against SVM (Cox & Savoy, 2003;
Mourao-Miranda et al., 2005), all principal components were retained for analysis; as a result,
SVM greatly outperformed LD-PC, which sometimes performed at the level of random guessing.
However, LaConte et al. (2006) showed that, with careful selection of PC basis, the
performances of LD-PC and linear SVM are comparable (in that study, LD-PC used the number
of components that optimized classification accuracy, and the complexity of SVM was
controlled in the same way). Another example of poor performance of LD-PC as a result of
suboptimal choice of dimensionality is a study by Lukic et al. (2002), which introduced the
simple simulation algorithm for fMRI/PET data that we have adapted (we have modified it by
simulating the effect of hemodynamic response). In that study, 1/3 of the total number of PCs is
retained; this is a fixed number that depends only on the sample size, and is not influenced by
signal-to-noise ratio in the data. The resulting performance of LD-PC is never better than
univariate t test, and sometimes much worse than that.
In our evaluation, LD and QD were regularized by optimizing the Δ metric, which is a
combination of two performance metrics: accuracy of classification and reproducibility of spatial
maps. As a result, the classification accuracy of LD-PC was comparable to SVM, in simulated as
well as real data, with two exceptions: early/late contrast in the stroke data (Section 8.2), and, to
a lesser extent, RT/FIX contrast in the aging data (Section 8.3). This replicated the result
obtained by LaConte et al. (2006) on a different data set. In our simulations (Section 7.2.1), QD
134
was shown to be the most accurate classifier in the situation of strong heteroscedasticity, when
the following three conditions were satisfied:
a) expected values of class means were identical (M=0),
b) within-class covariance matrices were sufficiently different (V>1),
c) the correlation between nodes of the active network was large (ρ=0.99)
If the correlation between the active areas was zero, nonlinear GNB was the most accurate of the
classifiers we have tested. However, the usefulness of GNB-N classifier in real fMRI data is
questionable, because of the strong spatial correlation between (at least a subset of) voxels in the
brain. We never observed GNB-N significantly outperforming other classifiers in real data;
situations when it was the worst-ranking classifier (in terms of classification accuracy) were
observed in both stroke (finger/wrist contrast) and aging data sets (DM/RT and DM/PM
contrasts). On the other hand, QD was the most accurate classifier for the early/late and
healthy/impaired contrasts in the stroke set. Also, in our study of dimensionality in self-control,
QD was more accurate than LD in classifying spatial maps (71% versus 58%); one class was
much more heterogeneous than the other, and the classes were better discriminated with a
nonlinear decision surface. Our conclusion is that, in terms of classification accuracy, a
combination of LD-PC and QD is not worse, and sometimes better, than linear SVM. Running a
combination of QD and LD-PC does not take longer time than running one of these methods,
because the computation is dominated by PCA decomposition of the data; once it is done, LD-
PC and QD take very little time to compute. When the difference between classes is driven by
the difference in means, LD-PC is likely to be the best method of the two; when it is driven by
difference in connectivity, QD is likely to be a better model.
However, the real advantage of LD-PC and QD over LD-RR, SVM and GNB is the high
reproducibility of their within-subject spatial maps. This point is often ignored in comparative
evaluation studies of classifiers for fMRI, which frequently use classification accuracy as the
sole metric of performance (e.g., Cox & Savoy, 2003; Mourao-Miranda et al., 2005; Ku et al.,
2008; Misaki et al., 2010; Schmah et al., 2010). Our simulations (Section 7.2.2) show that LD-
PC and QD tend to produce within-subject maps, which are (sometimes dramatically) more
reproducible than the maps produced by methods that do not use PCA regularization; this is
135
replicated in the aging data set (Section 8.3). It is not altogether surprising, because highest-
ranking principal components tend to extract strongly reproducible spatial patterns. Therefore,
LD-PC and QD are particularly useful tools of within-subject analysis in neuroscience, where
obtaining a reproducible spatial map is no less important than accurate classification.
Another advantage of PC-based classifiers is the additional information that is obtained by
determining the optimal PC subspace. Size of this subspace is the number of orthogonal
dimensions in the model that gives a "good" description of our data20. Chapters 4 and 5 present a
survey of optimization metrics that can be used as quantitative measures of how "good" the
model is at describing the data. We conclude that the most useful metrics are: reproducibility of
LD/QD spatial maps, accuracy of LD/QD classification (which is, however, a less robust metric
compared to reproducibility), and generalization of PPCA model. All three metrics are measured
on an independent test set using NPAIRS resampling scheme.
In Chapter 6, we show that intrinsic PC dimensionality can have a neurobiological significance:
we demonstrate that it relates to behavioural measures of post-stroke recovery of motor function
(Yourganov et al., 2010; see Section 6.1), as well as the strength of self control (Berman et al.,
2013; see Section 6.2). Also, preliminary results of Schmah and colleagues show that intrinsic
dimensionality is a predictor of post-stroke recovery of speech after the intensive course of
speech therapy (the abstract has been submitted for presentation at Rotman annual neuroscience
conference, March 2013).
9 .2 D irectio ns fo r future res earch
9 .2 .1 S im ulatio ns o f m ultiple netw o rks
The simulation framework that we use for evaluation is somewhat simplistic: the active areas are
either organized into a single network, or uncorrelated. This is convenient for the purpose of
evaluation, because the active network is controlled by only three parameters (M, V and ρ); the
influence of these parameters on the classifier performance and on dimensionality estimation can
20
The regularization parameter in LD-RR can be transformed into effective degrees of freedom (Kustra & Strother, 2001). We have not investigated the correspondence between effective d.o.f. and estimated dimensionality (which defines the number of degrees of freedom in the LD/QD model).
136
be explored thoroughly. However, it is somewhat unrealistic: the brain is organized into a set of
networks, not into one network (Toro et al., 2008). An obvious direction for future research is to
modify the simulation framework so the active areas are coupled into two or more networks.
This would increase the number of parameters of simulated signal; however, the gain in
neurobiological relevance would be more important than the growth of parameter space.
In particular, these simulations would be useful for studying generalization-error optimization.
For single-network simulation, generalization error is optimized when 1 principal component is
used (regardless of the correlation of the network). In real data from the stroke study, we have
observed that optimum number of PCs is always greater than one, and, more importantly, this
number correlates with behavioural post-stroke recovery. This suggests that this measure could
be sensitive to the post-stroke reorganization of cortical networks; to test this hypothesis,
multiple-network simulations are needed.
A more complex (and considerably more realistic) framework for simulating fMRI data is
provided by the Virtual Brain project21 (see Jirsa et al., 2010). This framework models the brain
as a dynamical system of interconnected nodes, each node being an ensemble of neurons. The
information about the connectivity between the nodes is taken from the database of anatomical
connections in the macaque cortex. This framework can simulate neuronal activity and the
corresponding BOLD signal; in addition, it can be used to study the impact of brain lesions.
9 .2 .2 D im ens io nality in the aging s tud y s et
The data set from an aging study is a natural candidate for studying the relationship between
dimensionality and behaviour. We have shown that PC dimensionality is coupled with self-
control ability (Section 6.1) and with post-stroke recovery (Section 6.2). The aging study data
can be used to answer another question: is intrinsic dimensionality affected by age of the
participant? The original study had three age groups (20-31 years; 56-65 years; 66-85 years); for
the work presented in this thesis, the two older groups were pooled together. It would be
interesting to look at potential difference in dimensionality between these two older groups, to
21
http://thevirtualbrain.org/
137
see whether it reflects the accelerating differences n brain structure after the age of 65 (Grady et
al., 2010).
This study can also be used to look at the intrinsic dimensionality for different cognitive tasks.
The behavioural tasks range from simple (reaction) to more complex (perceptual and delayed
matching); however, they all require a motor response and use the same visual stimuli. This can
be used to study the link between dimensionality and cognitive load.
9 .2 .3 M o d ificatio ns to LD and QD; ad d itio nal perfo rm ance m etrics
We have not investigated some potentially interesting modifications to our classifiers. In
particular, our implementation of QD uses the same number of PCs for both classes. QD models
the two within-class covariance matrices separately, rather than pooling them together the way it
is done in LD. This gives QD the advantage over LD when the two covariance matrices are
indeed significantly different. However, our current implementation of QD does not consider the
possibility that the intrinsic dimensionality of the two classes might be different, and, therefore,
the two within-class covariance matrices might be better approximated using different number of
PCs. Fitting the number of PCs separately for each class is an optimization on a 2D grid; it
should be no more computationally intensive than our current implementation because
computation of QD is dominated by singular value decomposition of the data matrix. After SVD
is computed, the grid optimization should be almost as fast as our current one-dimensional
optimization. For a data matrix composed of J voxels and N volumes, one-dimensional
optimization has the computational complexity of O(N), and grid optimization has the
complexity of O(N2); since J>>N, both optimization methods are much faster than SVD, which
has computational complexity of O(JN2).
It is also possible to combine LD and QD into a single classifier, as proposed by Friedman
(1989), who calls this approach "regularized discriminant analysis". QD uses class-specific
covariance matrices, and LD pools them together; regularized discriminant analysis uses a linear
combination of class-specific and pooled covariance matrices. The linear weights must be tuned
for each class separately, which involves a grid optimization. For this approach, we do not expect
a large increase in computational intensity, again because computation is dominated by SVD.
138
In our evaluation, we use two metrics of performance: classification accuracy and
reproducibility. It would be interesting to see whether our results are replicated for some
additional metrics of spatial pattern similarity, such as mutual information (Bell & Sejnowski,
1995; Afshinpour et al., 2011). For probabilistic classifiers, we could look at prediction accuracy
(which is subtly different from classification accuracy; see Section 2.1.3) as a metric of
performance as well as a cost function for regularization tuning. Also, in the aging study data, we
have looked at across-subject reproducibility, but not at across-subject classification; that is, how
well we can classify one subject's volumes with using the model that has been trained on another
subject, which would be the natural implementation of the more common group analysis of
combined subjects.
139
R eferences
Abdi, H., O'Toole, A. J., Valentin, D., & Edelman, B. (2005). DISTATIS: The analysis of multiple distance matrices. In Proceedings of the IEEE Computer Society: International Conference on Computer Vision and Pattern Recognition (42-47).
Abdi, H., Dunlop, J. P., & Williams, L. J. (2009). How to compute reliability estimates and display confidence and tolerance intervals for pattern classifiers using the Bootstrap and 3-way multidimensional scaling (DISTATIS). NeuroImage, 45(1), 89-95.
Abdi, H. (2010). The Greenhouse-Geisser Correction. Encyclopedia of Research Design. Thousand Oaks: Sage.
Abou-Elseoud, A., Starck, T., Remes, J., Nikkinen, J., Tervonen, O., & Kiviniemi, V. (2010). The effect of model order selection in group PICA. Human brain mapping, 31(8), 1207-1216.
Afshin-Pour, B., Soltanian�Zadeh, H., Hossein�Zadeh, G. A., Grady, C. L., & Strother, S. C. (2010). A mutual information-based metric for evaluation of fMRI data-processing approaches. Human brain mapping, 32(5), 699-715.
Aguirre, G. K., Zarahn, E., & D'esposito, M. (1998). The variability of human, BOLD hemodynamic responses. NeuroImage, 8(4), 360-369.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723.
Attwell, D., Buchan, A. M., Charpak, S., Lauritzen, M., Macvicar, B. A., & Newman, E. A. (2010). Glial and neuronal control of brain blood flow. Nature, 468, 232-243.
Bandettini, P. (2007). Functional MRI today. International Journal of Psychophysiology, 63(2), 138-145.
Beckmann, C. F., & Smith, S. M. (2004). Probabilistic independent component analysis for functional magnetic resonance imaging. IEEE Transactions on Medical Imaging, 23, 137-152.
Beckmann, C. F., DeLuca, M., Devlin, J. T., & Smith, S. M. (2005). Investigations into resting-state connectivity using independent component analysis. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1457), 1001-1013.
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6), 1129-1159.
Berman, M., Yourganov, G., Askren, M. K., Ayduk, O., Casey, B. J., Gotlib, I. H., Kross, E., McIntosh, A.R., Strother, S.C., Wilson, N.L., Zayas, V., Mischel, W., Shoda, Y., & Jonides, J. (2013). Dimensionality of brain networks linked to life-long individual differences in self-control. Nature Communications, Article # 1373 (doi:10.1038/ncomms2374)
Biehl, M., & Mietzner, A. (1994). Statistical mechanics of unsupervised structure recognition. Journal of Physics A: Mathematical and General, 27, 1885-1897.
140
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc.
Bodurka, J., Ye, F., Petridou, N., Murphy, K., & Bandettini, P. A. (2007). Mapping the MRI voxel volume in which thermal noise matches physiological noise-implications for fMRI. NeuroImage, 34(2), 542-549.
Buckner, R. L., Bandettini, P. A., O'Craven, K. M., Savoy, R. L., Petersen, S. E., Raichle, M. E., et al. (1996). Detection of cortical activation during averaged single trials of a cognitive task using functional magnetic resonance imaging. Proceedings of National Academy of Science U.S.A., 93, 14878-14883.
Burock, M. A., Buckner, R. L., Woldorff, M. G., Rosen, B. R., & Dale, A. M. (1998). Randomized event-related experimental designs allow for extremely rapid presentation rates using functional MRI. Neuroreport, 9, 3735-3739.
Buxton, R. B. (2009). Introduction to functional magnetic resonance imaging: principles and techniques. Cambridge University Press.
Calhoun, V. D., Adali, T., Pearlson, G. D., & Pekar, J. J. (2001). A method for making group inferences from functional MRI data using independent component analysis. Human Brain Mapping, 14(3), 140-151.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), Article # 27.
Churchill, N. W., Oder, A., Abdi, H., Tam, F., Lee, W., Thomas, C., et al. (2012A). Optimizing preprocessing and analysis pipelines for single-subject fMRI. 1. Standard temporal motion and physiological noise correction methods. Human Brain Mapping, 33(3), 609-627.
Churchill, N. W., Yourganov, G., Oder, A., Tam, F., Graham, S. J., & S.C., S. (2012B). Optimizing preprocessing and analysis pipelines for single-subject fMRI. 2. Interactions with ICA, PCA, Task Contrast and Inter-Subject Heterogeneity. PLoS One, 7(2).
Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York, NY: Wiley.
Constable, R. T., Skudlarski, P., & Gore, J. C. (1995). An ROC approach for evaluating functional brain MR imaging and postprocessing protocols. Magnetic Resonance in Medicine, 34, 57-64.
Cooper, P. W. (1963). Statistical classification with quadratic forms. Biometrika, 50(3-4), 439-448.
Cordes, D., Haughton, V., Carew, J. D., Arfanakis, K., & Maravilla, K. (2002). Hierarchical clustering to measure connectivity in fMRI resting-state data. Magnetic Resonance Imaging, 20(4), 305-317.
Cordes, D., & Nandy, R. R. (2006). Estimation of the intrinsic dimensionality of fMRI data. NeuroImage, 29, 145-154.
Cortes, C., & Vapnik, V. N. (1995). Support-Vector Networks. Machine Learning, 20(3), 273-297.
141
Cox, D. D., & Savoy, R. L. (2003). Functional magnetic resonance imaging (fMRI) "brain reading": detecting and classifying distributed patterns of fMRI activity in human visual cortex. NeuroImage, 19, 261-270.
Cox, R. W. (1996). AFNI: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical Research, 29, 162-173.
Cox, R. W., & Jesmanowicz, A. (1999). Real-time 3D image registration for functional MRI. Magnetic Resonance in Medicine, 42(4), 1014-1018.
Culham, J. C. (2006). Functional Neuroimaging: Experimental Design and Analysis. Handbook of Functional Neuroimaging of Cognition (pp. 53-82). Cambridge MA: MIT Press.
Dagli, M. S., Ingeholm, J. E., & Haxby, J. V. (1999). Localization of cardiac-induced signal change in fMRI. NeuroImage, 9, 407-415.
Deco, G., Jirsa, V. K., Robinson, P. A., Breakspear, M., & Friston, K. (2008). The Dynamic Brain: From Spiking Neurons to Neural Masses and Cortical Fields. PLoS Computational Biology, 4(8).
Deco, G., Jirsa, V. K., & McIntosh, A. R. (2011). Emerging concepts for the dynamical organization of resting-state activity in the brain. Nature Reviews Neuroscience 12, 43-56.
Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of machine learning research, 7, 1-30.
Eastment, H. T., & Krzanowski, W. J. (1982). Cross-Validatory Choice of the Number of Components from a Principal Component Analysis. Technometrics, 24(1).
Edelstein, W. A., Glover, G. H., Hardy, C. J., & Redington, R. W. (1986). The intrinsic signal-to-noise ratio in NMR imaging. Magnetic Resonance in Medicine, 3(4), 604-618.
Efron, B., & Tibshirani, R. (1993). Introduction to the Bootstrap: Academic Press, San Diego.
Fox, P. T., Mintun, M. A., Reiman, E. M., & Raichle, M. E. (1988). Enhanced detection of focal brain responses using intersubject averaging and change-distribution analysis of subtracted PET images. Journal of Cerebral Blood Flow & Metabolism, 8(5), 642-653.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188.
Foerster, B. U., Tomasi, D., & Caparelli, E. C. (2005). Magnetic field shift due to mechanical vibration in functional magnetic resonance imaging. Magnetic Resonance in Medicine, 54(5), 1261-1267.
Fox, P. T., Raichle, M. E., Mintun, M. A., & Dence, C. (1988). Nonoxidative glucose consumption during focal physiologic neural activity. Science, 241, 462-464.
Fox, M. D., Snyder, A. Z., Vincent, J. L., Corbetta, M., Van Essen, D. C., & Raichle, M. E. (2005). The human brain is intrinsically organized into dynamic, anticorrelated functional networks. Proceedings of the National Academy of Sciences, 102(27), 9673-9678.
Freund, J. E. (1992). Mathematical Statistics: Prentice Hall, New Jersey
Friedman, J. H. (1989). Regularized Discriminant Analysis. Journal of the American Statistical Association, 84(405), 165-175.
142
Friston, K. J., Frith, C. D., Liddle, P. F., & Frackowiak, R. S. (1991). Comparing functional (PET) images: the assessment of significant change. Journal of Cerebral Blood Flow and Metabolism, 11, 690-699.
Friston, K. J., Frith, C., Turner, R., & Frackowiak, R. S. J. (1995A). Characterizing evoked hemodynamics with fMRI. NeuroImage, 2, 157-165.
Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J. B., Frith, C., & Frackowiak, R. S. J. (1995B). Statistical Parametric Maps in Functional Imaging: A General Linear Approach. Human Brain Mapping, 2, 189-210.
Friston, K. J., Poline, J. B., Holmes, A. P., Frith, C. D., & Frackowiak, R. S. (1996). A multivariate analysis of PET activation studies. Human Brain Mapping, 4(2), 140-151.
Friston, K. J., Fletcher, P., Josephs, O., Holmes, A., Rugg, M. D., & Turner, R. (1998). Event-Related fMRI: Characterizing Differential Responses. NeuroImage, 7, 30-40.
Friston, K., Phillips, J., Chawla, D., & Buchel, C. (2000). Nonlinear PCA: characterizing interactions between modes of brain activity. Philosophical Transactions of the Royal Society of London, B., Biological Sciences, 355, 135-146.
Garrett, D. D., Kovacevic, N., McIntosh, A. R., & Grady, C. L. (2012). The modulation of BOLD variability between cognitive states varies by age and processing speed. Cerebral Cortex [doi: 10.1093/cercor/bhs055].
Genovese, C. R., Lazar, N. A., & Nichols, T. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage, 15(4), 870-878.
Glover, G. H. (1999). Deconvolution of impulse response in event-related BOLD fMRI. NeuroImage, 9, 416-429.
Glover, G. H., Li, T. Q., & Ress, D. (2000). Image-based method for retrospective correction of physiological motion effects in fMRI: RETROICOR. Magnetic Resonance in Medicine, 44, 162-167.
Grabowski, T. J., Frank, R. J., Brown, C. K., Damasio, H., Ponto, L. L., Watkins, G. L., et al. (1996). Reliability of PET activation across statistical methods, subject groups, and sample sizes. Human Brain Mapping, 4, 23-46.
Grady, C. L., Springer, M. V., Hongwanishkul, D., Mcintosh, A. R., & Winocur, G. (2006). Age-related changes in brain activity across the adult lifespan. Journal of Cognitive Neuroscience, 18(2), 227-241.
Grady C.L. 2008. Compensatory reorganization of brain networks in older adults. In (Eds. Jagust W. & D'Esposito M.), Imaging the Aging Brain (pp. 105-114). New York, NY: Oxford University Press.
Grady, C. L., Protzner, A. B., Kovacevic, N., Strother, S. C., Afshin-Pour, B., Wojtowicz, M., Andreson, J. A. E., Churchill, N., & McIntosh, A.R. (2010). A multivariate analysis of age-related differences in default mode and task-positive networks across multiple cognitive domains. Cerebral Cortex, 20, 1432-1447.
Greicius, M. D., Krasnow, B., Reiss, A. L., & Menon, V. (2003). Functional connectivity in the resting brain: a network analysis of the default mode hypothesis. Proceedings of the National Academy of Sciences, 100(1), 253-258.
143
Hansen, L. K., & Larsen, J. (1996). Unsupervised Learning and Generalization, Proceedings of the IEEE International Conference on Neural Networks 1996, Washington DC (pp. 25-30).
Hansen, L. K., Larsen, J., Nielsen, F. A., Strother, S. C., Rostrup, E., Savoy, R., Lange, N., Sidtis, J., Svarer, C., & Paulson, O. B. (1999). Generalizable Patterns in Neuroimaging: How Many Principal Components? NeuroImage, 9, 534-544.
Harel, N., Ugurbil, K., Uludag, K., & Yacoub, E. (2006). Frontiers of brain mapping using MRI. Journal of Magnetic Resonance Imaging, 23(6), 945-957.
Harrison, R. V., Harel, N., Panesar, J., & Mount, R. J. (2002). Blood capillary distribution correlates with hemodynamic-based functional imaging in cerebral cortex. Cerebral Cortex, 12, 225-233.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction: Springer.
Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., & Pietrini, P. (2001). Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex. Science, 293, 2425-2430.
Haynes, J. D., & Rees, G. (2006). Decoding mental states from brain activity in humans. Nature Revues Neuroscience, 7, 523-534.
Hlinka, J., Palus, M., Vejmelka, M., Mantini, D., & Corbetta, M. (2011). Functional connectivity in resting-state fMRI: is linear correlation sufficient? NeuroImage, 54(3), 2218-2225.
Hodges, J. L. (1955). Discriminatory Analysis I. Survey: USAF School of Aviation Medicine, Randolph AFB, Texas.
Hoyle, D. C., & Rattray, M. (2004). Principal-component-analysis eigenvalue spectra from data with symmetry-breaking structure. Physical Review E, 69, 026124.
Hoyle, D. C., & Rattray, M. (2007). Statistical mechanics of learning multiple orthogonal signals: asymptotic theory and fluctuation effects. Physical Review E, 75, 016101.
Huettel, S. A., Song, A. W., & McCarthy, G. (2004). Functional Magnetic Resonance Imaging: Sinauer Associates.
Jenkinson, M., Bannister, P. R., Brady, J. M., & Smith, S. M. (2002). Improved optimisation for the robust and accurate linear registration and motion correction of brain images. NeuroImage, 17(2), 825-841.
Jezzard, P., & Clare, S. (1999). Sources of distortion in functional MRI data. Human Brain Mapping, 8, 80-85.
Jirsa, V., Sporns, O., Breakspear, M., Deco, G., & McIntosh, A. R. (2010). Towards the virtual brain: network modeling of the intact and the damaged brain. Archives italiennes de biologie, 148(3), 189-205.
Kamitani, Y., & Tong, F. (2005). Decoding the visual and subjective contents of the human brain. Nature Neuroscience, 8, 679-685.
Kiebel, S. & Holmes, A. P. (2003). The general linear model. Human Brain Function, 2, 725-760.
144
Kjems, U., Hansen, L. K., Anderson, J., Frutiger, S., Muley, S., Sidtis, J., Rottenberg, D., & Strother, S. C. (2002). The quantitative evaluation of functional neuroimaging experiments: mutual information learning curves. NeuroImage, 15, 772-786.
Kleinschmidt, A. (2007) Different analysis solutions for different spatial resolutions? Moving towards a mesoscopic mapping of functional architecture in the human brain. NeuroImage, 38, 663–665
Kriegeskorte, N., & Bandettini, P. (2007). Analyzing for information, not activation, to exploit high-resolution fMRI. NeuroImage, 38, 649-662.
Krishnan, G., Williams, L. J., McIntosh, A. R., & Abdi, H. (2011). Partial Least Squares (PLS) Methods for Neuroimaging: a Tutorial and Review. NeuroImage, 56, 455-475.
Kruger, G., & Glover, G. H. (2001). Physiological noise in oxygenation-sensitive magnetic resonance imaging. Magn Reson Med, 46, 631-637.
Krzanowski, W. J., & Kline, P. (1995). Cross-validation for choosing the number of important components in principal component analysis. Multivariate Behavioral Research, 30(2), 149-165.
Ku, S.-P., Gretton, A., Macke, J., & Logothetis, N. K. (2008). Comparison of pattern recognition methods in classifying high-resolution BOLD signals obtained at high magnetic field in monkeys. Magnetic Resonance Imaging, 26, 1007-1014.
Kustra, R., & Strother, S. (2001). Penalized discriminant analysis of [15O]-water PET brain images with prediction error selection of smoothness and regularization hyperparameters. IEEE Transactions on Medical Imaging, 20, 376-387.
Kwong, K. K., Belliveau, J. W., Chesler, D. A., Goldberg, I. E., Weisskoff, R. M., Poncelet, B. P., et al. (1992). Dynamic magnetic resonance imaging of human brain activity during primary sensory stimulation. Proceedings of the National Academy of Sciences, 89, 5675-5679.
LaConte, S., Anderson, J., Muley, S., Ashe, J., Frutiger, S., Rehm, K., et al. (2003). The evaluation of preprocessing choices in single-subject BOLD fMRI using NPAIRS performance metrics. NeuroImage, 18, 10-27.
LaConte, S., Strother, S., Cherkassky, V., Anderson, J., & Hu, X. (2005). Support vector machines for temporal classification of block design fMRI data. NeuroImage, 26, 317-329.
Lange, N., Strother, S. C., Anderson, J. R., Nielsen, F. r., Holmes, A. P., Kolenda, T., Savoy, R., & Hansen, L. K. (1999). Plurality and resemblance in fMRI data analysis. NeuroImage, 282-303.
Le, T. H., & Hu, X. (1997). Methods for assessing accuracy and reliability in functional MRI. NMR in Biomedicine, 10(4-5), 160-164.
Li, Y.-O., Adali, T., & Calhoun, V. D. (2007). Estimating the number of independent components for functional magnetic resonance imaging data. Human Brain Mapping, 28(11), 1251-1266.
Logothetis, N. K., Pauls, J., Augath, M., Trinath, T., & Oeltermann, A. (2001). Neurophysiological investigation of the basis of the fMRI signal. Nature, 412, 150-157.
145
Lukic, A. S., Wernick, M. N., & Strother, S. C. (2002). An evaluation of methods for detecting brain activations from functional neuroimages. Artificial Intelligence in Medicine, 25, 69-88.
Lund, T. E., Madsen, K. H., Sidaros, K., Luo, W. L., & Nichols, T. E. (2006). Non-white noise in fMRI: does modelling have an impact? NeuroImage, 29, 54-66.
Lusted, L. B. (1968). Introduction to Medical Decision Making: Thomas Springfield, IL.
Maitra, R. (2010). A re-defined and generalized percent-overlap-of-activation measure for studies of fMRI reproducibility and its use in identifying outlier activation maps. NeuroImage, 50(1), 124-135.
Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis: Academic Press.
Mason, S. J., & Graham, N. E. (2002). Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation. Quarterly Journal of the Royal Meteorological Society, 128, 2145-2166.
Mattay, V. S., Fera, F., Tessitore, A., Hariri, A. R., Berman, K. F., Das, S., Meyer-Lindenberg, A., Goldberg, T. E., Callicott, J. H., & Weinberger, D. R. (2006). Neurophysiological correlates of age-related changes in working memory capacity. Neuroscience letters, 392(1-2), 32-37.
McIntosh, A. R., Bookstein, F. L., Haxby, J. V., & Grady, C. L. (1996). Spatial pattern analysis of functional brain images using partial least squares. NeuroImage, 3, 143-157.
McIntosh, A. R., & Lobaugh, N. J. (2004). Partial least squares analysis of neuroimaging data: applications and advances. NeuroImage, 23 Suppl 1, S250-263.
McIntosh, A. R., Kovacevic, N., & Itier, R. J. (2008). Increased Brain Signal Variability Accompanies Lower Behavioral Variability in Development. PLoS Computational Biology, 4(7), e10000106.
McIntosh, A. R., Kovacevic, N., Lippe, S., Garrett, D., Grady, C., & Jirsa, V. (2010). The development of a noisy brain. Archives Italiennes de Biologie, 148(3), 323-337.
Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473.
Minsky, M., & Seymour, P. (1969). Perceptrons. Oxford, England: M.I.T. Press.
Metz, C. E. (1986). ROC Methodology in Radiologic Imaging. Investigative Radiology, 21, 720-733.
Metz, C. E., Herman, B. A., & Shen, J.-H. (1998). Maximum Likelihood Estimation of Receiver Operating Characteristic (ROC) Curves from Continuously-Distributed Data. Statistics in Medicine, 17, 1033-1053.
Minka, T. P. (2000). Automatic choice of dimensionality for PCA: MIT Media Laboratory Perceptual Computing Section.
Misaki, M., Kim, Y., Bandettini, P. A., & Kriegeskorte, N. (2010). Comparison of multivariate classifiers and response normalizations for pattern-information fMRI. NeuroImage, 53(1), 103-118.
146
Mischel, W., Ayduk, O., Berman, M. G., Casey, B. J., Gotlib, I. H., Jonides, J., Kross, E., Teslovich, T., Wilson, N. L., Zayas, V., & Shoda, Y. (2011). ‘Willpower’over the life span: decomposing self-regulation. Social Cognitive and Affective Neuroscience, 6(2), 252-256.
Mitchell, T. M., Hutchinson, R., Niculescu, R. S., Pereira, F., Wang, X., Just, M., & Newman, S. (2004). Learning to Decode Cognitive States from Brain Images. Machine Learning, 57(1-2), 145-175.
Morch, N., Hansen, L. K., Strother, S. C., Svarer, C., Rottenberg, D. A., Lautrup, B., Savoy, R., & Paulson, O. (1997). Nonlinear versus linear models in functional neuroimaging: Learning curves and generalization crossover, Information Processing in Medical Imaging (Vol. 1230, pp. 259-270). New York: Springer-Verlag.
Mourao-Miranda, J., Bokde, A. L., Born, C., Hampel, H., & Stetter, M. (2005). Classifying brain states and determining the discriminating activation patterns: Support Vector Machine on functional MRI data. NeuroImage, 28, 980-995.
Mukamel, R., Gelbard, H., Arieli, A., Hasson, U., Fried, I., & Malach, R. (2005). Coupling between neuronal firing, field potentials, and FMRI in human auditory cortex. Science, 309(5736), 951-954.
Nandy, R. R., & Cordes, D. (2004). New approaches to receiver operating characteristic methods in functional magnetic resonance imaging with real data using repeated trials. Magnetic Resonance in Medicine, 52(6), 1424-1431.
Ng, A. Y., & Jordan, M. I. (2002). On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. Neural Information Processing Systems, 14.
Norman, K. A., Polyn, S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading: multi-voxel pattern analysis of fMRI data. TRENDS in Cognitive Sciences, 10(9).
Ogawa, S., Lee, T. M., Nayak, A. S., & Glynn, P. (1990). Oxygenation-sensitive contrast in magnetic resonance image of rodent brain at high magnetic fields. Magnetic Resonance in Medicine, 14, 68-78.
Ogawa, S., Tank, D. W., Menon, R., Ellermann, J. M., Kim, S. G., Merkle, H., & Ugurbil, K. (1992). Intrinsic signal changes accompanying sensory stimulation: functional brain mapping with magnetic resonance imaging. Proceedings of the National Academy of Sciences, 89(13), 5951-5955.
Pan, X., & Metz, C. E. (1997). The "Proper" Binormal Model: Parametric Receiver Operating Characteristic Curve Estimation with Degenerate Data. Academic Radiology, 4(5), 380-389.
Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: a tutorial overview. NeuroImage, 45, S199-S209.
Peres-Neto, P. R., Jackson, D. A., & Somers, K. M. (2005). How many principal components? stopping rules for determining the number of non-trivial axes revisited. Computational Statistics and Data Analysis, 49(4), 974-997.
Perlbarg, V., Bellec, P., Anton, J. L., Pelegrini-Issac, M., Doyon, J., & Benali, H. (2007). CORSICA: correction of structured noise in fMRI by automatic identification of ICA components. Magnetic Resonance Imaging, 25(1), 35-46.
147
Power, J. D., Barnes, K. A., Snyder, A. Z., Schlaggar, B. L., & Petersen, S. E. (2012). Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. Neuroimage, 59(3), 2142-2154.
Purdon, P. L., & Weisskoff, R. M. (1998). Effect of temporal autocorrelation due to physiological noise and stimulus paradigm on voxel-level false-positive rates in fMRI. Human Brain Mapping, 6(4), 239-249.
Raemaekers, M., Vink, M., Zandbelt, B., van Wezel, R. J., Kahn, R. S., & Ramsey, N. F. (2007). Test-retest reliability of fMRI activation during prosaccades and antisaccades. NeuroImage, 36, 532-542.
Raj, D., Anderson, A. W., & Gore, J. C. (2001). Respiratory effects in human functional magnetic resonance imaging due to bulk susceptibility changes. Physics in Medicine and Biology, 46, 3331-3340.
Rasmussen, P. M., Madsen, K. H., Lund, T. E., & Hansen, L. K. (2011). Visualization of nonlinear kernel models in neuroimaging by sensitivity maps. NeuroImage, 55(3), 1120-1131.
Rasmussen, P. M., Schmah, T., Madsen, K. H., Lund, T. E., Yourganov, G., Strother, S., & Hansen, L. K. (2012A). Visualization of nonlinear classification models in neuroimaging - signed sensitivity maps. Paper presented at the Biosignals 2012, International Conference on Bio-inspired Systems and Signal Processing.
Rasmussen, P. M., Hansen, L. K., Madsen, K. H., Churchill, N. W., & Strother, S. C. (2012B). Model sparsity and brain pattern interpretation of classification models in neuroimaging. Pattern Recognition, 45(6), 2085-2100.
Reimann, P., Van Den Broeck, C., & Bex, G. J. (1996). A Gaussian scenario for unsupervised learning. Journal of Physics A: Mathematical and General, 29, 3521-3535.
Reyment, R. A., & Joreskog, K. G. (1996). Applied Factor Analysis in the Natural Sciences: Cambridge University Press.
Rissanen, J. (1978). Modeling By Shortest Data Description. Automatica, 14, 465-471.
Rombouts, S. A., Barkhof, F., Hoogenraad, F. G., Sprenger, M., & Scheltens, P. (1998). Within-subject reproducibility of visual activation patterns with functional magnetic resonance imaging using multislice echo planar imaging. Magnetic Resonance Imaging, 16, 105-113.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408.
Schmah, T., Yourganov, G., Zemel, R. S., Hinton, G. E., Small, S. L., & Strother, S. C. (2010). Comparing classification methods for longitudinal fMRI studies. Neural Computation, 22(11), 2729-2762.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.
Seber, G. A. F. (2004). Multivariate observations: Wiley-Interscience.
Shaw, M. E., Strother, S. C., Gavrilescu, M., Podzebenko, K., Waites, A., Watson, J., et al. (2003). Evaluating subject specific preprocessing choices in multisubject fMRI data sets using data-driven performance metrics. NeuroImage, 19, 988-1001.
148
Skudlarski, P., Constable, T. R., & Gore, J. C. (1999). ROC Analysis of Statistical Methods Used in Functional MRI: Individual Subjects. NeuroImage, 9(3), 311-329.
Small, S. L., Hlustik, P., Noll, D. C., Genovese, C., & Solodkin, A. (2002). Cerebellar hemispheric activation ipsilateral to the paretic hand correlates with functional recovery after stroke. Brain, 125(7), 1544-1557.
Smith, C. A. B. (1947). Some Examples of Discrimination. Annals of Eugenics, 13, 272-282.
Spreng, R. N., & Grady, C. L. (2010). Patterns of brain activity supporting autobiographical memory, prospection, and theory of mind, and their relationship to the default mode network. Journal of Cognitive Neuroscience, 22(6), 1112-1123.
Stoica, P., & Selen, Y. (2004). Model-order selection: a review of information criterion rules. IEEE Signal Processing Magazine, 21(4), 36-47.
Strother, S. C., Lange, N., Anderson, J. R., Schaper, K. A., Rehm, K., Hansen, L. K., et al. (1997). Activation pattern reproducibility: measuring the effects of group size and data analysis models. Human Brain Mapping, 5, 312-316.
Strother, S. C., Anderson, J., Hansen, L. K., Kjems, U., Kustra, R., Sidtis, J., et al. (2002). The quantitative evaluation of functional neuroimaging experiments: the NPAIRS data analysis framework. NeuroImage, 15, 747-771.
Strother, S., La Conte, S., Kai Hansen, L., Anderson, J., Zhang, J., Pulapura, S., et al. (2004). Optimizing the fMRI data-processing pipeline using prediction and reproducibility performance metrics: I. A preliminary group analysis. NeuroImage, 23 Suppl 1, 196-207.
Strother, S. C., Oder, A., Spring, R., & Grady, C. (2010). The NPAIRS Computational Statistics Framework for Data Analysis in Neuroimaging. Paper presented at the 19th International Conference on Computational Statistics: Keynote, Invited and Contributed Papers.
Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science, 240(4857), 1285-1293.
Sychra, J. J., Bandettini, P. A., Bhattacharya, N., & Lin, Q. (1994). Synthetic images by subspace transforms I. Principal components images and related filters. Medical Physics, 21, 193.
Tegeler, C., Strother, S. C., Anderson, J. R., & Kim, S. G. (1999). Reproducibility of BOLD-based functional MRI obtained at 4 T. Human Brain Mapping, 7, 267-283.
Thulborn, K. R., Waterton, J. C., Matthews, P. M., & Radda, G. K. (1982). Oxygenation dependence of the transverse relaxation time of water protons in whole blood at high field. Biochim. Biophys. Acta, 714, 265-270.
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society, Series B, 61, 611-622.
Toro, R., Fox, P. T., & Paus, T. (2008). Functional coactivation map of the human brain. Cerebral Cortex, 18(11), 2553-2559.
Ulfarsson, M. O., & Solo, V. (2008). Dimension Estimation in Noisy PCA With SURE and Random Matrix Theory. IEEE Transactions on Signal Processing, 56(12), 5804-5816.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New York, Inc.
149
Watkin, T. L. H., & Nadal, J.-P. (1994). Optimal Unsupervised Learning. Journal of Physics A: Mathematical and General, 27, 1899-1915.
Wax, M., & Kailath, T. (1985). Detection of signals by information theoretic criteria. IEEE Transactions on Acoustics, Speech and Signal Processing, 33(2), 387-392.
Wise, R. G., Ide, K., Poulin, M. J., & Tracey, I. (2004) Resting fluctuations in arterial carbon dioxide induce significant low frequency variations in BOLD signal. NeuroImage, 21(4), 1652-1664
Wold, S. (1978). Cross-validatory estimation of the number of components in factor and principal component models. Technometrics, 20(4), 397-405.
Woods, R. P., Grafton, S. T., Holmes, C. J., Cherry, S. R., & Mazziotta, J. C. (1998). Automated image registration: I. General methods and intrasubject, intramodality validation. Journal of computer assisted tomography, 22, 139-152.
Worsley, K. J. (2001). Statistical Analysis of Activation Images. Functional MRI: Introduction to Methods, pp. 251-270.
Worsley, K. J., & Friston, K. J. (1995). Analysis of fMRI time-series revisited-again. NeuroImage, 2, 173-181.
Yamamoto, Y., Ihara, M., Tham, C., Low, R. W. C., Slade, J. Y., Moss, T., et al. (2009). Neuropathological Correlates of Temporal Pole White Matter Hyperintensities in CADASIL. Stroke, 40(6), 2004-2011.
Yourganov, G., Schmah, T., Small, S. L., Rasmussen, P. M., & Strother, S. C. (2010). Functional connectivity metrics during stroke recovery. Archives Italiennes de Biologie, 148(3), 259-270.
Yourganov, G., Xu., C., Lukic, A., Grady, C., Small, S., Wernick, M., et al. (2011). Dimensionality Estimation for Optimal Detection of Functional Networks in BOLD fMRI Data. NeuroImage, 56(2), 531-543.
Zarahn, E., Aguirre, G., & D'Esposito, M. (1997A). A trial-based experimental design for fMRI. NeuroImage, 6, 122-138.
Zarahn, E., Aguirre, G. K., & D'Esposito, M. (1997B). Empirical analyses of BOLD fMRI statistics. I. Spatially unsmoothed data collected under null-hypothesis conditions. NeuroImage, 5, 179-197.
Zhang, J., Liang, L., Anderson, J. R., Gatewood, L., Rottenberg, D. A., & Strother, S. C. (2008). A Java-based fMRI Processing Pipeline Evaluation System for Assessment of Univariate General Linear Model and Multivariate Canonical Variate Analysis-based Pipelines. Neuroinformatics, 6, 123-134.
150
Append ix Wo rks w ith s ignificant co ntributio n fro m the autho r
1 Peer-reviewed publicatio ns Grigori Yourganov, Tanya Schmah, Nathan W. Churchill, Marc G. Berman, Cheryl L.
Grady, Stephen C. Strother, “Evaluation of classifiers and their spatial maps with
simulated and experimental fMRI data”. In preparation
Nathan W. Churchill, Grigori Yourganov, Stephen C. Strother, "Comparing
Classification and Regularization Methods in fMRI for Large and Small Sample sizes".
In preparation
Tanya Schmah, Stephen C. Strother, Grigori Yourganov, Nathan W. Churchill, Richard
S. Zemel, Steven L. Small, “Complexity of functional connectivity predicts recovery
from aphasia”. In preparation
Marc G. Berman, Grigori Yourganov, Mary K. Askren, Ozlem Ayduk, B.J. Casey, Ian H.
Gotlib, Ethan Kross, Anthony R. McIntosh, Stephen Strother, Nicole L. Wilson, Vivian
Zayas, Walter Mischel, Yuichi Shoda, John Jonides, “Dimensionality of brain networks
linked to life long individual differences in self-control”. Nature Communications,
Article # 1373 (doi:10.1038/ncomms2374)
Nathan W. Churchill, Grigori Yourganov, Anita Oder, Fred Tam, Simon J. Graham,
Stephen C. Strother, “Optimizing preprocessing and analysis pipelines for single-subject
fMRI: 2. Interactions with ICA, PCA, task contrast and inter-subject heterogeneity”.
PLOS One, 7(2):e31147, 2012
Nathan W. Churchill, Grigori Yourganov, Robyn Spring, Peter M. Rasmussen, Wayne
Lee, Jon E. Ween, Stephen C. Strother, “PHYCAA: Data-Driven Measurement and
Removal of Physiological Noise in BOLD fMRI”. NeuroImage 59(2), 2011
Grigori Yourganov, Xu Chen, Ana S. Lukic, Cheryl L. Grady, Stephen L. Small, Miles
N. Wernick, Stephen C. Strother, “Dimensionality Estimation for Optimal Detection of
Functional Networks in BOLD fMRI data”, NeuroImage 56 (2), 2011
151
Grigori Yourganov, Tanya Schmah, Steven L. Small, Peter M. Rasmussen, Stephen C.
Strother, “Functional connectivity metrics during stroke recovery”, Archives Italiennes de
Biologie 148 (3), 2010
Tanya Schmah, Grigori Yourganov, Richard S. Zemel, Geoffrey E. Hinton, Stephen L.
Small, Stephen C. Strother, “Comparing Classification Methods for Longitudinal fMRI
studies”, Neural Computation 22, 2010
2 C o nference pres entatio ns Tanya Schmah, E. Susan Duncan, Grigori Yourganov, Richard S. Zemel, Stephen L.
Small, Stephen C. Strother, "Predicting Language Recovery after Stroke Using
Variability of Performance and Complexity of Functional Connectivity". Poster presented
at the 23rd Annual Rotman Neuroscience Conference, "Brain Plasticity &
Neurorehabilitation", March 2013.
Grigori Yourganov, Xu Chen, Stephen C. Strother, “Detection of functional networks in
fMRI data: evaluation of univariate and multivariate approaches”. Poster presented at
Annual Meeting of the Organisation for Human Brain Mapping, June 2010
Grigori Yourganov, Ana S. Lukic, Cheryl L. Grady, Miles N. Wernick, Stephen C.
Strother, “Optimizing Activation Detection with Better Dimensionality Estimation in
BOLD fMRI”. Poster presented at Annual Meeting of the Organisation for Human Brain
Mapping, June 2009
Grigori Yourganov, Xu Chen, Ana S. Lukic, Miles N. Wernick, Stephen C. Strother,
“The Impact of Dimensionality Estimation On Spatial Signal Detection In Multivariate
Gaussian Image Data”. Poster presented at Annual Meeting of the Organisation for
Human Brain Mapping, June 2008