Pride cti Gaevssiu an Cl assificatitcn Fu ofon ional MRI Dat a · 2014-01-14 · ii Predictive Gaussi an Cl assificatitcn Fu ofon ional MRI dat a Grigori Yourganov Doctor of Philosophy

Pred ictive G aus s ian C las s ificatio n o f F unctio nal M R I Data

by

G rigo ri Yo urgano v

A thes is s ubm itted in co nfo rm ity w ith the requirem ents fo r the d egree o f D o cto r o f Philo s o phy

Ins titute o f M ed ical S cience Univers ity o f T o ro nto

© C o pyright by G rigo ri Yo urgano v 2 0 1 3

ii

Pred ictive G aus s ian C las s ificatio n o f F unctio nal M R I d ata

Grigori Yourganov

Doctor of Philosophy

Institute of Medical Science University of Toronto

2013

Abs tract

This thesis presents an evaluation of algorithms for classification of functional MRI data. We

evaluated the performance of probabilistic classifiers that use a Gaussian model against a popular

non-probabilistic classifier (support vector machine, SVM). A pool of classifiers consisting of

linear and quadratic discriminants, linear and non-linear Gaussian Naive Bayes (GNB)

classifiers, and linear SVM, was evaluated on several sets of real and simulated fMRI data.

Performance was measured using two complimentary metrics: accuracy of classification of fMRI

volumes within a subject, and reproducibility of within-subject spatial maps; both metrics were

computed using split-half resampling. Regularization parameters of multivariate methods were

tuned to optimize the out-of-sample classification and/or within-subject map reproducibility.

SVM showed no advantage in classification accuracy over Gaussian classifiers. Performance of

SVM was matched by linear discriminant, and at times outperformed by quadratic discriminant

or nonlinear GNB. Among all tested methods, linear and quadratic discriminants regularized

with principal components analysis (PCA) produced spatial maps with highest within-subject

reproducibility. We also demonstrated that the number of principal components that optimizes

the performance of linear / quadratic discriminants is sensitive to the mean magnitude, variability

and connectivity of simulated active signal. In real fMRI data, this number is correlated with

behavioural measures of post-stroke recovery , and, in a separate study, with behavioural

iii

measures of self-control. Using the data from a study of cognitive aspects of aging, we accurately

predicted the age group of the subject from within-subject spatial maps created by our pool of

classifiers. We examined the cortical areas that showed difference in recruitment in young versus

older subjects; this difference was demonstrated to be primarily driven by more prominent

recruitment of task-positive network in older subjects. We conclude that linear and quadratic

discriminants with PCA regularization are well-suited for fMRI data classification, particularly

for within-subject analysis.

iv

Ackno w led gm ents

Most importantly, I would like to express my gratitude to my academic supervisors, Dr. Stephen

Strother and Dr. Randy McIntosh, who have helped me enormously during my graduate studies.

This thesis would not exist without their mentorship, without their financial and moral support.

I am also deeply grateful to Dr. Stephen Small from University of California (Irvine) and to Dr.

Cheryl Grady from University of Toronto, for letting me use the data from their beautifully

designed experiments. I would also like to thank Natasa Kovacevic for doing a fantastic job at

careful preprocessing of the data. I am heavily indebted to Dr. Ana Lukic for developing the

simulation framework that my thesis rests on, and to Dr. Xu Chen for teaching me how to use

this framework.

I have been very lucky to closely collaborate with Dr. Tanya Schmah, Dr. Marc Berman and Dr.

Nathan Churchill. I have learned a great many things from them, and certainly hope that our

collaborations continue into the future!

Many thanks to Dr. Rafal Kustra and Dr. Richard Zemel, members of my Program Advisory

Committee, for being patient with me during the long period of my doctoral studies and for

providing invaluable feedback!

v

T able o f C o ntents

Chapter 1 Introduction .................................................................................................................... 1

1.1 The nature of signal and noise in fMRI .................................................................................... 1

1.1.1 Magnetic Resonance Imaging and the BOLD signal ........................................................ 1

1.1.2 Hemodynamic response and temporal resolution of fMRI ............................................... 3

1.1.3 Spatial resolution of fMRI ................................................................................................ 4

1.1.4 Sources of noise in fMRI .................................................................................................. 5

1.2 Experimental design .................................................................................................................. 7

1.3 Statistical analysis ..................................................................................................................... 9

1.3.1 General Linear Model ....................................................................................................... 9

1.3.2 Multivariate classification ............................................................................................... 10

1.4 Selecting the method of analysis ............................................................................................. 12

1.5 Data sets for evaluation of algorithms .................................................................................... 13

1.6 Structure of the thesis .............................................................................................................. 14

Chapter 2 Evaluating algorithms for fMRI data analysis ............................................................. 16

2.1 Metrics of performance ........................................................................................................... 16

2.1.1 Receiver Operating Characteristics (ROC) methodology ............................................... 16

2.1.2 Reproducibility; NPAIRS framework ............................................................................. 20

2.1.3 Accuracy of Classification .............................................................................................. 23

2.1.4 Prediction-Reproducibility Plots ..................................................................................... 24

2.2 Data sets .................................................................................................................................. 26

2.2.1 Simulated data ................................................................................................................. 27

2.2.2 Stroke recovery study ...................................................................................................... 30

2.2.3 Aging study ..................................................................................................................... 31

Chapter 3 Probabilistic classification of fMRI data ...................................................................... 33

vi

3.1 General considerations ............................................................................................................ 33

3.2 Constructing spatial maps for classifiers ................................................................................ 36

3.3 Quadratic discriminant ............................................................................................................ 38

3.4 Linear discriminant ................................................................................................................. 39

3.5 Univariate methods: Gaussian Naive Bayes classifier, General Linear Model ...................... 41

3.6 Regularization of the covariance matrix ................................................................................. 45

3.7 Non-probabilistic classification: Support Vector Machines ................................................... 46

Chapter 4 Methods for estimating the intrinsic dimensionality of the data .................................. 48

4.1 Principal Component Analysis and dimensionality reduction ................................................ 48

4.2 Probabilistic Principal Component Analysis .......................................................................... 50

4.3 Methods of dimensionality estimation .................................................................................... 52

4.3.1 Akaike Information Criterion .......................................................................................... 53

4.3.2 Minimum Description Length ......................................................................................... 55

4.3.3 Bayesian evidence ........................................................................................................... 56

4.3.4 Stein's Unbiased Risk Estimator ..................................................................................... 57

4.3.5 Predicted Residual Sum of Squares ................................................................................ 59

4.3.6 Generalization error ........................................................................................................ 60

4.3.7 Reproducibility and Classification Accuracy .................................................................. 61

4.3.8 The Area under a ROC curve .......................................................................................... 62

Chapter 5 Intrinsic dimensionality estimation: results .................................................................. 64

5.1 Simulated data ......................................................................................................................... 64

5.1.1 Analytic methods ............................................................................................................ 64

5.1.2 Empirical methods: PRESS and Generalization error .................................................... 67

5.1.3 Empirical methods: Reproducibility and Classification Accuracy ................................. 68

5.1.4 Summary of performance on simulated data .................................................................. 71

5.1.5 Effect on data analysis .................................................................................................... 75

vii

5.2 Intrinsic dimensionality estimation in real data ...................................................................... 77

5.3 Lessons learned ....................................................................................................................... 80

Chapter 6 Intrinsic dimensionality and complexity of fMRI data ................................................ 84

6.1 Intrinsic dimensionality in a study of self-control .................................................................. 84

6.2 Complexity of cortical networks in fMRI study of stroke recovery ....................................... 86

Chapter 7 Evaluation of classifiers: simulated fMRI data ............................................................ 94

7.1 Pool of classifiers .................................................................................................................... 94

7.2 Performance of classifiers on simulated data .......................................................................... 96

7.2.1 Classification accuracy .................................................................................................... 98

7.2.2 Reproducibility ................................................................................................................ 99

7.2.3 Partial area under the ROC curve .................................................................................. 100

7.2.4 ROC evaluation of GLM, PCA, ICA and PLS ............................................................. 101

7.3 Summary of evaluation of classifiers on simulated data ...................................................... 104

Chapter 8 Evaluation of classifiers: real fMRI data .................................................................... 106

8.1 Data sets ................................................................................................................................ 106

8.2 Evaluation on real data: Stroke study ................................................................................... 108

8.3 Evaluation on real data: Aging study .................................................................................... 113

8.4 Spatial maps for the aging study ........................................................................................... 117

8.4.1 Reproducibility of spatial maps across subjects and across methods ........................... 118

8.4.2 Group-level classification of spatial maps .................................................................... 125

Chapter 9 Conclusions and Future Research .............................................................................. 133

9.1 Evaluation of classifiers for fMRI data ................................................................................. 133

9.2 Directions for future research ............................................................................................... 135

9.2.1 Simulations of multiple networks ................................................................................. 135

9.2.2 Dimensionality in the aging study set ........................................................................... 136

9.2.3 Modifications to LD and QD; additional performance metrics .................................... 137

viii

References ................................................................................................................................... 139

Appendix Works with significant contribution from the author ................................................. 150

1 Peer-reviewed publications .................................................................................................... 150

2 Conference presentations ....................................................................................................... 151

ix

Lis t o f T ables

Table 6.1. Correlations between fMRI-based measures and behavioural measures.. ................... 92

Table 8.1. Cortical areas that show sensitivity to age ................................................................. 128

x

Lis t o f F igures

Figure 1.1. The temporal profile of a model hemodynamic response function .............................. 3

Figure 2.1. Examples of different distributions of H1 (solid green) and H0 (dashed red), and the

corresponding ROC curves. A: perfect detector; the ROC curve is a step function. B: Chance

distributions overlap completely; the ROC curve is the identity line. C: the usual situation,

distributions overlap somewhat and the area under the ROC curve is between 0.5 and 1. D:

degenerate case, the H1 distribution is completely contained inside H0 distribution; the ROC

curve shows the characteristic "hook" at the bottom. ................................................................... 18

Figure 2.2. Scatter plots of 2 simulated spatial maps, corresponding to reproducibility of r=0.5

(A) and r=0 (B). Major and minor axes are displayed for each scatter plot. The major axis

contains signal and noise, and variance along this axis is 1+r, while the minor axis contains only

noise, with variance 1-r. ................................................................................................................ 23

Figure 2.3. Prediction-Reproducibility (P-R) plots in the presence and absence of a simulated

cortical network. The black line shows the P-R trajectory when the activation loci are linearly

coupled into a single covarying network. The grey line shows the trajectory when the activation

loci are independent. The size of the markers corresponds to the number of principal components

used in the analysis (varied from 1 to 40). We show the average trajectory for 100 simulated data

sets described in Section 2.2.1. ..................................................................................................... 26

Figure 2.4. The phantom in baseline (left) and activation (right) states. Noise is not displayed.. 26

Figure 5.1. Normalized cost function of several methods of dimensionality estimation. ............ 65

Figure 5.2. Reproducibility and classification accuracy for linear and quadratic discriminants, as

a function of number of principal components, for two simulated data sets. Left and right plots

correspond to a weak (V=0.1) and strong (V=1.6) variance of the signal, respectively, for

moderate connectivity of ρ=0.5. The dotted line shows the classification accuracy corresponding

to random guessing. ...................................................................................................................... 68

xi

Figure 5.3. Plot of the first 10 eigenvalues of the covariance matrix of a single data set, for M =

0.01 (top) and M = 0.03 (bottom); ρ is set to 0.5, and V varies from 0.1 to 1.6 in increments of

0.5. Eigenvalues are averaged across 100 simulated data sets. .................................................... 69

Figure 5.4. Median dimensionality estimates in simulations, as calculated by various methods

(see legend and text), shown as a function of the relative signal variance, V, defined as the

variance of the amplitude of the Gaussian activation blobs relative to the variance of the

independent background Gaussian noise added to each voxel. M is set to 0.01 for the top row and

to 0.03 for the bottom row. The three panels from left to right in A and B show three increasing

levels of correlation, ρ, between Gaussian activation blob amplitudes. Range bars on the first

(V=0.1) and last (V=1.6) data points reflect the 25%–75% interquartile distribution range across

100 simulation estimates. .............................................................................................................. 72

Figure 5.5. Asymptotic relationship between global signal-to-noise ratio (gSNR) and

dimensionality that optimizes reproducibility, for linear (A) and quadratic (B) discriminant maps.

Marker size indicates relative signal variance, V, from 0.1 (small) to 1.6 (large). Five colours

encode five different levels of M, and the spatial correlation is encoded by different symbols. .. 74

Figure 5.6. Partial ROC area (corresponding to false positive frequency range of [0…0.1]) as a

function of the relative signal variance, V, calculated for linear discriminant (LD, on the principal

component subspace, with subspace size selected by various methods), and for univariate general

linear model (GLM). M is set to 0.01 for the top row (A) and to 0.03 for the bottom row (B). The

three panels from left to right in A and B show three levels of correlation, ρ, between Gaussian

activation blob amplitudes. Error bars show standard deviation across 16 active loci (centers of

Gaussian activation blobs). ........................................................................................................... 76

Figure 5.7. Asymptotic relationship between global signal-to-noise ratio (gSNR) and optimal

dimensionality in real fMRI data: stroke study (A) and aging study (B). Each marker indicates a

subject. .......................................................................................................................................... 76

Figure 5.8. Optimal dimensionality and global SNR in a group study. Each marker indicates an

age group; each solid line indicates a task that is contrasted with fixation.. ................................ 76

xii

Figure 6.1. Intrinsic dimensionality in the self-control study, for the subjects in the high-delaying

and the low-delaying groups. Dimensionality is estimated by optimization of LD classification

accuracy. Error bars represent standard errors across the subject group. ..................................... 85

Figure 6.2. Scatter plots for four combinations of fMRI-based measures and behavioural

measures: QD dimensionality versus final peg test performance (A); generalization-error

dimensionality versus final pinch test performance (B); sphericity versus pinch test improvement

(C); spectral distance versus pinch test improvement (D). Each subject is represented with a

specific symbol.. ........................................................................................................................... 85

Figure 6.3. Scatter plot for the first vs. the second principal component produced by PLS analysis

of the correlation matrix given in Table 6.1. Squares and circles denote fMRI-based and

behavioural measures, respectively... ........................................................................................... 91

Figure 7.1. Performance of the pool of six classifiers on simulated data sets. Top, middle and

bottom row show three metrics of performance. The three columns correspond to three levels of

mean signal magnitude M, and the three sub-columns to three levels of spatial correlation ρ... .. 97

Figure 7.2. Performance of the pool of six classifiers on simulated data sets, measured by partial

area under the ROC curve. The three columns correspond to three levels of mean signal

magnitude M, and the three sub-columns to three levels of spatial correlation ρ.... ................... 100

Figure 7.3. Partial area under ROC curve measured for maps that are produced by GLM, ICA,

PLS and PCA... ........................................................................................................................... 103

Figure 8.1. Performance of the pool of classifiers on the stroke recovery dataset for three

contrasts (healthy/impaired, early/late, and finger/wrist). The top figure shows the accuracy of

classification, and the bottom figure shows the reproducibility of spatial maps for six algorithms

of classification... ........................................................................................................................ 108

Figure 8.2. Ranking of six classifiers for two performance metrics: classification accuracy (top)

and map reproducibility (bottom). Ranks of 1 and 6 correspond to the best and the worst

performer, respectively. If the ranking of classifiers is not significantly different, they are linked

with a thick horizontal bar. Significance is established with Friedman and post-hoc

xiii

nonparametric testing, as described in the test. For the contrasts with significant difference in

ranking, critical distances (CD) are also specified ...................................................................... 111

Figure 8.3. Performance of a larger group of classifiers, used in a study by Schmah et al. (2010)

to classify the data in the stroke recovery study... ...................................................................... 113

Figure 8.4. Performance of the pool of classifiers on the dataset from the aging study. Left and

right columns correspond to subjects in the young and the older age groups, respectively... .... 114

Figure 8.5. Ranking of classifiers in the young age group. Classifiers linked with a horizontal bar

are not significantly different in their ranking... ......................................................................... 115

Figure 8.6. Ranking of classifiers in the older age group... ........................................................ 115

Figure 8.7. Across-subject reproducibility of within-subject spatial maps created by different

classifiers. The left and right panels correspond to the young and the older groups of subjects,

respectively... .............................................................................................................................. 119

Figure 8.8. Jaccard overlap of within-subject spatial maps across subjects. The left and right

panels correspond to the young and the older groups of subjects, respectively ......................... 120

Figure 8.9. Correlation of average spatial maps across classifiers. Individual subject maps created

by each of the 6 classifiers have been averaged across all subjects from our study... ................ 121

Figure 8.10. Jaccard overlap of average spatial maps across classifiers, for 2 strong contrasts

(RT/FIX and DM/FIX) .............................................................................................................. 122

Figure 8.11. DISTATIS plots of similarity of within-subject maps created with different

classifiers... .................................................................................................................................. 124

Figure 8.12. Accuracy of group-level classification of individual maps that have been created by

six different classifiers. Within-subject maps have been classified according to the age group of

the subject ("young" versus "older") ........................................................................................... 126

Figure 8.13. Cortical areas affected by aging, as revealed by the RT/FIX contrast. The top row

shows the group-difference map, thresholded at p<0.05. The middle and bottom rows show the

unthresholded group-average maps for the young and the older group, respectively ................ 130

xiv

Figure 8.14. Cortical areas affected by aging, as revealed by the DM/FIX contrast. The top row

shows the group- difference map, thresholded at p<0.05. The middle and bottom rows show the

unthresholded group-average maps for the young and the older group, respectively ................ 131

1

Chapter 1 Intro d uctio n

1 .1 T he nature o f s ignal and no is e in fM R I

In the past two decades, functional Magnetic Resonance Imaging (fMRI) has become one of the

most popular methods of studying brain activity (Bandettini, 2007). Broadly speaking, it is based

on measuring magnetic properties of the blood in the brain that are associated with cortical

activity (strictly speaking, these magnetic properties are measured at the level of nuclei of

hydrogen atoms, that are part of the water molecules of the blood). This technique is made

possible by a series of advances in neuroscience, physics and medical engineering.

1 .1 .1 M agnetic R es o nance Im aging and the B O LD s ignal

Magnetic Resonance (MR) imaging is well established in medicine. It is commonly used for

diagnostics, as a safer (although more expensive) alternative to X-rays. A comprehensive

treatment of the principles of fMRI can be found in, e.g., Buxton (2009); here we give a

simplified picture. When a strong constant magnetic field is applied to the body, and a portion of

hydrogen nuclei that constitute the water molecules in the tissue align along the direction of the

magnetic field. This creates a net magnetization of the tissue; the amount of magnetization

depends on the magnetic susceptibility of the tissue. The susceptibility is higher in oxygenated

blood than in deoxygenated blood; therefore, the magnetization of oxygenated blood is higher.

While the constant magnetic field is on, a second magnetic field is briefly turned on; this second

field is of much smaller magnitude and of perpendicular direction to the first field. If it oscillates

with a specific frequency (which depends on the magnitude of the constant field and on the type

of the tissue nuclei, which is, in our case, hydrogen), the phenomenon of magnetic resonance

occurs: the protons constituting the hydrogen nuclei enter a higher-energy state. When the

oscillating field is turned off, these protons return to their previous energy state, emitting a radio

wave in the process. This wave forms the basis for the MR signal received by the MR scanner.

Because of the difference in magnetic susceptibility, this signal is higher in oxygenated blood

than it is in deoxygenated blood. In 1936, Pauling and Coryell studied magnetic properties of

hemoglobin molecules, and discovered that oxygenated and deoxygenated hemoglobin had

different magnetic susceptibility. Thulborn et al. (1982) applied this idea to MR imaging and

2

found that the transverse relaxation time of water molecules in blood depended on the level of

oxygenation. The relationship between blood oxygenation and neuronal activity was studied with

Positron Emission Tomography (PET). Synaptic activity requires oxygen and glucose, which are

supplied via blood circulation. Using PET imaging, Fox and colleagues (1988) observed a large

increase in blood flow and metabolic rate of glucose after tactile stimulation but the measured

increase in metabolic consumption of oxygen was much lower in comparison. The change in

oxygenated blood flow driven by neuronal activity can be measured by MR scanner; unlike PET,

this measurement is not an invasive procedure.

Application of MR technology to study brain activity was suggested in a series of papers by

Ogawa and colleagues. In a study conducted in 1990 (see Ogawa, Lee, Nayak, & Glynn, 1990),

they acquired high-resolution images of rat brains at high magnetic fields (7 and 8.5 Tesla). They

saw dark lines that corresponded to anatomical divisions of the cortex. The prominence of these

lines depended on the level of oxygenation and on dilation of blood vessels, and the authors

proposed MRI as an alternative to PET in studying brain function. Indeed, the follow-up studies

(see Kwong et al., 1992, and Ogawa et al., 1992) showed that changes in MR images were

related to neuronal activity.

The mechanism of coupling between neurons and blood vessels is a subject of intense

investigation. The current understanding can be found in the review by Attwell and colleagues

(Attwell et al., 2010). When glutamate, which meditates the most common type of neural

activity, is absorbed by the neurons and by nearby astrocyte cells, a series of signaling chemicals

are released. This chemical signal is received by soft muscles, which respond by dilating the

blood vessels. This causes a marked increase in oxygenated blood flow to active cortical regions.

However, only a small fraction of oxygen is metabolized. The resulting over-supply of

oxygenated hemoglobin is used in fMRI as a marker of neuronal activity. The term “blood-

oxygenation-level dependent signal” (“BOLD signal”) was proposed by Ogawa et al. (1990) and

used to refer to the MR signal that is tuned to deoxygenated hemoglobin. Later studies found that

the BOLD signal matches electrophysiological measurements of neuronal activity, such as local

field potential (Logothetis et al., 2001) as well as firing rate (Mukamel et al., 2005).

3

1 .1 .2 Hem o d ynam ic res po ns e and tem po ral res o lutio n o f fM R I

The temporal profile of a BOLD response to a short neuronal event is called a “hemodynamic

response function” (HRF). This HRF time course of BOLD signals is quite slow relative to the

time course of underlying neuronal activity. It takes about 5 seconds for the BOLD signal to

reach its peak, and then it slowly decays. Often, an “undershoot” is observed: the BOLD signal,

after reaching the peak, decays below the baseline level, and then (up to 20 seconds after the

neuronal event) it returns to the equilibrium level (Glover, 1999). The profile of the BOLD

response varies across subjects (Aguirre, Zarahn, & D'esposito, 1998). Also, the magnitude of

the BOLD signal is influenced by the vascular density and therefore varies across spatial

locations in the brain (Harrison et al., 2002). In many cases, researchers use a simple model of

the hemodynamic response, which does not account for spatial and inter-subject variations: HRF

is modeled as a sum of two gamma distributions (Glover, 1999; Friston et al., 1998). We have

used this model to create hemodynamic effects in our simulations; this was done by convolving

the simulated signal with the impulse response function given by

2

2

21

1

1

)(exp

)(exp)(

21

b

dt

d

tc

b

dt

d

tth

aa

. (1.1)

Figure 1.1. The temporal profile of a model hemodynamic response function.

4

The parameters of the function were set according to Worsley (2001): a1 = 6, a2 = 12, b1 = b2 =

0.9 seconds, c = 0.35, and dk = akbk. Figure 1.1 illustrates the impulse response function h(t)

graphically.

Because of the slow time course of HRF, it is hard to resolve neuronal responses to a quick

succession of events with fMRI. Discrete neuronal events should be separated by at least 4

seconds in order to be resolved (Zarahn, Aguirre, & D'Esposito, 1997A). However, provided that

the stimuli are presented repeatedly and their timing is randomized, studies have shown

significant differential functional responses between two events (e.g. flashing visual stimuli)

spaced as closely as 500 ms apart (Burock et al., 1998).

1 .1 .3 Spatial res o lutio n o f fM R I

The spatial resolution of the BOLD signal is tied to the density of the spatial structure of the

vascular system (Harrison et al., 2002). Blood is supplied to the active neurons by capillaries,

which form a dense net around neurons. The diameter of a capillary is comparable to a size of a

red blood cell (Huettel et al., 2004). Blood circulates through a system of arteries and veins,

which branch out into arterioles and venules. It has been suggested (Harel et al., 2006), that the

blood flow mechanism is controlled on the level of arterioles on the pial surface of the brain.

Arterioles are thin (10 to 50 micrometers in diameter), and feed the cortical tissue in about 1-mm

thick cylinders around them. In comparison, the thickness of the grey matter in neocortex is

about 3 to 5 mm.

Spatial resolution of an fMRI study is usually chosen by the researcher with several

considerations in mind. The limit of this resolution is usually defined using the notion of a voxel,

a three-dimensional cell which is our sampling unit in space. Researchers in current fMRI studies

usually use voxels of size 1×1×1 to 5×5×5 mm3. The finer the spatial resolution, the more time

needed to scan the cortical surface of interest. The researcher needs to make sure that the choice

of spatial resolution translates to reasonable volume acquisition time, especially if the whole

brain needs to be covered.

Field strength influences the spatial structure of fMRI images. To achieve reliable acquisition at

reasonably high resolution, the majority of modern MRI scanners use 1.5 to 3 Tesla fields.

Above 3 Tesla, we can image the brain at the resolution of voxels of sub-millimeter volume and

5

approach the spatial resolution of cortical columns (Kriegeskorte & Bandettini, 2007). Such high

fields have been found safe for human subject exposure, and are slowly being accepted in the

neuroimaging community as more 7 Tesla scanners are purchased and installed. However, the

usefulness of high-field fMRI is a somewhat controversial issue, because, at higher field

strength, the MR signal is more severely affected by physiological processes as well as by

physical artifacts (see Kleinschmidt, 2007, and also Huettel et al., 2004, p. 236-241).

1 .1 .4 S o urces o f no is e in fM R I

The amplitude of BOLD signal changes caused by task-related neural activity is quite small

relative to the intensity of the background MR signal. More importantly, BOLD signal is

corrupted by noise from various sources. A good summary of noise in fMRI can be found in

Huettel et al. (2004). fMRI literature mentions several kinds of noise, which are characterized by

their origin and spatio-temporal properties. The noise that does not originate from neuronal

activity needs to be removed from the data prior to statistical analysis; this step is called pre-

processing. The most important kinds of BOLD noise are listed below.

Thermal noise (also called Johnson noise) is due to thermal motion of the electrons within the

subject and the scanner hardware. This noise is always present at temperatures above absolute

zero, and has been shown to be proportional to the field strength (Edelstein et al., 1986) and

independent of the magnitude of the MR signal (Kruger & Glover, 2001). This noise is

temporally and spatially independent and has an additive effect.

Scanner noise is caused by imperfections in the imaging hardware, most importantly by the

inhomogeneities of the magnetic field and by gradient coil nonlinearities (Jezzard & Clare,

1999). It is proportional to the field strength (Kruger & Glover, 2001). Special shimming coils

are usually used to correct for the magnetic field inhomogeneities, but it should not be assumed

that shimming removes them completely. The effect of these inhomogeneities is the strongest in

the parts of the brain where different tissues are located next to each other, for example, when

the brain tissue is situated close to bone or air-containing sinuses.

Because of subjects’ head motion, the signal recorded over time from a specific volume of the

magnetic field may contain, at different time points, contributions from different spatial areas of

a subject’s head; this severely degrades the BOLD signal. Head motion can also cause drifts in

6

fMRI signal. The severity of motion effects is at its worst at the edges of the brain, as well as at

the brain-CSF and brain-air interfaces. To alleviate the impact of head motion, the head of the

subject is packed into an MR head holder for scanning using padded cushions or vacuum packs.

The effect of head motion is usually corrected during pre-processing of the data, typically by

aligning the fMRI volumes to a reference volume using rigid-body transformation (see Jenkinson

et al., 2002; Woods et al., 1998; Cox & Jesmanowicz, 1999). However, residual motion effects

remain even after this procedure (Lund et al., 2006). An alternative approach is to discard

volumes severely impacted by head motion (Power et al., 2011). Head motion is a significant

contributor to noise, especially in older subjects as well as clinical patients (e.g. patients with

Parkinson’s disease). It should also be noted that the motion of the head could be coupled to the

task (many fMRI experiments require some kind of motor or verbal response from the subject).

Another possible source of head motion is the vibration of the scanner, which may cause drifts in

BOLD signal (Foerster, Tomasi, & Caparelli, 2005).

The contribution of physiological processes to BOLD fluctuations is often referred to as

physiological noise (Huettel et al., 2004). In contrast to thermal and scanner noise, which can be

reduced by decreasing spatial resolution, physiological noise is independent of the size of the

voxel (Bodurka et al., 2007; Kriegeskorte & Bandettini, 2007). The variability due to

physiological noise increases with the amplitude of the BOLD signal and with field strength

(Kruger & Glover, 2001). Cardiac, respiratory and neural activity are the main sources of

physiological noise.

The effect of cardiac activity is especially prominent in the vicinity of major blood vessels

(Dagli, Ingeholm, & Haxby, 1999). Heartbeat induces subtle tissue motion. It also creates

fluctuations in blood volume, which in turn create fluctuations in BOLD signal. In the power

spectrum of fMRI time series, cardiac activity creates peaks at the frequency of the heartbeat

(around 1 Hz) and its harmonics.

The displacement of brain tissue due to respiration is considerably larger than displacement due

to cardiac activity. In addition, changing lung volume creates susceptibility variations in the

magnetic field (Raj, Anderson, & Gore, 2001). Respiration, like cardiac activity, introduces

peaks in the power spectrum at the frequency of respiration (Raj et al., 2001), which lies at the

range 0.1-0.3 Hz (Wise et al., 2004). The spatial effect of respiration is rather more global than

7

the effect of cardiac activity (Glover, Li, & Ress, 2000), but it is mort prominent at large CSF

pools such as ventricles (Perlbarg et al., 2007). A common approach to correction for cardiac and

respiratory artifacts is to externally measure the pulse and the lung volume (pulse-oximeter and

respiratory belt are often used for these purposes; see Glover et al., 2000; Lund et al., 2006).

Another important source of physiological noise is spontaneous neuronal activity. The brain is

always active, and the neuronal activity related to the response to experimental tasks is a fraction

of overall neuronal activity (when analyzing experimental fMRI data, it does not seem possible

to tell whether BOLD signal corresponds to spontaneous or to salient neuronal activity) . It has

been suggested that spontaneous brain activity makes the brain more efficient in switching from

one state to another (Deco, Jirsa, & McIntosh, 2011; McIntosh et al., 2010). The power spectrum

of this source of noise obeys the 1/f law, which suggests a significant amount of autocorrelation

(Zarahn, Aguirre, & D'Esposito, 1997B).

Kruger and Glover (2001) have studied the relative contribution of the mentioned sources to

overall BOLD noise. The picture is somewhat different across brain tissues: overall variability

(measured with standard deviation of BOLD signal) is about twice as high in grey matter as it is

in white matter. Variability due to thermal and scanner noise is homogenous across the brain and

across subjects, and it explains about 10% of variance in the grey matter and about 35% in white

matter. Variability due to spontaneous activity explains 70% of grey matter variance and 45% of

white matter variance; the actual percentage of explained variance varies between subjects, but it

was uniformly found to be the largest source of variance. Variability due to head motion and to

cardiac and respiratory activity explained 10% of grey matter and 8% of white matter BOLD

signal variance; the contribution of these sources of noise was very different in different subjects.

1 .2 E xperim ental d es ign

The traditional approach in designing fMRI experiments is contrasting the neural response in two

(or more) experimental conditions. This approach uses subtraction logic, where two

experimental conditions or brain states are assumed to be different in only one factor (Culham,

2006), and differential BOLD response measures the influence of this factor. The condition

where this factor is present is called the task condition, and the other condition is called a

baseline condition.

8

Ideally, the two conditions should be different either in their presented stimuli or in behavioural

tasks. For example, when we want to study attention to visual stimuli, it makes sense to compare

passive viewing of the stimuli to the engaged viewing of the same stimuli (when the subject is

instructed to pay attention). When we are interested in response to visual stimuli, we may

contrast passive viewing with looking at a blank screen. If we contrast engaged viewing of

stimuli with looking at a blank screen, the differences in BOLD signal may be attributed to two

different factors, visual perception and attention, and the influence of these two factors cannot be

resolved. Careful selection of a baseline task is therefore critical in task-driven experiments.

A common way to design the presentation of the conditions is to use block design. A block is the

duration of time when the subject is repeatedly performing the task. “Task blocks” are

alternating with “baseline blocks”. This design is statistically powerful, because trials are

repeated many times within a block. However, sometimes we are interested in the response to a

single trial. In this case, the researchers use event-related design, where the task is performed

once, followed by some time of inactivity (Buckner et al., 1996).

Subtraction logic naturally leads towards hypothesis-driven research, where the researcher uses

the experimental data to test a certain hypothesis about the neural effect of the studied factor.

The research hypothesis might state that the effect is observed in certain cortical areas. Also, the

researcher might want to quantify the relationship of the effect of the stimulus. The research

hypothesis is contrasted with the null hypothesis, which states that the effect is not observed.

Specifically, the null hypothesis implies that the data sampled in different experimental

conditions comes from the same population.

There is an alternative to hypothesis-driven research, where the researcher is not looking for a

confirmation or rejection of a certain hypothesis, but rather lets the data “speak for themselves”

without formulating a hypothesis prior to analysis. This alternative is called data-driven

research. A popular data-driven approach is to use principal component analysis (Sychra et al.,

1994; Friston et al., 2000) or independent component analysis (Beckmann & Smith, 2004); both

of these methods represent the data as a sum of spatio-temporal components which are ordered

by the amount of data variance they explain. Another data-driven approach is to use clustering,

for example, Cordes et al. (2002) propose to group voxels into clusters according to the

similarity of time courses.

9

1 .3 Statis tical analys is

After the fMRI data have been collected and pre-processed, statistical analysis helps the

researcher to answer the most important questions: does the data support the research

hypothesis? What is the relationship between reported behavioural states and the observed data?

Where in the brain is the effect of interest observed? The methodology of answering these

questions has been an active field of research since the beginning of the 20th century, when the

pioneering work of Ronald Fisher, Karl Pearson, William Gosset and others developed the

framework for analyzing experimental results. This framework has been adopted by all

experimental natural sciences, including neuroscience.

1 .3 .1 G eneral Linear M o d el

In the fMRI community, the first complete framework for statistical analysis was developed by

Worsley and Friston (2005). This framework is univariate; that is, the observed signal is assumed

to be independent and identically distributed across all spatial locations. This framework is based

on multiple regression analysis commonly called the General Linear Model (GLM), which we

will describe in Section 3.4. The common use of GLM is to produce a spatial map, where the

significance of each voxel’s expression of the effect of interest is evaluated with a t test.

The univariate approach to fMRI data analysis assumes independence of the BOLD signal across

voxels, and, therefore, ignores all the interactions between the voxels. The advantage of this

approach is the simplicity of the model, with a relatively small number of degrees of freedom.

However, the brain is a connected system of neurons, organized into several functional networks

(Toro et al., 2008); therefore, the univariate model of the brain is hardly realistic. This suggests

an expansion of the GLM analysis where the interactions between brain areas are taken into

account. This type of analysis is multivariate: the fMRI volume is conceptualized as a vector in

multi-dimensional data space, where each voxel corresponds to a dimension. A subtype of

multivariate analysis is multivariate classification, where the volumes are classified into groups

based on the cognitive state of the subject at the time of the volume's acquisition. Multivariate

classification is discussed below; some other approaches to multivariate analysis are briefly

discussed in Section 7.2.4.

10

1 .3 .2 M ultivariate clas s ificatio n

An alternative to univariate GLM is to use classification algorithms for decoding the brain state.

An important example of this approach is classification of fMRI volumes according to the type

of the behavioural task that the subject was performing when the volume was acquired. If the

subjects come from different groups, we can also try to classify the subject’s fMRI data

according to the group to which the subject belongs. There are many algorithms developed for

pattern classification (see, for example, Bishop, 2006), and several have been successfully

applied to fMRI data (see e.g. Schmah et al., 2010, Misaki et al., 2010; Ku et al., 2008; Mitchell

et al., 2004). Many of these algorithms are multivariate1, that is, they imply that the brain state

information is encoded in the interactions between the volumes.

The advantages of multivariate classification over univariate analysis are described in the review

article by Haynes & Rees (2006). Multivariate analysis is a natural choice when the brain state is

encoded by the neuronal activity of a group of brain areas, rather than a single area (this is called

distributed representation; see Haxby et al., 2001). In any isolated voxel, the difference in

BOLD signal across the brain states can be too small and/or inconsistent for successful decoding

of a brain state. However, the brain state can be decoded when the BOLD signal measured for a

ensemble of voxels is analyzed together. For example, in a multivariate classification study by

Kamitani and Tong (2005), the subject was presented with simple visual stimuli: bars rotated by

a specific angle. In the visual cortex, this angle is encoded in orientation-selective cortical

columns, which are much smaller in size than a typical fMRI voxel (such as 3×3×3 mm voxels

used in the study). Individual voxels in that study showed poor selectivity to stimulus

orientation; however, the output of a linear combination of visual-cortex voxels contained

enough information for successful decoding of the stimulus orientation. 3×3×3 mm voxels used

in the study). This study, along with several others (e.g., Haxby et al., 2001; Mitchell et al., 2004;

see also reviews of Norman et al., 2006, and Haynes & Rees, 2006) has popularized multivariate

classification as a method of fMRI data analysis. However, it should be noted that multivariate

methods have been proposed to analyze fMRI and PET data analysis for a while before the mid-

1 An important exception is Gaussian Naïve Bayes (GNB) classifiers, described below.

11

2000s "multivariate" boom (for example, see Lautrup et al., 1994; Friston et al., 1995; Morch et

al., 1997; Strother et al., 1997).

Algorithms of classification can be grouped into two categories: probabilistic and non-

probabilistic. Classifiers from the first group assume a probabilistic model for the data. Model

parameters are estimated during the training of the classifier. For example, the probabilistic

classifiers discussed in this thesis use a multivariate Gaussian model for each class ("brain state")

of the data. Classification of a volume is done by assigning the volume to the most probable

class. On the other hand, non-probabilistic classifiers do not use a probabilistic model. A simple

example of a non-probabilistic classifier is nearest-neighbour classification (see e.g. Schmah et

al., 2010): a volume is assigned to the same class as its nearest neighbour. Another important

example is Support Vector Machine (SVM) method (see e.g. LaConte et al., 2005; Mourao-

Miranda et al., 2005; Cox & Savoy, 2003; this method was also used by Kamitani and Tong in

the study described above). SVM methodology constructs the surface that separates the classes

by maximizing the margin between the two classes while simultaneously minimizing the

misclassification rate (see Section 3.6 for details); this process does not involve estimation of the

parameters of the probabilistic model. In general, non-probabilistic approaches view construction

of probabilistic models as a more general problem that can be by-passed when solving a more

specific problem of classification (Ng & Jordan, 2002). However, the "extra step" of

probabilistic model estimation can be useful in the context of a neurobiological study. For

example, the estimated Gaussian model captures a sufficient share of information about the

connectivity between the brain areas (Hlinka et al., 2011).

In Chapter 3, we describe several methods of probabilistic classification, that all use multivariate

Gaussian distribution to model the fMRI data. Each class is sampled from a Gaussian

distribution defined by a mean vector and a covariance matrix. The most restrictive model,

Gaussian Naïve Bayes classifier, assumes that the covariance matrices are diagonal; in essence,

this is a univariate model of classification. Linear and quadratic discriminants are methods that

do not assume diagonality of covariance matrices. Of these two, linear discriminant (LD) is the

more constrained method because it assumes that the population covariance matrix is the same

for all classes; quadratic discriminant (QD) makes no such assumption. Both linear and quadratic

discriminants can classify the data just as accurately as SVMs, if the probabilistic model is

carefully regularized to prevent overfitting. To our knowledge, quadratic discriminant has not

12

been used previously to classify fMRI data. We demonstrate that in some situations it is the most

accurate among the classifiers we have tested (see Sections 7.2.1 and 8.2).

If we are able to classify the out-of-sample fMRI volumes accurately, we can conclude that there

is a difference in the brain’s response to different behavioural tasks. The next question is how to

identify the spatial locations where this difference is manifest. Kjems et al. (2002) have proposed

a method of constructing spatial maps for a given classification algorithm, where each voxel is

weighted according to its contribution to classification. They have also derived the general

equation of spatial maps for probabilistic canonical variate analysis, which is an extension of

linear discriminant to three or more classes. Rasmussen and colleagues (2011) have further

developed it for SVMs, as well as for kernel logistic regression and kernel linear discriminant.

We propose a modification to Kjems’s method (discussed at the end of Section 3.1) and show

how to construct spatial maps for quadratic discriminant and for univariate classifier known as

Gaussian Naive Bayes (GNB).

1 .4 Selecting the m etho d o f analys is

The current thesis examines the question of selecting the method of classification for fMRI data.

We use a framework proposed by Strother and colleagues (2002; 2010). Previously, this

framework was applied to evaluate pre-processing techniques (Strother et al., 2004; LaConte et

al., 2005; Churchill et al., 2012A; Churchill et al., 2012B). The framework tests the algorithm’s

ability to (a) accurately decode the task, and to (b) construct a reproducible spatial map.

Comparative evaluation of classifiers has been carried out on real fMRI data in a series of studies

(e.g., Misaki et al., 2010; Ku et al., 2008; Mitchell et al., 2004; see also our paper, Schmah et al.,

2010). However, these studies have evaluated only the accuracy of out-of-sample classification,

without assessing the quality of spatial maps. Reproducibility of spatial maps is a metric that is

complementary to classification accuracy (LaConte et al., 2003); both of these metrics should be

taken into account when evaluating a classifier. An algorithm that accurately predicts the brain

state but is unable to create a reproducible spatial map is of limited use to the researcher: it

indicates that the brain states are indeed different, but is not able to say which areas of the brain

are reliably implicated in this difference. Also, spatial maps are useful to identify the task-

coupled artifacts. Consider the case when a subject is performing a motor task and a baseline

(rest) task. The subject’s motion could be stronger during the motor task than during the

13

baseline, and (if the motion is not regressed out carefully) a classifier can capitalize on this

difference to accurately predict which task the subject was performing. Inspection of spatial

maps in this case will reveal that the edges of the brain (i.e. the voxels where the motion is the

strongest) are the biggest contributors to this classification. Here, the classification is accurate

not because of the difference in neuronal response to the task, but rather in some interacting

external factor.

1 .5 Data s ets fo r evaluatio n o f algo rithm s

We have evaluated a pool of classifiers on several fMRI data sets, simulated as well as real.

Simulated environments have several attractive features: it is possible to create a large number of

artificial “subjects”, and we can carefully explore the influence of different aspects of simulated

fMRI signal on the classifiers’ performance. The key advantage is the knowledge of “ground

truth” (that is, we always know the location of the active signal in a given volume). This

knowledge can be utilized in order to perform ROC (receiver operating characteristic) analysis.

Partial area under the ROC curve serves as the third metric of performance, together with

classification accuracy and reproducibility of spatial maps.

We have used a simulation framework proposed by Lukic et al. (2002) and further developed in

our paper (Yourganov et al., 2011). The data are generated to model a block-design fMRI study

with two conditions, “active” and “baseline”. The task-related signal is absent from the

“baseline” volumes; in the “active” volumes, it is distributed across a spatial network of “active

areas”. We model a wide range of situations by changing three parameters: the mean magnitude

of task-related signal, its temporal variance, and the correlation across the active areas. This

allows us to identify the situations when certain algorithms are better performers, and situations

when all classifiers perform equally well.

The real data for the evaluation come from two studies: a longitudinal study of recovery from a

motor stroke, and a study of cognitive aspects of aging. The first study has been conducted on

nine patients recovering from stroke, who have been scanned in four separate fMRI sessions

spanning half a year after the stroke (Small et al., 2002). At each session, the subjects’ BOLD

activity has been measured during performance of simple motor tasks; also, a series of

behavioural tests have been performed outside of the scanner. Pooling across the four sessions

gives us an opportunity to evaluate the classifiers on a large amount of data within a subject. In

14

addition to this evaluation, we can also pose a question: does the recovery from a stroke,

measured with behavioural tests, correspond to a change in cortical activity, measured by fMRI?

The second study involved subjects from two age groups: young and old (Grady et al., 2010).

Each subject performed a series of tasks of varying difficulty (the difficulty has been matched

across the age groups so the behavioural accuracies of the young and the old subjects are

approximately equal). The pool of classifiers were applied to classify the fMRI volumes within a

subject to each of the tasks. After that, we performed a group-level analysis, where we classified

the within-subject spatial maps into two age groups. This classification helped identify the tasks-

related brain areas recruited differently by the young and the old subjects.

1 .6 Structure o f the thes is

The initial goal of the work presented in the thesis was to evaluate linear and quadratic

discriminants on simulated as well as real fMRI data sets, and to compare them with univariate

methods and SVM. LD and QD use multivariate Gaussian distributions to model the data; this

distribution is perhaps the best-studied multivariate distribution in statistics. In addition, Hlinka

et al. (2011) have shown that multivariate Gaussian model provides a reasonably good

approximation of fMRI data: it captures about 95% of mutual information of the connectivity

between brain regions. The theoretical properties of linear discriminant have also been well-

researched; see, for example, books by Mardia et al. (1979) and by Seber (2004). Quadratic

discriminant is somewhat more obscure; to our knowledge, we are the first to apply it to

classification of fMRI data (Schmah et al., 2010). These methods are perhaps better suited for

fMRI analysis than univariate methods, because the brain areas are known to be highly

correlated (Toro et al., 2008, among others), and, presumably, better described by a multivariate

model. The MATLAB code for our implementations of linear and quadratic discriminant, as well

as for the evaluation framework, will be available from Dr. Strother's lab website.

The structure of this thesis is as follows. Chapter 2 gives a detailed description of the framework

of evaluation: the metrics of performance and the split-half resampling framework. Also, it

describes the data sets used for evaluation (simulated and real). Chapter 3 describes the

methodology of probabilistic classification of fMRI data. The three algorithms based on

Gaussian distribution are described: multivariate linear and quadratic discriminants, and

univariate GNB. Then we give a brief description of regularization of covariance matrices (a

15

necessary step in both linear and quadratic discriminants). We focus on one approach to

regularization, where the covariance matrix is approximated by a subset of its principal

components.

Chapter 4 is dedicated to the problem of estimating the number of components that give an

efficient approximation to the covariance matrix. This number is referred to as intrinsic

dimensionality of the data: the number of dimensions required to capture the signal of interest.

The problem of its estimation is rather difficult, and we give a survey of methods developed to

address this problem. In Chapter 5, these methods are tested on the simulated data (where the

structure of an underlying active spatial network is known). We conclude that intrinsic

dimensionality is best estimated with cross-validation methods, rather than with information

theory. We also show how the estimated dimensionality in real data is related to signal-to-noise

ratio. Chapter 6 presents a further description of intrinsic dimensionality in two real data sets. In

the first data set, we demonstrate that intrinsic dimensionality is linked to self-control ability in

healthy individuals. In the second set, we show how dimensionality (as well as other measures of

complexity of fMRI signal) reflects the process of cortical re-organization that accompanies

stroke recovery.

Chapters 7 and 8 contains an overview of evaluation of a pool of classifiers. First, they are tested

on simulated data (Chapter 7): we describe how the performance of classifiers is influenced by

changes in magnitude, variance and connectivity of the task-related signal. Then, we test the

algorithms on two real fMRI data sets (Chapter 8): the stroke recovery set, and the set from an

aging study. We describe the spatial maps created by the classifiers: their within-subject and

across-subject reproducibility, and the correlation of spatial maps across different classifier

algorithms. We also classify the individual maps according to the age group of participant, and

show how to use this classification to identify cortical areas that are affected by age. Finally,

Chapter 9 is the conclusion of this thesis with discussion of several future research directions.

16

Chapter 2 E valuating algo rithm s fo r fM R I d ata analys is

2 .1 M etrics o f perfo rm ance

The performance of an algorithm can be measured in multiple ways. We have selected three

performance metrics. The first one, area under a Receiver Operating Characteristics (ROC)

curve, has been widely used in the medical and machine learning literature as a measure of

accuracy and susceptibility to errors. This metric uses error rates that are measured using

knowledge of the "ground truth". We apply this metric only to simulated data. The other two

metrics, predictive accuracy and reproducibility of spatial maps, were proposed by Strother and

colleagues (2002; see also Kjems et al., 2002) as they can be obtained from the same resampling

procedure.

2 .1 .1 R eceiver O perating Characteris tics (R O C ) m etho d o lo gy

The ROC methodology is used to measure the quality of signal detection. It was first applied to

evaluate detection of objects with radar, and has been used in psychophysics as a way to describe

how well humans can detect stimuli (see Metz, Herman, & Shen, 1998). It has become a standard

tool in medical diagnostics and imaging, after the work of Lusted (1968) and Swets (1988). This

methodology, generally speaking, summarizes the capability of the detection algorithm to

accurately detect the signal that is present in the data, and to detect the absence of signal if the

signal is indeed not there. To make this evaluation, we need to know the "ground truth", i.e.

whether the signal is really present in the data or not. It was developed for binary detectors

(which detect the presence or absence of a signal), but it can be extended to a detector that rates

the magnitude of the signal using a discrete scale (Metz, 1986). Here we will consider the binary

case only.

A binary detector analyzes the input data and makes a decision where the signal is absent (such a

decision is called a "negative") or present (a "positive"). We can utilize the knowledge of

"ground truth" and say whether the detector was right or wrong. Therefore, a "positive" is a "true

positive" if the detector was right and the signal was in the data, or a "false positive" otherwise.

Analogously, we can talk about a "true negative" and a "false negative". In the case of detecting

task-related activation in fMRI data, the detecting algorithm produces a spatial map where each

17

voxel value indicates the magnitude of task-related effect in the corresponding spatial location.

To decide whether a voxel is active or inactive, we see whether the voxel value passes a

predefined threshold or not. There are two kinds of errors a detection algorithm can make given a

specific threshold. Type I error happens when the voxel known to be inactive surpasses the

threshold of activation and is therefore a false positive. Type II error is, conversely, an

occurrence of a false negative. ROC methodology evaluates the detector in terms of the

frequency of Type I and Type II errors. It computes the "false positive frequency" (FPF) and the

"true positive frequency" (TPF) for all possible thresholds. The plot of TPF versus FPF is known

as the ROC curve.

To compute this frequency, we need two types of data sets, which are labeled H0 and H1. Sets of

H0 type ("null" sets) contain data where the effect of interest is not present, and sets of H1 type

("alternative" sets) contain data where the effect is present. If the voxel from H1 that is known to

be active passes the threshold, it is a true positive; if it is below the threshold, it is a false

negative. All voxels in H0 that pass the threshold are false positives. We need many example sets

of both H1 and H0 type for robust estimation of the error frequencies.

A ROC curve is constructed for a single voxel that is known to be active in H1. We move the

threshold from most conservative (when no voxels pass the threshold) to most liberal (when all

voxels pass it). For each threshold, we look at this voxel in all the H1 sets, and count the number

of occurrences when the voxel value passes the threshold. This gives us the number of true

positives. Then we count the number of H0 sets where this voxel has a value that passes the

threshold; this gives us the number of false positives. Dividing by the total number of H1 and H0

sets gives us estimates of true positive frequency and false positive frequency, respectively.

We can also define TPF and FPF in continuous, rather than discrete, terms; this would

correspond to an ideal situation with infinitely many examples of H0 and H1 sets. For the

threshold t, we can define FPF and TPF in terms of continuous probability distributions of the

voxel value x:

t

t

dxxptTPF

dxxptFPF

)()(

)()(

1

0

(2.1)

18

Here, p0(x) and p1(x) are probability distributions of voxel values in the H0 and H1 sets,

respectively. We use the LABROC software (Metz et al., 1998) to generate smooth ROC curves

from a set of discrete (FPF, TPF) pairs.

Figure 2.1. Examples of different distributions of H1 (solid green) and H0 (dashed red),

and the corresponding ROC curves. A: perfect detector; the ROC curve is a step function.

B: Chance distributions overlap completely; the ROC curve is the identity line. C: the

usual situation, distributions overlap somewhat and the area under the ROC curve is

between 0.5 and 1. D: degenerate case, the H1 distribution is completely contained inside H0

distribution; the ROC curve shows the characteristic "hook" at the bottom.

The shape of the ROC curve depends on the amount of overlap between p0(x) and p1(x). Figure

2.1 shows some characteristic examples of ROC curves. In the best case, the detector is so good

that the distributions p0(x) and p1(x) do not overlap. If we set our threshold above all the values

of p0(x) but below all the values of p1(x), FPF will be zero and TPF will be one. If the threshold

is higher than that, TPF is less than one but FPF is still zero. For a smaller threshold value, FPF

is greater than zero but TPF is one. The corresponding ROC curve is a step function, where TPF

instantaneously rises from 0 to 1 at FPF = 0. If the separation between p0(x) and p1(x) is not ideal

and there is some overlap, the ROC curve rises smoothly rather than instantaneously, because

19

there exist thresholds for which both FPF and TPF are less than 1. In the worst case, p0(x) and

p1(x) completely overlap, so for any possible threshold TPF and FPF are equal and the ROC

curve is an identity line.

There is one more case that we should consider: when p0(x) is wider than p1(x) so p1(x) is

completely contained within p0(x). In this case there is no threshold for which TPF>FPF. ROC

curves in this case have a characteristic "hook" at the bottom, which corresponds to thresholds

that give FPF>TPF. Pan and Metz (1997) call such curves "degenerate" because they indicate

performance which is, for a certain range of thresholds, worse than chance. This kind of curve

indicates that the signal and noise have very different distributions. Examples of such curves

were reported in the fMRI literature; see Figure 2 in Constable et al. (1995), and Figures 4 and 5

in Lange et al. (1999).

To compare the performance of several detectors, we can inspect their ROC curves. Several

papers (for example, Constable et al., 1995; Lange et al., 1999; Skudlarski et al., 1999; Lukic et

al., 2002; Beckmann & Smith, 2004) have used this method to evaluate the performance of

algorithms for fMRI data analysis. As a quantitative metric of performance, we can use the area

under the curve. This area corresponds to the probability that the detector will assign a higher

value to a voxel randomly chosen from H1 than to a voxel that is randomly chosen from H0.It is

proportional to a Mann-Whitney U statistic, which is used in a non-parametric test (Conover,

1999) to determine whether two samples (in our case, the voxel values from H1 and H0) come

from the same distribution (Mason & Graham, 2002). We can also use the partial, rather than the

full, area under a ROC curve, for example the area for FPF between 0 and 0.1; this is equivalent

to setting the critical significance level α to 0.1 (Skudlarski et al., 1999). We use this metric to

evaluate signal detection of a series of classifiers (see Chapter 7). Also, we use it to evaluate

different methods of intrinsic dimensionality estimation for discriminant analysis (see Chapter

5).

The knowledge of "ground truth" is essential to ROC methodology. Therefore, simulated data

are commonly used to construct ROC curves. There are two common approaches. The first is to

generate H0 and H1 sets using a Gaussian distribution (Lukic et al., 2002; Beckmann & Smith,

2004; Yourganov et al., 2011). The second approach uses real resting-state fMRI data for H0

sets, and adds artificial activation signal at specific locations in resting-state data to generate H1

20

sets (Constable et al., 1995; Lange et al., 1999; Skudlarski et al., 1999; Beckmann & Smith,

2004). With the first approach, we can easily generate a large number of sets from both H0 and

H1, and test our detectors in a variety of simple, easily-controlled situations, which, however,

might be a poor approximation to real fMRI data. The second approach uses realistic H0 data, but

the H1 sets are a mixture of very complex real fMRI "noise" and typically simplistic artificial

"signal" so again the extent to which the results reflect real data performance is largely unknown.

Also, it has been proposed (Garret et al., 2012) that the variability of BOLD signal in resting

state is different from the variability observed when the subject is performing a task; therefore, it

seems questionable to use resting-state "noise" to generate H1 sets.

It should also be added that a number of studies (Nandy & Cordes, 2004; Le & Hu, 1997) have

proposed methods of applying ROC methodology to real fMRI data. The drawback of these

methods is relying on assumptions that are impossible to test. For example, it is assumed that the

regions found using a t test with high degree of confidence are the "true" activation regions and

contain no false positives. However, there is a way to apply ROC analysis to probabilistic

classification of real data: the "ground truth" information can be provided by the class labels. A

real data set, preferably with reasonably large separation between the classes, can serve as the H1

set; H0 set can be obtained from it by permuting the class labels. We use a probabilistic classifier

(such as linear or quadratic discriminant) to compute the probability of each voxel belonging to

the class indicated by its class label. After this, we apply the typical ROC analysis: we vary the

threshold and compute the frequency of true and false positives for each threshold (a volume

from the H1 set gives a true positive when the corresponding probability is higher than the

threshold; for the H0 set, the class labels are random, therefore the volume is a false positive

when the corresponding probability surpasses the threshold). The set of (FPF, TPF) pairs is

visualized as an ROC curve, and the area under the curve can serve as a performance metric.

This type of ROC analysis was not carried out in the work described in this thesis; it presents an

interesting possibility for future research.

2 .1 .2 R epro d ucibility; N PAIR S fram ew o rk

Like all experimental sciences, the goal of neuroscience is to discover reproducible results. A

measure of reproducibility is particularly useful for fMRI studies, because the data are often

corrupted by noise of strong magnitude and complicated structure. Reproducibility of spatial

21

maps evaluates the stability of spatial locations where the neural effect of interest is expressed.

For the unthresholded maps, reproducibility is monotonically related to global signal-to-noise

ratio (LaConte et al., 2003; Yourganov et al., 2011).

Several reproducibility metrics were proposed in the literature for both PET and fMRI

experiments. For example, a paper by Grabowski and colleagues (1996) obtained repeated

measurements on a relatively large number of subjects (eighteen). Subjects were randomly

separated into two cohorts, and a set of univariate algorithms was applied to analyze each cohort.

Voxels that passed the threshold of activation were grouped into clusters, and neuroanatomical

interpretation was given to each active cluster. The researchers studied whether the active

regions obtained on one cohort would also be found in (a) a different cohort, (b) the same cohort

but different session, and (c) different methods of univariate analysis. The frequency of obtaining

the same active region was used as a per-region measure of reproducibility.

The problem with such an approach is its reliance on neuroanatomical interpretation, which can

be subjective, ambiguous and variable across subjects. An fMRI study by Rombouts et al. (1998)

proposed a different measure of reproducibility between two repeated sessions: the proportion of

voxels that are found to be active in both sessions. Unlike the metric proposed by Grabowski et

al., this measure is computed on voxels, not on the anatomical regions; however, both measures

are very sensitive to the choice of activation threshold. Maximum reproducibility across two

sessions (averaged across subjects) was 0.75 for Bonferroni-corrected critical level of α=0.05,

but with the increase of threshold the number of activated voxels dropped and the proportion of

overlapping voxels dropped accordingly. To complicate the issue, maximum reproducibility

could be even higher (0.78) when different thresholds could be applied to the two sessions.

Maitra (2010) gives an overview of various metrics of similarity of thresholded maps, and

advocates the use of Jaccard's overlap metric. For two thresholded maps, Jaccard overlap is

determined by the ratio of the number of voxels in the intersection of the two maps to the number

of voxels in their union. Maitra argues that this metric is more intuitive than Dice overlap (used

by Rombouts et al., 1998), to which it is monotonically related.

A PET study by Strother et al. (1997) proposed a metric of reproducibility that is voxel-based

and at the same time threshold-independent. Given a pair of unthresholded spatial maps,

reproducibility is defined as Pearson's product-moment correlation coefficient r. The two maps

22

are computed using split-half resampling scheme: the data are split into two independent sets of

approximately equal size, and both sets are analyzed independently. Later fMRI studies (e.g.,

Tegeler et al., 1999; Raemaekers et al., 2007) used Pearson’s correlation to estimate

reproducibility of maps that came from two repetitions of the experiment on the same subject.

In a follow-up paper by Strother et al. (2002), split-half resampling became a basis of the

NPAIRS (Nonparametric Prediction, Activation, Influence and Reproducibility reSampling), a

data-driven pseudo-ROC framework for evaluation of the analysis chain. In NPAIRS,

reproducibility is the median correlation of two split-half maps (the median is taken across the

splits). The splitting procedure is repeated as many times as possible to stabilize the estimation of

reproducibility. The advantage of splitting the data into two halves (rather than a larger number

of subsets as in k-fold cross-validation) is that split-half resampling maximizes the amount of

data used to compute the spatial maps; it has been shown to stabilize variable estimates in

analyses of highly dimensional ill-posed data sets (Meinshausen & Buhlmann, 2010). When

resampling strategies for evaluation are used, the half-splits must be mutually independent. The

presence of correlation between samples leads to over-estimation of performance metrics, such

as reproducibility. In group studies, this independence can be achieved by using different

subjects for each split (but keeping the number of volumes in each split approximately equal). In

within-subject splits, the half-splits can be composed from different experimental runs. If there is

only one run for each subject, the split must be done so that the temporal separation between the

volumes in two splits is as large as possible, preferably more than 20 seconds (the timecourse of

the hemodynamic response; see Glover, 1998). This may often be achieved in block designs by

making sure that images from the same scanning block end are in the same half-split.

Examining the scatter plot of the two spatial maps can provide some insights about

reproducibility of our results. Given that the two maps are matched in their scale and variance,

the scatter plot of perfectly matching spatial maps (r=1) is a line of identity. In the worst case of

zero reproducibility (r=0) the plot is a cloud of a roughly circular shapes (see Figure 2.2 A). For

intermediate values of r, the cloud is elongated along the line of identity (see Figure 2.2 B). The

amount of deviation from this line can serve as an indicator of non-reproducible effects, such as

random noise. Our scatter plot can be analyzed using two axes, the "major" axis along the line of

identity and the "minor" axis orthogonal to it. The major axis contains a mixture of signal and

noise, and the minor axis contains only the noise that is uncorrelated with the signal. If the two

23

Figure 2.2. Scatter plots of 2 simulated spatial maps, corresponding to reproducibility of

r=0.5 (A) and r=0 (B). Major and minor axes are displayed for each scatter plot. The major

axis contains signal and noise, and variance along this axis is 1+r, while the minor axis

contains only noise, with variance 1-r.

maps are normalized to have unit variance, the variance along the major and minor axes is

re 11 and re 12 , respectively (Strother et al., 2002). If the variance of the noise along the

major and the minor axes is equal, we can estimate the signal variance as 21 ee , and define a

measure of global signal-to-noise as a ratio of the signal variance to the noise variance

(Yourganov et al., 2011):

r

r

e

eegSNR

1

2

2

21 (2.2)

This measure indicates the strength of reproducible signal that is contained in the spatial map

relative to non-reproducible noise.

2 .1 .3 Accuracy o f C las s ificatio n

Accuracy in predicting brain states is a natural metric to evaluate the fMRI data classifier. In the

NPAIRS framework, this metric is computed simultaneously with reproducibility of spatial

maps. Taken together, these two independent and complementary metrics can serve as an

alternative to evaluation with ROC methodology (LaConte et al., 2003).

24

There is a subtle difference between the "prediction accuracy" and "classification accuracy"

metrics. The task of a classifier is to assign a label to a data point (i.e. fMRI volume).

Probabilistic classifiers do it by computing the Bayesian posterior probability of the data point

belonging to class 1, class 2, etc., and assign it to the class that corresponds to the highest

probability. Non-probabilistic classifiers, notably Support Vector Machines (Vapnik, 1995),

classify the data point without computing these probabilities explicitly and therefore cannot be

considered probabilistic. By “classification accuracy”, we will refer to the accuracy of class

assignments (i.e. the proportion of correct assignments). For probabilistic classifiers, we can also

compute the “prediction accuracy” which is the posterior probability of the data point belonging

to the correct class. Consider, for example, applying a probabilistic classifier to a data point in a

two-class problem. The probabilities of this data point belonging to class 1 and class 2 have been

estimated as 0.6 and 0.4, respectively. If class 1 is indeed the correct class, the classification

accuracy is 1 (because 0.6>0.4, class participation has been estimated correctly) and prediction

accuracy is 0.6. The original NPAIRS framework (Strother et al., 2002) uses prediction

accuracy; however, we use classification accuracy instead, in order to be able to evaluate both

probabilistic and non-probabilistic classifiers.

To compute unbiased and robust estimates of classification/prediction accuracy, cross-validation

and resampling methods are normally used (see Efron & Tibshirani, 1993, particularly Chapter

17). The data set is split into a “training set” and a “test set”. The training set is used to train the

classifier, i.e. to estimate the parameters in the model that is used for classification. Then the data

in the test set are classified according to that model. Training and test sets should be as close to

independent as possible, so the estimate of classification/prediction accuracy computed on the

test set data is realistic (otherwise it will be biased upwards). We perform many such splits and

use the mean (or median) accuracy as our metric of classifier’s performance. In the NPAIRS

framework, we use one half of the data to train the classifier, and the other half as a test set; then,

we reverse it and use the second half for training and the first half for testing.

2 .1 .4 Pred ictio n-R epro d ucibility Plo ts

The use of prediction/classification accuracy as a performance metric has a long history in

statistics and machine learning (see Efron & Tibshirani, 1993; also Demsar, 2006). The novel

contribution of the NPAIRS framework was to combine this metric with the measure of

25

reproducibility of spatial maps. Excellent prediction accuracy does not guarantee that the

algorithm also produces reproducible spatial maps. This can be demonstrated on an example

when the signal that differs between the two classes is localized in a large and strongly correlated

cortical network, and a classifier randomly picks one voxel from that network. When a spatial

map is constructed according to each voxel's contribution to classification (see Section 3.2), the

voxels in the network will have low weights in the spatial map, with the exception of the one

voxel used in classification. Since the location of that voxel is random, the reproducibility of the

maps will be poor. Also, a highly reproducible map does not imply accurate classification. As a

somewhat pathological example, consider an algorithm that assigns constant values to all voxels.

Spatial maps are therefore perfectly reproducible, but have no predictive value whatsoever. In

general, we can view prediction and reproducibility as two semi-independent and complementary

measures that serve as indicators of bias and variance of our analytic model.

NPAIRS uses Prediction-Reproducibility plots (P-R plots) as a way to display these

complementary metrics (see Strother et al., 2002; LaConte et al., 2003). We use a split-half

resampling framework to compute these metrics for different splits, and plot median value

(across splits) of prediction accuracy P versus median value of reproducibility R. In the ideal

case (with infinitely large contrast-to-noise ratio and gSNR), the analytic model produces

perfectly reproducible maps and is able to predict mental states perfectly, corresponding to the

point (P=1, R=1) on the plot. We can judge the quality of our model by its proximity to this

point. This is analogous to the ROC methodology, where the ROC curve for a perfect detector

passes through the point (FPF=0, TPF=1).

P-R plots are particularly useful when our analytical model contains a hyperparameter. A

hyperparameter is a parameter that cannot be estimated automatically from the training data

alone; it is typically estimated by optimizing the model’s performance on the independent test set

(Lemm et al., 2010). NPAIRS provides a resampling framework for such estimation, and we can

tune the hyperparameter by optimizing prediction accuracy and/or map reproducibility

(Rasmussen et al., 2012B). The evaluation of our model for a particular value of the

hyperparameter gives us a point on the P-R plot, and by changing this parameter we can observe

a P-R trajectory. An important example of a hyperparameter is the number of principal

components that is used to regularize the linear discriminant. In Figure 2.3, we show two P-R

trajectories for this hyperparameter using the simulated data sets described in the upcoming

26

Section 2.2.1. Trajectories were computed using simulated data. In one case, simulated data

contained a correlated active network (black line with circles); in the other case, the active areas

were not correlated (grey line with squares). The size of the marker indicates the number of

principal components: smallest markers correspond to using just the first principal component in

our analysis, and largest markers indicate that the first 30 components were used. We can see

that the P-R trajectory is different in the presence and absence of underlying network

connections. If it is present, our prediction accuracy does not achieve the level that we can have

in the absence of the network (using a sufficiently large number of components). On the other

hand, spatial maps are much more reproducible if the network is present (and a small number of

components is used).

Figure 2.3. Prediction-Reproducibility (P-R) plots in the presence and absence of a

simulated cortical network. The black line shows the P-R trajectory when the activation

loci are linearly coupled into a single covarying network. The grey line shows the trajectory

when the activation loci are independent. The size of the markers corresponds to the

number of principal components used in the analysis (varied from 1 to 40). We show the

average trajectory for 100 simulated data sets described in Section 2.2.1.

2 .2 Data s ets

Our proposed framework of evaluation uses both simulated and real data to evaluate the

performance of an algorithm. When we generate artificial fMRI data, we have control over the

27

structure of signal and noise in the data, and therefore we can study the behaviour of our

algorithm in a variety of controlled situations. It is possible to create artificial data using

complicated methods such as neural mass models (Deco et al., 2008), but we have chosen to use

simple Gaussian simulations with many fewer parameters. This allowed us to perform a thorough

examination of how a certain parameter influences the performance of a classifier. Another

advantage of simple simulations is an opportunity to generate very large data sets and therefore

decrease the variance of performance metrics in our evaluations.

However, simulated data sets are not enough for evaluation. Real data have to be included as

well to provide us with a reality check, although comparisons between real and simulated data

are difficult due to lack of control over the real data and simplicity of the simulated data. We

have used real data from two studies: a longitudinal study of stroke recovery, and a study of

aging that involved a large number of people from different age groups performing cognitive

tasks of varying difficulty. The stroke recovery study, because of its longitudinal nature, provides

a large amount of fMRI volumes per subject, which is advantageous for complicated machine

learning algorithms that require a large amount of training data (Schmah et al., 2010). The aging

study allows us to evaluate our analytical methods on different age groups and on different

cognitive tasks of incremental behavioural difficulty that use the same visual stimuli; in addition,

it allows us to compare our findings with previously published results (Grady et al., 2010; Garret

et al., 2012) that have been obtained with different analytical methods.

2 .2 .1 S im ulated d ata

We have used computer-generated data to simulate a block-design experiment with two

conditions: activation and baseline. Data have been generated using the algorithm described by

Lukic and colleagues (2002), with some modifications, and results are reported in (Yourganov et

al., 2011). All images contain the same simplified single-slice "brain-like" background structure

with additive Gaussian noise. An elliptical background structure contained in a 60×60 pixel

image consists of “grey matter” in the center and on the rim of the phantom, with “white matter”

in between. The amplitude of the background signal in the “grey matter” is 4 times higher than in

the “white matter”; informally, this could be interpreted as the reflection of the fact that the grey

matter in the human brain consumes 4 times more energy than the white matter (Logothetis and

Wandell, 2004). Parameters of the phantom have been deduced from a PET study, and are also

28

representative of spatially smoothed fMRI data (Lukic et al., 2002). Gaussian noise is spatially

smoothed by convolving the image with a Gaussian filter that has a full-width-at-half-maximum

(FWHM) 2 of 2 pixels. After smoothing, the standard deviation of the noise is 5% of the

background signal. Images in the “activation” condition contain 16 Gaussian-shaped signal

“blobs” distributed over the image (12 in the “grey matter” and 4 in the “white matter”) and

added to the smoothed noisy background image. Figure 2.4 shows examples of baseline and

activation images (noise is not displayed; although activation signal could be negative as well as

positive, we only show positive signal for the sake of clarity). The FWHM of the activation blobs

vary between 2 and 4 pixels. Simulated experimental data sets are composed of N baseline and N

activation images per set (N = 100), so the total number of observations in a set is 2N=200. We

use a mask with J=2072 pixels covering the "brain" to exclude locations outside of the phantom

from analysis.

Images are arranged into 10 “epochs” of 20 images each to simulate a block design with epochs

of 10 “baseline” images followed by 10 “activation” images. To simulate the hemodynamic

response, each pixel’s time course is convolved with a hemodynamic response function (HRF)

defined by the sum of two Gamma functions (Glover, 1999). Parameters of the HRF model have

been taken from Worsley (2001): a1 = 6, a2 = 12, b1 = b2 = 0.9 seconds, c = 0.35, TR (time to

acquire the full brain volume) = 2 seconds.

Figure 2.4. The phantom in baseline (left) and activation (right) states. Noise is not

displayed.

2 FWHM defines the spread of the filter. For a Gaussian filter, it is proportional to variance: 2ln22FWHM

29

Amplitudes of the Gaussian activation signal blobs are sampled from a multivariate Gaussian

distribution. The mean amplitude of each activation is specified proportionally to the local value

of the background signal:

E[ak] = Mbk, (2.3)

where ak is the amplitude of kth activation, E[ak] is its expected value, bk is the value of noise-

free baseline image at the center of the kth area, and M is the proportionality constant. To study

the effect of M on performance of the algorithms, M has been set to different levels (0, 0.01,

0.02, 0.03 and 0.05) in different realizations of our simulated experiment. These levels of M

correspond to contrast-to-noise ratios (CNRs) of 0, 0.2, 0.4, 0.6 and 1.0. HRF convolution

changes these values to empirical measurements of 0, 0.3, 0.6, 1.0, and 1.6, respectively3.

The variance of the amplitude of the Gaussian activation signal in our multivariate Gaussian

distribution, denoted by σk2, is defined proportionally to the variance of the independent

background Gaussian noise added to each voxel, vk2:

σk2=Vvk

2, (2.4)

where the proportionality constant V has been varied from 0.1 to 1.6 in different realizations of

the experiment. In this dissertation, we refer to V as the relative signal variance, which may be

thought of as a form of physiological variation of the activation signal. The third parameter of

our multivariate Gaussian model is the correlation coefficient, ρ, which defines the covariance

between Gaussian activation signal amplitudes at the kth and lth locations (k ≠ l):

cov (ak, al) = ρ σk σl. (2.5)

The value of ρ has been set to 0, and 0.5 and 0.99 to define a simple distributed spatial network

(Lukic et al., 2002). This value is the same for all regions in the network.

3 Empirical measures of CNR are computed as follows. For a given active locus, we compute the difference in mean

signal of the active and baseline images (discarding the first 2 volumes in each block) in the H1 set, and divide it by the standard deviation of the timecourse at this locus in the corresponding H0 set. This is repeated for all 16 loci, and the result is averaged across the loci and then across the 100 sets.

30

The amplitudes of the multivariate Gaussian signal in the “active” state are defined by the three

parameters: CNR (or M), V and ρ. These values are the same for all volumes in a simulated

experimental set, but may differ across sets. All CNR values are those measured empirically after

convolution by the HRF. This Gaussian signal simulation incorporates the three ideas of (1) a

mean signal level, (2) physiological variation of signal levels about their mean across successive

scans, and (3) a partition of this signal variation between physiological noise and network

variation defined by the chosen value of the correlation coefficient coupling between distributed

Gaussian blobs.

To compute ROC curves, we have created 100 examples of H0 and H1 type sets for each setting

of (M, V, ρ). Each of the H1 sets consists of N activation images and N baseline images. Each of

the H0 sets consists of baseline images only. We have computed spatial maps for each set using

the algorithm under evaluation. At each of the 16 activation loci, we have built the ROC curve

using the voxel values at this particular locus. True positive frequency has been computed using

100 maps from the H1 sets, and false positive frequency has been computed from 100 H0 sets.

We have used LABROC1 software (Metz et al., 1998) to generate smooth ROC curves from

discrete values of (TPF, FPF).

2 .2 .2 Stro ke reco very s tud y

We also analyzed data collected by Small et al. (2002), in a longitudinal block-design study of

stroke recovery, in which 9 stroke patients were scanned in 4 different sessions at 1, 2, 3 and 6

months after the stroke. During each session, subjects were instructed to perform a motor task

(finger tapping alternating with wrist flexion) alternating with blocks of rest. Two runs were

recorded for subjects performing the task with their healthy hand, and 2 runs – with the hand

impaired by the stroke. Whole-brain fMRI data were acquired using 1.5T MRI scanner. 24

horizontal 6-mm thick slices were obtained. Scanning parameters were: volume acquisition time

(TR) = 4 seconds, echo time (TE) = 35 milliseconds, flip angle = 60º. Out of 24 slices, we

selected a subset of 7 slices that corresponded to parts of the brain involved in finger and wrist

motion (4 slices containing the cerebellum and 3 slices containing the motor areas). The number

of voxels selected for analysis was 10,499.

The data for all 9 subjects were spatially aligned to each other and corrected for motion. The

volumes within each motor-task block were divided by the average volume of the last two scans

31

of the preceding rest block, for intensity normalization and to filter out low temporal frequencies

(McIntosh & Lobaugh, 2004). Rest blocks were discarded after that. By pooling the data across

runs and sessions, we obtained 1280 volumes per individual subject; this large number of

volumes per subject allowed us to carry out within-subject analysis easily. The data set is used to

examine the utility of dimensionality estimates and related covariance measures as biomarkers of

stroke recovery (Yourganov et al., 2010; see also Chapter 6). An analysis of the performance of

various within-subject classifiers is also described in detail in Schmah et al. (2010) and

summarized in Chapter 8.

2 .2 .3 Aging s tud y

We analyzed another set of real fMRI data that was collected in a study by Grady and colleagues

(2010). To study the impact of aging on cognitive abilities, a number of subjects from three age

groups (“young”, 20-31 years, 19 subjects; “middle aged", 56-65 years, 14 subjects; “old”, 66-85

years, 14 subjects)4 were scanned during performance of five behavioural tasks. The tasks were:

1. fixation to a dot presented in the middle of the screen ( "FIX");

2. reaction task: detection of a visual stimulus and reporting its position on the screen ("RT");

3. perceptual matching, where the participant had to match the "target" sample presented in the

upper portion of the screen to one of the three stimuli presented in the lower portion ("PM");

4. delayed matching test of working memory, where the target stimulus was presented and then

removed from the screen, followed by a 2.5 seconds blank-screen delay. After this, three stimuli

were presented and the participant had to match them to the target ("DM").

During each experimental run, the fixation condition was presented in 8 20-second blocks. The

other four conditions were presented in 2 blocks for each condition, each block lasting

approximately 40 seconds (the duration varied slightly because the stimuli were generated at the

time of scanning runs). Four scanning runs were acquired on each participant, with 300 volumes

in each run. Scanning was done on a 3T scanner with the following parameters: TR = 2 seconds,

4 In our group-level analysis, we have pooled the “middle aged” and the “old” groups together to form one group of

older subjects.

32

TE = 30 milliseconds, flip angle = 70º). Functional images of a whole brain were acquired in 28

axial slices, 5-mm thick.

Preprocessing of data in this study was somewhat more extensive compared to the stroke study;

it is described in detail in the paper by Grady et al. (2010). First, the transformation of aligning

functional images to a common atlas was computed (details are described in Grady et al., 2010).

Then the images underwent slice time correction (with AFNI package; Cox, 1996) and motion

correction (with AIR package, Woods et al., 1998). This was applied to the original images,

which were afterwards transformed into a common anatomical space. Then the images were

smoothed with a Gaussian kernel (FWHM = 7 mm), and artifact-carrying components were

removed by using Independent Component Analysis (with the MELODIC package; Beckmann

& Smith, 2004). Using a standard white matter mask, mean white-matter signal was obtained by

averaging white-matter voxels; the mean signal was then regressed from the time course of each

voxel. The same was done for the mean CSF signal. Finally, linear trends were removed.

33

Chapter 3 Pro babilis tic clas s ificatio n o f fM R I d ata

3 .1 G eneral co ns id eratio ns

The problem of predicting a mental state for an fMRI volume is a problem of classification: a

specific volume is classified into one of the groups that represent various mental states.

Typically, a mental state is associated with performing a certain behavioural task. For example,

in the stroke recovery study described in Section 2.2.2 mental states are associated with left

finger tapping, right finger tapping, left and right wrist flexion, and rest. During training, a

classifier learns the association between training volumes and the corresponding mental states.

Afterwards, the classifier is applying this association to the test data in order to assign each

training volume to one of the mental states.

A researcher who wants to perform classification on fMRI data is facing a problem of selecting a

classification algorithm from a large group of methods created by the statistical and machine

learning community. Some of these methods are probabilistic: they construct a probabilistic

model for each class, and compute the probability of the fMRI volume belonging to each class,

so the volume is assigned to the most probable class. Another group of classifiers is non-

probabilistic: the mapping of fMRI volumes to class labels is done without constructing

probabilistic models. A popular example of non-probabilistic classifiers is Support Vector

Machines (SVMs; see Vapnik, 1995).

Classification algorithms can be univariate or multivariate: univariate classifiers assume that

fMRI signal is independent across voxels, and multivariate classifiers take interactions between

voxels into account. The brain is a network of interacting cortical areas, so the assumption of

voxel independence does not hold for fMRI data (also, artifacts in the data could be another

source of dependency between voxels); nevertheless, univariate classifiers usually have less

degrees of freedom than multivariate classifiers, and therefore require a smaller amount of

training data. Another distinction is between linear and nonlinear classifiers: linear classifiers

separate the two classes with a linear hyperplane in feature space, and nonlinear classifiers use a

nonlinear surface.

34

Generative classifiers operate in a Bayesian framework. Consider the classification problem

where the data vector x needs to be assigned to a class. The class memberships are unique, and

the number of classes is Nclass. Bayes rule can be adapted to the classification problem as follows

(Bishop, 2006):

)(

)()|()|(

x

xx

P

cclassPcclassPcclassP

(3.1)

where,

P(class = c | x) is the probability that a given data vector x belongs to class c. It is also

called posterior probability of class c. Data vector x is assigned to the class with the

largest posterior probability.

P(x | class = c) is the probability of x, if we know that it belongs to a class c. Generative

classifiers assume a probabilistic model for each class, so we can compute this

probability easily. This probability is called the likelihood of x for a class c.

P(class = c) is the probability that an arbitrary vector, without considering its numerical

value, belongs to class c; in other terms, it is the likelihood of occurrence of vectors

belonging to class c. For example, if all class memberships are equally likely, this

probability is equal to 1/ Nclass for all classes. This is called the prior probability of class

c, because it is not influenced by x and can be computed before obtaining x. In contrast,

the posterior probability of class c is computed (using Bayes rule) after obtaining x.

P(x) is the probability of x, i.e. the likelihood of observing this data vector. It is also

called the marginal probability of x. It can be written as a sum of likelihoods of x for all

our classes:

classN

k

kclassPkclassPP )()|()( xx (3.2)

35

For simplicity, consider a problem of two-class classification (Nclass = 2). If the two classes are

equally likely to occur, their prior probabilities are equal5, and a vector x is assigned to class 1 or

class 2 depending on which likelihood is greater, P(x | class = 1) or P(x | class = 2).

Equivalently, the class membership of x is given by the sign of the decision function, defined as

)2|(

)1|(log)(

classP

classPD

x

xx . (3.3)

If this function is positive, x is assigned to class 1; if it is negative, to class 2. If this function is

zero, the assignment cannot be made because the membership in either class is equally likely.

The set of points where D(x) = 0 is called the decision boundary, which can be seen as a hyper-

surface that separates the two classes.

Here, we will describe three probabilistic classifiers that use a Gaussian model to compute

likelihood functions. However, the assumptions behind the Gaussian model are different:

1. Quadratic discriminant is the most general method. Each class is modeled with a

separate multivariate Gaussian distribution with a specific mean vector and covariance

matrix.

2. Linear discriminant is a constrained version of the above: the covariance matrix is

assumed to be the same for all classes. The only difference between classes is in their

mean vectors.

3. Gaussian Naive Bayes classifier is still more constrained: the covariance matrix is

assumed to be diagonal. This is equivalent to using a univariate Gaussian distribution for

each dimension of the data. If we make an additional assumption that the covariance

matrix is the same for all classes, the decision function becomes linear; if this assumption

is not made, it is a nonlinear function. We will call these two variants linear and nonlinear

Gaussian Naïve Bayes, respectively.

5 If we do not want to assume that the class memberships have equal prior probabilities, we can estimate the prior

probability of a class using training data: as a ratio of the training vectors belonging to the class to the total number of training vectors.

36

3 .2 C o ns tructing s patial m aps fo r clas s ifiers

A spatial map, computed for a given classifier, indicates the relative importance of different

spatial locations in classification of fMRI volumes. We propose to construct spatial maps by

taking a voxel-wise derivative of the decision function. This is similar to the technique of

“sensitivity maps” proposed by Kjems and colleagues (2002). The value of the ith voxel of the

spatial map is computed as

j

j

ii D

xNy )(

1 )(x , (3.4)

where x(j) is the jth volume, and N is the number of volumes. Essentially, we take the decision

function for each volume, compute its partial derivative at a specific voxel location, and average

it across all volumes. Therefore, yi indicates the average impact of the ith voxel on the decision

function, and reflects the importance of this voxel in classification.

In the original method proposed by Kjems et al., the maps are made by taking the square of

voxel-wise derivative of P(class = c | x), posterior probability of class membership. This

approach has two disadvantages. The first disadvantage is numerical instability; Formula 3.3 can

be rewritten as

))(())(exp(1

1)|1( x

xx D

DclassP

, (3.5)

where σ denotes the sigmoid function: σ(z) = 1 / (1 + exp(-z)). Thus

)())((')|1( xxx Dx

DclassPx ii

(3.6)

where σ´(z) = exp(-z) / (1 + exp(-z))2 = σ(z) (1 - σ(z)). The term σ´(D(x)) is close to zero

whenever x is far from the decision boundary. This may be considered a desirable property

theoretically, but in our experiments, only a small number of volumes in the training set were

close enough to the decision boundary for this term to be numerically nonzero, leading to a non-

robust dependence on a small number of training volumes. Therefore, we advocate using the

37

derivative of decision function D(x) instead of P(class = c | x) when making a sensitivity map

(Yourganov et al., 2010).

The second disadvantage is loss of sign information due to squaring of ix

P(class = c | x). The

sign of ix

P(class = c | x) encodes the class preference of the ith voxel: it indicates whether the

signal in that voxel should be increased or decreased in order to increase P(class = c | x)

(Rasmussen et al., 2012A). This also applies when we take the derivative of D(x) instead of

P(class = c | x); for a two-class problem, we can say that a positive value of

)(xDxi

corresponds to a preference of the ith voxel for class 1, and a negative value to a

preference for class 2. If the sign information is preserved (as in Formula 3.4), sensitivity maps

can be interpreted analogously to statistical parametric maps (Worsley, 2001), where the sign of

the voxel indicates whether the contrast is expressed positively or negatively in that voxel.

In this thesis, we construct the spatial maps using the Formula 3.4; this corrects the drawback of

our earlier paper (Yourganov et al., 2010), where we have used the square of )(xDxi

. For a

multi-class problem, sensitivity maps can also be constructed according to a method proposed in

a recent paper by Rasmussen et al. (2012A). This method computes one sensitivity map for each

observation by taking the voxel-wise partial derivative of log P(class = c | x), and then groups

the maps into homogeneous clusters, producing one map per cluster (the number of clusters is

determined by optimizing generalization error in a cross-validation framework).

Let us consider the case when the derivative of the decision function D(x) with respect to x can

be expressed analytically as a function of x: )()( xx

xd D

. Then the sensitivity map can be

expressed as a function of the derivative:

trainN

kk

trainN 1

)(1

xdy . (3.7)

38

3 .3 Quad ratic d is crim inant

The quadratic discriminant (QD) method was first proposed in a 1947 paper by Smith (1947) to

classify data that come from multivariate Gaussian distribution with class-specific means and

covariance matrices. Cooper (1963) developed this method further and advocated its use for a

wide range of multivariate distributions. The description of this method and comparison to other

classification methods can be found in books by Seber (2004) and by Hastie et al. (2009). We

have demonstrated the efficacy of QD in classifying fMRI volumes (Schmah et al., 2010). Each

class is modeled with a multivariate Gaussian distribution with mean vector μc and covariance

matrix Σc:

)()(2

1

2

1

2

1

2)|(cc

Tc

ecclassP c

K μxΣμxΣx

. (3.8)

Here, K is the number of dimensions in our data. In real-world situations, the class-specific

population parameters μc and Σc are unknown. Instead, we can use unbiased estimates of mean

and covariance, which are computed using the training set:

c

c

N

j

T

cj

cj

ci

N

j

j

cc

N

N

1

)()(

1

)(

1

1

1

mxmxS

xm

(3.9)

Here, Nc is the number of training examples in class c. When computing likelihood functions, we

substitute sample mean mc and sample covariance matrix Sc into equation 3.6 in place of

population mean μc and population covariance matrix Σc, respectively.

For two equally likely classes, the decision function is

21

2211

112

1

2

1

2

1log

2

1)( mxSmxmxSmx

S

Sx TT

QDD . (3.10)

We can see that this function is a quadratic form on x. The decision boundary is a K-dimensional

quadric surface.

39

Let us now derive the formula for the sensitivity map. First, we can compute the derivative of the

decision function DQD (x). We can re-arrange the terms in Formula 3.10 and express DQD (x) as a

sum of a quadratic term, a linear term and a constant term:

constD TTTQD xSmSmxSSxx )()(

2

1)( 1

221

111

21

1 . (3.11)

The derivatives of the quadratic and linear terms can be computed easily since we know that, for

any symmetric matrix A, we can write (see e.g. the appendix in Mardia et al., 1979)

AyAxyx

AxAxxx

2

2

T

T

(3.12)

Therefore, the derivative of DQD (x) is a vector

)()(()()( 21

211

11

221

111

21

1 mxSmxSSmSmxSSxd TTQD . (3.13)

3 .4 Linear d is crim inant

When the classes are sampled from multivariate Gaussian populations that all share the same

covariance matrix, the Bayes-optimal method of classification is called linear discriminant (LD).

It can be seen as a special case of QD, but this method was developed before QD and is overall a

more popular method of classification. The algorithm was outlined in a 1936 paper by Fisher,

although Hotelling and Mahalanobis were developing similar ideas around the same time (see

Hodges, 1955, for a historical review). The earliest neural network prototype, Rosenblatt's

perceptron (1958), was mathematically equivalent to a linear discriminant, and therefore limited

to solving linearly separable classification problems, as shown by Minsky and Seymour (1969) .

In neuroimaging, LD has a long history as a method of analysis of fMRI and PET data (see e.g.

Friston et al., 1996; Tegeler et al., 1999; Strother et al., 1997, 2002). The treatment of the

mathematical theory behind LD is described in several books (e.g., Mardia et al., 1979; Seber,

2004). For binary classification, linear discriminant analysis is equivalent to canonical variate

analysis (Strother et al., 2002; Mardia et al., 1979).

40

LD assumes that data from all classes are sampled from multivariate Gaussian distributions,

where the mean vectors are specific to a class but the covariance matrix is the same across

classes. The likelihood function of a vector x which belongs to a class c is

)()(2

1

2

1

2

1

2)|(c

Tc

ecclassPK μxΣμxΣx

. (3.14)

This expression is identical to equation 3.7, except that the common covariance matrix Σ is now

used in place of the class-specific covariance matrix Σc. This sharing of covariance matrix across

classes leads to several significant simplifications. The decision rule for two equally likely

classes is now a linear, rather than quadratic, function on x:

)()(2

1)( 12

121 mmSmmxx

T

LDD . (3.15)

Here, sample means mc are estimates of class means μc (see Formula 3.9). Matrix S is an

unbiased estimate of population covariance Σ, and is computed by pooling sample covariance

matrices Sc:

c

N

c

cclass

N

NSS

1 2. (3.16)

Here, Nc is the size of the cth class, N is the total number of data points, Nclass is the number of

classes, and class-specific sample covariance matrices Sc are computed using Formula 3.9.

Since the decision function is linear, the decision boundary (the locus where DLD (x) is zero) is a

hyperplane in K dimensions. The derivative of the decision function can be easily computed

applying Formula 3.12:

)()( 211 mmSxd

LD . (3.17)

The derivative does not depend on x, so spatial maps can be computed without averaging across

training volumes: the vector dLD (x) serves as a spatial map.

It is interesting that Fisher (1936) has derived Formula 3.15 without using a probabilistic (i.e.

multivariate Gaussian) model. Instead, he used an intuitive formula for "good classification" that

41

maximizes differences between classes while minimizing differences within a class.

Mathematically, the problem is to find a vector d that maximizes the ratio

Sdd

BddT

T

. (3.18)

Here, B is the "between-class covariance matrix"; for a two-class problem, it is equal to

T

N

NNB ))(( 2121

21 mmmm , (3.19)

which is a singular matrix of rank one. The vector d that maximizes the ratio 3.18 is the first

eigenvector of the matrix S-1B, and it exactly corresponds to dLD (x) given by Formula 3.17. To

classify the test vector x, Fisher has proposed to look at the sign of the product dx, and to assign

it to class 1 or 2 depending on whether dx is positive or negative. It should be mentioned that,

while Fisher had not assumed a Gaussian distribution for the data, he still used the assumption

that the covariance matrices were equal across classes. This assumption is called

homoscedasticity; conversely, the situation where covariance matrices differ is called

heteroscedasticity.

3 .5 U nivariate m etho d s : G aus s ian Naive Bayes clas s ifier, G eneral Linear M o d el

Gaussian Naive Bayes (GNB) classifier is designed to handle situations when our data are

represented in a coordinate system with independent dimensions. We still make the assumption

that data are sampled from a multivariate Gaussian population. There are two versions of GNB,

linear and nonlinear; we will now describe the nonlinear scenario as the more general case. Here,

we don’t make the assumption of homoscedasticity (that is, the assumption that the population

covariance matrix is the same for all classes). Since the dimensions are independent, the

covariance matrices are diagonal. The parameters for each class are:

2

22

21

2

1

0

0

00

,

cK

c

c

c

cK

c

c

c

Σμ . (3.20)

42

Substituting this diagonal form of Σc into the formula for the multivariate Gaussian distribution,

we get the expression for the likelihood function:

K

ii

K

i ci

ciiK

ici

K

cclassxPx

cclassP11

2

2

1

22 )|(2

exp)2()|(x . (3.21)

Here, xi is the ith element of vector x, and it is sampled from a univariate Gaussian distribution

with mean μci and standard deviation σci.

The decision function for two equally likely classes is:

K

i i

iiK

i i

iiK

i i

iNGNB

s

mx

s

mx

s

sD

12

2

22

12

1

21

1 1

2

22log)(x (3.22)

Here, mci is the ith element of the class-specific sample mean mc, and

cN

jci

ji

cci mx

Ns

`

2)(

1

1 (3.23)

is the sample standard deviation along the ith dimension for class c.

Let us now consider the linear case, where all the classes are sampled from the distribution with

the same population covariance matrix. Therefore, the standard deviation along each dimension

is the same for all classes. We can estimate it using the mean of the standard deviations across

classes. The decision function simplifies to

K

i i

iiiiLGNB

s

mxmxD

12

21

22

2)(x (3.24)

and the ith element of its derivative is simply

221

i

iii s

mmd

. (3.25)

43

Note that this expression is very similar to a two-sided t test (Freund, 1992) along the ith

dimension:

Ns

mmt

i

ii

/21

. (3.26)

The t test is frequently used for neuroimaging data analysis since the early 1990s. Following the

work of Fox et al. (1988), Friston and colleagues (1991) have proposed t tests to measure task-

related activation in PET images. Later, they adopted this approach for fMRI data (Friston et al.,

1995A; Friston et al., 1995B; Worsley & Friston, 1995), by extending it and making it more

flexible. The extended approach is called the General Linear Model (GLM), or, more properly,

univariate GLM; see Kiebel & Holmes (2003) for a comprehensive introduction to GLM. Like

GNB, it is based on the assumption that the signal is independent across dimensions (voxels).

However, the analysis is more elaborate than performing the simple t test as in Formula 3.26.

The signal from each voxel is represented as a linear sum of predictors and an error term, so that

for the ith voxel of the jth volume we can write:

k

ijkijkj

i egx )( . (3.27)

Here, gjk is the value of the Kth predictor at time j. The predictors are pre-defined before the

analysis, and the first part of the analysis is estimation of predictor weights βki by linear

regression. A predictor is defined to encode a particular effect. For example, to encode task

effects, the predictor is defined so gjk = 1 if the jth voxel was acquired during that task, and gjk =

0 otherwise (to model the hemodynamic effect, the predictor is convolved with a hemodynamic

response function). It is also common to include predictors that model linear drifts (in form of gjk

= j), as well as low-order-polynomial drifts.

Equation 3.25 can be written in vector form, where xi is the time course of the ith voxel:

k

ikkii egx (3.28)

and also in matrix form, where X is the data matrix:

eGβX . (3.29)

44

The least-squares unbiased estimate of β is

XGGGβ TT 1

)(ˆ

. (3.30)

To study the differential response of a voxel to a set of tasks, we define a contrast vector c. We

select the predictors that we want to contrast with each other and set the corresponding elements

of c to 1 or -1. For example, to contrast the task-related response to a baseline, we can select

predictors that encode "task" and "baseline" conditions; the "task" predictor is set to 1, and the

baseline predictor to -1. The remaining predictors that we do not want to include into our

contrast have ck set to zero. Then, the magnitude of the estimated differential response of the ith

voxel is simply iTβc ˆ , where iβ is the vector of estimated predictor weights for the ith voxel. The

significance of this effect can be estimated with a t statistic, computed as follows:

}ˆ{

ˆ

i

iT

iVar

tβ

βc . (3.31)

Estimation of variance of iβ is difficult. The simplest approach assumes that the residual errors

(eij in Formula 3.27) are independent and normally distributed. In this case,

1)(}ˆ{ GGeeβ Ti

TiiVar . (3.32)

However, the assumption of temporal independence of residual errors is violated in typical fMRI

data. Because hemodynamic response is so slow, there is strong temporal dependence in the error

terms. Worsley and Friston (1995) have tried to account for this by estimating temporal

autocorrelation and including it in Formula 3.32. The use of t tests to make spatial maps have

been very popular in neuroscience community. Several of the most frequently used software

packages for fMRI data analysis are based on GLM, or at least include it as an option. Some

examples of such software are: SPM6, AFNI7, FSL8 and BrainVoyager9.

6 http://www.fil.ion.ucl.ac.uk/spm/

7 http://afni.nimh.nih.gov/afni/

8 http://www.fmrib.ox.ac.uk/fsl/

45

Temporal autocorrelation in the BOLD signal creates a problem for GLM analysis. Ordinary

least squares estimation assumes that the error terms are temporally independent and identically

distributed. If the voxelwise residual errors are correlated across time, the test statistics (and their

corresponding p values) are biased. A naïve application of a t test uses (number of volumes

minus number of predictors) as number of degrees of freedom; but the number of effective

degrees of freedom is smaller because of temporal autocorrelation, hence the bias in the t

statistic. Purdon and Weisskoff (1998) propose an approach of dealing with this bias: to get rid

of long-range temporal correlation by high-pass filtering, and to use first-order autoregression

model to account for short-range correlation. It is not necessary to perform high-pass filtering at

the pre-processing stage; it can be done within the univariate GLM framework by including low-

frequency oscillations into the design matrix.

3 .6 R egulariz atio n o f the co variance m atrix

The expression of the probability density for a multivariate Gaussian distribution includes an

inverse of its covariance matrix. This matrix could be class-specific, as in the quadratic

discriminant method (see Formula 3.9), or common for all classes, in the linear discriminant

method (see Formula 3.16). This matrix is of size K×K, where K is the number of dimensions in

the data (in fMRI, it is the number of voxels). This number could be quite large, especially if the

whole brain is acquired; typically, this number is on the order of tens of thousands. On the other

hand, the number of observations (i.e. fMRI volumes) is usually much less than that; it is on the

order of hundreds for single-subject studies, or it could be several thousand when we pool across

subjects or sessions.

High spatial resolution of fMRI comes with a price: the number of observations is less than the

number of dimensions. In this case, the covariance matrix is rank-deficient because we don't

have enough data. The rank of our matrix Sc (see Formula 3.9) is Nc-1 (here, Nc is the number of

observations, and we need to subtract 1 because of taking out the mean when computing the

covariance matrix), and it is much less than the number of voxels. The problem of inverting a

9 http://www.BrainVoyager.com/

46

rank-deficient covariance matrix is ill-posed (Friedman, 1989), and the inverse of Sc does not

exist. This is problematic for our classification (the expressions for decision functions, in

Formulas 3.10 and 3.15, involve Sc-1 and S-1. The GNB method does not have this problem in

either heteroscedastic or homoscedastic case, because the covariance matrix is diagonal and

therefore invertible.

The problem of inverting a rank-deficient covariance matrix is solved by regularizing the matrix,

that is, by approximating it with a full-rank matrix. One approach to regularization is to use

Principal Components Analysis (PCA) to project the data into low-dimensional space. Chapters

4 and 5 are dedicated to a detailed description of this approach. Another method of making the

covariance matrix invertible was described in a paper by Kustra and Strother (2001). In the

homoscedastic case, the covariance matrix S can be replaced by Sλ = S + λI, which is full-rank.

The heteroscedastic equivalent is class-specific matrix Sλc = Sc + λcI. The parameter λ needs to be

selected carefully; some metrics of optimization are proposed in the next chapter.

Finally, we can reduce the data dimensionality by feature selection, that is, selecting a subset of

voxels for our analysis. If the number of selected voxels does not exceed the number of

observations in any class, the covariance matrices are full-rank. This method is described by

Mitchell et al. (2004). Although GNB does not have a problem with inverting the covariance

matrix, its efficiency is improved with this kind of voxel selection, as found by Mitchell and

colleagues.

3 .7 N o n-pro babilis tic clas s ificatio n: Suppo rt Vecto r M achines

In our evaluation of probabilistic classifiers, we have compared them with a non-probabilistic

classifier, the method of Support Vector Machines (SVM), which is popular in the field of fMRI

analysis (see e.g. Cox & Savoy, 2003; LaConte et al., 2005; Mourao-Miranda et al., 2005;

Misaki et al., 2010). This method does not build a probabilistic model for the classes, but creates

the decision function in a way that simultaneously maximizes the margin between the two

classes and minimizes the misclassification rate (Cortes & Vapnik, 1995). We have tested the

simplest version of SVM that uses a linear kernel;. Decision function for a linear-kernel SVM is

linear in x:

47

bD TSVM xwx)( . (3.34)

The vector w and the scalar b are found by minimizing the expression

trainN

nnC

1

22

2

1 w , (3.35)

subject to the constraints:

,1)( nnnt xw for n = 1, …, N ,

where x1, …, xN are the training volumes and tn is 1 for volumes in class 1 and –1 for volumes in

class 2. The problem of finding optimal set of (w, b, ξ1, …, ξN) has a unique solution, which can

be found by quadratic programming. The variables ξ1, …, ξN are called slack variables; ξi

measures the degree of misclassification for vector xi. The quantity 2/||w|| is called the margin.

The tradeoff hyperparameter C specifies the importance of accuracy of classification relative to

maximizing the margin; higher values of C force the slack variables to be smaller.

We have used a MATLAB library LIBSVM10 (Chang & Lin, 2011) to compute weights w and

offset b. The spatial map for this decision function is defined by the vector w (LaConte et al.,

2005). Because of time limitations, we did not evaluate SVMs with nonlinear kernels, although

they sometimes outperform linear-kernel SVMs in classifying fMRI volumes (Schmah et al.,

2010). Construction of spatial maps for nonlinear-kernel SVMs is possible, although

computationally expensive (it is described in Rasmussen et al., 2011 and 2012A).

10

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

48

Chapter 4 M etho d s fo r es tim ating the intrins ic d im ens io nality o f the d ata

4 .1 Principal C o m po nent Analys is and d im ens io nality red uctio n

A typical fMRI data set consists of a relatively small number of volumes, each volume

containing measurements of BOLD signal for a large number of voxels. In within-subject

analysis of fMRI volumes, each volume is an observation, and each voxel defines a dimension of

the data space. Because of the complex spatio-temporal structure of BOLD signal, there is a

large amount of correlation across dimensions, making this representation highly redundant.

Another problem with high dimensionality of fMRI data is difficulty of multivariate modeling: if

we aspire to describe interactions between voxels, the number of degrees of freedom in our

model will exceed the sample size. This problem is encountered in applying Linear and

Quadratic Discriminants, as outlined in Section 3.5. Also, high data dimensionality results in a

heavy computational burden.

Principal Component Analysis can greatly improve the situation by projecting the data into a

low-dimensional space. The vector basis of this space is defined by a relatively small number of

"eigenimages". Eigenimages are mutually orthogonal vectors in voxel space. Each volume can

be represented as a linear combination of eigenimages. This orthogonal basis is given by

factoring of the data matrix X according to Eckart-Young Theorem (Reyment & Joreskog,

1996):

UZVUX T . (4.1)

Here, U and V are matrices with orthonormal columns, Γ is a diagonal matrix, and TΓVZ . All

these matrices are real-valued. The diagonal entries in the Γ matrix are called the singular values

of X. If J is the number of voxels and N is the number of volumes, then the size of X and U is

J×N, and the size of Γ, V and Z is N×N. The columns of U form the orthonormal basis of

eigenimages, and the rows of Z contain the coordinates of each data point in the eigenimage

space. The number of eigenimages is N, so this representation reduces the dimensionality of the

49

data from J (typically tens of thousands) to N (typically hundreds). This decomposition of X into

U, Γ and V is called a singular value decomposition (SVD) of X.

We can rewrite the Formula 4.1 as a sum:

N

i

Tiii

1

vuX , (4.2)

where ui and vi are the ith columns of U and V, and γi is the ith singular value. The product

γiuiviT is the ith principal component (PC) of the data (Reyment & Joreskog, 1996). Principal

components are usually ordered according to the amount of variance they explain. The first

component is the best one-dimensional approximation to our data, in the least-squares sense. The

linear combination of the first two components is the best two-dimensional approximation, and

so on. Alternatively, we can say that the first eigenimage defines the direction along which the

data is most variable. If we remove the variation along this direction from our data, the new

direction of the largest variance is the direction of the second eigenimage. The third eigenimage

defines the direction of largest variance if the variations along the directions of the first and

second eigenimage, and so on. With this ordering, the declining order is imposed on the singular

values: γ1 ≥ γ2 ≥ … ≥ γN.

For mean-centered data, eigenimages are also the eigenvectors of the covariance matrix. This

matrix is computed as T

NXXS

1

1

. If X is factored according to the Formula 4.1, then, taking

the orthonormality of V into account, S can be written as

TTT

NUUΛUUΓΓS

1

1, (4.3)

where Λ is a diagonal matrix with entries 2

1

1ii N

. This formula can be rewritten as

UΛSU , (4.4)

50

so for any i between 1 and N we can write iii uSu . Therefore, the eigenvectors of the

covariance matrix are the eigenimages of X, and the corresponding eigenvalues of S are squared

singular values of X, scaled by1

1

N. The proportion of variance of S explained by the first K

principal components isN

K

1

1 . The rank of S is determined by the number of its non-zero

eigenvalues.

We can think of fMRI data as a mixture of signal of interest and irrelevant noise. One goal of

dimensionality reduction is to get rid of components that contain mostly noise. In a cross-

validation framework, these noisy components are specific to the training set, and do not

generalize to the test set (Hansen et al., 1999). Principal components dominated by random noise

tend to have eigenvalues at the tail end of the spectrum (it should be noted that components

containing correlated structured noise can have relatively high eigenvalues). The number of

components that contain signal of interest is the intrinsic dimensionality of the data (Cordes &

Nandy, 2006).

Overall, the data dimensionality is reduced with two steps. First, we project the data onto a set of

eigenimages given by the Eckart-Young theorem, reducing the dimensionality from J to N. Then,

we further reduce it by identifying the K principal components that contain the signal (K≤N). The

second step is more difficult than the first, and a number of methods of estimating K have been

proposed; we review and test them in this chapter.

4 .2 Pro babilis tic Principal C o m po nent Analys is

Using the ideas of Principal Component Analysis, Tipping and Bishop (1999) have created a

probabilistic generative model called Probabilistic Principal Component Analysis (PPCA). We

will briefly discuss this model because of its importance in estimation of intrinsic dimensionality

(see, for example, Minka, 2000; Beckmann & Smith, 2004; Ulfarsson & Solo, 2008; Hansen et

al., 1999); for a comprehensive treatment of PPCA, see also Bishop (2006). J-dimensional vector

x is represented as a weighted sum of basis vectors hj, the mean vector m and an error term e:

emHwemhx

K

jjjw

1

. (4.5)

51

Here, K is the number of basis vectors, and H is a J × K matrix where ith column is the basis

vector hj. PPCA assumes the following distributions of w and e:

) ,(~

),(~2I0e

I0w

N

N. (4.6)

Therefore, the distribution of x is also a multivariate Gaussian, with mean m and covariance

matrix given by IHHΣ 2 T . The parameters that define the distribution of x can be

combined into a parameter vector θ = (H, m, σ2).

Tipping and Bishop (1999) have provided the solution for θ that maximizes p(X|θ), the

likelihood of the observed data. The logarithm of the likelihood function is

)(log)2log(2

)|(log 1SΣΣX traceJN

p , (4.7)

where S is the sample covariance matrix. We maximize the expression 4.7 using the eigenvalue

decomposition S = UΛUT. Since the number of basis vectors is K, the first K eigenvectors of S

define the signal subspace, and the remaining J − K eigenvectors define the noise subspace. The

variance of the noise is estimated as the mean of the last J − K eigenvalues of S:

N

KiiKJ 1

2 1ˆ . (4.8)

and H is estimated from the eigendecomposition of S (UK contains the first K eigenvectors of S,

and diagonal matrix ΛK contains the corresponding eigenvalues):

RIΛUH 2/12 )ˆ(ˆ KK , (4.9)

where R is an arbitrary orthogonal matrix. Finally, the maximum-likelihood estimate of m is the

sample mean:

i

iNxm

1. (4.10)

52

The maximum-likelihood estimate of model parameters is )ˆ,ˆ,ˆ(ˆ 2 mH . The coordinates of

data vector xi in principal-component space are given by

)ˆ(ˆˆˆˆ12 mxHIHHw

iTT

i . (4.11)

4 .3 M etho d s o f d im ens io nality es tim atio n

Estimation of intrinsic data dimensionality has been identified as "perhaps the most important

problem in using PCA" (Ulfarsson & Solo, 2008). Many methods of solving this problem have

been proposed in the statistical, machine-learning, and signal-detection literature. Not all of these

methods work in the situation when the number of variables (J) greatly exceeds the number of

observations (N), a situation typical for fMRI data (J >> N). A review of methods for the J < N

situation is given by Peres-Neto and colleagues (2005).

Some early methods (or, rather, rules-of-thumb) of intrinsic dimensionality estimation are

described by Mardia et al. (1979). For example, one can retain the PCs that, taken together,

explain 90% of the variance in the data; or one can look for the “knee” in the eigenvalue plot (the

point at which the eigenvalue spectrum of the covariance matrix flattens out, which in a white-

noise model indicates noise-dominated components); this is called the "scree-plot method". Both

of these methods are subjective, because the threshold of 90% is an arbitrary choice, and the

scree-plot method involves visual inspection. The estimates thus obtained are poor estimates of

intrinsic dimensionality (Beckmann & Smith, 2004).

In our publication (Yourganov et al., 2011), we have surveyed a set of methods of dimensionality

estimation. They can be classified into two broad categories, which we call analytic and

empirical. Methods in the first category estimate intrinsic dimensionality K by maximizing some

criterion that is computed on the whole data set X. The criterion is an analytic expression on X,

and it is formulated so the resulting PPCA model has some desirable properties from the point of

view of Bayesian prediction and/or information theory. Empirical methods do not use an analytic

criterion, but instead optimize some metric of performance on an independent test set. Cross-

validation is used to repeatedly split the data into training and test sets. This is computationally

expensive, compared with analytic methods. However, there is evidence empirical methods are

more sensitive to the true structure of the data; analytic methods rely on asymptotic properties

53

and tend to over-estimate the intrinsic dimensionality (see Hansen et al., 1999; Yourganov et al.,

2011). Also, from the pragmatic point of view, metrics of performance on an independent test set

are often easier to interpret than information-theoretic criteria.

In our survey, we have tested a variety of methods that will be described below. The analytic

approach is represented by optimization of Akaike information criterion, minimum description

length, Bayesian evidence and Stein's unbiased risk estimator. Examples of metrics to be

optimized with an empirical approach are predicted residual sum of squares, generalization error,

spatial map reproducibility, and prediction/classification accuracy. These methods have been

tested on our synthetic data described in Section 2.2.1. In addition, optimization of the area under

the ROC curve has served as the "gold standard" method of dimensionality estimation.

4 .3 .1 Akaike Info rm atio n C riterio n

Akaike (1974) proposed a criterion for determining the number of dimensions in the

dimensionality model parameters based on information theory. This criterion is known as Akaike

Information Criterion (AIC), and it has been applied to estimate the number of Gaussian signal

sources (Wax & Kailath, 1985) and, consequently, as the number of independent components in

fMRI data (Calhoun et al., 2001; Li et al., 2007). Here we follow the formulation of AIC given

by Stoica & Selen (2004), who give a useful review of several other analytical methods.

The aim of AIC is to make the data likelihood computed with maximum-likelihood estimates of

model parameters )ˆ|( Xp asymptotically approach the true data likelihood )|( Xp , as the

sample size grows to infinity. The similarity between )ˆ|( Xp and )|( Xp is measured with

cross-entropy:

)]ˆ|([log XpEI . (4.12)

Here, Eθ is the expected value under true model parameters θ. In practice, they are unknown.

Akaike approximates the expression 4.12 with the expected value under the maximum-likelihood

parameter estimates (rather than true model parameters):

])ˆ|([logˆ DpEI

X , (4.13)

54

where D is the number of degrees of freedom in the model. The unbiased estimate of this

approximation to I is

DpI )ˆ|(logˆ X . (4.14)

The model is chosen to maximize Î, or, alternatively, to minimize the expression

DpAIC 2)ˆ|(log2 X , (4.15)

which is known as the Akaike Information Criterion.

Wax and Kailath (1985) applied AIC to select the number of Gaussian sources for a model that

represents the data as a linear combination of sources. This model is similar to PPCA, except that

the sources are not assumed to be orthogonal. Also, the expected mean m of the data vectors is

assumed to be zero. The method was developed for complex sources, but we will assume that the

sources are real-valued. The model can be written as

eHwehx

K

jjjw

1

. (4.16)

The noise vector e is independent from the data, and has a multivariate Gaussian distribution

with zero mean and a covariance matrix given by σ2I. The log-likelihood term in 4.15 can be

computed using the maximum-likelihood estimators of H and σ2 given by 4.9 and 4.8. This term

is given by

J

Ki

J

KiiiJ

Kii

J

Ki

KJi

KJKJNKJ

KJ

p1 1

1

1

)/(1

log1

log1

)(1

log)ˆ|(log

X (4.17)

Let us now compute the number of degrees of freedom in the model. Using maximum-likelihood

estimates, the model is fully specified by σ2 and the first K eigenvalues and eigenvectors of the

sample covariance matrix S. The eigenvectors are J-dimensional, and constrained to be mutually

orthonormal. Number of degrees of freedom in UK is therefore )1(2

1 KKKJK

(normalization removes K degrees of freedom, and mutual orthogonalization removes

55

)1(2

1KK degrees of freedom). The remaining parameters (K eigenvalues and σ2) contribute

K+1 degrees of freedom to the model. Overall, the number of degrees of freedom is

1)1(2

1 KKJKD . (4.18)

By substituting expressions 4.18 and 4.17 into 4.15, we get an expression for AIC, which can be

evaluated for a range of values of K. The value of K that minimizes AIC can serve as an estimate

of intrinsic data dimensionality (Calhoun et al., 2001; Li et al., 2007).

4 .3 .2 M inim um Des criptio n Length

Rissanen (1978) proposed another information-theoretic criterion based on encoding theory.

Imagine that we encode our data X with binary digits. To minimize the length of encoded data,

we assign shorter codes to data vectors that occur more often. Minimum Description Length

(MDL) criterion is used to select the model that assigns the shortest binary sequence to the data.

Here we are considering the encoding of any possible data, not just the observed data that were

used to estimate the model parameters. Strictly speaking, the problem of finding such model is

equivalent to computing Kolmogorov complexity and is not computable. Rissanen's criterion is

an approximation developed for large data sets.

For an estimate of the parameter vector , the probability of the data vector x is given by the

likelihood )ˆ,ˆ,ˆ|( 2mHXp . For the observed data, the model that maximizes the likelihood is

also the model that produces the shortest description length. However, maximum-likelihood

estimates are biased (Bishop, 2006) and need to be corrected so the model produces the shortest

description of the previously unseen data. Rissanen arrived at the following criterion:

NDpMDL log2

1)ˆ|(log X , (4.19)

which is similar to AIC (formula 4.15) but includes the logarithm of N (sample size) into the

term that corrects the bias of the maximum-likelihood estimate. It is worth noting that the

Bayesian Information Criterion (BIC) introduced by Schwarz (1978) has the same formulation

(see Wax & Kailath, 1985).

56

Wax and Kailath (1985) have applied MDL to the problem of selection of the number of

Gaussian sources. The resulting formulation of MDL is similar to their formulation of AIC, with

the log-likelihood term computed according to 4.17 and the number of degrees of freedom

computed according to 4.18. This formulation was adapted to the problem of selecting the

number of independent components by Calhoun et al. (2001) and further by Li et al. (2007).

MDL was implemented as a part of the GIFT software package11. The efficiency of BIC as a

dimensionality estimation method was tested by Cordes and Nancy (2002), by Ulfarsson and

Solo (2008) and by Beckmann and Smith (2004).

4 .3 .3 Bayes ian evid ence

Minka (2000) put the problem of selecting the number of principal components into a fully

Bayesian framework. PPCA is used as the probabilistic model. For a given data set and its

sample covariance matrix, the maximum-likelihood estimates of model parameters H and 2 are

determined according to the value of K. Minka proposed to select K that produces H and 2 in

such a way that the likelihood )ˆ,ˆ,ˆ|( 2mHXp is maximized. In other words, we need to select K

that maximizes Bayesian evidence )|( Kp X .

Bayesian evidence can be computed by integrating over the parameters of the model:

dKppKp )|()|()|( XX . (4.20)

Here, θ denotes the parameter vector (H, m, σ2). The factor p(θ|K) is the prior distribution of

model parameters for a given value of K. Minka uses an non-informative (flat) prior for m, and

the priors distributions for H and σ2 have a hyper-parameter α which controls the sharpness of

the prior (α is small for non-informative priors). H is factorized as RIΛUH 2/12 )ˆ( K , where

R is an arbitrary orthogonal matrix. The priors for U, ΛK and σ2 are:

11

http://icatb.sourceforge.net

57

K

i

iJK

i

iJKp

Kp

KJKJKp

1

2/)1(

2

22

)2/)1((2~)|(

),(~)|(

)2))(2(),((~)|(

U

. (4.21)

Approximate solutions for the integral in 4.20 can be found using Laplace approximation, where

the prior probability p(θ|K) is approximated with a Gaussian distribution (see Bishop (2006) for

treatment of Laplace approximation). We arrive at the following approximation of Bayesian

evidence:

2/2/1)(

2/

1

)2(ˆ)|()|( KZ

KmKJN

NK

ji NKpKp

AUX . (4.22)

Here, m = JK−K(K+1)/2, and the determinant of the Hessian matrix AZ is given by

K

i

J

ijjiijz

1 1

11 ))(~~

( A , (4.23)

where λk is the kth eigenvalue of the sample covariance matrix S, and k~

takes the value of λk if

k<K and the value of 2 otherwise.

Beckmann and Smith (2004) use this Bayesian approach to select the number of principal

components in fMRI data. The components are then rotated so the mutual information shared

across components becomes zero. This method is called Probabilistic Independent Component

Analysis and is implemented in the popular MELODIC software package12.

4 .3 .4 Stein's U nbias ed R is k E s tim ato r

Ulfarsson and Solo (2008) have developed a method of selecting the number of principal

components, which is based on minimization of risk (defined as the expected mean squared

error). The data are represented with the PPCA model given in equation 4.5. In this model, the

12

http://www.fmrib.ox.ac.uk/analysis/research/melodic/

58

risk is the expected value of the squared difference between mHw and its estimate mwH ˆˆˆ .

The authors use the unbiased estimate of risk proposed by Stein:

2

1 1

22ˆˆˆ2

ˆˆˆ1 Jtrace

NNSURE

N

i

N

iiT

i

ii

mwHx

xmwH . (4.24)

Here, H and m are maximum-likelihood estimates given by Formulas 4.9 and 4.10, and iw is

given by 4.11. We could use maximum-likelihood estimate of σ2 given by 4.8, but Ulfarsson and

Solo state that this estimate is too much influenced by our choice of dimensionality K. Using

random matrix theory, they develop a different estimate of σ2 which does not require a good

estimate of K. Note, however, that maximum-likelihood estimate of σ2 is still used to compute H

and iw .

The noise variance σ2 can be estimated from the eigenvectors of the sample covariance matrix S.

If columns of the J × N data matrix X come from a multivariate-Gaussian distribution, S has a

Wishart distribution. Let γ = J/N. As J → ∞ and N → ∞, the distribution of eigenvalues of S

converges to a Marcenko-Pastur distribution:

otherwise 0

if ))((2)( 2

baabf

(4.25)

where 22/12 1 a and 22/12 1 b . The values of a and b define the asymptotic

boundaries on the range of eigenvalues: as the size of the matrix grows, λ1 → b and λJ → a.

Ulfarsson and Solo also provide the asymptotic formula for the median eigenvalue for given γ

and σ2. Let F½ be the asymptotic median eigenvalue for the situation when σ2 = 1.

The random-matrix estimate of noise variance is computed in two steps. First, we get a rough

estimate by dividing the median eigenvalue of S by F½. Then, b is obtained using this rough

estimate. A rough estimate of K is obtained as the index of the largest eigenvalue of S which is

greater than b. The final estimate of σ2 is computed as the ratio of the median of λK+1,…, λN to

F½.

59

With this estimate of noise variance, we can compute SURE (Stein's Unbiased Risk Estimator)

for a given K according to 4.24. To estimate intrinsic dimensionality, we select K that minimizes

SURE.

4 .3 .5 Pred icted R es id ual Sum o f S quares

Cross-validatory procedures of dimensionality estimation were introduced by Wold (1978) and

by Eastment and Krzanowski (1982; see also Krzanowski & Kline, 1995). The number of

principal components is selected to optimize the Predicted Residual Sum of Squares (PRESS)

statistic. Let K be the number of components that are used to approximate the data matrix. With

respect to a specific voxel, this approximation can be written as

K

kkikjkij vux

1

ˆ . (4.26)

The PRESS statistic is computed as the mean squared residual, by squaring the difference

between the true voxel signals and its corresponding approximations:

J

j

N

iijij xx

NJPRESS

1 1

2ˆ1

. (4.27)

To estimate intrinsic dimensionality, we select K that produces the approximation that minimizes

K.

However, there is a flaw in this simple scheme of computing PRESS. All voxels of the data

matrix are used to compute the U, Γ and V matrices. Therefore, the value of ijx is used to

compute its approximation ijx given by 4.27. This introduces a bias into the computed PRESS

statistic. To avoid it, we need to make sure that computation of ijx is performed without

utilizing ijx . Eastment and Krzanowski (1982) solve this issue by removing, in two separate

steps, the jth row and ith column from X. Let X(i) be the data matrix with ith column removed,

and X(j) be the matrix with the jth row removed. We compute singular value decomposition for

these matrices:

60

Tjjjj

Tiiii

)()()()(

)()()()(

VΓUX

VΓUX

(4.28)

Then we compute the approximation for ijx using )( jU , )(iV , and both )(iΓ and )( jΓ :

K

k

ik

ijk

jk

jjkij vux

1

)()()()(ˆ . (4.29)

This way, PRESS is computed by leave-one-voxel-out cross-validation. The true value of ijx

serves as test data that is independent of the training data (that was used to compute ijx ).

4 .3 .6 G eneraliz atio n erro r

Hansen and colleagues (1999) have developed a cross-validatory approach to dimensionality

estimation which is more efficient than PRESS minimization. It utilizes repeated splits into

training and test data, which is faster than the somewhat cumbersome process of removing rows

and columns from the data matrix as described above. The metric of performance is

generalization error, which tells us how well the model computed from the training set

generalizes to the independent test set. Hansen et al. have also produced a valuable comparison

of the empirical estimate of generalization error to its analytic estimate.

Generalization error is computed using the log-likelihood function. The theoretical formula for

generalization error is

xxx dppG )()|(log)( , (4.30)

which tells us how well the model specified by θ generalizes to all possible data vectors. In

practice, we can estimate G(θ) from a finite set of observations. In an earlier paper, Hansen and

Larsen (1996) have derived an analytical estimate for G(θ):

NDpG /)|(log)(ˆ X , (4.31)

where D is the number of degrees of freedom in θ. Note the similarity to other analytical criteria,

such as AIC (Formula 4.15) and MDL (Formula 4.19).

61

Hansen et al. use the PPCA model, where the log-likelihood function is given by Formula 4.7.

This function is used to compute the analytical estimate of G(θ) in Formula 4.31. An empirical

estimate of G(θ) can be computed using a split into training and test sets. The training set is used

to compute a maximum-likelihood estimate )ˆ,ˆ,ˆ(ˆ 2 mH for a specific value of intrinsic

dimensionality K, as described in Section 4.2. Then, generalization error is the negative log-

likelihood of the test set. Omitting the proportionality coefficient N/2 from the log-likelihood

function given by expression 4.7, we get

testtest traceJG SΣΣ 1ˆˆlog)2log()ˆ(ˆ . (4.32)

Here, Stest is the sample covariance matrix of the test data, and Σ is an estimate of the population

covariance matrix according to model : IHHΣ 2ˆˆˆˆ T . We compute )ˆ(ˆ testG for a series of

training-test splits of the data over a range of K, and select K that minimizes the mean value

(across splits) of )ˆ(ˆ testG .

4 .3 .7 R epro d ucibility and C las s ificatio n Accuracy

Another method of intrinsic dimensionality estimate comes from the NPAIRS literature (e.g.,

Strother et al., 2002; Strother et al., 2004; LaConte et al., 2003; Shaw et al., 2003). Here, the

dimensionality of the data is reduced with SVD, and a subset of principal components forms the

low-dimensional input to an analysis model. Performance of the model is greatly influenced by

the number of principal components selected for analysis. Selecting too few components leads to

a simplistic model with an inherent bias; selecting too many is analogous to overfitting (trying to

model the noise in the training set). The result of overfitting is increased variance of the model

performance on the independent test set (LaConte et al., 2003). This is an example of bias-

variance trade-off, where a complex model has high variance and low bias, and a simplistic

model has high bias and low variance (Hastie et al., 2009). In our case, the number of PCs that

we retain for our analysis defines the complexity of the model.

The NPAIRS framework uses two metrics of performance: reproducibility of spatial maps and

accuracy in predicting mental states. These metrics were described in Sections 2.1.2 and 2.1.3,

respectively. The approach suggested by the NPAIRS literature is to retain the number of

principal components such that one or both of the two metrics are optimized. It is possible to

62

optimize a combination of the two metrics, namely, the Euclidean distance from the "perfect

predictor". On prediction-reproducibility plots, "perfect predictor" corresponds to the point (R=1,

P=1), and the number of components that corresponds to the point closest to (R=1, P=1) can

serve as an estimate of intrinsic dimensionality. The prediction-reproducibility plots are

described in Section 2.1.4.

Instead of optimizing accuracy of prediction, we can select the number of components that

optimizes accuracy of classification (the difference between prediction accuracy and

classification accuracy is described in Section 2.1.3). Classification accuracy can be used if our

algorithm classifies the data without explicitly computing posterior probability of data belonging

to a particular class. In a recent paper (Schmah et al., 2010) we have selected the number of

principal components to optimize classification accuracy. In a follow-up paper (Yourganov et al.,

2010), we have shown that these estimates of dimensionality are relevant for prediction of

recovery from a motor stroke. Post-stroke recovery of function is a process of re-building the

cortical networks disrupted by the stroke, and the number of principal components that best

describes our data is an indicator of this recovery.

4 .3 .8 T he Area und er a R O C curve

In simulated data where the "ground truth" is known, we can use ROC metrics to select the

number of principal components. We use a subset of principal components as a low-dimensional

approximation to our data, and process it with our analytical model. Then we compute the

frequency of false positives and true positives, and construct the ROC curve as described in

Section 2.1.1. The partial area under the ROC curve that lies between FPF = 0 and FPF = 0.1 is

our optimization metric for selecting the number of principal components.

In our survey of intrinsic dimensionality estimation (Yourganov et al., 2011), we have used this

method as our "gold standard" method of estimating the dimensionality in simulated data. In our

simulated data, it is not always possible to say what is the "true" intrinsic dimensionality. When

ρ=0.99, the active areas are strongly coupled and can be described with a single PC, therefore it

could be argued that the true dimensionality in this case is 1. In the other extreme case, when

ρ=0, the active areas are mutually independent and the true dimensionality is the number of

active areas (sixteen). However, it is unclear what the true dimensionality is when ρ=0.5.

Therefore, we do not use the idea of "true dimensionality" when evaluating the methods of

63

dimensionality estimation; instead, we see how closely their results match the estimates obtained

with ROC area optimization.

64

Chapter 5 Intrins ic d im ens io nality es tim atio n: res ults

5 .1 S im ulated d ata

Dimensionality estimation methods have been tested on a large number of artificial fMRI data

sets. Simulations are described in Section 2.2.1. Each data set contains 200 volumes organized

into alternating "activation" and "baseline" epochs, with 10 volumes per epoch. Volumes in

“baseline” epochs are constructed by adding spatially smoothed Gaussian noise to a background

structure. Volumes of the “activation” epochs are constructed in the same way, with an addition

of Gaussian signal at 16 specific “active” areas. We call this signal “task-related”, because it is

present in the “activation” epochs and absent in “baseline” epochs. Time courses from every

pixel have been convolved with a model HRF. Because of the slow time-course of the

hemodynamic response, we have discarded the first 2 volumes of each epoch, leaving us with

160 volumes in total.

We have used these simulated data to study the impact of various parameters of task-related

signal on the estimates of dimensionality. These parameters are: mean magnitude of the active

signal (M), ratio of variance of this signal to the variance of background noise (V), and

correlation of timecourses across the 16 nodes of our simulated signal network (ρ). We have

varied M from 0 to 0.05 (in increments of 0.01), V from 0.1 to 1.6 (in increments of 0.25); ρ was

set to the values of 0, 0.5, and 0.99. For each setting of (M, V, ρ), we have constructed 100

artificial data sets as described above, and an additional group of 100 “false positive” data sets

consisting of 200 “baseline” epochs.

We have discovered that, in fact, many methods give the same estimates of dimensionality

irrespective of M, V and ρ. Other methods, however, are more sensitive: when the task-related

signal is stronger and/or the network coupling is more pronounced, this is reflected in smaller

estimates. Our results have been reported in a publication (Yourganov et al., 2011).

5 .1 .1 Analytic m etho d s

We have tested a number of analytic methods of dimensionality estimation: optimization of

Akaike Information Criterion (AIC), Minimum Description Length (MDL), Stein's Unbiased

65

Risk Estimator (SURE), and Bayesian Evidence. The first three methods have been implemented

in MATLAB, and the corresponding cost functions have been evaluated for the range of PC

dimensionalities from 1 to 160. Bayesian evidence has been computed by the MELODIC

software (a part of the FSL package for fMRI data analysis); this software has computed

Bayesian evidence for a range of K from 1 to 103. In the results presented below, there is a

difference between Bayesian evidence (as a part of MELODIC software) and other cost

functions (implemented by us in Matlab). This difference might be due to the additional

processing steps that MELODIC takes, such as eigenspectrum adjustment. It would be

interesting to test our own implementation of Bayesian evidence, rather than test MELODIC as a

black box; however, this testing was not done because of the time limitations.

Figure 5.1. Normalized cost function of several methods of dimensionality estimation.

The results are displayed on Figure 5.1, where we plot the cost function for each PC

dimensionality. Cost functions are normalized to the [0...1] range. The number of PCs that

minimizes the cost function is the estimate of intrinsic dimensionality. Dashed lines correspond

to the analytic methods of estimation; AIC and MDL cost functions are plotted together, because

they are virtually identical except for the scale factor. Solid lines show the cost functions for two

empirical methods (PRESS and generalization error), which are discussed in the next section.

The figure plots the results for the data set created using the following parameters: M = 0.01, V =

0.1, and ρ = 0. The plot of cost functions for other values of M, V and ρ looks identical, which

66

suggests that the four analytic methods give the same dimensionality estimates irrespective of the

underlying network structure and of signal-to-noise ratio in the data.

With the exception of Bayesian evidence optimization, these methods tend to produce inflated

estimates of intrinsic dimensionality. Cost functions for AIC/MDL and SURE tend to decrease

with increasing K until they reach the point of saturation (AIC/MDL is the first method to reach

this point, at K = 140). It is interesting to observe the similarity between SURE (an analytic

method) and PRESS (an empirical method). Optimization of PRESS is equivalent to

minimization of squared difference between the data matrix and its approximation computed

using a subset of principal components. Optimization of SURE is similar conceptually, but,

instead of trying to approximate the data matrix as closely as possible, we try to minimize the

squared difference between the "true" PPCA model and its maximum-likelihood estimate. The

similarity between these two cost functions indicates that PPCA provides a fairly good

representation of our simulated data (expect for the fact that our simulations involve HRF

convolution).

Analytical methods do not involve evaluation on an independent test set and can be thought of as

"optimization on the training set". Indeed, in order to get the "best" approximation of the training

set, we should use all of the principal components; this is reflected on the Figure 5.1, where the

cost function of AIC/MDL and SURE steadily decreases as we increase the number of

components. The "saturation" of the cost function when K>140 is perhaps due to the temporal

autocorrelation in the data imposed by HRF convolution.

The cost function for Bayesian evidence behaves quite differently from AIC/MDL and SURE. It

must be noted that we used the FSL MELODIC software to compute Bayesian evidence,

whereas the other analytical methods were implemented by us in Matlab; therefore, the

difference might be due to implementation as well as due to conceptual difference between the

cost functions. For instance, prior to dimensionality estimation, MELODIC performs high-pass

temporal filtering and adjustment of eigenspectrum in order to account for limited sample size

(Beckmann & Smith, 2004). It is interesting to note that the dimensionality estimate obtained

with this method is 16, which is the number of active regions in our simulations. The goal of

MELODIC is to represent the data matrix as a linear combination of sparse sources of signal

(Beckmann et al., 2005); we can say that MELODIC identifies the sparse sources (active areas)

67

correctly. It fails to detect the fact that these sources are correlated into a single spatial network;

this is consistent with reports that MELODIC splits functional cortical networks into different

"sources" (e.g. Abou-Elseoud et al., 2010). Therefore, it is a bad choice of analysis if we are

interested in analyzing functional networks.

5 .1 .2 E m pirical m etho d s : PR E S S and G eneraliz atio n erro r

Fugure 5.1 shows results of some empirical methods, in addition to analytic methods. The

empirical methods shown here are optimization of PRESS and of generalization error (GE in

short), computed for the same data set. The corresponding cost functions are normalized to

[0...1] range. As described in the previous section, PRESS method behaves in much the same

way as the cost functions for SURE, AIC and MDL methods. Optimization of GE gives us a cost

function with very different behaviour. It has a minimum at K=1, and then rises sharply.

The difference between these two empirical methods could be due to the difference in their

"training/test" framework. GE optimization utilizes the resampling framework common in

machine learning, when a subset of observations is held out during the training stage, to be used

later for testing the trained model. PRESS optimization does not use a test set in the traditional

sense; rather, it tries build an voxel-wise approximation of the training set (making sure that the

voxel which is to be approximated is not used to compute the approximation; see Section 4.3.5).

Figure 5.1 shows that this approach is very similar to optimization of SURE.

GE optimization tries to identify the PC subspace that is common to the training and test sets.

Our results indicate that this subspace is limited to the first principal component, that is, the

direction of maximum variance in the data. This variance could be driven by the difference

between the "active" and "baseline" volumes (GE optimization is an unsupervised method, that

is, it does not account for the fact that the data come from two classes). It could be argued that

our active network is indeed one-dimensional when ρ>0, making GE optimization suitable for

this situation. However, when ρ=0, the signal is independent across loci and therefore cannot be

considered truly one-dimensional. Overall, the understanding of GE optimization requires some

further analysis (including simulations of multi-dimensional networks).

68

5 .1 .3 E m pirical m etho d s : R epro d ucibility and C las s ificatio n Accuracy

The methods discussed so far in this chapter give very consistent estimates of intrinsic

dimensionality of simulated data, irrespective of the signal parameters. The cost functions

displayed on Figure 5.1 are virtually identical for all levels of M, V and ρ (with one exception:

Bayesian evidence optimization shows a slight tendency to increase with growing V, as shown

below). No matter whether loci of activation form a spatial network or are mutually independent,

the cost function is optimized with the same number of components.

Figure 5.2. Reproducibility and classification accuracy for linear and quadratic

discriminants, as a function of number of principal components, for two simulated data

sets. Left and right plots correspond to a weak (V=0.1) and strong (V=1.6) variance of the

signal, respectively, for moderate connectivity of ρ=0.5.

However, other empirical methods are much more sensitive to the spatio-temporal structure of

the signal. This sensitivity has been observed when we have used cost functions that directly

measure the performance of our data analysis, using such metrics as reproducibility of spatial

maps, accuracy of predicting the task, and area under ROC curve. Figure 5.2 shows how

reproducibility and classification accuracy vary with K, for two methods of analysis (linear and

quadratic discriminant). Left and right plots correspond to two artificial data sets. Both sets

69

contain relatively strong task-related signal (M =0.03), and a moderate level of spatial correlation

across active loci (ρ =0.5). What is different in the two plots is the temporal variance of the task-

related signal: it is very small (V=0.1) in the data set displayed on the left plot, and quite large

(V=1.6) in the data set displayed on the right plot.

This difference in V has a marked effect on the performance of both linear and quadratic

discriminants. For small V (left plot), both performance metrics are optimized when we use the

number of components that lies somewhere between 7 and 10. For large V, we get much better

performance when just the first component is used. Reproducibility of LD and QD has a clear

maximum at K=1, and classification accuracy is optimized when K is no more than 3.

The difference in cost functions for different values of V can be explained by the effect of V on

the shape of the eigenspectrum of the sample covariance matrix. This is demonstrated on Figure

5.3, which shows the first 10 eigenvalues of the sample covariance matrix. The network

correlation ρ is 0.5, M is set to 0.01 (upper panel) and 0.03 (lower panel), V varies from 0.1 to

1.6; the plot shows the eigenvalues averaged across 100 simulated data sets. We can see that, for

high values of V, the first eigenvalue is clearly separated from the rest. If V is not sufficiently

high, the first eigenvalue is not distinct from the rest of the spectrum.

Figure 5.3. Plot of the first 10 eigenvalues of the covariance matrix of a single data set, for

M = 0.01 (top) and M = 0.03 (bottom); ρ is set to 0.5, and V varies from 0.1 to 1.6 in

increments of 0.5. Eigenvalues are averaged across 100 simulated data sets.

70

The effect of V is most evident in the first eigenvalue, and close to negligible in the remaining

spectrum. This happens because the active areas form a single correlated network, which, in the

absence of noise, would be fully captured by the first principal component. In our simulations,

modification of V changes the variance of the active signal, but the variance of the noise is kept

constant. The increase of V leads to the increase of the total variance that is due to the active

signal, and this, in turn, increases the proportion of variance explained by the first principal

component (that is, the magnitude of the first eigenvalue).

This shift in eigenspectrum shape is analyzed in a paper by Hoyle and Rattray (2004), where it is

described as a phase transition of the eigenspectrum. Their simulated data contained a

multivariate-Gaussian latent variable embedded in white Gaussian noise, which is similar to our

simulations. The latent variable can be thought of as a symmetry-breaking direction: white

background noise is symmetrical in data space, and the latent variable defines the direction of

asymmetry. The authors studied the effect of the variance along this symmetry-breaking

direction (which is similar to our definition of relative signal variance V) on the eigenspectrum.

For a given number of N-dimensional data samples, there is a critical value of variance along the

symmetry-breaking direction; if the variance is above it, the first eigenvalue is separated from

the remaining eigenspectrum.

This phase transition is described in some earlier papers in statistical physics (Biehl & Mietzner,

1994; Watkin & Nadal, 1994; Reimann et al., 1996; see also a later paper by Hoyle & Rattray,

2007). It is usually formulated using α, the ratio of sample size to the number of variables (which

is the number of voxels for fMRI data). If α is above a certain critical threshold, it is possible to

learn the symmetry-breaking direction from the data: this direction is given by the first principal

component. If α is below the critical level, such learning is impossible. Earlier results by Biehl

and Mitzner (1994) show that this critical value depends on contrast-to-noise ratio as well as on

the variance along the symmetry-breaking direction. The order parameter of the phase transition

is the angle that the first eigenvector of the covariance matrix forms with the symmetry-breaking

direction. Below the critical value of α, the expected value of this angle is zero, i.e. we don't have

enough training examples to learn this direction. Above the critical value, the angle is non-zero

and eventually approaches 1.

71

The effect of V on our performance metrics can be explained with this phase transition. If the

network correlation ρ is above zero, our activation signal is essentially one-dimensional, because

most of the task-related variance can be explained by the first principal component. For low V,

the number of training examples is not enough to learn this one-dimensional network of

activation signal. If V is high enough, learning is possible because we have passed through the

phase transition and the number of examples is sufficiently high to capture the active network in

the first principal component. In this situation, adding more components is equivalent to

overfitting, and performance degrades when the number of components is more than one (see

Figure 5.2). This phase transition will be further explained in the next section.

5 .1 .4 Sum m ary o f perfo rm ance o n s im ulated d ata

Figure 5.4 shows the dimensionality of simulated data estimated by the methods described

above. The value of M is fixed at 0.01 (Figure 5.4 A) and at 0.03 (Figure 5.4 B). The three panels

from left to right in A and B correspond to levels of long-range spatial correlation in the network

(ρ=0, ρ=0.5, ρ=0.99), and the horizontal axis shows the relative signal variance, V, for each

panel. The plots record the median estimate of dimensionality, measured across 100 data sets,

and the error bars show the 25%-75% percentile range of the estimates. Error bars are displayed

only for the smallest (0.1) and the largest (1.6) levels of V, and there are no error bars for ROC

optimization curves because all 100 data sets are used to generate a single ROC curve for each

plot value.

This figure shows the estimates of dimensionality that are obtained by optimization of

performance of linear and quadratic discriminants. Three metrics of performance are used:

reproducibility of spatial maps, accuracy of classification, and partial area under the ROC curve.

For comparison, we also show the estimates obtained by optimization of Bayes evidence (as

computed by MELODIC software). We do not include other methods of estimation described in

chapter 4 (optimization of AIC, MDL, SURE, PRESS and GE criteria). As described in Sections

5.1.1 and 5.1.2, these methods give consistent estimates that are not influenced by M, V and ρ.

Plots of dimensionality estimates clearly show the transition of dimensionality that happens with

the emergence of spatial active network in the background noise. This transformation happens

when ρ is above zero, and when V reaches a certain threshold. At this point the network is

effectively captured in the first principal component, and the performance metrics are optimized

72

Figure 5.4. Median dimensionality estimates in simulations, as calculated by various

methods (see legend and text), shown as a function of the relative signal variance, V,

defined as the variance of the amplitude of the Gaussian activation blobs relative to the

variance of the independent background Gaussian noise added to each voxel. M is set to

0.01 for the top row and to 0.03 for the bottom row. The three panels from left to right in A

and B show three increasing levels of correlation, ρ, between Gaussian activation blob

amplitudes. Range bars on the first (V=0.1) and last (V=1.6) data points reflect the 25%–

75% interquartile distribution range across 100 simulation estimates.

at K=1. This is more evident when M = 0.03 (bottom row of Figure 5.4 B), where all

performance metrics converge and the signal is estimated as one-dimensional. When M = 0.01,

LD cannot robustly classify the simulated data, and the optimization of classification produces

highly variable dimensionality estimates (the variability in estimates increases as V grows). QD

73

is a much better classifier at such low levels of M, getting better with increasing V, due to the

fact that QD (unlike LD) is sensitive to the difference in covariance matrices of “active” and

“baseline” epochs, and this difference increases with V. Therefore, at low levels of M, estimates

of dimensionality obtained with QD are more robust compared to LD.

For both LD and QD, optimization of reproducibility of spatial maps is a method that is quite

sensitive to the emergence of the spatial activation network. For non-zero ρ, this method robustly

estimates the one-dimensional signal subspace if V is above the critical level. This critical level

depends on both M and ρ, and is somewhere between V=0.5 and V = 1. There is a good

correspondence in dimensionality estimates given by optimization of reproducibility and by

maximization of ROC area. Therefore, when we select K to optimize reproducibility of maps, we

also optimize signal detection of LD and QD. In the case of LD, optimization of classification

accuracy does not necessarily optimize signal detection when M is low. Using analytical methods

such as optimization of Bayesian evidence does not optimize signal detection because this

method fails to discover the one-dimensional spatial network in the data.

When ρ=0, the spatial network of activation cannot be captured by the first principal component.

All performance metrics are optimized with K>1. Performance of QD is optimized using lower

number of components relative to LD, reflecting the difference in the degrees of freedom in the

LD and QD models (QD has roughly twice as many degrees of freedom than LD). In the

situation when ρ=0, using Bayesian evidence will not be as damaging to signal detection as it is

when ρ>0.

The transition in intrinsic dimensionality can be also demonstrated using the notion of global

signal-to-nose ratio (gSNR), introduced in Section 2.1.2. Figure 5.5 shows a plot of intrinsic

dimensionality (estimated by optimizing reproducibility) versus gSNR. Spatial maps are

computed with linear discriminant (A) and with quadratic discriminant (B). On this plot, V is

varied between 0.1 and 1.6, ρ is at 0, 0.5 and 0.99, and M takes the values of 0, 0.01, 0.02, 0.03

and 0.05. Each marker displays the average dimensionality and gSNR across 100 simulated sets.

Size of the marker indicates V, shape of the marker encodes ρ, and colour encodes M. The plot

demonstrates the asymptotic relationship between gSNR and intrinsic dimensionality. If gSNR

(which measures the strength of reproducible signal) is high, the intrinsic dimensionality is

estimated as K=1. The structure of task-related signal is apparent in the data, because the first

74

eigenvalue clearly stands out in the eigenspectrum. As we lower gSNR by manipulating the

covariance matrix (i.e. by lowering V and/or ρ, keeping M constant), there comes a point of

phase transition. The first principal component loses its privileged position, and single-

dimensional representation of the data as one correlated network is broken up into apparently-

independent subnetworks. A large number of principal components is required for this multi-

dimensional representation. The phase transition can be seen as a loss of structure in the

representation of the data when gSNR falls below the critical level.


dimensionality that optimizes reproducibility, for linear (A) and quadratic (B) discriminant

maps. Marker size indicates relative signal variance, V, from 0.1 (small) to 1.6 (large). Five

colours encode five different levels of M, and the spatial correlation is encoded by different

symbols.

The critical level of gSNR (where the phase transition occurs) depends on the value of M13. For

LD, it is around 0.5 for M =0.01 and around 1 for M =0.03. For QD, the transition is smoother

and critical levels are harder to determine, although they seem to be close to the corresponding

critical levels in LD. For very strong M (0.05), it seems that all of our examples have gSNR

13

It is important to keep in mind that our definition of M uses our knowledge of the "ground truth" in simulated data, whereas gSNR requires no such knowledge. M is specific to the simulation procedure, and gSNR is computed from the reproducibility of spatial maps and is therefore specific to the method of data analysis, such as LD.

75

greater than the critical level, because the dimensionality is estimated as K=1 or K=2 for all

levels of ρ and V; therefore, the phase transition does not occur when M=0.05.

5 .1 .5 E ffect o n d ata analys is

Linear and quadratic discriminants, operating on a subspace of principal components of the data,

require careful selection of the PC subspace size. A non-optimal choice of PC dimensionality

leads to poor performance of LD and QD. This is especially evident when we retain all principal

components for our analysis, without discarding the noisy PCs. The paper by Mourao-Miranda et

al. (2005) compares the performance of LD on a full PC basis to support vector machines, and,

indeed, LD shows very poorly. However, when noisy PCs are discarded, the performance of LD

is comparable to SVM, as shown by LaConte et al. (2005).

Figure 5.6 shows how the performance of a linear discriminant is affected by dimensionality

estimation. Here, the performance is measured with the partial area under a ROC curve, for false

positive frequency between 0 and 0.1. Separate ROC curves have been constructed for each

center of activation using LABROC software (Metz et al., 1998). The plot shows the average

partial area across 16 active loci, with error bars indicating standard deviation across loci.

Several methods of dimensionality estimation have been used to determine the number of

components to be used in linear discriminant analysis. We use the MELODIC and GIFT

software packages to estimate dimensionality with optimization of Bayesian evidence and MDL,

respectively. The performance of univariate General Linear Model (GLM) is also shown for

reference. The dashed line indicates the performance of a randomly guessing detector, where true

and false positives are equally likely.

Among the methods of dimensionality estimation used in Figure 5.6, reproducibility

optimization and classification optimization are the two methods that are influenced by global

signal-to-noise ratio. Cost functions for the other three methods (MDL, generalization error and

Bayes evidence) have the same shape irrespective of gSNR. These functions are plotted on

Figure 5.1; generalization is optimized when K=1, MDL is optimized with very high number of

PCs (K=140), and Bayesian evidence is optimized for values of K between 15 and 17 (which

roughly corresponds to our number of active loci, 16). As we can see in Figure 5.6, optimization

of MDL is the most disadvantageous method to estimate K: when LD is performed using 140

principal components, it performs approximately at the level of random guessing. MDL

76

optimization severely overestimates the dimensionality; 140-dimensional representation of data

is dominated by noise.

Figure 5.6. Partial ROC area (corresponding to false positive frequency range of [0…0.1])

as a function of the relative signal variance, V, calculated for linear discriminant (LD, on

the principal component subspace, with subspace size selected by various methods), and for

univariate general linear model (GLM). M is set to 0.01 for the top row (A) and to 0.03 for

the bottom row (B). The three panels from left to right in A and B show three levels of

correlation, ρ, between Gaussian activation blob amplitudes. Error bars show standard

deviation across 16 active loci (centers of Gaussian activation blobs).

Estimates of K obtained with optimizing Bayesian evidence are much more reasonable, and LD

always performs well above chance when these estimates are used. However, this method can be

recommended only when the activation signal is independent across the 16 active loci (that is,

when ρ=0). In this situation, the data are best described by the model that uses 16 principal

77

components, which is close to the estimates given by Bayesian-evidence optimization. However,

when ρ>0, this method of estimation is clearly sub-optimal (except for the lowest levels of signal

variance V). The obtained estimates cause LD to perform below the level of univariate GLM,

whereas optimization of reproducibility, classification, or generalization results in LD

performing better than GLM and, in some cases, near the theoretical maximum of signal

detection (corresponding to partial ROC area of 0.1).

Optimization of generalization is a good strategy when the activation signal is correlated across

loci (ρ>0). In this situation, the active loci form a single spatial network; therefore, the data can

be adequately modeled with K=1. This is especially true for strong mean signal (M=0.03);

performance of LD with K=1 (which is the value that optimizes generalization) is close to

perfect. For weaker mean signal (M=0.01), performance is more modest but nevertheless the

highest of all dimensionality-estimation methods we have used.

There is some variability in the estimates of K that optimize reproducibility; this variability is

greater for weak mean signal (M=0.01). For both levels of M, as we increase V and ρ, K is

estimated to be 1 in a larger proportion of simulated data sets. This trend results in better ROC

performance of LD for growing V and ρ. The same trend is observed in LD when it is combined

with optimization of classification. The variability of estimates here is even higher than in

reproducibility optimization; nevertheless, for M=0.03 performance is comparable to the level

obtained with optimization of reproducibility. When M=0.01, ρ=0.99, and V>1, LD performs at a

high level for three methods of dimensionality estimation (generalization, reproducibility and

classification).

5 .2 Intrins ic d im ens io nality es tim atio n in real d ata

Intrinsic dimensionality has also been estimated on the real fMRI data sets described in Sections

2.2.1 and 2.2.2. ROC-type analysis cannot be performed on real data, so we cannot use it to

answer which method of estimation is better for signal detection. However, we have seen the

same asymptotic relationship between gSNR and optimal PC dimensionality as we have seen in

simulated data (Figure 5.5). For the data from the stroke study, this relationship is demonstrated

on Figure 5.7 A.

78

The vertical axis shows the number of PCs that maximizes reproducibility of LD maps in each

individual subject, and the horizontal axis shows the gSNR that corresponds to this maximum

reproducibility. Each marker corresponds to an individual subject. LD analysis has been

performed to classify the fMRI volumes according to the behavioural task ("finger tapping"

versus "wrist clenching"). Although the number of subjects is too small to make strong

statements, the pattern here is similar to the one observed in the simulated data (Figure 5.5):

subjects with greater gSNR require, in general, a smaller number of principal components to

optimize map reproducibility. This does not hold across all subjects; for example, subject S054

uses more principal components than subject S059 and yet achieves greater reproducibility, and,

therefore, greater gSNR. It is possible that we observe two asymptotic curves here, one with

subjects S059, S103 and S145, and the other with subjects S054 and S090; the remaining

subjects could belong to either asymptotic curve. LD classification accuracy for subjects S059,

S103 and S145 is lower (90%) than on subjects S054 and S090 (96%). This difference could be

driven by different levels of magnitude of underlying task-related BOLD signal. In simulated

data, we have shown that varying M can shift the asymptotic curve along the horizontal axis (see

Figure 5.5).


optimal dimensionality in real fMRI data: stroke study (A) and aging study (B). Each

marker indicates a subject.

This asymptotic relationship has been observed in all of our data sets, for all classification tasks.

For example, Figure 5.7 B demonstrates this relationship for the aging study data. Here, each

79

marker represents one subject, with age groups encoded with marker types. LD has been used to

classify each subject's volumes according to the task (delayed matching versus fixation), and to

build the corresponding spatial maps. Reproducibility of these maps has been used to estimate

intrinsic dimensionality and to compute gSNR. Compared with the stroke set displayed on

Figure 5.7 A, the gSNR values are larger and the estimated dimensionality is overall on a smaller

scale. This can be explained by stronger contrast in the aging-study set and, perhaps produced by

the stronger magnetic field (3T here versus 1.5T in the stroke study set). Within each age group,

the asymptotic dimensionality tends to hold. The overlap across the age groups is large, but it can

nevertheless be observed that the young subjects from the youngest age group exhibit smaller

gSNR values and larger dimensionality, relative to older subjects.

Figure 5.8. Optimal dimensionality and global SNR in a group study. Each marker

indicates an age group; each solid line indicates a task that is contrasted with fixation.

This asymptotic relationship is not specific to within-subject analysis; we have also observed it

in a group analysis (Yourganov et al., 2011). The data comes from a study of cognitive aspects of

aging (described in Grady et al., 2006); its design is similar to the aging study that we have used

for our evaluation. In the 2006 study, the participants come from 3 age groups (young: 20-30

years, 12 subjects; middle-aged: 40-60 years, 12 subjects; old: 65-78 years, 12 subjects). They

are presented with black line drawings of nameable objects, and words corresponding to names

of objects. The experiment consists of two “shallow” memory-encoding tasks, two “deep”

encoding tasks, and two recognition tasks. During the two shallow-encoding tasks, the subjects

are asked to report whether the pictures are large or small, and whether the words are printed in

80

upper or lower case. During the two deep-encoding tasks, the subjects are asked to determine

whether the pictures (or words) corresponded to living or non-living entities. During the two

recognition tasks, the subjects are instructed to report whether or not they had seen the presented

stimuli (pictures or words) previously. The tasks are performed during 24-second epochs, which

are interspersed with fixation blocks of equal length.

fMRI data has been acquired at 1.5 T scanner. The analysis has consisted of LDA classification

of fMRI volumes according to the task; there are 6 contrasts, each active task being contrasted

with fixation. The volumes have been pooled across the subjects. Figure 5.8 plots the

dimensionality that optimizes reproducibility of maps versus global SNR, for each of the six

contrasts and 3 age groups. The same asymptotic relationship between gSNR and dimensionality

can be observed here. In the young group, dimensionality is lower and gSNR is higher than in

the two older groups. In within-subject analysis, we observe the opposite trend, with younger

subjects having higher dimensionality and lower gSNR (see Figure 5.7B); this could be

explained by the hypothesis that younger subjects form a more homogeneous group, although the

variability in within-subject BOLD data is higher than in the older subjects (for discussion of

BOLD variability in different age groups, see Garret et al., 2012).

5 .3 Les s o ns learned

In this chapter, we have evaluated several methods of estimating the intrinsic dimensionality of

the fMRI data, that is, of determining which principal components contain relevant signal and

should be retained for further analysis, and which PCs contain noise and should be discarded.

This step is important for multivariate Gaussian classifiers that use PCA regularization; for

example, the performance of linear discriminant can be greatly influenced by the number of

retained PCs (see Figure 5.6, as well as LaConte et al., 2003, and Yourganov et al., 2011).

Methods for estimating intrinsic dimensionality of the data can be categorized into two groups:

analytic and empirical methods. Methods in the first group determine the number of principal

components that produce an approximation of the data, which is optimal for some information-

theoretical criterion. We have tested optimization of the following criteria: Akaike information

criterion, minimum description length, Stein's unbiased risk estimator, and Bayesian evidence

(computed using Laplace approximation). All of these methods, except for Bayesian evidence,

are optimized when an unreasonably large number of components are used (in our simulations,

81

the optimum was reached when 87.5% of PCs were retained). This estimate of dimensionality is

virtually useless: when LD-PC is operating on such high number of PCs, its performance is

barely above chance (see Figure 5.6). This over-estimation is consistent with results reported by

Cordes & Nandy (2006); they have demonstrated that temporal autocorrelation in the noise

inflates the estimates given by analytical methods such as AIC and MDL. Optimization of

Bayesian evidence, as implemented in the MELODIC FSL package, produces much more

reasonable estimates of dimensionality (retaining 10% of PCs in our simulations). This is

perhaps not an inherent advantage of Bayesian evidence optimization per se, but is due to the

way it is implemented in FSL MELODIC: before the dimensionality estimation is carried out,

the eigenspectrum of the data is adjusted to account for limited sample size (see Beckmann &

Smith, 2004). When LD-PC is operating on the PC subspace defined with this method, its signal

detection is well above chance; in the situations when the simulated sources of signal are

independent of each other, this method is the best performer (Figure 5.6). However, when the

sources are spatially correlated, PC subspace size is estimated better with empirical methods.

The empirical methods select the number of PCs that optimize the actual performance on the

independent test set. We have evaluated the following performance metrics: classification

accuracy of LD-PC, reproducibility of LD-PC maps, area under the ROC curve computed for

LD-PC maps, predicted sum of squares (PRESS), and generalization error of the probabilistic

PCA model. Of these methods, optimization of PRESS gives highly inflated dimensionality

estimates, comparable to some analytic methods (AIC, MDL, SURE). This method is also very

computationally intensive: for a J×N data matrix, it needs to compute J*N singular value

decompositions. This peculiar behaviour of PRESS is perhaps due to the fact that PRESS

optimization is theoretically different from the other empirical methods we have tested: it does

not attempt to separate the PCs into the "signal-containing" and "noise-containing" subspaces,

and does not use a held-out test set to optimize its cost function.

In our simulations, generalization error is optimized when 1 PC is used (in rare and seemingly

random cases, it is optimized with 2 PCs ). This is a reasonable estimate when the simulated

sources are spatially correlated and form a single network, which is best described with a single

PC. This estimate is more questionable when the sources are uncorrelated (theoretically, the best

estimate of dimensionality in this case is the number of sources); however, presumed under-

estimation of dimensionality this does not impair the signal detection when the mean signal is

82

strong. In this condition, LD-PC is not very sensitive to the number of PCs as long they are in the

range from 1 to (approximately) the number of sources. For weak mean signal and uncorrelated

sources, optimization of generalization error is a suboptimal choice of dimensionality estimation

for LD-PC.

In our simulations, analytic dimensionality-estimation methods, as well as a subset of empirical

methods (optimization of PRESS and of generalization error), are not sensitive to the parameters

of simulated signal of interest (that is, reproducible signal which is different between classes).

Such sensitivity is observed in empirical methods that optimize performance of LD-PC.

Estimates of PC dimensionality are small when the signal is correlated spatially and its variance

is relatively large. In this situation, the signal of interest is captured by a small number of

components. This behaviour is observed for all three measures of LD-PC performance: area

under the ROC curve, classification accuracy and reproducibility. However, if the performance

itself is poor and highly variable, the dimensionality estimates that optimize it are unstable (for

example, this is observed when we optimize classification accuracy when M=0.01).

Reproducibility is a more robust measure of performance than classification accuracy; its

optimization gives more stable estimates.

Overall, we recommend selecting the PC subspace size that optimizes a combination of the two

performance metrics (reproducibility and classification accuracy). As proposed by Zhang et al.

(2008), the number of components is selected in order to minimize

22 ..1.1 accuracyclassmedilityreproducibmed . (5.1)

The geometrical interpretation of this cost function can be demonstrated using the idea of

prediction-reproducibility plots (Section 2.1.4). A perfect classifier would have both

reproducibility and classification accuracy equal to 1; therefore, it is represented as the point that

has coordinates (1,1) in the prediction-reproducibility space. The Δ metric is the Euclidean

distance from the perfect classifier. By minimizing it, we attempt to find the PC subspace that is

both reproducible and predictive; it is more stable than optimization of either of the two metrics

by themselves (classification accuracy is particularly unreliable in the noisy data sets that are

inherently hard to classify). A study by Rasmussen et al. (2012B) also suggested optimization of

the Δ metric to select regularization parameters of multivariate classifiers that do not use PCA

83

regularization (Rasmussen et al., 2012B). That study demonstrated an advantage of the Δ metric

over prediction accuracy: when the regularization of classifiers was optimized for prediction

accuracy, the resulting spatial maps were sparser and less reproducible compared to Δ-optimized

regularization. This can serve as a reminder that in fMRI classification the quality of spatial

maps is just as important as classification accuracy.

84

Chapter 6 Intrins ic d im ens io nality and co m plexity o f fM R I d ata

In the previous chapter, we have used simulated data to show how estimation of intrinsic PC

dimensionality is influenced by the structure of the signal of interest (that is, the signal that is

reliably different between classes in a classification problem). Our measure of gSNR indicates

how easy it is to detect this signal in the data. If gSNR is high, the algorithms (such as LD and

QD) are better able to classify the data accurately and to produce reproducible maps. We have

demonstrated that higher gSNR is linked to lower intrinsic dimensionality (estimated by

optimizing map reproducibility or classification accuracy of linear and quadratic discriminants).

In this chapter, we present some behaviour correlates of PC dimensionality. We were motivated

by the result of McIntosh et al. (2008), where the intrinsic dimensionality of EEG signal (for a

specific channel) was estimated as the number of PCs that explain 90% of the trial-to-trial

variability. This dimensionality was measured for a group of children 8 to 15 years old, as well

as for adults, and has been shown to positively correlate (1) with age and (2) with behavioural

accuracy in a face memory task. These positive correlations were also observed for per-channel

multi-scale entropy, which is a nonlinear measure of signal complexity. These results suggest PC

dimensionality as a linear (and perhaps indirect) measure of complexity of EEG signal, which

increases with maturation. We have extended this idea to fMRI data, obtained from two studies:

a study of self-control and a longitudinal stroke recovery study; the results are described in this

chapter. In the first study, we have found significant difference in PC dimensionality between the

groups that exhibit different tendencies in self-control. In the second study, we have found that

PC dimensionality estimates (as well as some other measures of fMRI eigenspectrum) correlate

with behavioural measures of post-stroke recovery of motor function.

6 .1 Intrins ic d im ens io nality in a s tud y o f s elf-co ntro l

In a paper recently published in Nature Communications (Berman et al., 2013), we examined PC

dimensionality in healthy adults with varying levels of self-control abilities. The participants

have previously been involved in a seminal study of self-control across the life span (Mischel et

al., 2011). At the age of 4, they have been tested for their ability to control appetitive impulses:

they have been presented with a choice, either to eat one marshmallow immediately, or to wait

85

for a period of time and be rewarded with two marshmallows. At adolescence, these participants

have been assessed for their self-control ability using parental ratings; assessments have been

repeated (with self-reported ratings) when the participants have been in their 20s and 30s.

Participants who have demonstrated better self-control at age 4 (that is, they could resist

immediate temptation on the marshmallow test) have also shown, as adolescents, greater ability

to plan, to concentrate, and to cope with stress, as well as higher SAT scores. As adults, these

participants demonstrate higher educational achievement, higher subjective sense of self-worth,

and have lower rates of drug abuse. We have separated the participants into two groups, "high

delayers" and "low delayers", according to the participant being above or below average in the

self-control measures.

Figure 6.1. Intrinsic dimensionality in the self-control study, for the subjects in the high-

delaying and the low-delaying groups. Dimensionality is estimated by optimization of LD

classification accuracy. Error bars represent standard errors across the subject group.

A subset of participants (12 from each group), now in their 40s, has been recruited for an fMRI

study. During scanning, the participants have been instructed to perform the directed-forgetting

task that tested one's ability to control the contents of working memory. There are two types of

trials: "lure" and "control". "Lure" trials require suppressing information in working memory

whereas "control" trials do not. The difference in behavioural performance between "lure" and

86

"control" trials can be used as an assay of one's ability to control the contents of working

memory. Behavioural performance does not differ significantly across the two groups as both are

equally impaired on "lure" trials relative to "control" trials in accuracy and reaction time. LD

analysis has been performed to classify the fMRI volumes according to trial type; the accuracy of

this classification is not significantly different for the two groups. However, there is a significant

difference in the intrinsic dimensionality, that is, the number of principal components required to

optimize LD classification. Figure 6.1 shows the optimal number of PCs for each subject, as well

as the group averages. The high delayers, as a group, need a smaller number of PCs to optimize

LD classification; the low delayers show larger mean dimensionality as well as larger variability

in dimensionality. Univariate analysis demonstrated that subjects in both groups recruit the same

cortical areas in the directed-forgetting task, i.e., there is no significant group difference in the

magnitude of BOLD signal. However, recalling the results of our simulations, smaller values of

dimensionality in the high-delaying group might suggest that the signal of interest in the active

areas had more variability (V) and/or stronger network coupling (ρ). In addition, we have

constructed within-subject LD maps using the optimum-classification dimensionality estimates.

We can classify the within-subject maps according to the group with 71% accuracy using QD;

LD classification is lower (58%), because the group difference in homogeneity of the covariance

matrix requires a highly non-linear boundary to separate the two groups.

6 .2 C o m plexity o f co rtical netw o rks in fM R I s tud y o f s tro ke reco very

Using data from the longitudinal study of stroke recovery (Small et al., 2002), we studied the

relationship between PC dimensionality, computed on fMRI data, and improvement of motor

function, measured with standard behavioural tests. We were able to find the correlation of post-

stroke recovery with intrinsic dimensionality, although not all of our methods of dimensionality

estimation have demonstrated such correlation. Our results were published (Yourganov et al.,

2011).

Nine subjects recovering from a stroke in the motor area of the brain were tested in their ability

to perform hand movements, both with the healthy hand and with the hand impaired by the

lesion. All subjects were scanned at 4 sessions (taking place at 1, 2, 3 and 6 months post stroke).

Each session included behavioural assessment of recovery of motor function. Of the several

87

behavioural tests in the original study, we use the results of three tests (each has been performed

with both the healthy and the impaired hand):

1. Strength of the hand grip, measured with a dynamometer.

2. Strength of the pinch between thumb and index finger, measured the same way.

3. Performance on the nine-hole peg test, defined as 1/(time to complete).

Each test produces two behavioural measures of recovery: improvement of performance and final

performance. Improvement has been calculated as the difference between the performance of the

impaired hand on the first and last session, divided by the mean (across all 4 sessions)

performance of the healthy hand. The final performance is computed as the ratio of the

performance of the impaired hand on the fourth session only, again divided by the mean

performance of the healthy hand. During the fMRI scanning sessions, subjects have been

instructed to perform two tasks: wrist flexion/extension and tapping index finger and thumb

together. These movements are performed with the healthy as well as with the impaired hand, on

alternating runs. Each subject has been tested during four sessions, each session consisting of

behavioural evaluation and fMRI recording.

In an earlier paper (Schmah et al., 2010), we used these data to evaluate a number of

classification algorithms. fMRI volumes were classified into several categories: according to the

task (finger or wrist movement), to the side of the body (movement of the healthy hand or of the

impaired hand), and to the time elapsed since the stroke ("2 earlier sessions" versus "2 later

sessions"). Both linear and quadratic discriminants performed well relative to other methods of

classification. Quadratic discriminant is particularly good at differentiating between early and

late sessions; the accuracy of classification is at least 99% in all subjects. The PC dimensionality

that provided maximum classification accuracy of QD is highly variable across subjects, ranging

from 41 to 191 principal components. Encouraged by the high accuracy of classification (which

suggests that QD is the right model for early/late session classification), we correlated this

dimensionality with behavioural recovery measures (improvement as well as final performance

of all three tests).

88

Our results show the negative correlation between dimensionality estimates and performance on

the final session: Pearson's r is -0.56 for the pinch test, -0.69 for the grip test, and -0.74 for the

peg test, with p-values (uncorrected for multiple comparisons) of 0.11, 0.04 and 0.02,

respectively.

Figure 6.2. Scatter plots for four combinations of fMRI-based measures and behavioural

measures: QD dimensionality versus final peg test performance (A); generalization-error

dimensionality versus final pinch test performance (B); sphericity versus pinch test

improvement (C); spectral distance versus pinch test improvement (D). Each subject is

represented with a specific symbol.

We can speculate that the stroke recoverers with the highest behavioural performance on the

final session are the subjects that show the greatest recovery, therefore, the difference between

the early and the late sessions is more pronounced and can be captured by a small number of

89

PCs. However, correlation with behavioural improvement of performance in all three tests is not

significant (p>0.5), perhaps due to small and anatomical heterogeneous sample. The scatter plot

of the final performance on the 9-hole peg test versus QD dimensionality is shown on Figure 6.2

A.

We have observed another interesting correlation: final performance is positively correlated with

dimensionality estimates obtained by minimizing generalization error ("GE dimensionality"). In

our simulations, this method consistently estimates the dimensionality as 1 principal component

(and, very rarely, as 2 PCs), irrespective of the M, V and ρ of the task-related signal; this could be

due to the fact that the variation in our simulated data is driven by the difference between the

"active" and "baseline" classes and therefore could be captured with a single PC However, when

this method is applied to the stroke set, estimates of intrinsic dimensionality differ across

subjects (varying between 20 and 80 principal components), because the variance in the BOLD

signal in stroke recovery data has a much more complicated structure than what is observed in

our simulations . These estimates show strong correlation with all three behavioural measures of

final performance: Pearson's r is 0.73 for the pinch test, 0.84 for the grip test, and 0.57 for the

peg test (p=0.03, 0.004, and 0.1). Correlation with improvement of performance is, again,

insignificant (p>0.5). Scatter plot is displayed on Figure 6.2 B.

It is interesting to observe the difference between optimization of generalization and

optimization of classification: for the first method, estimates correlate positively with final

recovery, and for the second method, this correlation is negative (but also strong). There is a

fundamental difference between these two methods of dimensionality estimation: QD

dimensionality is modeling the difference between the early and the late sessions, and

generalization-error dimensionality captures the overall complexity of the data, with early and

late sessions pooled together14. It could be assumed that good recoverers have more complex

cortical networks that require larger number of components to describe them (with a probabilistic

PCA model). At the same time, the difference between the early and late stages of recovery is

more pronounced, requiring a QD model with fewer components.

14

We have also tried to estimate generalization-error dimensionality for early and late sessions separately to see whether there is a change that reflects behavioural improvement. The difference between early and late sessions was not significant (p=0.86 for the paired Wilcoxon test).

90

We found a correlation between improvement of behavioural performance and some measures

computed on the eigenspectrum of the covariance matrix (where the eigenvalues are sorted in

descending order). The first of such measures is the index of sphericity. This measure is

calculated on the covariance matrix with the data pooled across all sessions. The sphericity index

reflects the curvature of the plot of this matrix's eigenvalues. An index of 1 corresponds to a

situation when all eigenvalues are identical. The smaller this index is, the sharper is the drop in

the eigenspectrum. A sphericity index of 1 corresponds to a flat eigenspectrum, which would be

observed for an infinitely large sample of white Gaussian noise. In a limited sample of white

Gaussian noise, the eigenspectrum of the covariance matrix drops off slowly. If we add spatially

correlated signal that has sufficiently high magnitude and variance, the spectrum becomes less

spherical. Figure 5.3 shows how the presence of the network makes the first eigenvalue to stand

out, making the spectrum less spherical. We define sphericity index using the Greenhouse-

Geisser correction to the Box criterion (Schmah et al., 2010; Abdi, 2010):

N

ii

N

ii

N

1

2

2

1

1

1

. (6.1)

Here, N is the number of eigenvalues. This measure shows strong negative correlation with

improvement on pinch test and peg test (Pearson's r = -0.79 and -0.63; p = 0.01 and 0.07).

Correlation with improvement on hand test is insignificant (Pearson's r = -0.38, p = 0.3), as are

correlations with final measures of recovery (p>0.35 for all three behavioural tests). Overall, the

subjects showing a highly non-spherical covariance matrix (with a sharp drop in the

eigenspectrum) show the greatest improvement in two out of three behavioural measures of

recovery (pinch test and peg test) . The scatter plot of sphericity versus improvement in pinch

test is shown on Figure 6.2 C. In addition, we computed the sphericity for the covariance

matrices of early and late sessions separately. A paired Wilcoxon test fails to reveal any

significant difference between "early" and "late" sphericity (p=0.91).

To get a measure of difference in covariance matrices between the early and late sessions, we use

spectral distance, which we define as the Euclidean distance between the two ordered

eigenspectra:

91

N

i

Li

Eid

1

2)()( . (6.2)

Here, λi(E) denotes the eigenvalues of the "early" covariance matrix (sessions 1 and 2 combined),

and λi(L) -- eigenvalues of the "late" covariance matrix (sessions 3 and 4 combined). This measure

shows strong positive correlation with improvement on the pinch test (Pearson's r=0.79, p=0.01)

and a weaker one with the improvement on the grip test (Pearson's r=0.56, p=0.11). Correlation

with improvement on the peg test is insignificant (Pearson's r=0.38, p=0.31). Correlations with

behavioural measures of final recovery are also not significant (p>0.57 for all three behavioural

tests). The scatter plot of spectral distance versus improvement on pinch test is displayed on

Figure 6.2 D. A large spectral distance between the early and late sessions indicates a prominent

longitudinal change in the brain networks. It is therefore not surprising that it positively

correlates with behavioural improvement of motor performance.

Figure 6.3. Scatter plot for the first vs. the second principal component produced by PLS

analysis of the correlation matrix given in Table 6.1. Squares and circles denote fMRI-

based and behavioural measures, respectively.

To further analyze the correlations between fMRI-based measures and behavioral test results, we

performed a Partial Least Squares (PLS) analysis (McIntosh & Lobaugh, 2004; Krishnan et al.,

2011). The matrix of correlations R (displayed in Table 6.1) was decomposed using singular-

value decomposition: R=UΓVT. The first two principal components explain 43% and 39.5%,

92

respectively, of the total correlation (this is computed by dividing each singular value in Γ by the

total sum of singular values); taken together, they explain 82.5% of the total correlation. These

two principal components can be seen as the two orthogonal directions that capture (in the

optimal least-squares sense) the associations between fMRI-based and behavioural measures.

Figure 6.3 shows the scatter plot of the weights on the first two principal components (which

correspond to columns of U for fMRI-based measures and to columns of V for behavioural

measures). We see that the first (horizontal) direction captures the fMRI-based measures that are

computed on the eigenvalues (sphericity and spectral distance) and the behavioural measures

based on the improvement in performance. The second (vertical) direction captures the measures

based on the PC dimensionality that optimizes prediction (unsupervised or supervised) and the

behavioural measures based on the performance at the final session.

improvement final improvement final improvement final

sphericity index ‐0.68 (0.05) ‐0.30 (0.44) ‐0.87 (0.0045) ‐0.07 (0.88) ‐0.43 (0.25) 0.17 (0.68)

spectral distance 0.48 (0.19) 0.18 (0.64) 0.73 (0.03) ‐0.18 (0.64) 0.50 (0.18) ‐0.03 (0.95)

QD dimensionality ‐0.28 (0.46) ‐0.83 (0.008) 0.15 (0.71) ‐0.38 (0.31) ‐0.02 (0.98) ‐0.42 (0.27)

GE dimensionality ‐0.01 (0.99) 0.45 (0.22) ‐0.21 (0.6) 0.67 (0.05) 0.16 (0.68) 0.78 (0.015)

peg test pinch strength grip strength

Table 6.1. Correlations between fMRI-based measures and behavioural measures. Each

table entry shows Spearman's correlation coefficient, with the corresponding p-value in

parentheses. When corrected for multiple comparisons, none of the correlations are

significant at FDR≤0.05 level.

We have demonstrated that the process of post-stroke recovery of motor function is reflected in

the fMRI data, and, in particular, in the eigendecomposition of the covariance matrix. The PC

dimensionality is correlated with the performance on the last session, whereas the measures

computed on the eigenvalues of the data matrix (sphericity and spectral distance) correlate with

the improvement of performance in the 6-month period. These two directions of correlation are

mutually orthogonal (Figure 6.3), indicating that they might capture two relatively separate

behavioral recovery processes: the absolute level of performance reached over 6 months

recovery, and the change from the initial damaged brain required to reach this final performance.

This might also explain the opposing trends observed in sphericity and GE dimensionality (good

behavioural recovery is correlated with low sphericity and high GE dimensionality). Overall, our

93

results suggest that significant changes in the eigenspectra across sessions reflect underlying

changes in the BOLD functional connectivity and the re-organization of the brain, which leads to

effective recovery (Yourganov et al., 2010).

94

Chapter 7 E valuatio n o f clas s ifiers : s im ulated fM R I d ata

7 .1 P o o l o f clas s ifiers

In Chapter 5, we have evaluated different methods of estimating the intrinsic PC dimensionality

of the data. In this chapter, we describe the evaluation of classification algorithms that have been

used to classify fMRI volumes according to the task performed during the acquisition of the

volume. We have evaluated a group of classifiers well-established in the field, such as linear

discriminant, support vector machines, and Gaussian Naïve Bayes classifier; in addition, we have

tested quadratic discriminant, which is novel in fMRI classification. We have also constructed

spatial maps, which display the contribution of each spatial location to classification of the

volumes. Both simulated and real fMRI data have been used in this evaluation. This chapter

describes the results of the evaluation on the simulated data; this study will be submitted for

publication in summer of 2013.

Our pool of classifiers consisted of quadratic and linear discriminants (noted as QD and LD),

Gaussian Naïve Bayes classifiers (GNB-L and GNB-N for linear and nonlinear variants of GNB,

respectively), and a support vector machine (SVM) with a linear kernel. QD, LD and GNB are

probabilistic classifiers that use Gaussian distributions to model the classes; Chapter 3 describes

these classifiers and corresponding spatial maps. To evaluate SVM, we use the implementation

provided by LIBSVM library (Chang & Lin, 2011). For linear SVMs, the training weight vector

serves as a spatial map (this is proposed, among other approaches to visualizing the SVM model,

in a paper by LaConte and colleagues (2005); see also Rasmussen et al., 2012B). We did not test

SVMs with non-linear kernels, despite the evidence that they provide an advantage over linear

kernels in some situations (Schmah et al., 2010). The decision to exclude nonlinear SVMs from

our pool of classifiers was motivated by the fact that we wanted to evaluate the spatial maps

created by the classifiers (unlike our previous study described in Schmah et al, 2010, where we

evaluated classification oaccuracy of classifiers without considering the cossreponding spatial

maps). It is possible to construct spatial maps for nonlinear-kernel SVMs (Rasmussen et al.,

2011 and 2012A); however, this is computationally expensive and was not performed due to

time limitations.

95

For LD, we try two approaches to regularizing the pooled covariance matrix. The first approach

approximates it with a subset of its principal components; this is also the approach we used for

QD. The question of selecting the size of this subset is addressed in detail in Chapters 4 and 5,

where we show the usefulness of resampling-based approaches. In particular, reproducibility of

spatial maps and, to a lesser extent, accuracy of classification are two optimization metrics that

are sensitive to connectivity of spatial networks and yield reasonably good ROC performance of

LD. In the current evaluation, we have used a combination of reproducibility and classification

accuracy to determine K, the size of the optimal PC subset. Following Zhang et al. (2008), we

use split-half resampling to compute classification accuracy and map reproducibility for a range

of values of K, and compute the median values of these two metrics across splits. Then we select

the value of K that minimizes

22 ..1.1 accuracyclassmedilityreproducibmed . (7.1)

A perfect classifier would have both reproducibility and classification accuracy equal to 1. We

can think of these two metrics as axes that define our performance space; the perfect classifier

has coordinates (1, 1) in this space, and the Δ metric is the Euclidean distance from the perfect

classifier.

The second approach to regularizing the covariance matrix is ridge regularization, described in

Section 3.5; here, a diagonal matrix is added to the covariance matrix, making it full-rank. The

diagonal matrix is λI, and the value of λ is selected with a cross-validation procedure similar to

selecting K. Using split-half resampling, we compute the reproducibility of maps and accuracy of

classification for different values of λ. The value of λ that optimizes the Δ metric (given by

Formula 7.1) is used to regularize the within-class covariance in the linear discriminant.

The SVM model contains the hyper-parameter C, which regulates the trade-off between the

complexity of the model and the accuracy of classification (LaConte et al., 2005). We determine

the value of C in the same way as we select K and λ: by running split-half resampling, computing

median (across splits) classification accuracy and map reproducibility, and optimizing the Δ

metric. Our implementation of GNB (linear and nonlinear) has no regularization hyper-

parameters that need to be tuned with cross-validation.

96

Overall, our pool of classifiers consists of 6 models:

1) QD: quadratic discriminant with PC regularization of covariance matrices (the same

value of K is used for both classes);

2) LD-PC: linear discriminant with PC regularization of the pooled covariance matrix;

3) LD-RR: linear discriminant with ridge regularization of the pooled covariance matrix;

4) SVM: support vector machine with a linear kernel;

5) GNB-N: nonlinear Gaussian Naïve Bayes classifier, where the (diagonal) covariance

matrices are allowed to differ across classes and the decision function is nonlinear;

6) GNB-L: linear GNB classifier, where the covariance matrix is pooled across classes and

the decision function is linear.

We will now proceed to the results of evaluation of these classifiers on simulated data.

7 .2 Perfo rm ance o f clas s ifiers o n s im ulated d ata

The simulation framework described in Section 2.2.1 has been used to simulate a block-design

experiment with two conditions: active and baseline. A simulated experimental run consists of

100 active and 100 baseline volumes, arranged in alternating 10-volume blocks. Task-related

signal (which is present in the active blocks and absent from baseline blocks) is determined by

three parameters: mean expected magnitude (M), variance (V, defined as a ratio between the

variances of task-related signal and of the background noise), and correlation across the spatial

network of distributed activations (ρ). To evaluate the effect of each of these parameters on the

performance of classifiers, we have tested various levels of M (0, 0.02, and 0.03), V (0.1 to 1.6,

in increments of 0.25), and ρ (0, 0.5 and 0.99). For each setting of (M, V, ρ), we have generated

100 data sets; an additional 100 sets consisting of baseline volumes are created for ROC analysis.

When analyzing the simulated data, the first 2 volumes of each block have been discarded

because of slow temporal response of HRF. This reduces the size of the data sets to 160 volumes.

Voxels outside the “simulated brain” have been discarded, leaving 2072 voxels for further

analysis. Each simulated run has been split into two sets, both containing 5 active and 5 baseline

97

blocks. We have computed the median values of map reproducibility and classification accuracy

across 20 such splits.

The results of the evaluation are displayed on Figure 7.1. The rows of this figure correspond to

the metric of performance: classification accuracy (top), and reproducibility of maps (bottom).

The lines show median performance, taken across 100 data sets generated for each setting of (M,

V, ρ); the error bars show standard deviation of performance. The columns of Figure 7.1

correspond to the levels of M (0, 0.02 and 0.03). The three levels of ρ (0, 0.5 and 0.99) are shown

as three sub-panels of each level of M. Each of these sub-panels consists of a performance plot,

where the horizontal axis represents V, going from 0.1 to 1.6. The vertical axis of the plot is the

mean magnitude of the performance metric.

Figure 7.1. Performance of the pool of six classifiers on simulated data sets. Performance is

measured with classification accuracy (top row) and reproducibility of spatial maps

(bottom row). The three columns correspond to three levels of mean signal magnitude M,

and the three sub-columns to three levels of spatial correlation ρ.

98

7 .2 .1 C las s ificatio n accuracy

In terms of classification accuracy (top row of Figure 7.1), there are strong similarities between

LD-PC, LD-RR, SVM and GNB-L. These methods can all be referred to as “linear classifiers”,

because they all use a linear decision function to compute class membership, and the decision

boundary that separates the two classes is a hyperplane. Their accuracy is lowest when M=0, and

tends to increase with mean signal strength when M grows. For M = 0.03, we see a negative

effect of increasing V, which is modulated by ρ (it is negligible when ρ = 0, and strongest when

both ρ = 0.99). Changing V influences the spread of the “active” volumes, and as it grows the

two classes become harder to separate and their class covariances become less similar.

Separation between the two classes can be estimated with the Mahalanobis distance between the

class centroids m1 and m2:

211

21 mmSmm TMahd . (6.2)

Here, S is the pooled within-class covariance matrix, that is, the average between the sample

covariance matrices of each of the two classes. To invert it, we use PC regularization. LD-PC is

used to select the number of principal components for this approximation, by determining K that

optimizes the Δ metric in each data set.

When M = 0.03, mean classification accuracy is to a large extent predicted by median dMah. For a

given M, we have computed 21 mean values (across 100 data sets) of dMah, one each for 3 levels

of ρ and 7 levels of V. Pearson’s correlation coefficient between 21 values of mean dMah and the

corresponding mean classification accuracies is 0.954 when M = 0.03. For lower levels of M, this

correlation is not observed, perhaps due to instability in estimating K.

The remaining classifiers in our pool, QD and GNB-N, show quite different trends in

performance (different from the linear classifiers as well as from each other). The positive

influence of M is also observed, but the effect of V and ρ is more complex. These two methods

use a non-linear decision function to classify the volumes, and the covariance matrix is estimated

separately for each class. Therefore, these methods can use the difference in the covariance

matrices to their advantage: if the separation between the class centroids is small (that is, M is

low), prediction of class membership can be enhanced using the different sample covariance

99

matrices. This is evident when M = 0: here, both nonlinear methods get better as V increases. For

QD, this beneficial effect of V is modulated by ρ; when M = 0, ρ > 0, and V is sufficiently large,

QD is the most accurate classifier, peaking at mean accuracy of 66% when ρ = 0.99 and V = 1.6..

When ρ = 0, the functional nodes of the active network are independent, and GNB-N is the best

model for our data, achieving 60% mean accuracy at the largest setting of V.

When M>0, the performance of these two nonlinear classifiers is influenced by V in a more

complex way. When M is 0.03, performance is high (greater than 70%), but the influence of V is

detrimental to performance, due to growing overlap between the classes. In this aspect, the

nonlinear classifiers resemble the linear; indeed, the performance of the methods starts to

converge when M = 0.03. At M = 0.02, there seems to be a transition when the effect of V

changes from beneficial to detrimental, and V seems to have no effect on mean classification

accuracy.

7 .2 .2 R epro d ucibility

The bottom row of Figure 7.1 shows the reproducibility of spatial maps produced by our pool of

classifiers. If we compare it with the classification accuracy plot, we see that the classifiers here

can be grouped in a different way:

1. univariate methods: two versions of GNB

2. multivariate methods that use PC regularization: LD-PC and QD

3. other multivariate methods: SVM and LD-RR

Inside each group, reproducibility is quite similar, but the groups are clearly distinct in most

cases. Let us inspect the situation for a given level of mean signal, M. We see the same pattern

across all values of M:

As expected, network coupling (ρ) has no effect on univariate methods. There is a slight

detrimental effect of V, which is noticeable at high levels of M. In most cases, univariate

maps are less reproducible than multivariate maps. It is interesting to note here that

pooling of variance across classes has no effect on reproducibility, because performance

of GNB-N and GNB-L is the same.

100

PC-based methods (LD-PC and QD) get a tremendous boost from increasing V, when the

active areas are coupled (ρ > 0). For sufficiently large levels of V, reproducibility of these

two methods greatly surpasses reproducibility of all other methods in our pool

Other multivariate methods (SVM and LD-RR) are not influenced by V and ρ, and have

effectively identical performance (reflecting the real-data findings of Rasmussen et al.,

2012). However, the relative ranking of SVM and LD-RR among the pool of classifiers

depends on ρ: when ρ = 0, they are the sometimes the best methods in terms of

reproducibility, along with LD-PC. The same holds when ρ >0 and V is very small. In

other situations (ρ > 0 and V > 0.1) they perform worse than PC-based methods.

This ranking of the methods is largely consistent across M. Overall, M has a positive effect on

reproducibility: for all methods, spatial maps become more reproducible as M grows.

Figure 7.2. Performance of the pool of six classifiers on simulated data sets, measured by

partial area under the ROC curve. The three columns correspond to three levels of mean

signal magnitude M, and the three sub-columns to three levels of spatial correlation ρ.

7 .2 .3 Partial area und er the R O C curve

In addition to the NPAIRS metrics (classification accuracy and reproducibility), we have also

measured the performance using partial area under the ROC curve. The results are displayed on

Figure 7.2, which is organized similarly to Figure 7.1: the three panels correspond to three levels

of M (0, 0.02, 0.03), and the three sub-panels inside each panel correspond to three levels of ρ (0,

101

0.5, 0.99). We have computed a ROC curve for each of the 16 active areas, and estimated the

partial area corresponding to false positive frequency range from 0 to 0.1 (see Section 2.1.1). The

lines show the median value of partial ROC area across the 16 areas, and error bars show its

standard deviation. The dashed line shows the partial ROC area of 0.05, which corresponds to

random guessing.

We see that the classifiers group in the same fashion with respect to ROC area as they do with

respect to spatial map reproducibility reflecting the fact that they estimate the same spatial signal

detection performance:

1) PC-based multivariate methods;

2) other multivariate methods (SVM and LD-RR);

3) univariate methods.

The first group is the best performer in terms of ROC area. QD is better than LD-PC when M is

zero, and LD-PC is better than QD when M = 0.03 and ρ = 0. In other situations, their

performance is identical. Both methods are sensitive to V, and their performance increases as V

grows from 0 to 1. When mean signal is relatively strong (M = 0.03), both LD-PC and QD are

near-perfect in their signal detection (partial ROC area approaches the theoretical maximum

value of 0.1), for all levels of V and ρ.

Univariate methods are uniformly the worst performers. In the absence of mean signal, they

never rise significantly above chance. When M>0, they are much better than chance, but their

performance drops as V grows. This decline is less severe when M is large. As expected, ρ has no

effect on performance of univariate detectors. Pooling of variance across classes is beneficial for

signal detection: GNB-L is slightly, but consistently, better than GNB-N. The second group of

algorithms (SVM and LD-RR) is intermediate in terms of performance: better than univariate

methods, but never as good as PC-based multivariate methods.

7 .2 .4 R O C evaluatio n o f G LM , PCA, ICA and PLS

In addition to the pool of classifiers described above, we have evaluated a different group of

popular methods of fMRI data analysis, using the ROC methodology. These methods do not

102

classify the fMRI volumes; their goal is to produce spatial maps that summarize brain activity.

Some of these methods are unsupervised (i.e. they produce maps that show important sources of

temporal variance in the fMRI signal, irrespective of the task), and some are supervised (they

characterize the response of the brain to a particular task or set of tasks). We have tested the

following unsupervised methods:

Principal Component Analysis (PCA): signal detection of the first eigenimage of the

mean-centered data matrix has been evaluated with ROC methodology;

Independent Component Analysis (ICA). MELODIC software has been used to

decompose the data matrix into independent components (ICs). Each component consists

of a spatial map and the associated timecourse. We have correlated these timecourses

with the predictor function, to identify the component that showed strongest correlation.

For the predictor function, we have convolved the boxcar function (alternation between

task and rest blocks) with the hemodynamic response function (HRF, described in

Section 1.1.2). The spatial map of the most-correlated independent component has been

evaluated with ROC methodology.

The supervised methods that we have tested are:

Mean-centered Partial Least Squares (see McIntosh et al., 1996; McIntosh & Lobaugh,

2004; Krishnan et al., 2011). This method looks at the principal components of the

product of the data matrix and design matrix (the latter encodes the task for each

volume). For our situation of two-class, one-subject analysis, this product has only one

eigenvector, which is proportional to the difference in class means. We do not perform

the bootstrapping procedure that is standard in PLS, because of the nature of our

simulations (which is, one session per subject, and one subject per experiment).

General Linear Model (GLM; see Section 3.4). We have used only one predictor in our

analysis: boxcar function convolved with HRF.

The results are displayed on Figure 7.3. As in the previous section, for our metric of signal

detection we use the partial area under the ROC curve for a false discovery rate between 0 and

0.1.

103

Figure 7.3. Partial area under ROC curve measured for maps that are produced by GLM,

ICA, PLS and PCA.

When M = 0 we see that PCA is better (sometimes dramatically so) at signal detection than other

methods. Signal detection of PCA improves as V grows; this is expected, because V is the ratio

of task-coupled variance to noise variance. When V is large, the variance in the active loci is

higher than in the inactive loci, helping the active loci to “stand out” in the eigenimages. For

example, when V = 1, the variance is twice as much in the active loci (which contain the sum of

signal and noise) as in the inactive loci (which contain only the noise). When active loci form a

spatial network (i.e. when ρ > 0), the first eigenimage accounts for a relatively large share of

variance. Therefore, at large levels of V and ρ, the first eigenimage captures most of the task-

driven variance. Because this variance is expressed only in the active loci, they get high weights

in the eigenimages. This explains the success of the PCA algorithm at M = 0, when other

methods are either at chance or only slightly better than chance. PCA also outperforms all the

methods in our classifier pool, including LD and QD (compare Figures 7.3 and 7.2).

When M > 0, performance of PCA gets better because the mean difference is another source of

variance that is captured by the first eigenimage. This variance is manifest only in the active loci,

so PCA is doing a good job at detecting them. The performance of GLM and (to a lesser extent)

PLS improves dramatically when M>0. PLS is directly proportional to the difference in class

means and not affected by ρ or V. GLM, as a univariate method, is not affected by ρ, and V has a

detrimental effect; the GLM map consists of the t values at each voxel location; these values

decrease in magnitude as V grows. Overall, GLM is the best method of the four when either ρ or

V is small but is still worse than LD or QD regularised on a PC subspace (see Fig. 6.1). It

104

outperforms univariate GNB (both linear and nonlinear). Remember that in our simulations the

signal is convolved with HRF, which has a somewhat slow response; consequently, the first

couple of volumes in the “active” blocks do not reflect the mean difference. In our

implementation of GNB, we discard the transition scans at the beginning of each block, reducing

the sample size (2 images are discarded at the beginning of each block, so only 160 volumes out

of original 200 are retained). GLM does not discard the transition scans, but, rather, this slow

response is modeled in the predictor. In our simulations, GLM has the additional benefit of using

the “true” HRF function in the predictor, i.e. the same function that has been used in the

construction of the data and in the analysis.

7 .3 Sum m ary o f evaluatio n o f clas s ifiers o n s im ulated d ata

Figure 7.2 shows that the multivariate classifiers based on PCA regularization (that is, QD and

LD-PC) are the most efficient methods of detecting the simulated signal. This advantage over

other classifiers from our pool is explained by the performance of PCA as shown in Figure 7.3.

For V>1 in Figure 7.3, PCA by itself performs similar to or better than SVM and LD-RR.

Therefore, when an adaptive PCA subspace is used to provide discriminant features for QD and

LD, they do not need to further improve performance by much to easily outperform the other

approaches as signal detectors for variance-driven covariance structures.

As we can see in the top row of Figure 7.1, PCA regularization does not provide any advantage

to LD with respect to the accuracy of classification: LD performs at the same level of accuracy

regardless of the method of regularization (PCA or ridge regularization). At M=0.03, the

accuracy of all classifiers from our pool is more or less similar and displays the same declining

trend with increasing V. At lower levels of M, the difference in accuracy is driven by linearity of

classifiers rather than by regularization method (there is an additional difference between

nonlinear univariate and nonlinear multivariate classifiers, that is, between GNB-N and QD).

Comparing Figures 7.1 and 7.2, we can say that signal detection has more similarities to

reproducibility of spatial maps than to classification accuracy. We observe the same trends in

signal detection (Figure 7.2) as in reproducibility (Figure 7.1, bottom row): the PCA-regularized

classifiers are the best performers, and their performance improves with the growing variance of

correlated active signal; the univariate methods perform the worst, and are negatively influenced

by growing V; finally, SVM and LD-RR perform at the intermediate level, and the influence of

105

increasing V is minimal (a slight negative trend can be observed with respect to signal detection).

Principal components capture the most important sources of variance in the data. In case when

these sources are correlated, they are to some degree captured by the first principal component.

With growing V, there is an increase in the portion of total variance that is due to the active

signal; this is reflected in the improvement of signal detection of the first principal component

(Figure 7.3). This, in turn, helps the methods that use PCA for regularization to improve their

signal detection (Figure 7.2) and the reproducibility of their maps (Figure 7.1, bottom row). To

the extent that variance-driven covariance structures provide a good description of brain

networks as reflected in real BOLD fMRI data sets then we can expect QD and LD on a

regularised subspace to also outperform other approaches as signal detectors.

106

Chapter 8 E valuatio n o f clas s ifiers : real fM R I d ata

8 .1 Data s ets

This chapter presents the results of evaluation of our pool of classifiers on two real fMRI data

sets. In the previous chapter, we have used simulations to investigate the influence of parameters

of simulated signal (namely, M, V and ρ) on the performance of our classifier pool. In this

chapter, we apply this classifier pool to real fMRI data; the ranking of classifiers according to

their performance is largely similar to the one observed in simulations. We concentrate on

within-subject analysis, so the analysis of real fMRI data is comparable to our analysis of

simulated data, where we have not attempted to simulate the across-subject heterogeneity.

However, one of our real data sets was also subject to group-level analysis, described in Section

8.4.2.

The two real fMRI sets come from the stroke recovery study (see Section 2.2.2) and from an

aging study (see Section 2.2.3). These two sets represent two scenarios with different advantages

and disadvantages. The first set is composed of a small number of highly heterogeneous subjects,

with a large number of fMRI volumes per subject. In the second set, the number of subjects is

much larger, and they are more homogeneous (within their age group) because they all come

from a healthy population. However, the number of fMRI volumes per subject is much smaller in

the second set. Also, the first set is a longitudinal study with 4 scanning session per subject; this

gives us an interesting opportunity to classify the volumes based on the time after stroke ("early"

versus "late" sessions) in addition to classification of volumes based on the task ("finger tapping"

versus "wrist flexion" and "healthy hand movement" versus "impaired hand movement"). The

second study is not longitudinal, but the subjects come from two age groups, which allows us to

study the effects of age on classifier performance.

Another important advantage of the second set is the range of behavioural tasks performed by

subjects during scanning. In the first set, the subjects performed simple motor tasks (finger

tapping and wrist flexion, with alternating hands). In the second set, the subjects performed a set

of visuomotor tasks with varying cognitive load: simple reaction task (RT), matching of a target

stimulus to one of the three simultaneously presented stimuli (PM, for "perceptual matching"),

107

and a short-term memory task, where the target stimulus was removed from the screen before

presentation of the three stimuli (DM, for "delayed matching"). The forth "task" was passive

fixation on a dot in the middle of the screen (FIX). We have used four binary contrasts for our

within-subject analysis: DM/FIX, RT/FIX, DM/RT and DM/PM. We can assume that the

RT/FIX and DM/FIX are relatively strong contrasts, because different brain networks are

recruited during active visuomotor tasks and during passive fixation (see Grady et al., 2010). The

DM/PM contrast can be assumed to be the weakest, because of the similarity between the DM

and PM tasks (because of this similarity, we did not study PM/FIX and PM/RT contrasts).

Finally, the DM/RT contrast is intermediate. These four contrasts represent a range of "contrast

strengths", which is somewhat analogous to varying M, V and ρ in our simulations.

Both data sets were analyzed using a split-half resampling framework. The volumes in the stroke

set were pooled across 4 scanning sessions, and then split into two half-sets; the temporal

separation between the two half-sets was at least 16 seconds. 20 such splits were created for each

subject. In the aging study, 4 scanning runs were acquired for each subject; these 4 runs were

evenly split into half-sets, providing 3 splits per subject. For each binary contrast (in both stroke

and aging data), we have balanced the number of volumes between the two classes within haf-

sets by subsampling from a larger class.

The first part of this chapter (Sections 8.2 and 8.3) evaluates the classifiers in within-subject

analysis framework. The results obtained on the two real datasets are compared with the results

of within-subject analysis of simulated data sets. The second part of the chapter (Section 8.4)

uses the aging study data to investigate some questions that go beyond within-subject analysis..

First, the within-subject spatial maps are evaluated on their reproducibility across subjects.

Second, we inspect the agreement between the maps created by different classifiers on the same

subject. Finally, we classify the within-subject maps based on the age group of the participant;

we create the group-level spatial map for this classification and use it to identify the spatial

locations where the effect of age is expressed.

The results of evaluation on the stroke recovery set have been published previously (Schmah et

al., 2010); for that paper, we have used only one metric of evaluation: the accuracy of

classification. The remaining results will be submitted for publication in early 2013.

108

8 .2 E valuatio n o n real d ata: Stro ke s tud y

Data collected for the longitudinal study of stroke recovery has been used to evaluate the

performance of our pool of classifiers defined in Section 7.1. The results are displayed in Figure

8.1. Classifiers have been applied to three binary classification problems: “healthy hand versus

impaired hand”, “early session versus late session”, and “finger tapping versus wrist flexion”.

For each of the 9 participants, we have created 20 training-test splits of equal size; classification

accuracy and map reproducibility have been averaged across 20 splits for each subject

separately. The figure shows box-and-whisker plots of the mean values of classification accuracy

(left panel) and map reproducibility (right panel) across the 9 subjects. Multivariate classifiers

have been regularized by optimization of the Δ metric in the split-half framework (defined in

Formula 7.1).

Figure 8.1. Performance of the pool of classifiers on the stroke recovery dataset for three

contrasts (healthy/impaired, early/late, and finger/wrist). The top figure shows the

accuracy of classification, and the bottom figure shows the reproducibility of spatial maps

for six algorithms of classification.

109

The plot shows that there is a large amount of overlap in the performance of our classifiers.

Accuracy of classification is above chance for all three classification contrasts. For the

"healthy/impaired" contrast, the distribution of accuracies across the nine subjects is roughly the

same for all classifiers, with QD being slightly better than others on average. For the "early/late"

contrast, the across-subject accuracy is the greatest for QD, followed by SVM and nonlinear

GNB. The difference in classifiers' accuracy in the third contrast, "finger/wrist", is more

pronounced than in the first two contrasts. Multivariate linear classifiers (LD-PC, LD-RR and

SVM) are most accurate. In this contrast, the across-subjects variability of classification accuracy

is smaller than in the other two; this also applies to across-subjects variability of reproducibility.

For the other two contrasts, reproducibility is more heterogeneous across subjects (in the extreme

case, reproducibility of nonlinear GNB for the "early/late" contrast ranges from 0 to 1). Of the

three classification contrasts, performance in the "finger/wrist" contrast is the most stable across

subjects. Anatomically, the subjects in this data set showed a large heterogeneity in the location

and severity of the lesion, as well as in their behavioural recovery of motor function (Small et al.,

2002). It is possible that this drives the heterogeneity of classification accuracy and

reproducibility that we observe for the "healthy/impaired" and the "early/late" contrasts.

The classification performance is more stable in the "finger/wrist" contrast, because the two

tasks, "finger tapping" and "wrist flexion", recruit different areas of the motor cortex. The

function of motor cortices has been disrupted in stroke patients, but the data from the healthy and

impaired hands are pooled together for this classification task, which can reduce the

heterogeneity due to stroke.

With such a heterogeneous data set, it is difficult to compare classifiers against each other using

just the box-and-whisker plot (such as one shown on Figure 8.1). We perform additional

statistical testing to answer the question whether the ranking of classifiers is consistent across

subjects. The framework for testing this question with non-parametric statistical tests is

described in Conover (1999; pages 369-373), and also in Demsar (2006). For each contrast and

subject, we rank the methods from 1 to 6 according to their median performance (either accuracy

or reproducibility), with rank 1 being the best performer. Whenever two methods are tied, say for

ranks r and r+1, they are assigned a common rank of r+0.5. First, we test the null hypothesis that

110

all methods perform equally well (and the difference in their ranking is therefore not significant),

using the Friedman test15. We compute the statistic

k

j

b

iij

k

j

b

iij

kbkr

kbrk

1 1

22

2

1 1

4

)1(

2

)1()1(

, (8.1)

where rij is the rank of the jth classifier in the ith subject, b is the number of subjects and k is the

number of classifiers. The approximate distribution of this statistic is χ2 with k-1 degrees of

freedom. Friedman test is the non-parametric equivalent to repeated-measures ANOVA.

If Friedman test results in the rejection of the null hypothesis, we can proceed to post-hoc testing,

where we test the significance of difference in ranking between a pair of classifiers. For each

contrast separately, we compute the average rank of each classifier across subjects. Then, the

ranking of a pair of classifiers is significantly different if the difference in their average ranks is

no less than the critical distance, defined as

bkb

rrb

t

k

j

b

i

k

j

b

iijij

/)1)(1(

21 1 1

2

1

2

2/1

(8.2)

where t1-α/2 is the 1-α/2 quantile of the t distribution with (b-1)(k-1) degrees of freedom, and α is

the significance level, which we set to 0.05. This threshold of significance uses Bonferroni

correction for multiple comparisons, and is therefore somewhat conservative.

15

We give the formulas according to Conover (1999), where the tests are adjusted for the situation when ties are present. Demsar (2006) does not make this adjustment.

111

Figure 8.2. Ranking of six classifiers for two performance metrics: classification accuracy

(top) and map reproducibility (bottom). Ranks of 1 and 6 correspond to the best and the

worst performer, respectively. If the ranking of classifiers is not significantly different, they

are linked with a thick horizontal bar. Significance is established with Friedman and post-

hoc nonparametric testing, as described in the test. For the contrasts with significant

difference in ranking, critical distances (CD) are also specified.

Figure 8.2 displays the results of this test graphically. Classification accuracy is evaluated in the

top row of Figure 8.2, and the reproducibility of spatial maps is evaluated in the bottom row.

Each bar represents a contrast; if there is a significant difference between the ranks (as estimated

by Friedman test), critical distances (CDs) are computed. The average ranks of classifiers are

marked on each bar, with 1 being the best performer and 6 being the worst. If the ranking of

classifiers is not significantly different (that is, if the distance between the corresponding markers

is smaller than the critical distance), they are linked together with a horizontal line under the bar.

Figure 8.2 shows that, in terms of classification accuracy, QD is the best-ranking method for two

out of three contrasts, but its ranking is not significantly different from SVM and LD-PC (for

"healthy/impaired" contrast) and from SNV and GNB-N (for "early/late" contrast). For that

second contrast, the methods fall into two groups with significantly different ranks: (QD, SVM,

GNB-N) are consistently better than (LD-PC, GNB-L, and LD-RR). For the "finger/wrist"

contrast, SVM has the highest rank, and two versions of LD are not significantly worse. QD and

GNB-L are ranked consistently lower than these three methods, and GNB-N is uniformly ranked

as the worst method in all subjects. In terms of reproducibility, we see that PC-based methods

(LD and QD) tend to have the highest ranking (with the exception of QD in "early/late" contrast).

This is consistent with our simulations, where maps for QD and LD-PC are much more

112

reproducible compared with other classifiers (see Figure 7.1, when M=0.02 or 0.03, ρ=0.5 or

0.99, and V>1). However, this is not significant in any of the three contrasts.

In our publication (Schmah et al., 2010), we used the stroke dataset to evaluate a large group of

classifiers, which included QD, LD-PC, linear SVM and GNB, and also K-nearest neighbours,

logistic regression, restricted Boltzmann machines (RBMs), and two nonlinear kernels of SVM.

In that publication, we used classification accuracy as our only metric of performance, and have

not studied the spatial maps (which are difficult to construct for some of the classifiers).

Regularization has been performed with optimizing classification accuracy, rather than the Δ

metric we used above. The box-and-whisker plots of within-subject classification accuracies are

shown in Figure 8.3. Classification accuracies are higher than the numbers shown on the top row

of Figure 8.1, for two reasons. First, the classifiers in Schmah et al. publication were tuned to

optimize classification, without accounting for reproducibility of maps (as in Figure 8.1).

Second, the size of the training set was larger (75% in Schmah et al. publication, compared to

50% in split-half resampling procedure used to compute Figure 8.1). The most dramatic

difference is demonstrated by QD in the "early/late" contrast: with accuracy-driven

regularization, it is at least 99% accurate in all the subjects, with the median accuracy 99.81%.

The only method that outperforms QD for this contrast is RBM with generative training, a much

more complex method with a much larger computation time (it requires 7.4 hours to classify one

subject's data set, whereas QD requires only 49 seconds). Overall, the performance of LD and

QD is comparable to (and, in many cases, surpasses) the performance of more complex methods,

such as RBMs, logistic regression, and kernel SVM. This suggests that multivariate Gaussian

distribution is a useful probabilistic model for fMRI data; this is supported by a study by Hlinka

et al. (2011) that showed that linear correlation provides a good description of connectivity

between brain regions as measured by fMRI.

113

Figure 8.3. Performance of a larger group of classifiers, used in a study by Schmah et al.

(2010) to classify the data in the stroke recovery study.

8 .3 E valuatio n o n real d ata: Aging s tud y

The data recorded in the aging study (see Section 2.2.3) have been used as another real data set

for the evaluation of our pool of classifiers. This data set is larger than the stroke set: it consists

of 19 young and 28 older subjects, whereas the stroke set consists of 9 subjects. It is also

presumably less heterogeneous: all the subjects have been screened with a questionnaire to

exclude health problems, and the anatomical MRI scans have been inspected to rule out severe

abnormalities (Grady et al., 2010). We have used the set of six classifiers (QD, LD-PC, LD-RR,

linear SVM, linear and nonlinear GNB) to classify the fMRI volumes based on behavioural task.

We examine interactions between performance of classifiers, strength of the contrast, and age

group (our dimensionality studies show the difference in intrinsic dimensionality between the

two age groups; see Figure 5.7).

We have tested 4 contrasts: RT/FIX, DM/FIX, DM/RT, and DM/PM (these tasks are described

in detail in Section 2.2.3). All analysis has been performed separately for each subject and

contrast. Multivariate classifiers have been regularized with the optimization of the Δ metric.

114

Figure 8.4 displays the within-subject classification accuracy and reproducibility on a box-and-

whisker plot; two age groups are plotted separately (left and right columns correspond to young

and older subjects).

Figure 8.4. Performance of the pool of classifiers on the dataset from the aging study. Left

and right columns correspond to subjects in the young and the older age groups,

respectively.

Examination of performance in Figure 8.4 shows that the grouping of classifiers with respect to

their performance is similar to the grouping we see in the simulations (see bottom row of Figure

7.1; also, Figure 7.2). In simulated data, we observed the following grouping of methods with

respect to reproducibility: (1) QD and LD-PC, (2) SVM and LD-RR, (3) GNB-L and GNB-N.

This trend can be observed in the aging data as well. To study the ranking of classifiers, we

performed post-hoc Friedman tests; results are shown on Figures 8.5 (for the young group) and

8.6 (for the older group).

115

Figure 8.5. Ranking of classifiers in the young age group. Classifiers linked with a

horizontal bar are not significantly different in their ranking.

Figure 8.6. Ranking of classifiers in the older age group.

Figure 8.4 shows that in both age groups the RT/FIX and DM/FIX contrasts are the easiest to

classify. This is expected, because the RT and DM behavioural tasks require visual attention and

motor action, whereas FIX is a passive condition. Therefore, the cortical recruitment in the two

classes of the binary classification problem is expected to be quite different. In the DM/PM

contrast, this is not the case: both of these tasks required matching a target to one of the three

stimuli. In the PM task, the target is presented simultaneously with the three stimuli, whereas in

the DM task the target is presented first and then removed, and there is a 2.5 second interval of

blank-screen display before the three stimuli are presented. Therefore, the difference between the

two tasks is in the 2.5-second interval (with TR interval being 2 seconds) when the subject had to

retain the target in short-term memory. Our pool of classifiers largely ignores this difference:

when distinguishing between DM and PM, they are only slightly better than random guessing.

The plot of reproducibility indicates that the PC-based methods (LD-PC and QD) are able to find

reproducible spatial networks within the DM/PM contrast data; however, the activity of these

116

networks is not predictive of a mean difference between the stimuli16. The DM/RT contrast is

intermediate: the RT task requires attention and motor action, but, unlike the DM task, it requires

neither perceptual matching nor short-term memory. Due to this difference between the classes,

the accuracy is better than chance on most subjects, but, on average, worse than the accuracy in

the two strongest contrasts (RT/FIX and DM/FIX).

Comparing the performance of our classifier pool across the age groups, we can see that the

classifiers frequently perform better on the older subjects. We know that the older subjects have

performed the behavioural tasks as accurately as younger subjects; however, the reaction time is

higher in the older group (Grady et al., 2010). The older subjects’ data can be easier to classify

because they have spent more time carrying out the behavioural task. Another contribution factor

might be compensation in older adults: to achieve the same level of behavioural accuracy as the

younger subjects, the older subjects recruit larger areas of the cortex (Mattay et al., 2006),

making the signal easier to detect.

We have evaluated the significance of the age-related difference in accuracy with a Mann-

Whitney U test (Conover, 1999; this test is a non-parametric equivalent to Student's unpaired t

test). To account for multiple comparisons, we have used false discovery rate (FDR) correction

with FDR=0.05 (Genovese et al., 2002). In two contrasts (DMS/FIX and RT/FIX) accuracy of

some classifiers has been found to be significantly different between the young and the older

groups. In the DMS/FIX contrast, these classifiers are LD-PC, LD-RR and SVM; in the RT/FIX

contrast, the difference is observed in LD-PC, LD-RR and QD. This significant difference in

accuracy implies the difference in cortical recruitment, which was explored in our group-level

classification presented in Section 8.4.2.

16

The BOLD activity in these networks might not be predictive of the task, but the correlation within these networks might be very high (relative to the rest of the brain), making them easy to detect with PC analysis and, therefore, highly reproducible. Visual inspection of PC-based classifier maps for the DM/PM contrast showed strong sensitivity in the default-mode regions. These regions are not expected to be "recruited" per se by either DM or PM tasks, but they exhibit high correlation of BOLD signal (Spreng & Grady, 2010).

117

8 .4 Spatial m aps fo r the aging s tud y

In the previous section, we have focused on within-subject analysis, and (among other things) on

reproducibility of spatial maps within a subject. The next step of evaluation is to address the

following questions:

1) for a given classifier, how reproducible are the maps across subjects?

2) for a given subject, how well do the maps computed with different classifiers agree with

each other?

In order to compare the maps across subjects and methods, they have to be normalized so the

distribution of noise is matched across subjects/methods. This normalization is described in

NPAIRS literature (Strother et al., 2002; LaConte et al., 2003), where the resulting normalized

maps are called "reproducible statistical parametric maps, Z-scored" (rSPM{Z}). The

distribution of signal and noise is computed from the scatter plots of two split-half maps (which

are divided by their respective standard deviations in order to bring them to the same scale). As

described in Section 2.1.2, the scatter plot has a major axis along the line of identity and a minor

axis perpendicular to it. Variation along the minor axis is due to the noise uncorrelated with

signal of interest, and variation along the major axis contains a mixture of signal and noise. For

two split-half maps z1 and z2, the projection of scatter-plot points onto the major and minor axes

is (z1+ z2)/2 and (z1- z2)/2, respectively (Strother et al., 2002). To obtain rSPM{Z}, we divide the

projection onto the major axis by the standard deviation of the projection onto the minor axis.

We repeat this procedure for all splits; each split gives us a scatter plot, and rSPM{Z} patterns

are averaged across splits.

Visual inspection of normalized rSPM{Z} maps, averaged across subjects for each contrast and

classifier, has shown that the brain areas with the strongest expression of contrast (i.e., largest Z-

scores) are neurobiologically meaningful. For the RT/FIX and DM/FIX contrasts, the brain areas

with preference17 to FIX condition are members of the default mode network18 (Fox et al., 2003;

17

Preference is defined analogously to Rasmussen et al. (2012A). Let's say that a volume is assigned to class 1 when the decision function, computed on that volume, is positive; if it is negative, the volume is assigned to class 2. A voxel has a preference for class 1 if increasing the signal in that voxel leads to increase in the decision function.

118

Grecius et al., 2005; Toro et al., 2008). These areas are: posterior cingulate cortex, anterior

cingulate cortex with adjacent ventro-medial prefrontal area, bilateral angular gyri, and bilateral

superior frontal gyri. The areas that show preference for the active task (RT in the RT/FIX

contrast, DM in the DM/FIX contrast) are the motor areas (primary motor cortex and middle

cingulate/supplementary motor area), bilateral insulae/frontal opercula and dorso-lateral

prefrontal cortices, and large areas of bilateral intraparietal lobules. These regions are often

recruited when performing an externally driven task (Grady et al., 2010); they have been shown

to be correlated amongst themselves and anti-correlated with default-mode regions (Toro et al.,

2008). Grady et al. (2010) refer to the network formed by these areas as the "task-positive

network".

For the DM/RT contrast, the preference for RT is found in bilateral insulae, anterior and

posterior cingulate gyri, middle cingulate/supplementary motor area, and left primary motor

cortex. Preference for DM is found in posterior area of intraparietal lobule (bilaterally). The

same area shows preference for DM in the DM/PM contrast, whereas preference for PM is found

in visual and middle cingulate areas. Neurobiological interpretation of the results of DM/RT and

DM/PM contrasts is difficult because of high variability across subjects (see below), and requires

further analysis.

8 .4 .1 R epro d ucibility o f s patial m aps acro s s s ubjects and acro s s m etho d s

Across-subject reproducibility has been evaluated by correlating the individual spatial maps

between all possible pairs of subjects within the age group. We have computed spatial maps for

all subjects and contrasts, using 6 classifiers from our pool; the regularization hyperparameters

have been tuned to optimize the Δ metric. The spatial maps have been normalized as described

above. After that, for each classifier separately, we have computed Pearson's correlation

coefficient across all possible pairings of subjects within each group. The young group consists

of 19 subjects, giving us 171 possible pairings; older group has 28 subjects and 378 possible

18

Greicius et al. (2003) define the default-mode areas as the areas that show "relative decreases in neural activity during task performance compared with a baseline state".

119

pairings. The box-and-whisker plot on Figure 8.7 presents the distribution of these pairwise

correlations in the young and the older groups. This figure demonstrates that across-subject

reproducibility is very similar for all classifiers in our pool. The advantage that PC-based

methods have in within-subject reproducibility does not generalize across subjects. LD-PC and

QD are operating on a subset of highest-ranked principal components; these components capture

the primary sources of variance within a subject, which boosts the reproducibility of QD and LD-

PC individual maps. However, across-subject reproducibility of PC-based classifier maps is

much smaller than their within-subject reproducibility, which suggests that these primary sources

of variance are heterogeneous across subjects. Figure 8.7 also shows that across-subjects

reproducibility of univariate maps (GNB-L and GNB-N) is comparable to reproducibility of

multivariate maps. It should be noted that our procedure of normalization is essentially

multivariate, because the standard deviation along the noise (minor) axis is pooled across voxels;

therefore, the rSPM{Z} patterns created for GNB-L and GNB-N are not truly univariate.

Figure 8.7. Across-subject reproducibility of within-subject spatial maps created by

different classifiers. The left and right panels correspond to the young and the older groups

of subjects, respectively.

The contrast has a marked effect on across-subject reproducibility, which tends to be larger for

the strong contrasts (RT/FIX and DM/FIX). Reproducibility of maps for the weakest contrast,

DM/PM, is near zero on average. There is also a difference between the age groups: in the older

group, the maps are more reproducible across subjects. This difference is significant for all

120

classifiers in the two strong contrasts (p<0.05 on a Mann-Whitney U test, corrected for multiple

comparisons with FDR at α=0.05). In the DM/RT and DM/PM contrasts, it is significant in LD-

RR, GNB-L and GNB-N maps. In the DM/PM contrast, it is also significant in SVM maps.

Recall that within-subject reproducibility is not significantly different between the age groups

(Figure 8.4, bottom row); there seems to be more group heterogeneity in the older group, but the

amount of within-subject heterogeneity is roughly equivalent for the two age groups.

Figure 8.8. Jaccard overlap of within-subject spatial maps across subjects. The left and

right panels correspond to the young and the older groups of subjects, respectively.

Similarity of within-subject spatial maps across subjects was also evaluated using Jaccard

overlap. We have thresholded the rSPM{Z} maps, created for each subject, classifier, and

contrast, to match the false discovery rate of 0.1 (see Genovese et al., 2002, for a description of

this method of correcting spatial maps for multiple comparisons). The Jaccard overlap between

the two thresholded spatial maps X and Y is

YX

YX

(8.3)

This overlap was computed for all possible pairings of subjects within an age group, for each

classifier and contrast. Results are displayed on Figure 8.8; the similarity of Jaccard overlap to

reproducibility (Figure 8.7) is evident. Similarity across subjects is more pronounced in the two

121

strong contrasts; in this case, there is more similarity in the older group compared with the young

group. In the weakest contrast (DM/PM), across-subject similarity is close to zero.

Figure 8.9. Correlation of average spatial maps across classifiers. Individual subject maps

created by each of the 6 classifiers have been averaged across all subjects from our study.

The next question is, how similar are spatial maps across methods? When we average the

normalized maps across subjects, the maps created by different classifiers tend to be quite

similar. Figure 8.9 plots Pearson's correlation between pairs of average maps for each pairing of

6 classifiers. The average has been taken across all subjects, young and older pooled together

(the same trends have been observed for group averages). For three contrasts (RT/FIX, DM/FIX

and DM/RT), we observe a large amount of consensus in our classifiers: the correlation values

are 0.84 and higher. For these contrasts, the weakest correlation is observed between pairings of

one multivariate and one univariate method. Consensus between pairs of multivariate methods is

higher: the smallest observed value is 0.95 for RT/FIX, 0.99 for DM/FIX, and 0.93 for DM/RT.

Between the two univariate methods, correlation of average maps is 1.0 in all observed contrasts.

Across-methods correlation in the weakest contrast (DM/PM) is smaller than in the other three

122

contrasts; it is at least 0.63. In all contrasts, highest correlation is observed in pairs (QD, LD-PC),

(LD-RR, SVM), (GNB-N, GNB-L).

In addition, we computed the Jaccard overlap of average maps created by different classifiers for

the same contrast. The within-sibject rSPM{Z}maps were averaged across all subjects, and

thresolded at FDR≤0.1. Using formula 8.3, we computed the Jaccard overlap between averaged

maps created by all possible pairs of classifiers for a given contrast. For the two weakest

contrasts (DM/RT and DM/PM), the Jaccard overlap was consistently zero (even with the

somewhat liberal choice of threshold of 0.1). The results for the two strongest contrasts are

presented in Figure 8.10. The strongest similarity is observed in classifier pairs (QD, LD-PC);

(LD-RR, SVM); (GNB-L, GNB-N). The similarity between the univariate and multivariate

classifiers is relatively low. Finally, the similarity between different multivariate classifiers is

stronger in the D</FIX contrast than in RT/FIX contrast. These results are consistent with

across-method reproducibility results shown in Figure 8.9.

Figure 8.10. Jaccard overlap of average spatial maps across classifiers, for 2 strong

contrasts (RT/FIX and DM/FIX).

Correlation and overlap of averaged maps does not consider the variability of maps across

subjects. To further investigate across-methods similarity, we have used DISTATIS, a variant of

multidimensional scaling which takes the across-subject variability into account. Full treatment

of DISTATIS can be found in publications by Abdi and colleagues (2005, 2009). The goal of

multidimensional scaling is to find a low-dimensional representation of high-dimensional data in

123

such a way that the distances between low-dimensional representations of any two data points

are good approximations to the distances between these points in the original high-dimensional

space (see Mardia et al., 1979, pages 394-409). For our purpose, the high-dimensional data are

the spatial maps, and we define the distance between ith and jth map as 1-ρij, where ρij is

Pearson's correlation coefficient of ith and jth maps. Distance matrix contains the distances

between all possible pairings of the data points. Multidimensional scaling finds a low-

dimensional representation from the eigendecomposition of the distance matrix. DISTATIS is an

generalization of this method for a set of distance matrices. This method combines the distance

matrices into a single compromise matrix, and projects the original distance matrices onto the

compromise matrix.

In order to apply DISTATIS, we compute within-subject distance matrices. Our pool of

classifiers consists of six methods, so we compute a 6×6 distance matrix for each of the 47

subjects. We double-center these matrices; let Si denote the doubly-centered distance matrix for

the ith subject. Then we compute the similarities between distance matrices for each pair of

subjects. The similarities are computed with an RV coefficient, which indicates how much

information is shared between two matrices (Abdi et al., 2009). We form the 47×47 matrix of RV

coefficients, and compute its first eigenvector p1. The ith coordinate of this eigenvector indicates

how similar the ith subject's distance matrix is to all other subjects' distance matrices. Then, the

compromise matrix S+ is formed as a weighted sum of doubly-centered distance matrices:

subjects

iii

#

1

SS (8.4)

where the weight αi is the ith coordinate of p1, divided by the sum of all coordinates of p1. The

compromise matrix S+ is the best way (in the least-squares sense) to represent all 47 distance

matrices with a single matrix.

A low-dimensional representation of S+ is computed from its eigendecomposition. For easy

visualization, it is common to use 2-dimensional representation (using the first two principal

components of S+). Figure 8.11 plots this 2-dimensional representation for each of the four

contrasts. The first and second principal components are represented by the horizontal and

vertical axes, respectively; on each axis, we specify the amount of variance explained by each

124

component. We project the centroids of the spatial maps created by each of the six methods onto

this coordinate space; the projections of centroids are marked with +. In order to compute the

confidence intervals around the centroids, we have drawn 1000 bootstrap samples from our set of

47 subjects. The confidence intervals are shown as ellipses around the centroids.

Figure 8.11. DISTATIS plots of similarity of within-subject maps created with different

classifiers.

Figure 8.11 displays the familiar grouping of methods: (QD and LD-PC), (SVM and LD-RR),

and (GNB-L and GNB-N). This pairing is also observed in simulated data when we evaluate the

reproducibility and ROC properties of the algorithms (Figure 7.1, bottom row; Figure 7.2); it is

also observed in evaluation of within-subject reproducibility in the aging study (bottom rows of

Figures 8.5 and 8.6). Within each pair of methods, there is a similarity in computational models:

both QD and LD-PC use PCA-based regularization, both GNB-L and GNB-N are univariate

Gaussian Naive Bayes classifiers, and both SVM and LD-RR use a L2 penalty for regularization.

This pairwise similarity between methods is especially strong in the DM/FIX and RT/FIX

contrasts, where the corresponding ellipses overlap almost completely. In two weak contrasts

(DM/RT and DM/PM), the overlap in the (QD and LD-PC) and in the (SVM and LD-RR) pairs

of ellipses is reduced, although the two GNB ellipses fully overlap. Therefore, the strength of the

125

contrast influences the consensus between the multivariate methods that use the same

regularization scheme.

8 .4 .2 G ro up-level clas s ificatio n o f s patial m aps

In addition to reproducibility, we have evaluated another aspect of spatial maps: their ability to

capture age-related information. Grady et al. (2010), using the same data set, have reported the

differences in cortical recruitment between the young and the older groups, along with

behavioural differences in reaction time. These differences have been found with Partial Least

Squares (PLS) analysis pooling all the subjects within the age group. In this section, we have

studied how well the age-related difference is preserved in within-subject spatial maps created by

different classifiers. We frame this problem in terms of classification: given a within-subject

spatial map, can we reliably predict the age group of that subject? We have already shown that

classification accuracy, as well as across-subjects reproducibility, is different between the age

groups, at least for some classifiers and contrasts. We can therefore expect, at least in some

situations, a better-than-chance accuracy when we classify the individual maps according to the

age group.

We have used LD-PC and QD for this group-level classification. Within-subject maps have been

created for 6 classifiers and 4 contrasts, as before; multivariate classifiers have been regularized

with Δ-metric optimization. The maps have been converted to rSPM{Z} as described in the

previous section. To match the sample size from the two age groups, we have used all of our 19

young subjects, and have randomly selected 19 subjects from the older group. This procedure has

been performed 10 times. During each iteration, we have performed 1000 splits of the within-

subject maps into training, validation and test sets. The test set has been formed by randomly

selecting one young and one older subject; the same procedure has been used to form the

validation set (assuring that different subjects are selected for test and validation). The remaining

2*17=34 subjects have formed the training set. This splitting scheme maximizes the size of the

training set, which is advantageous when data are heterogeneous. For each subject, the accuracy

of classification has been computed as the frequency of correct prediction of age group for that

subject.

126

Figure 8.12. Accuracy of group-level classification of individual maps that have been

created by six different classifiers. Within-subject maps have been classified according to

the age group of the subject ("young" versus "older").

The accuracy of LD-PC are displayed on a box-and-whisker plot in Figure 8.12 (QD produces

very similar results). We can see that the age group could be predicted from individual maps

with better-than-chance accuracy, irrespective of which classifier has been used to create the map

(the only exceptions are QD and LD-RR in DM/PM contrast). In the two strongest contrasts,

RT/FIX and DM/FIX, the maps are easily classifiable; median classification accuracy is between

74% (for LD-PC maps) and 85% (for GNB-N maps). This suggests that the individual spatial

maps contain the information that encodes age; all the classifiers from our pool are able of

capturing this information. The maps for the DM/RT contrast are harder to classify; median

accuracy ranges from 57% for SVM maps to 70% for the GNB-N maps. Classification for the

weakest contrast, DM/PM, is less accurate but still above chance (expect for QD and LD-RR

maps in some subjects).

This indicates that the individual maps indeed contain age-relevant information. To locate the

voxels that encode this information, we have created spatial maps for the LD-PC classification of

individual maps. For each contrast, six of these group-level maps have been made, one per

classifier from our pool. Maps have been computed according to Formula 3.17 from Chapter 3

127

(with our data set consisting of individual maps rather than fMRI volumes), and converted to

normalized rSPM{Z} patterns. We refer to these maps as "group-difference" maps, to clearly

distinguish them from the "group-average maps", that is, group-specific averages of individual

maps. On the group-difference map, voxel weights reflect sensitivity to age, that is, the

importance of this voxel in classifying the subjects’ maps according to the age of participants.

Positive values of a voxel indicate that it tends to have higher Z-scores in the older subjects’

individual maps; negative value indicates the opposite (the voxel tends to be higher in the young

subjects’ individual maps).

Interpretation of the group-difference maps is not trivial. In our situation, the individual maps are

created for a binary classification problem (say, task A versus task B); a positive voxel value in

individual maps reflects preference for task A, and a negative value reflects preference for task

B. A positive voxel value in a group map indicates that, in this spatial location, the difference (A-

B) is higher in the older than in the young subjects. This could happen in a number of situations:

older subjects are sensitive to A in this spatial location, and young subjects to B;

in both subjects, this spatial location is sensitive to A, but more so in older subjects;

in both subjects, it is sensitive to B, but more so in young subjects.

To distinguish between these three situations, individual maps need to be inspected along with

group-difference maps.

Using LD-PC, we have constructed 6 group-difference maps for each of the 4 contrasts (making

one group-level map for each bar in Figure 8.12). Group-difference maps have been normalized

to rSPM{Z} and thresholded at α=0.05 using false discovery rate correction for multiple

comparisons. Thresholded maps have been visually inspected to find brain areas that might

encode the age effect. Table 8.1 lists the areas that survived the correction for multiple

comparisons in a majority of group- difference maps for a given contrast19. No such areas have

been found for the DM/PM contrast because of lack of consensus among group- difference maps.

19

Since the maps for GNB-L and GNB-N classifiers were very similar, they counted as one vote. The areas listed in Table 8.1 were the areas found significant by at least 3 votes out of 5 (4 multivariate methods and one univariate).

128

This table also lists the preferred condition for each area and each age group. The preference is

determined as follows: if the "Task A versus Task B" contrast is expressed positively in this area

of the normalized group-average map (that is, the z-scores are positive), the preference is for

Task B; if the expression of the contrast is negative, the preference is for task A. The table also

lists the z-scores for each area and age group; first, we have identified location of most active

voxel in the group-difference map, and then we have obtained the z-score from this location in

the group-average LD-PC maps.

area preference in young preference in older

DM/FIX

precuneus FIX (2.2) DM (‐2.3)

L intraparietal sulcus DM (‐2.5) DM (‐5.4)

L dorsolateral PFC FIX (0.4) DM (‐3.3)

R intraparietal sulcus DM (‐2) FIX (0.1)

posterior cingulate FIX (3.5) FIX (1.6)

R sup. cerebellum DM (‐3.7) DM (‐6.6)

L sup. cerebellum DM (‐3.3) DM (‐5.6)

L primary motor DM (‐3.2) DM (‐5.6)

RT/FIX

precuneus FIX (1.5) RT (‐2.6)

L intraparietal sulcus RT (‐3.2) RT (‐6.5)

R intraparietal sulcus RT (‐2.8) RT (‐5.2)

L primary motor RT (‐2.4) RT (‐5.1)

R primary motor RT (‐0.8) RT (‐3.3)

L temporal pole RT (‐3.2) RT (‐2.6)

DM/RT

R intraparietal sulcus DM (‐0.7) RT (2.1)

L intraparietal sulcus RT (0.8) RT (2.2)

R primary motor RT (0.4) RT (2.4)

L primary motor RT (0.9) RT (3)

middle cingulate / SMA RT (0.7) DM (‐0.7)

L superior temporal RT (0.4) RT (2.4)

Table 8.1. Cortical areas that show sensitivity to age. Abbreviations: PFC, prefrontal

cortex; SMA, supplementary motor area.

The majority of the areas listed in the Table 8.1 for the RT/FIX and DM/FIX contrast are the

task-positive areas with stronger preference (that is, higher absolute z-scores) for the active task

(RT or DM) in the older group. In addition, a subregion of right intraparietal sulcus (a task-

positive area) has been found to have higher preference for DM in the younger group, and an

129

area located at the left temporal pole had a higher RT preference in the younger group. Posterior

cingulate cortex, the default-mode region, has a stronger preference for FIX in the young group

(however, it is below the FDR-corrected threshold in the RT/FIX contrast). Precuneus is

particularly interesting, because it has a strong FIX preference in the young subjects and strong

preference for the active task in the older subjects. This is the only area where we have found the

strong difference in preferred condition between the young and the older groups.

Overall, the difference between the age groups seems to be driven by the stronger preference for

the active task in the older subjects, and, perhaps, a stronger preference for FIX in the young

subjects. This is consistent with the hypothesis of compensation in the older individuals: to reach

the same level of behavioural performance as the young subjects, they recruit the task-positive

regions more prominently (see Grady, 2008, for the discussion of compensatory recruitment in

aging). The areas listed for the DM/PM contrast followed this trend: most of them are the task-

positive areas with stronger RT preference in the older group. The only exception is the region in

the middle cingulate / supplementary motor area, which shows RT preference in the young group

and DM preference in the older group; however, this preference is somewhat weak (|z|<1).

Figures 8.13 and 8.14 show some examples of group-difference maps with corresponding group-

average maps. The maps are constructed with LD-PC; most areas listed in Table 8.1 for the

DM/FIX and RT/FIX contrasts are displayed in Figures 8.13 and 8.14, respectively. In the group-

average maps, the cortical recruitment for the active tasks (RT and DM) is more spatially

extensive in the older group, which conforms with the hypothesis of compensatory recruitment.

In addition, the younger group shows more extensive recruitment of default-mode regions for

FIX (this is not universal; in Figure 8.13, older group shows more extensive recruitment of

ventromedial prefrontal region, a default-mode area). This is perhaps due to the fact that the

active task is behaviourally harder for the older subjects (who have shown higher response times

compared with young subjects). It is known that the activity of default-mode network is

modulated by the behavioural difficulty of the active task (Grady et al., 2010); it can be higher in

young subjects due to the fact that they find the active task easier to perform than the older

subjects.

It is interesting to compare our findings with an the results that were obtained by Grady and

colleagues (2010) on the same data set with a different method of multivariate analysis, partial

130

least squares (McIntosh et al., 1996). This method finds the latent variables (LVs) that best

explain the covariance between the fMRI data and the experimental design variables; for this

study, the experimental variables are the age group and the behavioural task. The most

significant LV (accounting for 24% of the covariance) captures the difference between the

fixation and the active tasks; the corresponding brain areas are members of the default mode and

the task positive networks. The second most significant LV (accounting for 12.6% of the

Figure 8.13. Cortical areas affected by aging, as revealed by the RT/FIX contrast. The top

row shows the group-difference map, thresholded at p<0.05. The middle and bottom rows

show the unthresholded group-average maps for the young and the older group,

respectively.

131

Figure 8.14. Cortical areas affected by aging, as revealed by the DM/FIX contrast. The top

row shows the group- difference map, thresholded at p<0.05. The middle and bottom rows

show the unthresholded group-average maps for the young and the older group,

respectively.

covariance) captures the age effect, that is, the difference between the young and the older

groups. Significance of latent variables has been estimated with a permutation test, and the

remaining LVs were found to be insignificant.

The second latent variable reveals a list of areas that encode the age effect. It is important to

point out that these areas are important sources of covariance after the variance due to the first

LV has been factored out. Therefore, the areas where the fixation-versus-active-task effect is

stronger than the young-versus-older effect will generally not be captured by the second LV.

This is different from our group-level analysis, where we can identify the areas that show a

strong age effect regardless of the strength of the fixation/active task effect in these areas. In our

132

group-difference maps, the areas identified by second LV are encoding the age effect to some

degree, but most of them do not survive the FDR correction for multiple comparisons. The only

area that has been found to be significant in both studies is the precuneus; consistently with our

results, Grady et al. have observed the difference in preferred condition between the age groups.

The majority of the areas where we have found the age effect (see Table 8.1) are not captured by

the second LV, but they are captured by the first LV, potentially because the task effect in these

areas dominates the age effect. The conclusion made Grady and colleagues, that the older group

shows more extensive recruitment of the task-positive regions, and the younger group has more

extensive recruitment of the default mode regions, is supported by our results.

133

Chapter 9 C o nclus io ns and F uture R es earch

9 .1 E valuatio n o f clas s ifiers fo r fM R I d ata

The evaluation of classifiers presented in the thesis shows that predictive multivariate Gaussian

methods (linear and quadratic discriminants) are useful and efficient tools for fMRI

classification. However, optimal performance of LD/QD requires careful regularization, because

number of dimensions (voxels) in a typical fMRI set far exceeds the number of observation

(volumes). If PCA is used to reduce the data dimensionality, performance of LD/QD is strongly

influenced by the number of PCs retained for analysis. As pointed out by LaConte et al. (2003),

selecting too few or too many components produces a model that does not describe the data

adequately. In some previous studies that evaluated LD-PC against SVM (Cox & Savoy, 2003;

Mourao-Miranda et al., 2005), all principal components were retained for analysis; as a result,

SVM greatly outperformed LD-PC, which sometimes performed at the level of random guessing.

However, LaConte et al. (2006) showed that, with careful selection of PC basis, the

performances of LD-PC and linear SVM are comparable (in that study, LD-PC used the number

of components that optimized classification accuracy, and the complexity of SVM was

controlled in the same way). Another example of poor performance of LD-PC as a result of

suboptimal choice of dimensionality is a study by Lukic et al. (2002), which introduced the

simple simulation algorithm for fMRI/PET data that we have adapted (we have modified it by

simulating the effect of hemodynamic response). In that study, 1/3 of the total number of PCs is

retained; this is a fixed number that depends only on the sample size, and is not influenced by

signal-to-noise ratio in the data. The resulting performance of LD-PC is never better than

univariate t test, and sometimes much worse than that.

In our evaluation, LD and QD were regularized by optimizing the Δ metric, which is a

combination of two performance metrics: accuracy of classification and reproducibility of spatial

maps. As a result, the classification accuracy of LD-PC was comparable to SVM, in simulated as

well as real data, with two exceptions: early/late contrast in the stroke data (Section 8.2), and, to

a lesser extent, RT/FIX contrast in the aging data (Section 8.3). This replicated the result

obtained by LaConte et al. (2006) on a different data set. In our simulations (Section 7.2.1), QD

134

was shown to be the most accurate classifier in the situation of strong heteroscedasticity, when

the following three conditions were satisfied:

a) expected values of class means were identical (M=0),

b) within-class covariance matrices were sufficiently different (V>1),

c) the correlation between nodes of the active network was large (ρ=0.99)

If the correlation between the active areas was zero, nonlinear GNB was the most accurate of the

classifiers we have tested. However, the usefulness of GNB-N classifier in real fMRI data is

questionable, because of the strong spatial correlation between (at least a subset of) voxels in the

brain. We never observed GNB-N significantly outperforming other classifiers in real data;

situations when it was the worst-ranking classifier (in terms of classification accuracy) were

observed in both stroke (finger/wrist contrast) and aging data sets (DM/RT and DM/PM

contrasts). On the other hand, QD was the most accurate classifier for the early/late and

healthy/impaired contrasts in the stroke set. Also, in our study of dimensionality in self-control,

QD was more accurate than LD in classifying spatial maps (71% versus 58%); one class was

much more heterogeneous than the other, and the classes were better discriminated with a

nonlinear decision surface. Our conclusion is that, in terms of classification accuracy, a

combination of LD-PC and QD is not worse, and sometimes better, than linear SVM. Running a

combination of QD and LD-PC does not take longer time than running one of these methods,

because the computation is dominated by PCA decomposition of the data; once it is done, LD-

PC and QD take very little time to compute. When the difference between classes is driven by

the difference in means, LD-PC is likely to be the best method of the two; when it is driven by

difference in connectivity, QD is likely to be a better model.

However, the real advantage of LD-PC and QD over LD-RR, SVM and GNB is the high

reproducibility of their within-subject spatial maps. This point is often ignored in comparative

evaluation studies of classifiers for fMRI, which frequently use classification accuracy as the

sole metric of performance (e.g., Cox & Savoy, 2003; Mourao-Miranda et al., 2005; Ku et al.,

2008; Misaki et al., 2010; Schmah et al., 2010). Our simulations (Section 7.2.2) show that LD-

PC and QD tend to produce within-subject maps, which are (sometimes dramatically) more

reproducible than the maps produced by methods that do not use PCA regularization; this is

135

replicated in the aging data set (Section 8.3). It is not altogether surprising, because highest-

ranking principal components tend to extract strongly reproducible spatial patterns. Therefore,

LD-PC and QD are particularly useful tools of within-subject analysis in neuroscience, where

obtaining a reproducible spatial map is no less important than accurate classification.

Another advantage of PC-based classifiers is the additional information that is obtained by

determining the optimal PC subspace. Size of this subspace is the number of orthogonal

dimensions in the model that gives a "good" description of our data20. Chapters 4 and 5 present a

survey of optimization metrics that can be used as quantitative measures of how "good" the

model is at describing the data. We conclude that the most useful metrics are: reproducibility of

LD/QD spatial maps, accuracy of LD/QD classification (which is, however, a less robust metric

compared to reproducibility), and generalization of PPCA model. All three metrics are measured

on an independent test set using NPAIRS resampling scheme.

In Chapter 6, we show that intrinsic PC dimensionality can have a neurobiological significance:

we demonstrate that it relates to behavioural measures of post-stroke recovery of motor function

(Yourganov et al., 2010; see Section 6.1), as well as the strength of self control (Berman et al.,

2013; see Section 6.2). Also, preliminary results of Schmah and colleagues show that intrinsic

dimensionality is a predictor of post-stroke recovery of speech after the intensive course of

speech therapy (the abstract has been submitted for presentation at Rotman annual neuroscience

conference, March 2013).

9 .2 D irectio ns fo r future res earch

9 .2 .1 S im ulatio ns o f m ultiple netw o rks

The simulation framework that we use for evaluation is somewhat simplistic: the active areas are

either organized into a single network, or uncorrelated. This is convenient for the purpose of

evaluation, because the active network is controlled by only three parameters (M, V and ρ); the

influence of these parameters on the classifier performance and on dimensionality estimation can

20

The regularization parameter in LD-RR can be transformed into effective degrees of freedom (Kustra & Strother, 2001). We have not investigated the correspondence between effective d.o.f. and estimated dimensionality (which defines the number of degrees of freedom in the LD/QD model).

136

be explored thoroughly. However, it is somewhat unrealistic: the brain is organized into a set of

networks, not into one network (Toro et al., 2008). An obvious direction for future research is to

modify the simulation framework so the active areas are coupled into two or more networks.

This would increase the number of parameters of simulated signal; however, the gain in

neurobiological relevance would be more important than the growth of parameter space.

In particular, these simulations would be useful for studying generalization-error optimization.

For single-network simulation, generalization error is optimized when 1 principal component is

used (regardless of the correlation of the network). In real data from the stroke study, we have

observed that optimum number of PCs is always greater than one, and, more importantly, this

number correlates with behavioural post-stroke recovery. This suggests that this measure could

be sensitive to the post-stroke reorganization of cortical networks; to test this hypothesis,

multiple-network simulations are needed.

A more complex (and considerably more realistic) framework for simulating fMRI data is

provided by the Virtual Brain project21 (see Jirsa et al., 2010). This framework models the brain

as a dynamical system of interconnected nodes, each node being an ensemble of neurons. The

information about the connectivity between the nodes is taken from the database of anatomical

connections in the macaque cortex. This framework can simulate neuronal activity and the

corresponding BOLD signal; in addition, it can be used to study the impact of brain lesions.

9 .2 .2 D im ens io nality in the aging s tud y s et

The data set from an aging study is a natural candidate for studying the relationship between

dimensionality and behaviour. We have shown that PC dimensionality is coupled with self-

control ability (Section 6.1) and with post-stroke recovery (Section 6.2). The aging study data

can be used to answer another question: is intrinsic dimensionality affected by age of the

participant? The original study had three age groups (20-31 years; 56-65 years; 66-85 years); for

the work presented in this thesis, the two older groups were pooled together. It would be

interesting to look at potential difference in dimensionality between these two older groups, to

21

http://thevirtualbrain.org/

137

see whether it reflects the accelerating differences n brain structure after the age of 65 (Grady et

al., 2010).

This study can also be used to look at the intrinsic dimensionality for different cognitive tasks.

The behavioural tasks range from simple (reaction) to more complex (perceptual and delayed

matching); however, they all require a motor response and use the same visual stimuli. This can

be used to study the link between dimensionality and cognitive load.

9 .2 .3 M o d ificatio ns to LD and QD; ad d itio nal perfo rm ance m etrics

We have not investigated some potentially interesting modifications to our classifiers. In

particular, our implementation of QD uses the same number of PCs for both classes. QD models

the two within-class covariance matrices separately, rather than pooling them together the way it

is done in LD. This gives QD the advantage over LD when the two covariance matrices are

indeed significantly different. However, our current implementation of QD does not consider the

possibility that the intrinsic dimensionality of the two classes might be different, and, therefore,

the two within-class covariance matrices might be better approximated using different number of

PCs. Fitting the number of PCs separately for each class is an optimization on a 2D grid; it

should be no more computationally intensive than our current implementation because

computation of QD is dominated by singular value decomposition of the data matrix. After SVD

is computed, the grid optimization should be almost as fast as our current one-dimensional

optimization. For a data matrix composed of J voxels and N volumes, one-dimensional

optimization has the computational complexity of O(N), and grid optimization has the

complexity of O(N2); since J>>N, both optimization methods are much faster than SVD, which

has computational complexity of O(JN2).

It is also possible to combine LD and QD into a single classifier, as proposed by Friedman

(1989), who calls this approach "regularized discriminant analysis". QD uses class-specific

covariance matrices, and LD pools them together; regularized discriminant analysis uses a linear

combination of class-specific and pooled covariance matrices. The linear weights must be tuned

for each class separately, which involves a grid optimization. For this approach, we do not expect

a large increase in computational intensity, again because computation is dominated by SVD.

138

In our evaluation, we use two metrics of performance: classification accuracy and

reproducibility. It would be interesting to see whether our results are replicated for some

additional metrics of spatial pattern similarity, such as mutual information (Bell & Sejnowski,

1995; Afshinpour et al., 2011). For probabilistic classifiers, we could look at prediction accuracy

(which is subtly different from classification accuracy; see Section 2.1.3) as a metric of

performance as well as a cost function for regularization tuning. Also, in the aging study data, we

have looked at across-subject reproducibility, but not at across-subject classification; that is, how

well we can classify one subject's volumes with using the model that has been trained on another

subject, which would be the natural implementation of the more common group analysis of

combined subjects.

139

R eferences

Abdi, H., O'Toole, A. J., Valentin, D., & Edelman, B. (2005). DISTATIS: The analysis of multiple distance matrices. In Proceedings of the IEEE Computer Society: International Conference on Computer Vision and Pattern Recognition (42-47).

Abdi, H., Dunlop, J. P., & Williams, L. J. (2009). How to compute reliability estimates and display confidence and tolerance intervals for pattern classifiers using the Bootstrap and 3-way multidimensional scaling (DISTATIS). NeuroImage, 45(1), 89-95.

Abdi, H. (2010). The Greenhouse-Geisser Correction. Encyclopedia of Research Design. Thousand Oaks: Sage.

Abou-Elseoud, A., Starck, T., Remes, J., Nikkinen, J., Tervonen, O., & Kiviniemi, V. (2010). The effect of model order selection in group PICA. Human brain mapping, 31(8), 1207-1216.

Afshin-Pour, B., Soltanian�Zadeh, H., Hossein�Zadeh, G. A., Grady, C. L., & Strother, S. C. (2010). A mutual information-based metric for evaluation of fMRI data-processing approaches. Human brain mapping, 32(5), 699-715.

Aguirre, G. K., Zarahn, E., & D'esposito, M. (1998). The variability of human, BOLD hemodynamic responses. NeuroImage, 8(4), 360-369.

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723.

Attwell, D., Buchan, A. M., Charpak, S., Lauritzen, M., Macvicar, B. A., & Newman, E. A. (2010). Glial and neuronal control of brain blood flow. Nature, 468, 232-243.

Bandettini, P. (2007). Functional MRI today. International Journal of Psychophysiology, 63(2), 138-145.

Beckmann, C. F., & Smith, S. M. (2004). Probabilistic independent component analysis for functional magnetic resonance imaging. IEEE Transactions on Medical Imaging, 23, 137-152.

Beckmann, C. F., DeLuca, M., Devlin, J. T., & Smith, S. M. (2005). Investigations into resting-state connectivity using independent component analysis. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1457), 1001-1013.

Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6), 1129-1159.

Berman, M., Yourganov, G., Askren, M. K., Ayduk, O., Casey, B. J., Gotlib, I. H., Kross, E., McIntosh, A.R., Strother, S.C., Wilson, N.L., Zayas, V., Mischel, W., Shoda, Y., & Jonides, J. (2013). Dimensionality of brain networks linked to life-long individual differences in self-control. Nature Communications, Article # 1373 (doi:10.1038/ncomms2374)

Biehl, M., & Mietzner, A. (1994). Statistical mechanics of unsupervised structure recognition. Journal of Physics A: Mathematical and General, 27, 1885-1897.

140

Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc.

Bodurka, J., Ye, F., Petridou, N., Murphy, K., & Bandettini, P. A. (2007). Mapping the MRI voxel volume in which thermal noise matches physiological noise-implications for fMRI. NeuroImage, 34(2), 542-549.

Buckner, R. L., Bandettini, P. A., O'Craven, K. M., Savoy, R. L., Petersen, S. E., Raichle, M. E., et al. (1996). Detection of cortical activation during averaged single trials of a cognitive task using functional magnetic resonance imaging. Proceedings of National Academy of Science U.S.A., 93, 14878-14883.

Burock, M. A., Buckner, R. L., Woldorff, M. G., Rosen, B. R., & Dale, A. M. (1998). Randomized event-related experimental designs allow for extremely rapid presentation rates using functional MRI. Neuroreport, 9, 3735-3739.

Buxton, R. B. (2009). Introduction to functional magnetic resonance imaging: principles and techniques. Cambridge University Press.

Calhoun, V. D., Adali, T., Pearlson, G. D., & Pekar, J. J. (2001). A method for making group inferences from functional MRI data using independent component analysis. Human Brain Mapping, 14(3), 140-151.

Chang, C.-C., & Lin, C.-J. (2011). LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), Article # 27.

Churchill, N. W., Oder, A., Abdi, H., Tam, F., Lee, W., Thomas, C., et al. (2012A). Optimizing preprocessing and analysis pipelines for single-subject fMRI. 1. Standard temporal motion and physiological noise correction methods. Human Brain Mapping, 33(3), 609-627.

Churchill, N. W., Yourganov, G., Oder, A., Tam, F., Graham, S. J., & S.C., S. (2012B). Optimizing preprocessing and analysis pipelines for single-subject fMRI. 2. Interactions with ICA, PCA, Task Contrast and Inter-Subject Heterogeneity. PLoS One, 7(2).

Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York, NY: Wiley.

Constable, R. T., Skudlarski, P., & Gore, J. C. (1995). An ROC approach for evaluating functional brain MR imaging and postprocessing protocols. Magnetic Resonance in Medicine, 34, 57-64.

Cooper, P. W. (1963). Statistical classification with quadratic forms. Biometrika, 50(3-4), 439-448.

Cordes, D., Haughton, V., Carew, J. D., Arfanakis, K., & Maravilla, K. (2002). Hierarchical clustering to measure connectivity in fMRI resting-state data. Magnetic Resonance Imaging, 20(4), 305-317.

Cordes, D., & Nandy, R. R. (2006). Estimation of the intrinsic dimensionality of fMRI data. NeuroImage, 29, 145-154.

Cortes, C., & Vapnik, V. N. (1995). Support-Vector Networks. Machine Learning, 20(3), 273-297.

141

Cox, D. D., & Savoy, R. L. (2003). Functional magnetic resonance imaging (fMRI) "brain reading": detecting and classifying distributed patterns of fMRI activity in human visual cortex. NeuroImage, 19, 261-270.

Cox, R. W. (1996). AFNI: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical Research, 29, 162-173.

Cox, R. W., & Jesmanowicz, A. (1999). Real-time 3D image registration for functional MRI. Magnetic Resonance in Medicine, 42(4), 1014-1018.

Culham, J. C. (2006). Functional Neuroimaging: Experimental Design and Analysis. Handbook of Functional Neuroimaging of Cognition (pp. 53-82). Cambridge MA: MIT Press.

Dagli, M. S., Ingeholm, J. E., & Haxby, J. V. (1999). Localization of cardiac-induced signal change in fMRI. NeuroImage, 9, 407-415.

Deco, G., Jirsa, V. K., Robinson, P. A., Breakspear, M., & Friston, K. (2008). The Dynamic Brain: From Spiking Neurons to Neural Masses and Cortical Fields. PLoS Computational Biology, 4(8).

Deco, G., Jirsa, V. K., & McIntosh, A. R. (2011). Emerging concepts for the dynamical organization of resting-state activity in the brain. Nature Reviews Neuroscience 12, 43-56.

Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of machine learning research, 7, 1-30.

Eastment, H. T., & Krzanowski, W. J. (1982). Cross-Validatory Choice of the Number of Components from a Principal Component Analysis. Technometrics, 24(1).

Edelstein, W. A., Glover, G. H., Hardy, C. J., & Redington, R. W. (1986). The intrinsic signal-to-noise ratio in NMR imaging. Magnetic Resonance in Medicine, 3(4), 604-618.

Efron, B., & Tibshirani, R. (1993). Introduction to the Bootstrap: Academic Press, San Diego.

Fox, P. T., Mintun, M. A., Reiman, E. M., & Raichle, M. E. (1988). Enhanced detection of focal brain responses using intersubject averaging and change-distribution analysis of subtracted PET images. Journal of Cerebral Blood Flow & Metabolism, 8(5), 642-653.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188.

Foerster, B. U., Tomasi, D., & Caparelli, E. C. (2005). Magnetic field shift due to mechanical vibration in functional magnetic resonance imaging. Magnetic Resonance in Medicine, 54(5), 1261-1267.

Fox, P. T., Raichle, M. E., Mintun, M. A., & Dence, C. (1988). Nonoxidative glucose consumption during focal physiologic neural activity. Science, 241, 462-464.

Fox, M. D., Snyder, A. Z., Vincent, J. L., Corbetta, M., Van Essen, D. C., & Raichle, M. E. (2005). The human brain is intrinsically organized into dynamic, anticorrelated functional networks. Proceedings of the National Academy of Sciences, 102(27), 9673-9678.

Freund, J. E. (1992). Mathematical Statistics: Prentice Hall, New Jersey

Friedman, J. H. (1989). Regularized Discriminant Analysis. Journal of the American Statistical Association, 84(405), 165-175.

142

Friston, K. J., Frith, C. D., Liddle, P. F., & Frackowiak, R. S. (1991). Comparing functional (PET) images: the assessment of significant change. Journal of Cerebral Blood Flow and Metabolism, 11, 690-699.

Friston, K. J., Frith, C., Turner, R., & Frackowiak, R. S. J. (1995A). Characterizing evoked hemodynamics with fMRI. NeuroImage, 2, 157-165.

Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J. B., Frith, C., & Frackowiak, R. S. J. (1995B). Statistical Parametric Maps in Functional Imaging: A General Linear Approach. Human Brain Mapping, 2, 189-210.

Friston, K. J., Poline, J. B., Holmes, A. P., Frith, C. D., & Frackowiak, R. S. (1996). A multivariate analysis of PET activation studies. Human Brain Mapping, 4(2), 140-151.

Friston, K. J., Fletcher, P., Josephs, O., Holmes, A., Rugg, M. D., & Turner, R. (1998). Event-Related fMRI: Characterizing Differential Responses. NeuroImage, 7, 30-40.

Friston, K., Phillips, J., Chawla, D., & Buchel, C. (2000). Nonlinear PCA: characterizing interactions between modes of brain activity. Philosophical Transactions of the Royal Society of London, B., Biological Sciences, 355, 135-146.

Garrett, D. D., Kovacevic, N., McIntosh, A. R., & Grady, C. L. (2012). The modulation of BOLD variability between cognitive states varies by age and processing speed. Cerebral Cortex [doi: 10.1093/cercor/bhs055].

Genovese, C. R., Lazar, N. A., & Nichols, T. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage, 15(4), 870-878.

Glover, G. H. (1999). Deconvolution of impulse response in event-related BOLD fMRI. NeuroImage, 9, 416-429.

Glover, G. H., Li, T. Q., & Ress, D. (2000). Image-based method for retrospective correction of physiological motion effects in fMRI: RETROICOR. Magnetic Resonance in Medicine, 44, 162-167.

Grabowski, T. J., Frank, R. J., Brown, C. K., Damasio, H., Ponto, L. L., Watkins, G. L., et al. (1996). Reliability of PET activation across statistical methods, subject groups, and sample sizes. Human Brain Mapping, 4, 23-46.

Grady, C. L., Springer, M. V., Hongwanishkul, D., Mcintosh, A. R., & Winocur, G. (2006). Age-related changes in brain activity across the adult lifespan. Journal of Cognitive Neuroscience, 18(2), 227-241.

Grady C.L. 2008. Compensatory reorganization of brain networks in older adults. In (Eds. Jagust W. & D'Esposito M.), Imaging the Aging Brain (pp. 105-114). New York, NY: Oxford University Press.

Grady, C. L., Protzner, A. B., Kovacevic, N., Strother, S. C., Afshin-Pour, B., Wojtowicz, M., Andreson, J. A. E., Churchill, N., & McIntosh, A.R. (2010). A multivariate analysis of age-related differences in default mode and task-positive networks across multiple cognitive domains. Cerebral Cortex, 20, 1432-1447.

Greicius, M. D., Krasnow, B., Reiss, A. L., & Menon, V. (2003). Functional connectivity in the resting brain: a network analysis of the default mode hypothesis. Proceedings of the National Academy of Sciences, 100(1), 253-258.

143

Hansen, L. K., & Larsen, J. (1996). Unsupervised Learning and Generalization, Proceedings of the IEEE International Conference on Neural Networks 1996, Washington DC (pp. 25-30).

Hansen, L. K., Larsen, J., Nielsen, F. A., Strother, S. C., Rostrup, E., Savoy, R., Lange, N., Sidtis, J., Svarer, C., & Paulson, O. B. (1999). Generalizable Patterns in Neuroimaging: How Many Principal Components? NeuroImage, 9, 534-544.

Harel, N., Ugurbil, K., Uludag, K., & Yacoub, E. (2006). Frontiers of brain mapping using MRI. Journal of Magnetic Resonance Imaging, 23(6), 945-957.

Harrison, R. V., Harel, N., Panesar, J., & Mount, R. J. (2002). Blood capillary distribution correlates with hemodynamic-based functional imaging in cerebral cortex. Cerebral Cortex, 12, 225-233.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction: Springer.

Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., & Pietrini, P. (2001). Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex. Science, 293, 2425-2430.

Haynes, J. D., & Rees, G. (2006). Decoding mental states from brain activity in humans. Nature Revues Neuroscience, 7, 523-534.

Hlinka, J., Palus, M., Vejmelka, M., Mantini, D., & Corbetta, M. (2011). Functional connectivity in resting-state fMRI: is linear correlation sufficient? NeuroImage, 54(3), 2218-2225.

Hodges, J. L. (1955). Discriminatory Analysis I. Survey: USAF School of Aviation Medicine, Randolph AFB, Texas.

Hoyle, D. C., & Rattray, M. (2004). Principal-component-analysis eigenvalue spectra from data with symmetry-breaking structure. Physical Review E, 69, 026124.

Hoyle, D. C., & Rattray, M. (2007). Statistical mechanics of learning multiple orthogonal signals: asymptotic theory and fluctuation effects. Physical Review E, 75, 016101.

Huettel, S. A., Song, A. W., & McCarthy, G. (2004). Functional Magnetic Resonance Imaging: Sinauer Associates.

Jenkinson, M., Bannister, P. R., Brady, J. M., & Smith, S. M. (2002). Improved optimisation for the robust and accurate linear registration and motion correction of brain images. NeuroImage, 17(2), 825-841.

Jezzard, P., & Clare, S. (1999). Sources of distortion in functional MRI data. Human Brain Mapping, 8, 80-85.

Jirsa, V., Sporns, O., Breakspear, M., Deco, G., & McIntosh, A. R. (2010). Towards the virtual brain: network modeling of the intact and the damaged brain. Archives italiennes de biologie, 148(3), 189-205.

Kamitani, Y., & Tong, F. (2005). Decoding the visual and subjective contents of the human brain. Nature Neuroscience, 8, 679-685.

Kiebel, S. & Holmes, A. P. (2003). The general linear model. Human Brain Function, 2, 725-760.

144

Kjems, U., Hansen, L. K., Anderson, J., Frutiger, S., Muley, S., Sidtis, J., Rottenberg, D., & Strother, S. C. (2002). The quantitative evaluation of functional neuroimaging experiments: mutual information learning curves. NeuroImage, 15, 772-786.

Kleinschmidt, A. (2007) Different analysis solutions for different spatial resolutions? Moving towards a mesoscopic mapping of functional architecture in the human brain. NeuroImage, 38, 663–665

Kriegeskorte, N., & Bandettini, P. (2007). Analyzing for information, not activation, to exploit high-resolution fMRI. NeuroImage, 38, 649-662.

Krishnan, G., Williams, L. J., McIntosh, A. R., & Abdi, H. (2011). Partial Least Squares (PLS) Methods for Neuroimaging: a Tutorial and Review. NeuroImage, 56, 455-475.

Kruger, G., & Glover, G. H. (2001). Physiological noise in oxygenation-sensitive magnetic resonance imaging. Magn Reson Med, 46, 631-637.

Krzanowski, W. J., & Kline, P. (1995). Cross-validation for choosing the number of important components in principal component analysis. Multivariate Behavioral Research, 30(2), 149-165.

Ku, S.-P., Gretton, A., Macke, J., & Logothetis, N. K. (2008). Comparison of pattern recognition methods in classifying high-resolution BOLD signals obtained at high magnetic field in monkeys. Magnetic Resonance Imaging, 26, 1007-1014.

Kustra, R., & Strother, S. (2001). Penalized discriminant analysis of [15O]-water PET brain images with prediction error selection of smoothness and regularization hyperparameters. IEEE Transactions on Medical Imaging, 20, 376-387.

Kwong, K. K., Belliveau, J. W., Chesler, D. A., Goldberg, I. E., Weisskoff, R. M., Poncelet, B. P., et al. (1992). Dynamic magnetic resonance imaging of human brain activity during primary sensory stimulation. Proceedings of the National Academy of Sciences, 89, 5675-5679.

LaConte, S., Anderson, J., Muley, S., Ashe, J., Frutiger, S., Rehm, K., et al. (2003). The evaluation of preprocessing choices in single-subject BOLD fMRI using NPAIRS performance metrics. NeuroImage, 18, 10-27.

LaConte, S., Strother, S., Cherkassky, V., Anderson, J., & Hu, X. (2005). Support vector machines for temporal classification of block design fMRI data. NeuroImage, 26, 317-329.

Lange, N., Strother, S. C., Anderson, J. R., Nielsen, F. r., Holmes, A. P., Kolenda, T., Savoy, R., & Hansen, L. K. (1999). Plurality and resemblance in fMRI data analysis. NeuroImage, 282-303.

Le, T. H., & Hu, X. (1997). Methods for assessing accuracy and reliability in functional MRI. NMR in Biomedicine, 10(4-5), 160-164.

Li, Y.-O., Adali, T., & Calhoun, V. D. (2007). Estimating the number of independent components for functional magnetic resonance imaging data. Human Brain Mapping, 28(11), 1251-1266.

Logothetis, N. K., Pauls, J., Augath, M., Trinath, T., & Oeltermann, A. (2001). Neurophysiological investigation of the basis of the fMRI signal. Nature, 412, 150-157.

145

Lukic, A. S., Wernick, M. N., & Strother, S. C. (2002). An evaluation of methods for detecting brain activations from functional neuroimages. Artificial Intelligence in Medicine, 25, 69-88.

Lund, T. E., Madsen, K. H., Sidaros, K., Luo, W. L., & Nichols, T. E. (2006). Non-white noise in fMRI: does modelling have an impact? NeuroImage, 29, 54-66.

Lusted, L. B. (1968). Introduction to Medical Decision Making: Thomas Springfield, IL.

Maitra, R. (2010). A re-defined and generalized percent-overlap-of-activation measure for studies of fMRI reproducibility and its use in identifying outlier activation maps. NeuroImage, 50(1), 124-135.

Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis: Academic Press.

Mason, S. J., & Graham, N. E. (2002). Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation. Quarterly Journal of the Royal Meteorological Society, 128, 2145-2166.

Mattay, V. S., Fera, F., Tessitore, A., Hariri, A. R., Berman, K. F., Das, S., Meyer-Lindenberg, A., Goldberg, T. E., Callicott, J. H., & Weinberger, D. R. (2006). Neurophysiological correlates of age-related changes in working memory capacity. Neuroscience letters, 392(1-2), 32-37.

McIntosh, A. R., Bookstein, F. L., Haxby, J. V., & Grady, C. L. (1996). Spatial pattern analysis of functional brain images using partial least squares. NeuroImage, 3, 143-157.

McIntosh, A. R., & Lobaugh, N. J. (2004). Partial least squares analysis of neuroimaging data: applications and advances. NeuroImage, 23 Suppl 1, S250-263.

McIntosh, A. R., Kovacevic, N., & Itier, R. J. (2008). Increased Brain Signal Variability Accompanies Lower Behavioral Variability in Development. PLoS Computational Biology, 4(7), e10000106.

McIntosh, A. R., Kovacevic, N., Lippe, S., Garrett, D., Grady, C., & Jirsa, V. (2010). The development of a noisy brain. Archives Italiennes de Biologie, 148(3), 323-337.

Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473.

Minsky, M., & Seymour, P. (1969). Perceptrons. Oxford, England: M.I.T. Press.

Metz, C. E. (1986). ROC Methodology in Radiologic Imaging. Investigative Radiology, 21, 720-733.

Metz, C. E., Herman, B. A., & Shen, J.-H. (1998). Maximum Likelihood Estimation of Receiver Operating Characteristic (ROC) Curves from Continuously-Distributed Data. Statistics in Medicine, 17, 1033-1053.

Minka, T. P. (2000). Automatic choice of dimensionality for PCA: MIT Media Laboratory Perceptual Computing Section.

Misaki, M., Kim, Y., Bandettini, P. A., & Kriegeskorte, N. (2010). Comparison of multivariate classifiers and response normalizations for pattern-information fMRI. NeuroImage, 53(1), 103-118.

146

Mischel, W., Ayduk, O., Berman, M. G., Casey, B. J., Gotlib, I. H., Jonides, J., Kross, E., Teslovich, T., Wilson, N. L., Zayas, V., & Shoda, Y. (2011). ‘Willpower’over the life span: decomposing self-regulation. Social Cognitive and Affective Neuroscience, 6(2), 252-256.

Mitchell, T. M., Hutchinson, R., Niculescu, R. S., Pereira, F., Wang, X., Just, M., & Newman, S. (2004). Learning to Decode Cognitive States from Brain Images. Machine Learning, 57(1-2), 145-175.

Morch, N., Hansen, L. K., Strother, S. C., Svarer, C., Rottenberg, D. A., Lautrup, B., Savoy, R., & Paulson, O. (1997). Nonlinear versus linear models in functional neuroimaging: Learning curves and generalization crossover, Information Processing in Medical Imaging (Vol. 1230, pp. 259-270). New York: Springer-Verlag.

Mourao-Miranda, J., Bokde, A. L., Born, C., Hampel, H., & Stetter, M. (2005). Classifying brain states and determining the discriminating activation patterns: Support Vector Machine on functional MRI data. NeuroImage, 28, 980-995.

Mukamel, R., Gelbard, H., Arieli, A., Hasson, U., Fried, I., & Malach, R. (2005). Coupling between neuronal firing, field potentials, and FMRI in human auditory cortex. Science, 309(5736), 951-954.

Nandy, R. R., & Cordes, D. (2004). New approaches to receiver operating characteristic methods in functional magnetic resonance imaging with real data using repeated trials. Magnetic Resonance in Medicine, 52(6), 1424-1431.

Ng, A. Y., & Jordan, M. I. (2002). On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. Neural Information Processing Systems, 14.

Norman, K. A., Polyn, S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading: multi-voxel pattern analysis of fMRI data. TRENDS in Cognitive Sciences, 10(9).

Ogawa, S., Lee, T. M., Nayak, A. S., & Glynn, P. (1990). Oxygenation-sensitive contrast in magnetic resonance image of rodent brain at high magnetic fields. Magnetic Resonance in Medicine, 14, 68-78.

Ogawa, S., Tank, D. W., Menon, R., Ellermann, J. M., Kim, S. G., Merkle, H., & Ugurbil, K. (1992). Intrinsic signal changes accompanying sensory stimulation: functional brain mapping with magnetic resonance imaging. Proceedings of the National Academy of Sciences, 89(13), 5951-5955.

Pan, X., & Metz, C. E. (1997). The "Proper" Binormal Model: Parametric Receiver Operating Characteristic Curve Estimation with Degenerate Data. Academic Radiology, 4(5), 380-389.

Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: a tutorial overview. NeuroImage, 45, S199-S209.

Peres-Neto, P. R., Jackson, D. A., & Somers, K. M. (2005). How many principal components? stopping rules for determining the number of non-trivial axes revisited. Computational Statistics and Data Analysis, 49(4), 974-997.

Perlbarg, V., Bellec, P., Anton, J. L., Pelegrini-Issac, M., Doyon, J., & Benali, H. (2007). CORSICA: correction of structured noise in fMRI by automatic identification of ICA components. Magnetic Resonance Imaging, 25(1), 35-46.

147

Power, J. D., Barnes, K. A., Snyder, A. Z., Schlaggar, B. L., & Petersen, S. E. (2012). Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. Neuroimage, 59(3), 2142-2154.

Purdon, P. L., & Weisskoff, R. M. (1998). Effect of temporal autocorrelation due to physiological noise and stimulus paradigm on voxel-level false-positive rates in fMRI. Human Brain Mapping, 6(4), 239-249.

Raemaekers, M., Vink, M., Zandbelt, B., van Wezel, R. J., Kahn, R. S., & Ramsey, N. F. (2007). Test-retest reliability of fMRI activation during prosaccades and antisaccades. NeuroImage, 36, 532-542.

Raj, D., Anderson, A. W., & Gore, J. C. (2001). Respiratory effects in human functional magnetic resonance imaging due to bulk susceptibility changes. Physics in Medicine and Biology, 46, 3331-3340.

Rasmussen, P. M., Madsen, K. H., Lund, T. E., & Hansen, L. K. (2011). Visualization of nonlinear kernel models in neuroimaging by sensitivity maps. NeuroImage, 55(3), 1120-1131.

Rasmussen, P. M., Schmah, T., Madsen, K. H., Lund, T. E., Yourganov, G., Strother, S., & Hansen, L. K. (2012A). Visualization of nonlinear classification models in neuroimaging - signed sensitivity maps. Paper presented at the Biosignals 2012, International Conference on Bio-inspired Systems and Signal Processing.

Rasmussen, P. M., Hansen, L. K., Madsen, K. H., Churchill, N. W., & Strother, S. C. (2012B). Model sparsity and brain pattern interpretation of classification models in neuroimaging. Pattern Recognition, 45(6), 2085-2100.

Reimann, P., Van Den Broeck, C., & Bex, G. J. (1996). A Gaussian scenario for unsupervised learning. Journal of Physics A: Mathematical and General, 29, 3521-3535.

Reyment, R. A., & Joreskog, K. G. (1996). Applied Factor Analysis in the Natural Sciences: Cambridge University Press.

Rissanen, J. (1978). Modeling By Shortest Data Description. Automatica, 14, 465-471.

Rombouts, S. A., Barkhof, F., Hoogenraad, F. G., Sprenger, M., & Scheltens, P. (1998). Within-subject reproducibility of visual activation patterns with functional magnetic resonance imaging using multislice echo planar imaging. Magnetic Resonance Imaging, 16, 105-113.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408.

Schmah, T., Yourganov, G., Zemel, R. S., Hinton, G. E., Small, S. L., & Strother, S. C. (2010). Comparing classification methods for longitudinal fMRI studies. Neural Computation, 22(11), 2729-2762.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.

Seber, G. A. F. (2004). Multivariate observations: Wiley-Interscience.

Shaw, M. E., Strother, S. C., Gavrilescu, M., Podzebenko, K., Waites, A., Watson, J., et al. (2003). Evaluating subject specific preprocessing choices in multisubject fMRI data sets using data-driven performance metrics. NeuroImage, 19, 988-1001.

148

Skudlarski, P., Constable, T. R., & Gore, J. C. (1999). ROC Analysis of Statistical Methods Used in Functional MRI: Individual Subjects. NeuroImage, 9(3), 311-329.

Small, S. L., Hlustik, P., Noll, D. C., Genovese, C., & Solodkin, A. (2002). Cerebellar hemispheric activation ipsilateral to the paretic hand correlates with functional recovery after stroke. Brain, 125(7), 1544-1557.

Smith, C. A. B. (1947). Some Examples of Discrimination. Annals of Eugenics, 13, 272-282.

Spreng, R. N., & Grady, C. L. (2010). Patterns of brain activity supporting autobiographical memory, prospection, and theory of mind, and their relationship to the default mode network. Journal of Cognitive Neuroscience, 22(6), 1112-1123.

Stoica, P., & Selen, Y. (2004). Model-order selection: a review of information criterion rules. IEEE Signal Processing Magazine, 21(4), 36-47.

Strother, S. C., Lange, N., Anderson, J. R., Schaper, K. A., Rehm, K., Hansen, L. K., et al. (1997). Activation pattern reproducibility: measuring the effects of group size and data analysis models. Human Brain Mapping, 5, 312-316.

Strother, S. C., Anderson, J., Hansen, L. K., Kjems, U., Kustra, R., Sidtis, J., et al. (2002). The quantitative evaluation of functional neuroimaging experiments: the NPAIRS data analysis framework. NeuroImage, 15, 747-771.

Strother, S., La Conte, S., Kai Hansen, L., Anderson, J., Zhang, J., Pulapura, S., et al. (2004). Optimizing the fMRI data-processing pipeline using prediction and reproducibility performance metrics: I. A preliminary group analysis. NeuroImage, 23 Suppl 1, 196-207.

Strother, S. C., Oder, A., Spring, R., & Grady, C. (2010). The NPAIRS Computational Statistics Framework for Data Analysis in Neuroimaging. Paper presented at the 19th International Conference on Computational Statistics: Keynote, Invited and Contributed Papers.

Swets, J. A. (1988). Measuring the accuracy of diagnostic systems. Science, 240(4857), 1285-1293.

Sychra, J. J., Bandettini, P. A., Bhattacharya, N., & Lin, Q. (1994). Synthetic images by subspace transforms I. Principal components images and related filters. Medical Physics, 21, 193.

Tegeler, C., Strother, S. C., Anderson, J. R., & Kim, S. G. (1999). Reproducibility of BOLD-based functional MRI obtained at 4 T. Human Brain Mapping, 7, 267-283.

Thulborn, K. R., Waterton, J. C., Matthews, P. M., & Radda, G. K. (1982). Oxygenation dependence of the transverse relaxation time of water protons in whole blood at high field. Biochim. Biophys. Acta, 714, 265-270.

Tipping, M. E., & Bishop, C. M. (1999). Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society, Series B, 61, 611-622.

Toro, R., Fox, P. T., & Paus, T. (2008). Functional coactivation map of the human brain. Cerebral Cortex, 18(11), 2553-2559.

Ulfarsson, M. O., & Solo, V. (2008). Dimension Estimation in Noisy PCA With SURE and Random Matrix Theory. IEEE Transactions on Signal Processing, 56(12), 5804-5816.

Vapnik, V. N. (1995). The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New York, Inc.

149

Watkin, T. L. H., & Nadal, J.-P. (1994). Optimal Unsupervised Learning. Journal of Physics A: Mathematical and General, 27, 1899-1915.

Wax, M., & Kailath, T. (1985). Detection of signals by information theoretic criteria. IEEE Transactions on Acoustics, Speech and Signal Processing, 33(2), 387-392.

Wise, R. G., Ide, K., Poulin, M. J., & Tracey, I. (2004) Resting fluctuations in arterial carbon dioxide induce significant low frequency variations in BOLD signal. NeuroImage, 21(4), 1652-1664

Wold, S. (1978). Cross-validatory estimation of the number of components in factor and principal component models. Technometrics, 20(4), 397-405.

Woods, R. P., Grafton, S. T., Holmes, C. J., Cherry, S. R., & Mazziotta, J. C. (1998). Automated image registration: I. General methods and intrasubject, intramodality validation. Journal of computer assisted tomography, 22, 139-152.

Worsley, K. J. (2001). Statistical Analysis of Activation Images. Functional MRI: Introduction to Methods, pp. 251-270.

Worsley, K. J., & Friston, K. J. (1995). Analysis of fMRI time-series revisited-again. NeuroImage, 2, 173-181.

Yamamoto, Y., Ihara, M., Tham, C., Low, R. W. C., Slade, J. Y., Moss, T., et al. (2009). Neuropathological Correlates of Temporal Pole White Matter Hyperintensities in CADASIL. Stroke, 40(6), 2004-2011.

Yourganov, G., Schmah, T., Small, S. L., Rasmussen, P. M., & Strother, S. C. (2010). Functional connectivity metrics during stroke recovery. Archives Italiennes de Biologie, 148(3), 259-270.

Yourganov, G., Xu., C., Lukic, A., Grady, C., Small, S., Wernick, M., et al. (2011). Dimensionality Estimation for Optimal Detection of Functional Networks in BOLD fMRI Data. NeuroImage, 56(2), 531-543.

Zarahn, E., Aguirre, G., & D'Esposito, M. (1997A). A trial-based experimental design for fMRI. NeuroImage, 6, 122-138.

Zarahn, E., Aguirre, G. K., & D'Esposito, M. (1997B). Empirical analyses of BOLD fMRI statistics. I. Spatially unsmoothed data collected under null-hypothesis conditions. NeuroImage, 5, 179-197.

Zhang, J., Liang, L., Anderson, J. R., Gatewood, L., Rottenberg, D. A., & Strother, S. C. (2008). A Java-based fMRI Processing Pipeline Evaluation System for Assessment of Univariate General Linear Model and Multivariate Canonical Variate Analysis-based Pipelines. Neuroinformatics, 6, 123-134.

150

Append ix Wo rks w ith s ignificant co ntributio n fro m the autho r

1 Peer-reviewed publicatio ns Grigori Yourganov, Tanya Schmah, Nathan W. Churchill, Marc G. Berman, Cheryl L.

Grady, Stephen C. Strother, “Evaluation of classifiers and their spatial maps with

simulated and experimental fMRI data”. In preparation

Nathan W. Churchill, Grigori Yourganov, Stephen C. Strother, "Comparing

Classification and Regularization Methods in fMRI for Large and Small Sample sizes".

In preparation

Tanya Schmah, Stephen C. Strother, Grigori Yourganov, Nathan W. Churchill, Richard

S. Zemel, Steven L. Small, “Complexity of functional connectivity predicts recovery

from aphasia”. In preparation

Marc G. Berman, Grigori Yourganov, Mary K. Askren, Ozlem Ayduk, B.J. Casey, Ian H.

Gotlib, Ethan Kross, Anthony R. McIntosh, Stephen Strother, Nicole L. Wilson, Vivian

Zayas, Walter Mischel, Yuichi Shoda, John Jonides, “Dimensionality of brain networks

linked to life long individual differences in self-control”. Nature Communications,

Article # 1373 (doi:10.1038/ncomms2374)

Nathan W. Churchill, Grigori Yourganov, Anita Oder, Fred Tam, Simon J. Graham,

Stephen C. Strother, “Optimizing preprocessing and analysis pipelines for single-subject

fMRI: 2. Interactions with ICA, PCA, task contrast and inter-subject heterogeneity”.

PLOS One, 7(2):e31147, 2012

Nathan W. Churchill, Grigori Yourganov, Robyn Spring, Peter M. Rasmussen, Wayne

Lee, Jon E. Ween, Stephen C. Strother, “PHYCAA: Data-Driven Measurement and

Removal of Physiological Noise in BOLD fMRI”. NeuroImage 59(2), 2011

Grigori Yourganov, Xu Chen, Ana S. Lukic, Cheryl L. Grady, Stephen L. Small, Miles

N. Wernick, Stephen C. Strother, “Dimensionality Estimation for Optimal Detection of

Functional Networks in BOLD fMRI data”, NeuroImage 56 (2), 2011

151

Grigori Yourganov, Tanya Schmah, Steven L. Small, Peter M. Rasmussen, Stephen C.

Strother, “Functional connectivity metrics during stroke recovery”, Archives Italiennes de

Biologie 148 (3), 2010

Tanya Schmah, Grigori Yourganov, Richard S. Zemel, Geoffrey E. Hinton, Stephen L.

Small, Stephen C. Strother, “Comparing Classification Methods for Longitudinal fMRI

studies”, Neural Computation 22, 2010

2 C o nference pres entatio ns Tanya Schmah, E. Susan Duncan, Grigori Yourganov, Richard S. Zemel, Stephen L.

Small, Stephen C. Strother, "Predicting Language Recovery after Stroke Using

Variability of Performance and Complexity of Functional Connectivity". Poster presented

at the 23rd Annual Rotman Neuroscience Conference, "Brain Plasticity &

Neurorehabilitation", March 2013.

Grigori Yourganov, Xu Chen, Stephen C. Strother, “Detection of functional networks in

fMRI data: evaluation of univariate and multivariate approaches”. Poster presented at

Annual Meeting of the Organisation for Human Brain Mapping, June 2010

Grigori Yourganov, Ana S. Lukic, Cheryl L. Grady, Miles N. Wernick, Stephen C.

Strother, “Optimizing Activation Detection with Better Dimensionality Estimation in

BOLD fMRI”. Poster presented at Annual Meeting of the Organisation for Human Brain

Mapping, June 2009

Grigori Yourganov, Xu Chen, Ana S. Lukic, Miles N. Wernick, Stephen C. Strother,

“The Impact of Dimensionality Estimation On Spatial Signal Detection In Multivariate

Gaussian Image Data”. Poster presented at Annual Meeting of the Organisation for

Human Brain Mapping, June 2008

Documents

Pride cti Gaevssiu an Cl assificatitcn Fu ofon ional MRI Dat a · 2014-01-14 · ii Predictive Gaussi an Cl assificatitcn Fu ofon ional MRI dat a Grigori Yourganov Doctor of Philosophy