Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
PERSONALIZED HYPOTHESIS TESTS FOR DETECTING MEDICATIONRESPONSE IN PARKINSON DISEASE PATIENTS USING iPHONE
SENSOR DATA
ELIAS CHAIBUB NETO∗, BRIAN M. BOT, THANNEER PERUMAL, LARSSON OMBERG, JUSTIN
GUINNEY, MIKE KELLEN, ARNO KLEIN, STEPHEN H. FRIEND, ANDREW D. TRISTER
Sage Bionetworks, 1100 Fairview Avenue North, Seattle, Washington 98109, USA∗corresponding author e-mail: [email protected]
We propose hypothesis tests for detecting dopaminergic medication response in Parkinson diseasepatients, using longitudinal sensor data collected by smartphones. The processed data is composed ofmultiple features extracted from active tapping tasks performed by the participant on a daily basis,before and after medication, over several months. Each extracted feature corresponds to a timeseries of measurements annotated according to whether the measurement was taken before or afterthe patient has taken his/her medication. Even though the data is longitudinal in nature, we showthat simple hypothesis tests for detecting medication response, which ignore the serial correlationstructure of the data, are still statistically valid, showing type I error rates at the nominal level.We propose two distinct personalized testing approaches. In the first, we combine multiple feature-specific tests into a single union-intersection test. In the second, we construct personalized classifiersof the before/after medication labels using all the extracted features of a given participant, and testthe null hypothesis that the area under the receiver operating characteristic curve of the classifier isequal to 1/2. We compare the statistical power of the personalized classifier tests and personalizedunion-intersection tests in a simulation study, and illustrate the performance of the proposed testsusing data from mPower Parkinsons disease study, recently launched as part of Apples ResearchKitmobile platform. Our results suggest that the personalized tests, which ignore the longitudinal aspectof the data, can perform well in real data analyses, suggesting they might be used as a sound baselineapproach, to which more sophisticated methods can be compared to.
Keywords: personalized medicine, hypothesis tests, sensor data, remote monitoring, Parkinson
1. Introduction
Parkinson disease is a severe neurodegenerative disorder of the central nervous system causedby the death of dopamine-generating cells in the midbrain. The disease has considerableworldwide morbidity and is associated with substantial decrease in the quality of life of thepatients (and their caregivers), decreased life expectancy, and high costs related to care. Earlysymptoms in the motor domain include shaking, rigidity, slowness of movement and difficultyfor walking. Later symptoms include issues with sleeping, thinking and behavioral problems,depression, and finally dementia in the more advanced stages of the disease. Treatments areusually based on levodopa and dopamine agonist medications. Nonetheless, as the diseaseprogresses, these drugs often become less effective, while still causing side effects, includinginvoluntary twisting movements (dyskinesias). Statistical approaches aiming to determine ifa given patient responds to medication have key practical importance as they can help thephysician in making more informed treatment recommendations for a particular patient.
In this paper we propose personalized hypothesis tests for detecting medication responsein Parkinson patients, using longitudinal sensor data collected by iPhones. Remote monitoringof Parkinson patients, based on active tasks delivered by smartphone applications, is an activeresearch field.1 Here we illustrate the application of our personalized tests using sensor data
Pacific Symposium on Biocomputing 2016
273
collected by the mPower study, recently launched as part of Apple’s ResearchKit2,3 mobileplatform. The active tests implemented in the mPower app include tapping, voice, memory,posture and gait tests, although in this paper we focus on the tapping data only. During atapping test the patient is asked to tap two buttons on the iPhone screen alternating betweentwo fingers on the same hand for 20 seconds. Raw sensor data collected during a single test isgiven by a time series of the screen x-y coordinates on each tap. Processed data correspondsto multiple features extracted from the tapping task, such as the number of taps and the meaninter-tapping interval. Since the active tests are performed by the patient on a daily basis,before and after medication, over several months, the processed data corresponds to timeseries of feature measurements annotated according to whether the measurement was takenbefore or after the patient has taken his/her medication. Though others have investigated thefeasibility of monitoring medication response in Parkinson patients using smartphone sensordata, this previous work did not focus on the individual effects that medications have, butrather focused on the classification on a population level.4
The first step in analyzing these data is to show that simple feature-specific tests, whichignore the serial correlation in the extracted features, are statistically valid (the distributionof the p-values for tests applied to data generated under the null hypothesis is uniform). Thiscondition guarantees that the tests are exact, that is, the type I error rates match the nominallevels, so that our inferences are neither conservative nor liberal. In other words, if we adopta significance level cutoff of α, the probability that our tests will incorrectly reject the nullwhen it is actually true is given by α.
Even though the simple feature-specific tests are valid procedures for testing for medicationresponse, in practice, we have multiple features and need to combine them into a single decisionprocedure. The second main contribution of this paper is to propose two distinct approaches tocombine all the extracted features into a single hypothesis test. In the first, and most standardapproach, we combine simple tests, applied to each one of our extracted features, into a singleunion-intersection test. Although simple to implement, scalable, and computationally efficient,this approach requires multiple testing correction, which might become burdensome when thenumber of extracted features is large. In order to circumvent this potential issue, our secondapproach is to construct personalized classifiers of the before/after medication labels usingall the extracted features of a given patient, and test the null hypothesis that the area underthe receiver operating characteristic curve (AUROC) of the classifier is equal to 1/2 (in whichcase the patient’s extracted features are unable to predict the before/after medication labels,implying that the patient does not respond to the medication). A slight disadvantage ofthe classifier approach, compared to the union-intersection tests, is the larger computationalcost (especially for classifiers that require tuning parameter optimization by cross-validation)involved in the classifier training. In any case, the increased computational demand is by nomeans a limiting factor for the application of the approach.
The rest of this paper is organized as follows. In Section 2 we present our personalizedtests, discuss their statistical validity, and perform a power study comparison. In Section 3we illustrate the application of our tests to the tapping data of the mPower study. Finally, inSection 4 we discuss our results.
Pacific Symposium on Biocomputing 2016
274
2. Methods
2.1. Notation and a few preliminary comments on the data
Throughout this paper, we let xkt, k = 1, . . . , p, t = 1, . . . , n, represent the measurement offeature k at time point t, and let yt = {b, a}, represent the binary outcome variable, corre-sponding to the before/after medication label, where b and a stand for “before” and “after”medication, respectively.
Even though the participants where asked to perform the active tasks 3 times per day,one before the medication, one after, and one at any other time of their choice, participantsdid not always follow the instructions correctly. As a result, the data is non-standard, withvariable number of daily tasks (sometimes fewer, sometimes greater than 3 tasks per day),and variable timing relative to medication patterns (e.g., bbabba. . . , aaabbb. . . , instead ofbababa. . . ). Furthermore, the data also contains missing medication labels, as sometimes, aparticipant performed the active task but did not report whether the task was taken before orafter medication. In our analysis we restrict our attention to data collected before and aftermedication only. Hence, for each participant, the number of data points used in our tests isgiven by n = nb + na, where nb and na correspond, respectively, to the number of before/aftermedication labels.
2.2. On the statistical validity of personalized tests which ignore theautocorrelation structure of the data
It is common knowledge that the t-test, the Wilcoxon rank-sum test, and other two-sampleproblem tests, suffer from inflated type I error rates in the presence of dependency. We pointout, however, that this can happen when the data within each group is dependent, but the twogroups are themselves statistically independent. When the data from both groups is sampledjointly from the same multivariate distribution, the dependency of the data might no longerbe an issue. Figure 1 provides an illustrative example with t-tests applied to simulated data.
The t-test’s assumption of independence (within and between the groups’ data) is requiredin order to make the analytical derivation of the null distribution feasible. It doesn’t mean thetest will always generate inflated type I error rates in the presence of dependency (as illustratedin Figure 1f). As a matter of fact, a permutation test based on the t-test statistic is valid if thegroup labels are exchangeable under the null,5 even when the data is statistically dependent.Exchangeability6 captures a notion of symmetry/similarity in the data, without requiringindependence. On the examples presented in Figure 1, the group labels are exchangeable onpanels a and c as illustrated by the symmetry/similarity of the data between groups 1 and 2at each row of the heatmaps. For panel b, on the other hand, the lack of symmetry betweenthe groups on each row illustrates that the group labels are not exchangeable.
In the context of our personalized tests, the before/after medication labels are exchange-able under the null of no medication response, even though the measurements of any extractedfeature are usually serially correlated. Note that the exchangeability is required for the medi-cation labels, and not for the feature measurements, which are not exchangeable due to theirserial correlation. Figure 2 illustrates this point, showing the symmetry/similarity of the sep-arate time series for the before and after medication data.
Pacific Symposium on Biocomputing 2016
275
independ. within and between groups
(a)
group 1 group 2
sim
ulat
ed d
ata
sets
dep. within and indep. between groups
(b)
group 1 group 2
sim
ulat
ed d
ata
sets
depend. within and between groups
(c)
group 1 group 2
sim
ulat
ed d
ata
sets
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
1.2 (d)
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
05
1015
(e)
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
1.2 (f)
Fig. 1. The effect of data dependency on the t-test. Panels a, b, and c show heatmaps of the simulated data.Columns are split between group 1 and 2, and each row corresponds to one simulated null data set (we showthe top 30 simulations only). Bottom panels show the p-value distributions for 10,000 tests applied to nulldata simulated according to: (i) x1 ∼ N30(µ1, I) and x2 ∼ N30(µ2, I) with µ1 = µ2 = 0 (panel d); (ii)x1 ∼ N30(µ1,Σ) and x2 ∼ N30(µ2,Σ), where µ1 = µ2 = 0 and Σ is a correlation matrix with off-diagonalelements equal to ρ = 0.95 (panel e); and (iii) (x1,x2)
t ∼ N60(µ,Σ), with µ = (µ1,µ2)t = 0 and Σ as before
(panel f). The density of the Uniform[0, 1] distribution is shown in red. Panel d shows that under the standardassumptions of the t-test, the p-value distribution under the null is (as expected) uniform. Panel e showsthe p-value distribution for strongly dependent data, showing highly inflated type I error rates, even thoughthe data was simulated according to t-test’s null hypothesis that µ1 = µ2. Panel b clarifies why that is thecase. For each row (i.e., simulated data set), the data tends to be quite homogeneous inside each group, butquite distinct between the groups. Because on each simulation we sample the data vectors x1 and x2 from amultivariate normal distribution with a very strong correlation structure, all elements in the x1 vector tendto be close to each other, and all elements in x2 tend to be similar to each other. However, because x1 andx2 are sampled independently from each other, their values tend to be distinct. In combination, the smallvariability in each group vector together with the difference in their means leads to high test statistic valuesand small p-values. Panel f shows the p-value distribution for strongly dependent data, when sampled jointly.In this case, the distribution is uniform. Panel c clarifies why. Now, each row tends to be entirely homogeneous(within and between groups), since the joint sampling of x1 and x2 makes all elements in both vectors quitesimilar to each other, so that the difference in their means tends to be small.
0 20 40 60 80 100
−6
−2
24
(a) AR(1) data simulated under the null
time
feat
ure
beforeafter
0 5 10 15 20
−0.
20.
20.
61.
0
Lag
AC
F
(b) autocorrelation plot of feature data
0 20 40 60 80 100
−6
−2
24
(c) before and after medication series
time
feat
ure
beforeafter
Fig. 2. Exchangeability of the before/after medication labels in time series data. Panel a shows a feature,simulated from an AR(1) process, under the null hypothesis that the patient does not respond to medication.In this case, the before medication (red dots) and after medication (blue dots) labels are randomly assignedto the feature measurements. Panel b shows an autocorrelation plot of the feature data. Panel c shows theseparate “before medication” (red) and “after medication” (blue) series. Note the symmetry/similarity of thetwo series. Clearly, under the null hypothesis that the patient does not respond to medication, the medicationlabels are exchangeable, since shuffling of the before/after medication labels would not destroy the serialcorrelation structure or the trend of the series.
Pacific Symposium on Biocomputing 2016
276
Hence, even though our longitudinal data violates the independence assumption of thet-test, the permutation test based on the t-test statistic is still valid. Of course the sameargument is valid for permutation tests based on other test statistics. Figure 3 illustrates thispoint, with permutation tests based on the t-test and Wilcoxon rank-sum test. Panel a showsthe original data. Red and blue dots represent measurements before and after medication,respectively. The grey dots represent data collected at another time or where the medicationlabel is missing. Panel b shows one realization of a random permutation of the before/aftermedication labels. In order to generate a permutation null distribution, we perform a largenumber of random label shuffles, and for each one, we evaluate the adopted test statistic in thepermuted data. Panels c and d show the permutation null distributions generated from 10,000random permutations of the medication labels based, respectively, on the t-test and on theWilcoxon rank-sum test statistics. The red curve on panel c shows the analytical density of a
Apr 05 Apr 10 Apr 15 Apr 20 Apr 25 Apr 30
5070
90
(a) original data
date
num
ber
of ta
ps
beforeafterother / missing
Apr 05 Apr 10 Apr 15 Apr 20 Apr 25 Apr 30
5070
90
(b) shuffled before/after medication labels
date
num
ber
of ta
psbeforeafterother / missing
permutation null (t−test statistic)
test statistic
Den
sity
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
(c)
permutation null (Wilcoxon test statistic)
test statistic
Den
sity
200 300 400 500 600 700 8000.00
00.
002
0.00
4 (d)
p−value distribution under the null (from analytical t−tests)
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
1.2 (e)
p−value distribution under the null (from analytical Wilcoxon tests)
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
1.2 (f)
Fig. 3. Personalized tests for the null hypothesis that the patient does not respond to medication, accordingto a single feature (number of taps), and based on t-test and Wilcoxon rank-sum test statistics. Panel a showsthe data on the number of taps for one of mPower’s study participants during April 2015. Panel b shows onerealization of a random permutation of the before/after medication labels. Panel c shows the permutation nulldistribution based on the t-test statistic. The red curve represents the analytical density of a t-test, namely, at-distribution with nb + na − 2 degrees of freedom. Panel d shows the permutation null distribution based onWilcoxon rank-sum test statistic. The red curve shows the density of the normal asymptotic approximationfor Wilcoxon’s test, namely, a Gaussian density with mean, nb na/2, and variance, nb na(nb + na + 1)/12. Inthis example nb = 31 and na = 34. Panels e and f show the analytical p-value distributions under the null.Results are based on 10,000 null data sets. Each null data set was generated as follows: (i) randomly samplethe number of “before” medication labels, nb, from the set {10, 11, . . . , n−10}, where n is the total number ofmeasurements; (ii) compute the number of “after” medication labels as na = n−nb; and (iii) randomly assignthe “before medication” and “after medication” labels to the number of taps measurements. The plots showthe histograms of the p-values derived from the application of t-tests (panel e) and Wilcoxon’s tests (panel f)to each of the null data sets. The density of the Uniform[0, 1] distribution is shown in red.
Pacific Symposium on Biocomputing 2016
277
t-test for the data on panel a, while the red curve on panel d shows the density of the normalasymptotic approximation for the Wilcoxon test represented in panel b (we show the normalapproximation density, since the exact null distribution is discrete). The close similarity of thepermutation and analytical distributions suggests that, in practice, we can use the analyticalp-values of t-tests or Wilcoxon rank-sum tests instead of the permutation p-values. Panels eand f further corroborate this point, showing uniform distributions for the analytical p-valuesof t-tests and Wilcoxon tests respectively, derived from 10,000 distinct null data sets.
2.3. Personalized union-intersection tests
In order to combine the feature-specific tests, H0k : the patient does not respond to themedication, according to feature k, versus H1k : the patient responds to the medication, acrossall extracted features, we construct the union-intersection test,
H0 : ∩pk=1H0k versus H1 : ∪p
k=1H1k , (1)
where, in words, we test the null hypothesis that the patient does not respond to medica-tion, for all p features, versus the alternative hypothesis that he/she responds to medicationaccording to at least one of the features. Under this test, we reject H0 if the p-value of atleast one of the feature-specific tests is small. Hence, the p-value for the union-intersectiontest corresponds to the smallest p-value (across all p tests) after multiple testing correction.
We implement union-intersection tests based on the t-test and Wilcoxon rank-sum teststatistics. We adopt the Benjamini-Hochberg approach7 for multiple testing correction. Asdetailed in Figure 4, the union-intersection test tends to be slightly conservative when appliedto correlated features.
median tapping interval
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0 (a)
number of taps
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0 (b)
UI−test with cor. feat.
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0 (c)
UI−test with uncor. feat.
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0 (d)
Fig. 4. P-value distributions for the union-intersection test under the null. Results are based on 10,000 nulldata sets generated by randomly permuting the before/after medication labels. Panels a and b show thep-value distributions from, H0k, for 2 out of the 24 features combined in the union-intersection test. Weobserved uniform distributions for all features, but report just 2 here due to space constraints. Panel c showsthe p-value distribution of the union-intersection test using Benjamini-Hochberg correction. The skewness ofthe distribution towards larger p-values indicates that the test is conservative, meaning that at a nominalsignificance level α the probability of rejecting the null when it is actually true is smaller than α. Since the p-value distributions of the features are uniform (panels a and b), the skewness of the union-intersection p-valuedistribution is clearly due to the multiple testing correction. We experimented with other procedures, but theBenjamini-Hochberg correction was the least conservative one. Panel d reports the p-value distribution forthe union-intersection test using a shuffled version of the feature data. Note that by shuffling the data, wedestroy the correlation among the features, so that the now uniform distribution suggests that the violation ofthe independence assumption (required by the Benjamini-Hochberg approach) seems to be the reason for theslightly conservative behavior of the union-intersection test. Results were generated using Wilcoxon’s test.
Pacific Symposium on Biocomputing 2016
278
2.4. Personalized classifier tests
An alternative approach to combine the information across multiple features into a singledecision procedure is to test whether a classifier trained using all the features is able topredict the before/after medication labels better than a random guess. Adopting the AUROCas a classification performance metric, we have that a prediction equivalent to a random guesswould have an AUROC equal to 0.5, whereas a perfect prediction would lead to an AUROCequal to 1. Furthermore, if a classifier generates an AUROC smaller than 0.5, we only needto switch the labels in order to make the AUROC larger than 0.5. Therefore, we can test if aclassifier’s prediction is better than random using the one-sided test,
H0 : AUROC = 1/2 versus H1 : AUROC > 1/2 . (2)
It has been shown8 that, when there are no ties in the predicted class probabilities used forthe computation of the AUROC, the test statistic of the Wilcoxon rank-sum test (also knownas the Mann-Whitney U test), U , is related to the AUROC statistic by U = nb na(1−AUROC)
(see section 2 of reference9 for details). Hence, under the assumption of independence (requiredby Wilcoxon’s test) the analytical p-value for the hypothesis test in (2) is given by the lefttail probability, P (U ≤ nb na(1−AUROC)), of Wilcoxon’s null distribution. In the presence ofties, the p-value is given by the left tail of the asymptotic approximate null,
U ≈ N
nb na
2,nb na(n+ 1)
12− nb na
12n (n− 1)
τ∑j=1
tj(tj − 1)(tj + 1)
, (3)
where τ is the number of groups of ties, and tj is the number of ties in group j.9 Alternatively,we can get the p-value as the right tail probability of the corresponding AUROC null,
AUROC ≈ N
1
2,
n+ 1
12nb na− 1
12nb na n (n− 1)
τ∑j=1
tj(tj − 1)(tj + 1)
. (4)
As before, even though the test described above assumes independence, the exchangeabilityof the before/after medication labels guarantees the validity of the permutation test based onthe AUROC statistic. Figure 5 illustrates this point.
permutation null (random forest test)
test statistic
Den
sity
0.3 0.4 0.5 0.6 0.7
02
46 (a)
p−value distribution under the null (analytic p−values)
p−values
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.6
1.2 (b)
Fig. 5. Panel a shows the permutation null distribution for the personalized classifier test (based on therandom forest algorithm). The red curve represents the density of the AUROC (approximate) null distributionin eq. (4). Panel b shows the p-value distribution for the (analytical) classifier test, based on 10,000 null datasets, generated by randomly permuting the before/after medication labels as described in Figure 3. The densityof the Uniform[0, 1] distribution is shown in red.
Pacific Symposium on Biocomputing 2016
279
2.5. Statistical power comparison
In this section we compare the statistical power of the personalized classifier test (based on therandom forest10 and extra trees11 classifiers) against the personalized union-intersection tests(based on t-tests and Wilcoxon rank-sum tests). We simulated feature data, xkt, according tothe model,
xkt = A cos(π t) + ϵkt , (5)
where ϵkt ∼ N(0, σ2k) represents i.i.d. error terms, and the function A cos(πt) describes the
periodic signal, with A representing the peak amplitude of the signal. Figure 6 describes theadditional steps involved in the generation of the before/after medication labels.
0 5 10 15 20 25 30
−0.
50.
00.
5
periodic signal
time
(a)
0 5 10 15 20 25 30
−0.
50.
00.
5
add before/after medication labels
time
beforeafter
(b)
0 5 10 15 20 25 30
−0.
50.
00.
5
introduce labeling errors
time
beforeafter
(c)
0 5 10 15 20 25 30
−0.
50.
00.
5
add random noise
time
beforeafter
(d)
Fig. 6. Data simulation steps. First, we generate the periodic signal, A cos(πt), shown in panel a for t =1, . . . , 30, and A = 0.25. Second, we assign the “before medication” label (red dots) to the negative values, andthe “after medication” label (blue dots) for the positive values (panel b). Third, we introduce some labelingerrors (panel c). Finally, at the fourth step (panel d), we add ϵkt ∼ N(0, σ2
k) error terms to the measurements.
We ran six simulation experiments, covering all combinations of sample size, n = {50, 150},and number of features, p = {10, 50, 200}. In all simulations we adopted A = 0.25, mislabelingerror rate of 10%, and increasing σk = {1.00, 1.22, 1.44, 1.67, 1.89, 2.11, 2.33, 2.56, 2.78} for thefirst 9 features, and σk = 3, for 10 ≤ k ≤ p. In each experiment, we generated 1,000 data setsand computed the personalized tests p-values. Figure 7 presents a comparison of the empiricalpower of the personalized tests as a function of the significance threshold α. For each test,the empirical power curve was estimated as the fraction of the p-values smaller than α, for adense grid of α values varying between 0 and 1.
The results showed some clear patterns. First, the comparison of the top and bottom panelsshowed that for all 4 tests an increase in sample size leads to an increase in statistical power (asone would have expected). Second, the empirical power of the union-intersection tests basedon Wilcoxon and t-tests was very similar in all simulation experiments, with the t-test beingslightly better powered than the Wilcoxon test. This result was expected since the simulatedfeatures were generated using Gaussian errors, and the t-test is known to be slightly betterpowered than the Wilcoxon rank-sum test under normality. Similarly, the empirical power ofthe personalized classifier tests was also similar, with the extra trees algorithm tending to beslightly better powered. Third, this study shows that neither the personalized classifier nor theunion-intersection approaches dominate each other. Rather, we see that for this particular datageneration mechanism and simulation parameters choice, the union-intersection tests tendedto be better powered than the classifier tests for smaller values of p, whereas the converse was
Pacific Symposium on Biocomputing 2016
280
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
n = 50, p = 10
α
pow
er
randomForestextraTreesUI−WilcoxonUI−t−test
(a)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
n = 50, p = 50
α
pow
er
randomForestextraTreesUI−WilcoxonUI−t−test
(b)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
n = 50, p = 200
α
pow
er
randomForestextraTreesUI−WilcoxonUI−t−test
(c)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
n = 150, p = 10
α
pow
er
randomForestextraTreesUI−WilcoxonUI−t−test
(d)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
n = 150, p = 50
α
pow
errandomForestextraTreesUI−WilcoxonUI−t−test
(e)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
n = 150, p = 200
α
pow
er
randomForestextraTreesUI−WilcoxonUI−t−test
(f)
Fig. 7. Comparison of the personalized test’s empirical power as a function of the significance cutoff, α.
true for larger p. Furthermore, observe that the power of the union-intersection tests tendedto decrease as the number of features increased (note the slight decrease in the slope of thecyan and pink curves as we go from the panels on the left to the panels on the right). Thetests based on the classifiers, on the other hand, tended to get better powered as the numberof features increased (note the respective increase in the slope of the blue and black curves).
The observed decrease in power of the union-intersection tests might be explained by theincreased burden caused by the multiple testing correction required by these tests. On theother hand, the personalized classifier tests are not plagued by the multiple testing issue sinceall features are simultaneously accounted for by the classifiers. Furthermore, in situationswhere none of the features is particularly informative, the classifiers might still be able tobetter aggregate information across the multiple noisy features.
3. Real data illustrations
In this section we illustrate the performance of our personalized hypothesis tests, based ontapping data collected by the mPower study between 03/09/2015 (date the study opened)and 06/25/2015. We restrict our analyzes to 57 patients, who performed at least 30 tappingtasks before medication, as well as 30 or more tasks after medication.
Figure 8 presents the results. Panel a shows the number of tapping tasks (before andafter medication) performed by each participant. Panel b reports the AUROC scores for 4classifiers (random forest, logistic regression with elastic-net penalty,12 logistic regression, andextra trees). Panel c presents the p-values for the respective personalized classifier testsa, aswell as for 2 personalized union-intersection tests (based on t- and Wilcoxon rank-sum tests).
Panel b shows that the tree-based classifier tests (random forest and extra trees) showedcomparable performance across all participants, whereas the regression-based approaches(elastic net and logistic regression) were sometimes comparable but sometimes strongly out-
aThe results for the personalized classifier tests were based on 100 random splits of the data into training andtest sets, using roughly half of the data for training and the other half for testing. The AUROC and p-valuesreported on Figure 8 b and c correspond to the median of the AUROCs and p-values across the 100 datasplits.
Pacific Symposium on Biocomputing 2016
281
performed by the tree-based tests. Panel c shows that the union-intersection tests, nonetheless,produced at times much smaller p-values than the classifier tests. At a significance level of0.001 (grey horizontal line), about one quarter of the patients (leftmost quarter) respond todopaminergic medication, according to most tests.
sam
ple
size
0
50
100
150
200beforeafter(a)
0.4
0.5
0.6
0.7
0.8
0.9
1.0
AU
RO
C
randomForestelasticNetlogisticRegrextraTrees
(b)
05
1015
2025
3035
−lo
g 10(
p−va
lue)
6c13
0d05
−41
af−
49e9
−92
12−
aaae
738d
3ec1
ac01
f080
−96
a4−
4faf
−bc
80−
291a
996f
6edd
bda5
8cbf
−a7
73−
487d
−bb
c4−
2fe3
67e0
fcc2
6899
ba07
−f8
68−
49db
−96
51−
3504
b6d7
54d6
00dc
061b
−81
51−
44cc
−8e
ae−
4d10
f11a
5ab6
5863
cea7
−ec
16−
44a1
−80
a7−
785e
652c
ea4d
cdb1
8d7b
−67
93−
48c0
−80
85−
63c2
a71c
e5a9
3939
e327
−c5
ae−
4d4d
−98
52−
fb1b
adc8
5108
c77e
3f11
−71
95−
4e8a
−91
46−
063b
ef90
a03b
5a5b
0173
−3d
27−
412c
−84
f3−
5753
4bf1
414f
b5d3
00e5
−43
97−
4f24
−ab
97−
6ae7
6a3c
956b
4624
2d15
−41
f9−
417c
−b4
54−
ddd9
5a4f
451c
ce5c
b455
−f7
1f−
4727
−94
d5−
13f8
60bf
4ce0
b80b
54cc
−73
23−
4a1e
−88
cf−
1120
5122
3fc5
92da
6dbb
−dd
d2−
4a0f
−90
e4−
46de
8b15
66ac
4879
59bf
−ba
b4−
488a
−a9
ae−
67b4
e645
f347
ffab2
631−
dfe1
−4f
5e−
bd9e
−f7
63e8
aede
19
8ef4
e9c3
−c7
01−
47e4
−be
c5−
4ca6
4dbd
adc7
e87e
1729
−89
7f−
4025
−aa
b4−
5e2e
4ac9
ba7c
36ad
38f2
−ee
b7−
4913
−9f
4d−
adb2
0e53
bede
1fbd
e69f
−f9
97−
43d2
−bb
3d−
3583
22b6
7bbd
5c52
5b29
−bf
01−
47ea
−ba
44−
4261
a832
5b7b
cb3f
afc6
−b4
d3−
479b
−8a
2d−
f5f0
37b9
6420
61e2
23ac
−46
57−
48b8
−89
b0−
d37d
f5c8
5d39
8eff6
83a−
5bfe
−4c
5c−
b686
−07
dbb7
c363
26
9a20
1650
−4a
5b−
4637
−ab
5c−
828d
8f38
23c5
7bc8
6881
−70
75−
467a
−85
0e−
53e0
21b6
0f9f
3071
2a06
−af
02−
484a
−87
5b−
5b4c
268f
1b62
9794
c5a3
−4f
f6−
4391
−87
18−
3c82
9f57
ec11
4329
4479
−32
e0−
4589
−92
ee−
269b
36a7
0e46
8ce9
23e3
−a1
c5−
4f1e
−be
17−
69b2
cd9f
0e50
9cf0
5509
−f1
69−
4b4b
−8e
f5−
3124
213f
3d8e
6725
8105
−11
5d−
42aa
−af
10−
faba
f10b
462c
344b
785d
−d1
2a−
47b6
−95
ba−
fa05
eecd
7e8b
6244
be74
−6f
9b−
4b01
−95
58−
b169
192b
ae63
7457
113a
−34
bd−
4324
−93
5c−
f66b
d33f
1e4b
f8d4
4faa
−f7
1a−
43be
−9f
a8−
8d38
1b9c
c380
710f
2d79
−14
68−
40b1
−9a
b5−
beb8
721f
43a5
ac7c
3961
−92
68−
42b7
−ae
b7−
fc4c
49c7
a6bb
1ff8
05eb
−4e
96−
4d64
−b7
a4−
cc54
aadd
f2d6
a50d
fd03
−62
11−
47ff−
9d94
−80
26c6
5eed
93
c6d6
62c2
−5a
f8−
4781
−bb
68−
d991
62d9
d1fa
5ce6
9643
−ae
ec−
44e5
−a5
cc−
4c06
d36f
5979
cd76
9996
−62
81−
44b2
−b3
6b−
f0ee
b94f
a0a5
6588
3b0e
−ec
25−
4883
−90
19−
ae20
a495
b072
b2a7
60b7
−bd
2d−
4918
−a7
b7−
6fa3
8542
3e11
3516
da76
−a3
e1−
47ac
−92
55−
ee10
0797
ba9b
7dc5
2ba5
−9c
e9−
42c6
−bf
4b−
c7ce
52d1
b266
88e6
0419
−1c
48−
4da8
−af
57−
e6b4
4925
54f9
3599
ec25
−58
c9−
4215
−aa
c3−
a128
7f7a
7601
d0ae
0ee1
−16
5b−
4e5a
−9c
aa−
ad8a
1768
8445
72f7
5a4d
−de
63−
4077
−b7
7e−
49d9
8f55
9e8c
8bea
b5c6
−60
67−
4c47
−b6
55−
1901
f5a5
34a8
9884
c3bf
−86
bb−
4370
−ac
bf−
19db
34cf
297c
73d6
7610
−d6
6b−
4a4b
−b6
06−
d98a
8ada
5f03
462d
754c
−dc
70−
4768
−91
26−
e045
cd21
c72e
300e
5d1f
−5a
38−
4e78
−8f
42−
1d58
0b74
569a
randomForestelasticNetlogisticRegrextraTreest.testWilcoxon
(c)
Fig. 8. Results of personalized hypothesis tests applied to the tapping data from the mPower study. Panela shows the number of tapping tasks performed before (red) and after (blue) medication, per participant.Panel b shows the AUROC scores (across 100 random splits of the data into training and test sets) for fourclassifiers. Panel c shows the adjusted p-values (in negative log base 10 scale) of 4 classification tests and 2union-intersection tests. The p-values were adjusted using Benjamini-Hochberg multiple testing correction.The grey horizontal line represents an adjusted p-value cutoff of 0.001. The participants were sorted accordingto the AUROC of the random forest algorithm (black dots in panel b).
In order to illustrate how our personalized tests are able to pick up meaningful signalfrom the extracted features, we present on Figure 9, time series plots of 2 features (numberof taps and mean tapping interval) which were consistently ranked among the top 3 featuresfor the top 3 patients on the left of Figure 8c, and compared it to the data from the bottom3 patients on the right of Figure 8c. (For the random forest classifier test, we ranked thefeatures according to the importance measure provided by the random forest algorithm. Forthe union-intersection tests, we ranked the features according to the p-values of the feature-specific tests.) Comparison of panels a to f, which show the data from patients that respond tomedication (according to our tests), against panels g to l, which report the data from patientsthat do not respond to medication, shows that our tests can clearly pick up meaningful signal
Pacific Symposium on Biocomputing 2016
282
from these two extracted features.
Apr May Jun
5015
0
6c130d05−41af−49e9−9212−aaae738d3ec1
num
ber
of ta
ps
beforeafterother / missing
(a)
Apr May Jun
5015
0
ac01f080−96a4−4faf−bc80−291a996f6edd
num
ber
of ta
ps
beforeafterother / missing
(b)
Mar 16 Mar 26 Apr 05 Apr 15 Apr 25
5015
0
bda58cbf−a773−487d−bbc4−2fe367e0fcc2
num
ber
of ta
ps
beforeafterother / missing
(c)
Apr May Jun
0.1
0.3
0.5
0.7 6c130d05−41af−49e9−9212−aaae738d3ec1
mea
n ta
ppin
g in
terv
al
beforeafterother / missing
(d)
Apr May Jun
0.1
0.3
0.5
0.7 ac01f080−96a4−4faf−bc80−291a996f6edd
mea
n ta
ppin
g in
terv
al
beforeafterother / missing
(e)
Mar 16 Mar 26 Apr 05 Apr 15 Apr 25
0.1
0.3
0.5
0.7 bda58cbf−a773−487d−bbc4−2fe367e0fcc2
mea
n ta
ppin
g in
terv
al
beforeafterother / missing
(f)
Apr May Jun
5015
0
73d67610−d66b−4a4b−b606−d98a8ada5f03
num
ber
of ta
ps
beforeafterother / missing
(g)
May Jun
5015
0
462d754c−dc70−4768−9126−e045cd21c72e
num
ber
of ta
ps
beforeafterother / missing
(h)
Apr May Jun
5015
0
300e5d1f−5a38−4e78−8f42−1d580b74569a
num
ber
of ta
ps
beforeafterother / missing
(i)
Apr May Jun
0.1
0.3
0.5
0.7 73d67610−d66b−4a4b−b606−d98a8ada5f03
mea
n ta
ppin
g in
terv
al
beforeafterother / missing
(j)
May Jun
0.1
0.3
0.5
0.7 462d754c−dc70−4768−9126−e045cd21c72e
mea
n ta
ppin
g in
terv
al
beforeafterother / missing
(k)
Apr May Jun0.
10.
30.
50.
7 300e5d1f−5a38−4e78−8f42−1d580b74569a
mea
n ta
ppin
g in
terv
al
beforeafterother / missing
(l)
Fig. 9. Comparison of the leftmost 3 patients against the rightmost 3 patients in Figure 8c, according to2 tapping features (number of taps and mean tapping interval). Panels a to f show the results for the top 3patients (which respond to medication according to our tests), while panels g to l show the results for thebottom 3 patients (which don’t respond to medication according to our tests).
4. Discussion
In this paper we describe personalized hypothesis tests for detecting dopaminergic medicationresponse in Parkinson patients. We propose and compare two distinct strategies for combiningthe information from multiple extracted features into a single decision procedure, namely: (i)union-intersection tests; and (ii) hypothesis tests based on classifiers of the medication labels.We carefully evaluated the statistical properties of our tests, and illustrated their performancewith tapping data from the mPower study. We have also successfully applied our tests tofeatures extracted from the voice and accelerometer data collected using the mPower app, butcannot present these results here due to space limitations.
About one quarter of the patients analyzed in this study showed strong statistical evidencefor medication response. For these patients, we observed that the union-intersection teststended to generate smaller p-values than the classifier based tests. This result corroboratesour observations in the empirical power simulation studies, where union-intersection teststended to out-perform classifier tests when the number of features is small (recall that weemployed only 24 features in our analyses).
Although our tests can detect medication responses, they do not explicitly determinethe direction of the response, that is, they cannot differentiate a response in the expecteddirection from a paradoxical one. We point out, however, that their main utility is to helpout the physician pinpoint patients in need of a change on their drug treatment. For instance,our tests are able to detect patients for which the drug has an effect in the expected direction
Pacific Symposium on Biocomputing 2016
283
but is not well calibrated (so that its effect is wore off by the time the patient takes themedication), or patients showing paradoxical responses. Note that both cases flag a situationwhich requires an action by the physician (calibrating the medication dosage in the firstcase, and stopping/changing the medication in the second one). Therefore, even though ourtests cannot distinguish between these cases they are still able to detect patients which canpotentially benefit from a dose calibration or a change in medication.
Even though we restricted our attention to 4 classifiers, other classification algorithms couldalso be easily used for generating additional personalized classifier tests. In our experience,however, tree based approaches such as the random forest and extra trees classifiers tend toperform remarkably well in practice, providing a robust default choice. Similarly, we focusedon t- and Wilcoxon rank-sum tests in the derivation of our personalized union-intersectiontests, since these tests show robust practical performance for detecting group differences.
Although others have investigated the feasibility of monitoring medication response inParkinson patients using smartphone sensor data,4 their study focused on the classificationof the before/after medication response at the population level, with the classifier applied todata across all patients, and not at a personalized level, as done here.
In this paper we show that simple hypothesis tests, which ignore the serial correlation inthe feature data, are statistically valid and well powered to detect medication response in themPower study data. We point out, however, that, in theory, more sophisticated approachesable to leverage the longitudinal nature of the data could in principle improve the statisticalpower to detect medication response. In any case, the good practical performance of our simplepersonalized tests suggests that they can be used as sound baseline approaches, against whichmore sophisticated methods can be compared.
Code and data availability. All the R13 code implementing the personalized tests, andused to generate the results and figures in this paper is available at https://github.com/
Sage-Bionetworks/personalized hypothesis tests. The processed tapping data is avail-able at doi:10.7303/syn4649804.Acknowledgements. This work was funded by the Robert Wood Johnson Foundation.
References
1. S. Arora, et al, Parkinsonism and related disorders 21, 650-653 (2015).
2. Editorial, Nature Biotechnology 33, 567 (2015).
3. S. H. Friend, Science Translational Medicine 7, 297ed10 (2015).
4. A. Zhan, et al, High frequency remote monitoring of Parkinson’s disease via smartphone: platform overview
and medication response detection (manuscript under review) (2015).
5. B. Efron, R. J. Tibshirani, An introduction to the bootstrap, Chapter 15 (Chapman and Hall/CRC, 1993).
6. J. Galambos, Encyclopedia of Statistical Sciences 7, 573-577 (1996).
7. Y. Benjamini, Y. Hochberg, Journal of the Royal Statistical Society, Series B 57, 289-300 (1995).
8. D. Bamber, Journal of Mathematical Psychology 12, 387-415 (1975).
9. S. L. Mason, N. E. Graham, Quarterly Journal of the Royal Meteorological Society 128, 2145-2166 (2002).
10. L. Breiman, Machine Learning 45, 5-32 (2001).
11. P. Geurts, D Ernst, L. Wehenkel, Machine Learning 63, 3-42 (2006).
12. J. H. Friedman, T. Hastie, R. Tibshirani, Journal of Statistical Software 33 (1), (2010).
13. R Core Team, R Foundation for Statistical Computing (Vienna, Austria. ISBN 3-900051-07-0, URL
http://www.R-project.org/ 2015).
Pacific Symposium on Biocomputing 2016
284