View
219
Download
1
Tags:
Embed Size (px)
Citation preview
1
Quantifying Opinion about a Logistic Regression using
Interactive Graphics
Paul GarthwaiteThe Open University
Joint work with Shafeeqah Al-Awadhi
2
Introduction/Plan
• This work arose from a practical problem in logistic regression.
• The theory extends easily to elicit opinion about the link function of any glm.
• I will outline the method for glm’s in general.• The motivating problem has some additional
(commonly occurring) structure that the elicitation method exploits.
• Interactive computing is used to elicit opinion.• Prior models can be formed that aim to allow a small
amount of data to correct some potential systematic biases in assessments.
• Results for the practical problem will be given.
3
Motivating Example
The task is to model the habitat distribution of fauna
in south-east Queensland - bats, birds, mammals etc.
Available information:• Environmental attributes on a GIS database.• Sample information of presence/absence at 300-
400 sites.• Background knowledge of ecologists.
The ecologists have seen the bat (say) in various
locations but this information is difficult to use
in a traditional statistical analysis because it has
not been obtained from any sampling scheme.
Prob(presence) = f (environmental attributes)
4
Continuous variables: elevation; quarterly rainfall and temperatures; canopy cover; slope; aspect.Factors: land type; vegetation; forest structure;
logging; grazing; etc.
A workshop with 15 ecologists indicated• unimodal or monotic relationships• independence between attributes in their effect on
the probability of presence.
00.05
0.10.15
0.20.25
0.30.35
0.40.45
0.5
0 2 4 6 8 10 12
attribute
prob
(pre
senc
e)
5
Generalised Linear Model (glm)
The model has the form where g[.] is the link function.
For logistic regression, and
is the probability of presence.
is the vector of predictor variables.
From the ith predictor variable, , a vector of explanatory variables is constructed
such that we have the linear
equation
r[ ( )]Y g
[ ] ln( /(1 ))g
r
ir
'11
'X X ... m n m nY
'
, ( ),1X ( ,..., )i iii X X
6
Define:
and then is a linear function ofY
,1 , ( )X ( , ... , )'.
i i ii X X
, 1
, ,, 1 , 1
, ,, 1
0 if
if
if .
i i j
i j i i i ji j i j
i j i j ii j
R r
X R r r R r
r r r R
7
Factors:One factor level (the best one, say) is chosen
as the reference level. Each other level is given a
dummy 0/1 variable that equals 1 for that level
and 0 for all other levels:
,i jX
,,
1 if
0 otherwisei i j
i jR r
X
8
'11
'X X ... m n m nY
The sampling model is
Let
For the prior distribution we put
The values of the parameters in red must be chosen
by the expert to represent his or her opinions.
1 ( , ... , )'.m n
00 10
1
MVN ,'
b b
9
Assessing medians and quartiles.These are fundamental assessment tasks the expert performs. How far is it from Aberdeen to Southampton?
25% 25% 25% 25%
470m 525m 600miles
The median (blue) is assessed first and then the
lower and upper quartiles (red).
Ecologists were given practice at performing these
tasks in preparatory training and explanation.
| | |
|||
10
Eliciting and
and . Also,
at the reference point. The expert
assesses , the median of at this point.
(For logistic regression is the probability
of presence.)
We put .
The expert also assesses the lower and upper quartiles and . We put
0b
00
0
E( )b 00
V ) ar(
Y
0.50m
0.50m
0.500
[ ] g mb
0.75m
00
2
0.75 0.25 g(m ) ( )
1.348
g m
0.25m
11
Eliciting and
• is determined from the unconditional assessments.
• is determined from assessments conditional on
. equalling .
1 b
b
1
0.75
m
12
Eliciting and for factors.
Put . Then
enabling to be estimated.
[Go to program]
b 1
0.75 0.75 [ ]y g m
0.
100 0.75 075 1
b E[ | ] ( )yy b
1
13
Assessments to obtain
Conditional on the first three line segments being
correct, the dashed lines are quartiles of where the
line might continue.
14
Conditional Assessments for Factors
• The circles indicate conditions.• Dotted horizontal bars are previous assessments.• Solid bars are current assessments and must be
within the dotted bars if is positive-definite.
[Go to program]
15
Calculating
Iterative calculations determine .
Start by estimating the lower-right scalar
element of , and call it . Then estimate the
lower-right of and call it , etc.
If
and is positive-definite, then so is
provided .
1A
p2 2
1
a 'A
a Aii i
ii i
a
1A
iAi
11
a ' A aii i iia
Ap
16
Alternative Prior Models
Individuals can show systematic bias in their
subjective assessments. The aim is to form prior
models that allow a small amount of data to
largely correct some potential biases.
Prior 2The marginal distribution of is diffuse, rather
than . The conditional distribution
of is assumed to be unchanged:
This allows for error in specifying the origin of the
Y-axis.
0 00
N ( , )b
| MVN (b, )
17
Prior 3Prior 3 replaces the scale for Y with some other
linear scale. is again given a diffuse
distribution and the conditional distribution of
is taken to be
is also given a diffuse distribution.
Prior 4This is the same as Prior 3, except it allows for
systematic bias in quartile assessments by putting
are given diffuse distributions.
2 | MVN ( b, )
|
| MVN ( b, )
a d, n
18
Cross-validation and scoring
• The usefulness of a prior distribution can be objectively examined by using cross-validation and a scoring rule.
• For the cross-validation the data for a species were divided into four sets. Each set in turn was omitted and the remaining sets used to form prediction equations.
• Prediction equations were applied to the omitted set and squared error loss determined:
where the summation is over all sites in the omitted (validation) set, is the probability of presence given by the prediction equation, and is a 0/1 dummy variable indicating absence/presence.
• This defines a proper scoring rule.
2Squared error loss ( )k k
kw
k
kw
19
Results for little bent-wing bat
_______________________________________Method Set 1 Set 2 Set 3 Set4 Total
Prior 1 9.57 8.93 8.94 9.30 36.74Prior 2 9.62 9.03 8.98 9.24 36.87Prior 3 9.52 8.86 8.92 8.81 36.11Prior 4 9.73 8.87 8.90 8.62 36.13Frequent. 11.03 9.72 9.55 10.78 41.07No data 10.83 9.81 9.92 10.56 41.12
SampleResults
11/94 10/94 10/93 11/94 42 in375
20
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0
prior value
po
ster
ior
valu
e u
sin
g P
rio
r 1
Prior 1
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0
prior value
po
ster
ior
valu
e u
sin
g P
rio
r 2
Prior 2
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0
prior value
po
ster
ior
valu
e u
sin
g P
rio
r 3
Prior 3
-5.0
-4.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0
prior value
po
ster
ior
valu
e u
sin
g P
rio
r 4
Prior 4
MVN (b, ) MVN (b, )
2
MVN ( b, ) MVN ( b, )
21
____________________________________________
____________________________________________
.
Method
Littlebent-wingbat
Comm-onbent-wingbat
Frog-mouth
Pow-erfulowl
Great-erglider
Prior 1 36.74 12.75 28.76 13.61 43.90Prior 2 36.87 12.73 28.91 13.60 43.94Prior 3 36.11 12.41 25.99 13.17 42.35Prior 4 36.13 12.75 28.61 13.61 43.90Frequent. 41.07 13.70 30.91 14.38 44.15No data 41.12 13.66 29.54 15.07 48.81
SampleResults
42 in375
13 in375
31 in324
14 in324
53 in343
22
Concluding Comments
• The elicitaion method described here is able to handle large problems by:
(a) using interactive graphics
(b) suggesting values to the expert that might
represent his or her opinions.• It is believed that the use of graphs can improve
the quality of the assessed distributions.• Cross-validation can demonstrate clearly the gain
from using prior knowledge, when there is such gain.
• Additional parameters in the prior model can allow limited data to be used more effectively.