Upload
jins0618
View
54
Download
2
Embed Size (px)
Citation preview
Classification by CUT:Clearance Under Threshold
Ryan McBride ([email protected]),
Ke Wang ([email protected]),
and Wenyuan Li ([email protected])
June 17, 2015
SummaryI Domain knowledge helps identify “bad”
cases.I Usual Domain Knowledge: Each
outcome’s cost or relative benefit - cost
sensitive classification.I But costs are too hard to specify in
practice.I Our Idea: Model with a regulatory
threshold, a maximum acceptable
frequency in future cases.I Experiments: Our numbers > other
numbers.
I Problem: Given a collection of
sampled electrical transformers, predict
ones with carcinogenic polychlorinated
biphenyls (PCBs), known to be harmful
to human and environment.
Similar Problems
I Predict a cancer patient
I Predict an unqualified applicant
I Predict a broken car brake
Conventional SolutionI User sets cost matrix
(note: negative=bad)
Object Class jPositive Negative
Predicted Positive C1 C2
Class i Negative C3 C4
I Issue: What is the cost of notremoving a public health hazard?
Our Solution: Thresholds
I Insight: Problems without costs focus
on acceptable rates of negatives:
1. Regulations: At most “1 hazard out of100”.
2. Power Industries: Too frequent outagesin equipment ⇒ Strengthen equipment.
I Idea: Model to find “under threshold”
groups.
I CUT Classification: Given t,partition attribute space:
-
x
y + ++
+ ++
+
- ++
+
++
----
- -++
�� ��
��
I Gi Over Threshold ⇒ Mitigate Risk.
I Gi Under Threshold ⇒ Delay Action.
Defining Cleared Groups
I When is a group “underthreshold”?
I One sample that isn’t contaminated?I One hundred samples with no PCBs?I Million samples with no PCBs?
I Only “clear” if enoughobservations...
I Use statistics to estimatepotential frequencies
Statistical ClearanceI Use confidence interval with some
confidence (e.g. 99%):I Frequency in future cases is no more
than upper bound: ub(Gi)
I Example: There is a 99% chance that
no more than 5% of Dynamo
Incorporated transformers are
contaminated.
I Unknown class object o cleared if in Gi
where ub(Gi) ≤ t.
Partitioning Objective
I Goal: Prove many future casesare cleared.
I CUT+ Algorithm: Repeatedsearch for large cleared groupings.
Example with t = 5% on next slide.
I List valid partitions and choose one:
Lowlands:
2 PCB of 300
ub(Lowlands):
1.6%
300 CLEARED
Midlands:
103 PCB of 150
ub(Midlands):
76.3%
NON-CLEARED
Partition A: Region for t=5%
Highlands:
45 PCB of 550
ub(Highlands):
10.3%
NON-CLEARED
Partition B: Manufacturer for t=5%
Made-Up Electric:
130 PCB of 400
ub(Made-Up)=36.4%
NON-CLEARED
Dynamo Inc:
20 PCB of 600
ub(Dynamo)=4.8%
600 CLEARED
I Partition A clears 300 samples.I Partition B clears 600 samples.
I Partition B preferred because it clears
more objects.
Current Tree Partition:
Produced by
Made-Up Electric
20 PCB of 600
ub(Dynamo): 4.8%,
600 CLEARED
Produced by
Dynamo Inc
All Objects
130 PCB of 400
ub(Made-Up): 36.4%
NON-CLEARED
Improvement 1: Repeat partition search in
non-cleared groups.
Final Tree
20 PCB of 600
ub(Dynamo): 4.8%,
600 CLEARED
In Surrey
Produced by
Dynamo Inc
All Objects
98 PCB of 100
ub(G): 100%,
NON-CLEARED
In LowlandsIn Midlands
In Highlands
30 PCB of 150
ub(G): 25.8%,
NON-CLEARED
2 PCB of 150
ub(G): 4.2%,
150 CLEARED
Produced by
Made-Up Electric
Improvement 2: Merge all non-cleared
regions then search again.
CUT+ Algorithm
I Given a set of training objects, G , and a
clearance threshold, tI REPEAT UNTIL no cleared group is
found:I CUT Tree(G , t)I Remove the objects assigned to a cleared
group from G
I Three heuristics for building trees:
1. Immediate Clearance2. Risk Reduction3. Pure Potential
Experiments (1)
I Use cross-validation and compare:I 3 CUT+ algorithms.I Competitors from other classification
areas.
I Problem Set: PCBidentification problems.
Experiments (2)
I Evaluate partition {G1, . . . ,Gn}with test set by:
I Percent of positives cleared (TPR).
PCB Experiment (1)
I t ranges from0% to p̂.
I p̂ is theobserved rateof PCB cases.
0%
1%
2%
3%
4%
5%
0%
0.1
p̂
0.2
p̂
0.3
p̂
0.4
p̂
0.5
p̂
0.6
p̂
0.7
p̂
0.8
p̂
0.9
p̂
1.0
p̂
FP
R(t
)
Clearance Threshold, tPure Potential Baseline1: C4.5
Baseline2: SMOTE Baseline3: MetaCost
0%
20%
40%
60%
80%
100%
0%
0.1
p̂
0.2
p̂
0.3
p̂
0.4
p̂
0.5
p̂
0.6
p̂
0.7
p̂
0.8
p̂
0.9
p̂
1.0
p̂
TP
R
Results for PCB50
CUT+ clears more non-PCB transformers.
Paper results show that there are not too
many “over threshold” errors.
PCB Experiment (2)
I t ranges from0% to p̂.
I p̂ is theobserved rateof PCB cases.
0%
1%
2%
3%
4%
5%
0%
0.1
p̂
0.2
p̂
0.3
p̂
0.4
p̂
0.5
p̂
0.6
p̂
0.7
p̂
0.8
p̂
0.9
p̂
1.0
p̂
FP
R(t
)
Clearance Threshold, tPure Potential Baseline1: C4.5
Baseline2: SMOTE Baseline3: MetaCost
0%
20%
40%
60%
80%
100%
0%
0.1
p̂
0.2
p̂
0.3
p̂
0.4
p̂
0.5
p̂
0.6
p̂
0.7
p̂
0.8
p̂
0.9
p̂
1.0
p̂
TP
R
Results for PCB50
Competitors have few cleared groups since:I Too few observations to clear group.
I Or frequency too high to clear group.
I More Experiments on UCI Sets:Pure Potential best algorithm in 22 out
of 25 tests.
I Code available at
http://www.cs.sfu.ca/~wangk/
software/CUT_classification
Acknowledgments
I Funding: BC Hydro R&D program and
Canada’s NSERC.
I Transformer Image Source:
Wikipedia user Benutzer:Stahlkocher;
License: GFDL.