Wang ke classification by cut clearance under threshold

Classification by CUT:Clearance Under Threshold

Ryan McBride ([email protected]),

Ke Wang ([email protected]),

and Wenyuan Li ([email protected])

June 17, 2015

SummaryI Domain knowledge helps identify “bad”

cases.I Usual Domain Knowledge: Each

outcome’s cost or relative benefit - cost

sensitive classification.I But costs are too hard to specify in

practice.I Our Idea: Model with a regulatory

threshold, a maximum acceptable

frequency in future cases.I Experiments: Our numbers > other

numbers.

I Problem: Given a collection of

sampled electrical transformers, predict

ones with carcinogenic polychlorinated

biphenyls (PCBs), known to be harmful

to human and environment.

Similar Problems

I Predict a cancer patient

I Predict an unqualified applicant

I Predict a broken car brake

Conventional SolutionI User sets cost matrix

(note: negative=bad)

Object Class jPositive Negative

Predicted Positive C1 C2

Class i Negative C3 C4

I Issue: What is the cost of notremoving a public health hazard?

Our Solution: Thresholds

I Insight: Problems without costs focus

on acceptable rates of negatives:

1. Regulations: At most “1 hazard out of100”.

2. Power Industries: Too frequent outagesin equipment ⇒ Strengthen equipment.

I Idea: Model to find “under threshold”

groups.

I CUT Classification: Given t,partition attribute space:

-

x

y + ++

+ ++

+

- ++

+

++

----

- -++

��

��

I Gi Over Threshold ⇒ Mitigate Risk.

I Gi Under Threshold ⇒ Delay Action.

Defining Cleared Groups

I When is a group “underthreshold”?

I One sample that isn’t contaminated?I One hundred samples with no PCBs?I Million samples with no PCBs?

I Only “clear” if enoughobservations...

I Use statistics to estimatepotential frequencies

Statistical ClearanceI Use confidence interval with some

confidence (e.g. 99%):I Frequency in future cases is no more

than upper bound: ub(Gi)

I Example: There is a 99% chance that

no more than 5% of Dynamo

Incorporated transformers are

contaminated.

I Unknown class object o cleared if in Gi

where ub(Gi) ≤ t.

Partitioning Objective

I Goal: Prove many future casesare cleared.

I CUT+ Algorithm: Repeatedsearch for large cleared groupings.

Example with t = 5% on next slide.

I List valid partitions and choose one:

Lowlands:

2 PCB of 300

ub(Lowlands):

1.6%

300 CLEARED

Midlands:

103 PCB of 150

ub(Midlands):

76.3%

NON-CLEARED

Partition A: Region for t=5%

Highlands:

45 PCB of 550

ub(Highlands):

10.3%

NON-CLEARED

Partition B: Manufacturer for t=5%

Made-Up Electric:

130 PCB of 400

ub(Made-Up)=36.4%

NON-CLEARED

Dynamo Inc:

20 PCB of 600

ub(Dynamo)=4.8%

600 CLEARED

I Partition A clears 300 samples.I Partition B clears 600 samples.

I Partition B preferred because it clears

more objects.

Current Tree Partition:

Produced by

Made-Up Electric

20 PCB of 600

ub(Dynamo): 4.8%,

600 CLEARED

Produced by

Dynamo Inc

All Objects

130 PCB of 400

ub(Made-Up): 36.4%

NON-CLEARED

Improvement 1: Repeat partition search in

non-cleared groups.

Final Tree

20 PCB of 600

ub(Dynamo): 4.8%,

600 CLEARED

In Surrey

Produced by

Dynamo Inc

All Objects

98 PCB of 100

ub(G): 100%,

NON-CLEARED

In LowlandsIn Midlands

In Highlands

30 PCB of 150

ub(G): 25.8%,

NON-CLEARED

2 PCB of 150

ub(G): 4.2%,

150 CLEARED

Produced by

Made-Up Electric

Improvement 2: Merge all non-cleared

regions then search again.

CUT+ Algorithm

I Given a set of training objects, G , and a

clearance threshold, tI REPEAT UNTIL no cleared group is

found:I CUT Tree(G , t)I Remove the objects assigned to a cleared

group from G

I Three heuristics for building trees:

1. Immediate Clearance2. Risk Reduction3. Pure Potential

Experiments (1)

I Use cross-validation and compare:I 3 CUT+ algorithms.I Competitors from other classification

areas.

I Problem Set: PCBidentification problems.

Experiments (2)

I Evaluate partition {G1, . . . ,Gn}with test set by:

I Percent of positives cleared (TPR).

PCB Experiment (1)

I t ranges from0% to p̂.

I p̂ is theobserved rateof PCB cases.

0%

1%

2%

3%

4%

5%

0%

0.1

p̂

0.2

p̂

0.3

p̂

0.4

p̂

0.5

p̂

0.6

p̂

0.7

p̂

0.8

p̂

0.9

p̂

1.0

p̂

FP

R(t

)

Clearance Threshold, tPure Potential Baseline1: C4.5

Baseline2: SMOTE Baseline3: MetaCost

0%

20%

40%

60%

80%

100%

0%

0.1

p̂

0.2

p̂

0.3

p̂

0.4

p̂

0.5

p̂

0.6

p̂

0.7

p̂

0.8

p̂

0.9

p̂

1.0

p̂

TP

R

Results for PCB50

CUT+ clears more non-PCB transformers.

Paper results show that there are not too

many “over threshold” errors.

PCB Experiment (2)

I t ranges from0% to p̂.

I p̂ is theobserved rateof PCB cases.

0%

1%

2%

3%

4%

5%

0%

0.1

p̂

0.2

p̂

0.3

p̂

0.4

p̂

0.5

p̂

0.6

p̂

0.7

p̂

0.8

p̂

0.9

p̂

1.0

p̂

FP

R(t

)

Clearance Threshold, tPure Potential Baseline1: C4.5

Baseline2: SMOTE Baseline3: MetaCost

0%

20%

40%

60%

80%

100%

0%

0.1

p̂

0.2

p̂

0.3

p̂

0.4

p̂

0.5

p̂

0.6

p̂

0.7

p̂

0.8

p̂

0.9

p̂

1.0

p̂

TP

R

Results for PCB50

Competitors have few cleared groups since:I Too few observations to clear group.

I Or frequency too high to clear group.

I More Experiments on UCI Sets:Pure Potential best algorithm in 22 out

of 25 tests.

I Code available at

http://www.cs.sfu.ca/~wangk/

software/CUT_classification

http://www.cs.sfu.ca/~wangk/software/CUT_classification

http://www.cs.sfu.ca/~wangk/software/CUT_classification

Acknowledgments

I Funding: BC Hydro R&D program and

Canada’s NSERC.

I Transformer Image Source:

Wikipedia user Benutzer:Stahlkocher;

License: GFDL.

Data & Analytics

Wang ke classification by cut clearance under threshold