Franhouder july2013

TRANSFER LEARNING AND SE [email protected]

WVU, JULY 2013

SOUND BITES

•  Ye olde worlde SE

•  “The” model of SE (defects, effort, etc)

•  21st century SE

•  Models (plural) •  No generality in models •  But , perhaps generality in how we find those models

•  Transfer learning

2

3

WHAT IS TRANSFER LEARNING?

•  Source = old= Domain1 = < Eg1, P1>

•  Target = new = Domain2 = <Eg2, P2>

•  If we move from domain1 to domain2, do have have to start afresh?

•  Or can we learn faster in “new” … •  … Using lessons learned from “old”?

•  NSF funding (2013..2017):

•  Transfer learning in Software Engineering •  Menzies, Layman, Shull , Diep

4

WHO CARES? (WHAT’S AT STAKE?)

•  “Transfer” is a core scientific issue

•  Lack of transfer is the scandal of SE

•  Replication is Empirical SE is rare

•  Conclusion instability •  It all depends.

•  The full stop syndrome

•  The result?

•  A funding crisis

5

MANUAL TRANSFER (WAR STORIES)

•  Brazil, SEL, 2002: need domain knowledge (but now gone)?

•  NSF, SEL, 2006: need better automatic support

•  Kitchenham, Mendes et al, TSE 2007: for = against

•  Zimmermann FSE, 2009: cross works in 4/600 times

6

WAR STORIES (EFFORT ESTIMATION) Effort = a . locx . y

•  learned using Boehm’s methods

•  20*66% of NASA93 •  COCOMO attributes •  Linear regression (log

pre-processor) •  Sort the co-efficients

found for each member of x,y

7

WAR STORIES (DEFECT ESTIMATION)

8

BUT THERE IS HOPE

•  Maybe we’ve been looking in the wrong direction

•  SE project data = surface features of an underlying effect •  Go beneath the surface

9

Focused too much on what we can see at first glance

Did not check the nuances on the hidden structure beneath

10

BUT THERE IS HOPE

With new data mining technologies, true picture emerges, where we can see what is going on

12/1/2011 11

BUT THERE IS HOPE

ESEM, 2011 : How to Find Relevant Data for Effort Estimation

TIM MENZIES, EKREM KOCAGUNELI

THERE IS HOPE



13

USD DOD MILITARY PROJECTS (LAST DECADE)

14

You must segment to find relevant

data

15"

DOMAIN SEGMENTATIONS

15

Q: What to do about rare

zones?

A: Select the nearest ones from the rest But how?

IN THE LITERATURE: WITHIN VS CROSS = ??

BEFORE THIS WORK

16

Kitchenham et al. TSE 2007

•  Within-company learning (just use local data)

•  Cross-company learning (just use data from other companies)

Results mixed •  No clear win from cross

or within

Cross vs within are no rigid boundaries

•  They are soft borders •  And we can move a

few examples across the border

•  And after making those moves

•  “Cross” same as “local”

SOME DATA DOES NOT DIVIDE NEATLY ON EXISTING DIMENSIONS

17

THE LOCALITY(1) ASSUMPTION

18

Data divides best on one attribute 1.  development centers of developers; 2.  project type; e.g. embedded, etc; 3.  development language 4.  application type (MIS; GNC; etc); 5.  targeted hardware platform; 6.  in-house vs outsourced projects; 7.  Etc

If Locality(1) : hard to use data across these boundaries

•  Then harder to build effort models: •  Need to collect local data (slow)

THE LOCALITY(N) ASSUMPTION

19

Data divides best on combination of attributes If Locality(N)

• Easier to use data across these boundaries

•  Relevant data spread all around

•  little diamonds floating in the dust

HOW TO FIND RELEVANT TRAINING DATA?

20

independent attributes

w x y z class similar 1

0 1 1 1 2 similar 2

0 1 1 1 3 different 1 7 7 6 2 5 different 2 1 9 1 8 8 different 3 5 4 2 6 10

alien 1 74 15 73 56 20 alien 2 77 45 13 6 40 alien 3 35 99 31 21 60 alien 4 49 55 37 4 80

Use similar?

Use more variant?

Use aliens ?

VARIANCE PRUNING

21

independent attributes

w x y z class similar 1

0 1 1 1 2 similar 2

0 1 1 1 3 different 1 7 7 6 2 5 different 2 1 9 1 8 8 different 3 5 4 2 6 10

alien 1 74 15 73 56 20 alien 2 77 45 13 6 40 alien 3 35 99 31 21 60 alien 4 49 55 37 4 80

1) Sort the clusters by “variance” 2) Prune those high variance things 3) Estimate on the rest

“Easy path”: cull the examples that hurt the learner

PRUNE !

KEEP !

TEAK: CLUSTERING + VARIANCE PRUNING (TSE, JAN 2011)

22

•  TEAK is a variance-based instance selector • It is built via GAC trees

•  TEAK is a two-pass system • First pass selects low-variance relevant projects • Second pass retrieves projects to estimate from

ESSENTIAL POINT

23

TEAK finds local regions important to the estimation of particular cases

TEAK finds those regions via locality(N)

•  Not locality(1)

WITHIN AND CROSS DATASETS

24

Note: all Locality(1) divisions

EXPERIMENT1: PERFORMANCE COMPARISON OF WITHIN AND CROSS-SOURCE DATA

25

•  TEAK on within & cross data for each dataset group (lines separate groups) •  LOOCV used for runs •  20 runs performed for each treatment •  Results evaluated w.r.t. MAR, MMRE, MdMRE and Pred(30), but see http://goo.gl/6q0tw

•  If within data outperforms cross, the dataset is highlighted with gray

•  See only 2 datasets highlighted

EXPERIMENT 2: RETRIEVAL TENDENCY OF TEAK FROM WITHIN AND CROSS-SOURCE DATA

26

EXPERIMENT2: RETRIEVAL TENDENCY OF TEAK FROM WITHIN AND CROSS-SOURCE DATA

27

Diagonal (WC) vs. Off-Diagonal (CC) selection percentages sorted

Percentiles of diagonals and off-diagonals

HIGHLIGHTS

28

1.  Don’t listen to everyone •  When listening to a crowd, first

filter the noise

2.  Once the noise clears: bits of me are similar to bits of you

•  Probability of selecting cross or within instances is the same

3.  Cross-vs-within is not a useful distinction

•  Locality(1) not informative •  Enables “cross-company”

learning

SO, THERE IS HOPE



•  Assuming locality(N), not locality(1)

•  No cross-, no within- •  Its all data we can learn from

29

TSE, 2013 : LOCAL VS. GLOBAL MODELS FOR EFFORT ESTIMATION AND DEFECT PREDICTION TIM MENZIES, ANDREW BUTCHER (WVU) ANDRIAN MARCUS (WAYNE STATE) THOMAS ZIMMERMANN (MICROSOFT) DAVID COK (GRAMMATECH)

Do not on what we can see at first glance

Check the nuances on the hidden structure beneath

31

THERE IS HOPE

12/1/2011 32

Cluster then learn (using envy)

•  Seek the fence where the grass is greener on the other side.

•  Learn from there

•  Test on here

•  Cluster to find “here” and “there”

12/1/2011 33

ENVY = THE WISDOM OF THE COWS

12/1/2011 34

@attribute recordnumber real @attribute projectname {de,erb,gal,X,hst,slp,spl,Y} @attribute cat2 {Avionics, application_ground, avionicsmonitoring, … } @attribute center {1,2,3,4,5,6} @attribute year real @attribute mode {embedded,organic,semidetached} @attribute rely {vl,l,n,h,vh,xh} @attribute data {vl,l,n,h,vh,xh} … @attribute equivphyskloc real @attribute act_effort real @data 1,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,25.9,117.6 2,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,24.6,117.6 3,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,7.7,31.2 4,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,8.2,36 5,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,9.7,25.2 6,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,2.2,8.4 ….

DATA = MULTI-DIMENSIONAL VECTORS

CAUTION: DATA MAY NOT DIVIDE NEATLY ON RAW DIMENSIONS

The best description for SE projects may be synthesize dimensions extracted from the raw dimensions

12/1/2011 35

FASTMAP

36

Fastmap: Faloutsos [1995] O(2N) generation of axis of large variability

•  Pick any point W; •  Find X furthest from W, •  Find Y furthest from Y.

c = dist(X,Y) All points have distance a,b to (X,Y)

•  x = (a2 + c2 − b2)/2c •  y= sqrt(a2 – x2)

Find median(x), median(y) Recurse on four quadrants

HIERARCHICAL PARTITIONING Prune

Find two orthogonal dimensions

Find median(x), median(y)

Recurse on four quadrants

Combine quadtree leaves with similar densities

Score each cluster by median score of class variable

37

Grow

38

Learning via “envy”

•  Seek the fence where the grass is greener on the other side.

•  Learn from there

•  Test on here

•  Cluster to find “here” and “there”

39

ENVY = THE WISDOM OF THE COWS





Combine quadtree leaves with similar densities

Score each cluster by median score of class variable

40

Grow





Combine quadtree leaves with similar densities Score each cluster by median score of class variable This cluster envies its neighbor with better score and max abs(score(this) - score(neighbor)) 41

Grow

Where is grass greenest?

Q: HOW TO LEARN RULES FROM NEIGHBORING CLUSTERS

A: it doesn’t really matter • Many competent rule learners

But to evaluate global vs local rules: • Use the same rule learner for local vs global rule learning

This study uses WHICH (Menzies [2010])

• Customizable scoring operator • Faster termination • Generates very small rules (good for explanation)

42

DATA FROM HTTP://PROMISEDATA.ORG/DATA

Effort reduction = { NasaCoc, China } : COCOMO or function points

Defect reduction = {lucene,xalan jedit,synapse,etc } : CK metrics(OO)

Clusters have untreated class distribution.

Rules select a subset of the examples:

•  generate a treated class distribution

43

0 20 40 60 80 100

25th

50th

75th

100th

untreated global local

Distributions have percentiles:

Treated with rules learned from all data

Treated with rules learned from neighboring cluster

Lower median efforts/defects (50th percentile)

Greater stability (75th – 25th percentile)

Decreased worst case (100th percentile)

BY ANY MEASURE, LOCAL BETTER THAN GLOBAL

44

RULES LEARNED IN EACH CLUSTER

What works best “here” does not work “there”

•  Misguided to try and tame conclusion instability •  Inherent in the data

Can’t tame conclusion instability.

•  Instead, you can exploit it •  Learn local lessons that do better than overly generalized global theories

45

RULES LEARNED IN EACH CLUSTER

What works best “here” does not work “there”

•  Misguided to try and tame conclusion instability •  Inherent in the data

Can’t tame conclusion instability.

•  Instead, you can exploit it •  Learn local lessons that do better than overly generalized global theories

46


Check the nuances on the structures within our data

•  Cluster, then envy

47

SO THERE IS HOPE

48

Conclusion

LACK OF TRANSFER = THE GREAT SCANDAL OF SE

•  Replication is Empirical SE is rare

•  Conclusion instability

•  “It all depends.” is not good enough

•  A funding crisis

49

BUT THERE IS HOPE



•  Assuming locality(N), not locality(1)

•  No cross-, no within- •  Its all data we can learn from

50


Check the nuances on the structures within our data

•  Cluster, then envy

51

BUT THERE IS HOPE

With new data mining technologies, true picture emerges, where we can see what is going on

12/1/2011 52

BUT THERE IS HOPE

53

Technology

Franhouder july2013