42
© 2014 MapR Technologies 1 ® © 2014 MapR Technologies Ted Dunning June 9, 2015

Practical Computing Wiith Chaos

Embed Size (px)

Citation preview

Page 1: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 1

®

© 2014 MapR Technologies

Ted Dunning June 9, 2015

Page 2: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 2

Practical Computing with Chaos Ted Dunning, Chief Applications Architect MapR Technologies

Email [email protected] [email protected] Twitter @Ted_Dunning

Page 3: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 3

e-book available courtesy of MapR Also at MapR booth

http://bit.ly/1jQ9QuL

A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)

Page 4: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 4

Practical Machine Learning series (O’Reilly) •  Machine learning is becoming mainstream •  Need pragmatic approaches that take into account real world

business settings: –  Time to value –  Limited resources –  Availability of data –  Expertise and cost of team to develop and to maintain system

•  Look for approaches with big benefits for the effort expended

Page 5: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 5

Agenda •  Monty Hall •  Randomized geo-coding •  Thompson sampling

–  Bayesian Bandits –  Targeting –  Bayesian ranking

•  Dithering (sound, signals) •  Synthetic data (preview)

Page 6: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 6

Let’s Start with Trouble •  Monty Hall problem (oops, done)

•  Three doors, one with a fabulous prize •  You pick one •  Monte shows you one of the remaining doors is empty •  You can switch at this point to the other door or not

•  Should you switch?

Page 7: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 7

Page 8: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 8

Page 9: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 9

Page 10: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 10

The Real Problem

•  Doing the math isn’t too hard

•  Convincing somebody you have the right answer is really hard

Page 11: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 11

Live Coding With REAL Chaos

Page 12: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 12

Geo-coding

Page 13: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 13

Geo-coding •  Some databases have disk locality ó key locality •  The primary key is totally ordered

•  Embedding a total ordering of the points in a plane is possible –  But loses some distance information –  A line is not a square!

•  We want to do proximity searches –  This gets harder in the polar regions for most codings

Page 14: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 14

Space Filling Curve

0 1

23 01

2 3

0

1 2

3 0

1 2

3

0

1 2

3

Page 15: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 15

Space Filling Curve

0123

2

3

3

1

0

2

2

3

1

1

00 3

201

Page 16: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 16 000 001 010 011 100 101 110 111

000

001

010

011

100

101

110

111

Z-coding – Interleave Bits

x = 010y = 011geo = 00.11.01

1110

010000

1110

11

01

01

10

00

00

11

01

10

01

110010

Page 17: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 17 000 001 010 011 100 101 110 111

000

001

010

011

100

101

110

111

Neighbors Often Share Prefix

1110

010000

1110

11

01

01

10

00

00

11

01

10

01

110010

00. 11.11

10. 01.01

00. 11.01

Page 18: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 18

Often, not always

13 15 37Close Far

Page 19: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 19 000 001 010 011 100 101 110 111

000

001

010

011

100

101

110

111

Random Sampling to Derive Keys

1110

010000

1110

11

01

01

10

00

00

11

01

10

01

110010

Page 20: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 20

"00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10”

1110

010000

1110

11

01

01

10

00

00

11

01

10

01

110010

Page 21: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 21

"00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10”

1110

010000

1110

11

01

01

10

00

00

11

01

10

01

110010

Page 22: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 22

"00.01.10" - "00.01.11" "00.11.00" - "00.11.11" "01.00.10" "01.10.00" - "01.10.10”

1110

010000

1110

11

01

01

10

00

00

11

01

10

01

110010

Page 23: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 23

Dithering

Page 24: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 24

•  4 bit sine wave (listen for artifacts as volume decreases)

•  White dithering (artifacts gone, we hear through the noise)

•  Noise shaping (noise is easier to hear through)

Page 25: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 25

0 1 2 3 4 5 6

−4−2

02

4

Time

Page 26: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 26

The Shape of the Noise

Noise

Frequency

−0.4 −0.2 0.0 0.2 0.4

01000

3000

Page 27: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 27

The Effect After Averaging

0 1 2 3 4 5 6

−4−2

02

4

Time

Page 28: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 28

Thompson Sampling

Page 29: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 29

Learning in the Real World •  In the real world we get to pick our training examples

–  Do we try this restaurant or not?

•  Learning has real and opportunity costs

•  Not learning has real and opportunity costs as well

•  Every sub-optimal choice we make incurs regret –  We would like to minimize this –  But we can’t quantify regret without incurring regret!

Page 30: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 30

An Example •  Pick one of five options

–  Purple, blue, green, red, yellow –  Each has a random payoff

•  If you pick a bad option, regret = mean(best) – mean(yours)

•  The best known algorithm uses randomization –  Best = minimal regret + minimal code complexity

Page 31: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 31

Demo – The Algorithm

Page 32: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 32

Synthetic Data

Page 33: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 33

select  IR.ENC_KEY  ,IR.ENCOUNTER_  ,IR.ETYPE  ,IR.bill_type  ,IR.CONTR_  ,IR.SOURCE_CD                ,IR.sub_source_cd  ,IR.HP_CD  ,IR.LOB_CD  ,IR.FDO  ,IR.TDOS  ,IR.member_Nbr                ,IR.HIC_NBR  ,IR.MEMBER_SOURCE_CD  ,IR.HDR_ERRCD  ,IR.HDR_ERRDESC                ,IR.PROVIDER_NBR  ,IR.provider_type  ,IR.PROVIDER_SOURCE_CD                ,IR.cms_provider_ty  e  ,IR.SPEC_CD  ,IR.SPEC_DESC  ,IR.rev_cd  ,IR.rev_cd_desc                ,IR.proc_cd  ,IR.diag_cd  ,IR.DIAG_CD_KEY  ,IR.DIAGNOSIS_KEY  ,IR.rec_state_cd                ,IR.rec_status_cd  ,IR.DG_ERRCD  ,IR.DG_ERRDESC    FROM  (SELECT  distinct  enc.encounter_key  as  ENC_KEY,                enc.encounter_nbr  as  ENCOUNTER_,  typ.encounter_type_cd  as  ETYPE,                  bt.bill_type,  cnt.contract_nbr  as  CONTR_,                ds.SOURCE_CD,  enc.sub_source_cd,  enc.HP_CD,  lob.LOB_CD,                enc.new_min_dt  as  FDOS,  substr(enc.new_max_dt,  1,  10)  as  TDOS,                enc.member_Nbr,  m.HIC_NBR,  m.MEMBER_SOURCE_CD,  eerr.error_cd  as  HDR_ERRCD,                eerr.ERROR_DESC  as  HDR_ERRDESC,  enc.PROVIDER_NBR,  prv.provider_type,                prv.PROVIDER_SOURCE_CD,  diag.cms_provider_type,                sp.specialty_cd  as  SPEC_CD,  sp.specialty_desc  as  SPEC_DESC,  svc.rev_cd,                rev.rev_cd_desc,  svc.proc_cd,  dgcd.diag_cd,  dgcd.DIAG_CD_KEY,  diag.DIAGNOSIS_KEY,                st.rec_state_cd,  sts.rec_status_cd,  derr.error_cd  as  DG_ERRCD,                derr.error_desc  as  DG_ERRDESC      FROM  oicpcuhg.ir_encounter  enc    `  

Can You See the Problem?

Page 34: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 34

INNER  JOIN  oicpcuhg.ir_encountertype  typ              ON  (typ.encounter_type_key  =  enc.encounter_type_key)  LEFT  OUTER  JOIN  oicpcuhg.ir_billtype  bt              ON  (bt.bill_type_key  =  enc.bill_type_key)  LEFT  OUTER  JOIN  oicpcuhg.ir_contract  cnt              ON  (cnt.contract_key  =  enc.contract_key)  LEFT  OUTER  JOIN  oicpcuhg.ir_datasource  ds              ON  (ds.source_key  =  enc.data_source_key)  LEFT  OUTER  JOIN  oicpcuhg.ir_lineofbusiness  lob              ON  (lob.lob_key  =  enc.lob_key)  INNER  JOIN  oicpcuhg.ir_member  m              ON  (                    m.hp_cd  =  enc.hp_cd            AND  m.member_source_cd  =  enc.member_source_cd            AND  m.member_nbr  =  enc.member_nbr)  LEFT  OUTER  JOIN  oicpcuhg.ir_encountererror  eerror              ON  (eerror.encounter_key  =  enc.encounter_key  and                    eerror.active_flg  =  'Y')  LEFT  OUTER  JOIN  oicpcuhg.ir_error  eerr              ON  (eerr.error_key  =  eerror.error_key)  LEFT  OUTER  JOIN  oicpcuhg.ir_provider  prv              ON  (prv.hp_cd  =  enc.hp_cd  and                    prv.provider_source_cd  =  enc.provider_source_cd  and                    prv.provider_nbr  =  enc.provider_nbr)  

Page 35: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 35

LEFT  OUTER  JOIN  oicpcuhg.ir_encounterspecialty  esp              ON  (esp.encounter_key  =  enc.encounter_key)  LEFT  OUTER  JOIN  oicpcuhg.ir_specialty  sp              ON  (sp.specialty_key  =  esp.specialty_key)  LEFT  OUTER  JOIN  oicpcuhg.ir_service  svc              ON  (svc.encounter_key  =  enc.encounter_key)  LEFT  OUTER  JOIN  oicpcuhg.ir_revenue  rev              ON  (rev.rev_cd  =  svc.rev_cd)  LEFT  OUTER  JOIN  oicpcuhg.ir_diagnosis  diag              ON  (diag.encounter_key  =  enc.encounter_key)  INNER  JOIN  oicpcuhg.ir_diagcd  dgcd              ON  (dgcd.diag_cd_key  =  diag.diag_cd_key)  INNER  JOIN  oicpcuhg.ir_recordstate  st              ON  (st.rec_state_key  =  diag.rec_state_key)  INNER  JOIN  oicpcuhg.ir_recordstatus  sts              ON  (sts.rec_status_key  =  diag.rec_status_key)  LEFT  OUTER  JOIN  oicpcuhg.ir_diagnosiserror  derror              ON  (derror.diagnosis_key  =  diag.diagnosis_key  and                    derror.active_flg  =  'Y')  LEFT  OUTER  JOIN  oicpcuhg.ir_error  derr              ON  (derr.error_key  =  derror.error_key))  IR  INNER  JOIN  oicpcuhg.umr_req_inbound  umr              ON  (trim(umr.member_nbr)  =  IR.member_Nbr  AND                    trim(umr.hhc_from_ccyymmdd)  =  IR.TDOS  AND                    trim(umr.sub_mcare_mbr)  =  IR.HIC_NBR  AND                    trim(umr.diag1)  =  IR.diag_cd)  

Page 36: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 36

One Attack •  The customer can’t give you the data

–  They can’t trust you, by law

•  But they can probably summarize the data –  How many columns –  What types –  Perhaps statistical summaries

Page 37: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 37

Bug Replication Without Security Violation

Customer You

Data Data

Data Fake

Data Fake

x y α ξ

x y α ξ

Page 38: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 38

The Upshot •  So random numbers are useful

•  But simple distributions not so much

•  How can YOU generate cool data?

Page 39: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 39

e-book available courtesy of MapR

http://bit.ly/1jQ9QuL

A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)

Page 40: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 40

Last October: Time Series Databases by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)

Time Series Databases

Ted Dunning &

Ellen Friedman

New Ways to Store and Access

Page 41: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 41

Coming in February: Real World Hadoop by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)

Page 42: Practical Computing Wiith Chaos

®© 2014 MapR Technologies 42

Thank you for coming today!