20
1 Predic(ve Analy(cs on a Big Data Scale! Afshin Goodarzi [email protected] April, 2014

Rethinking classical approaches to analysis and predictive modeling

Embed Size (px)

DESCRIPTION

Synopsis: The speaker will address the need to rethink classical approaches to analysis and predictive modeling. He will examine "iterative analytics" and extremely fine grained segmentation down to a single customer -- ultimately building one model per customer or millions of predictive models delivering on the promise of "segment of one" . The speaker will also address the speed at which all this has to work to maintain a competitive advantage for innovative businesses. Speaker: Afshin Goodarzi, Chief Analyst 1010data A veteran of analytics, Goodarzi has led several teams in designing, building and delivering predictive analytics and business analytical products to a diverse set of industries. Prior to joining 1010data, Goodarzi was the Managing Director of Mortgage at Equifax, responsible for the creation of new data products and supporting analytics to the financial industry. Previously, he led the development of various classes of predictive models aimed at the mortgage industry during his tenure at Loan Performance (Core Logic). Earlier on he had worked at BlackRock, the research center for NYNEX (present day Verizon) and Norkom Technologies. Goodarzi's publications span the fields of data mining, data visualization, optimization and artificial intelligence. Sponsor: 1010Data [ http://1010data.com ] Microsoft NERD [ http://microsoftnewengland.com ] Cognizeus [ http://cognizeus.com ]

Citation preview

Page 1: Rethinking classical approaches to analysis and predictive modeling

1  

 Predic(ve  Analy(cs  on  a  Big  Data  Scale!

Afshin  Goodarzi  [email protected]    

April, 2014

Page 2: Rethinking classical approaches to analysis and predictive modeling

2  

About  1010data  

•  Founded  in  2000    

•  Based  in  NYC  

•  Big  Data  analyAcs  plaCorm  in  the  cloud  

•  Library  of  pre-­‐built  analyAcal  applicaAons  

•  Speed,  power  and  flexibility  second  to  none  

Page 3: Rethinking classical approaches to analysis and predictive modeling

3  

We  Host/Analyze  14+  Trillion  Rows  of  Data  

All Quotes and Trades since 2003 on NYSE are done on 1010data

All mortgages ever issued are analyzed on 1010data

Nearly all real-estate transactions are completed on 1010data

Big Data - Granular Data - Time series Data  

All data for ~35,000 Retail outlets across the US are analyzed on 1010data

Page 4: Rethinking classical approaches to analysis and predictive modeling

4  

A  Typical  BI  Technology  Stack  

Administrators  

Data Sources

ETL  

Inter-­‐En

terprise  Users  

EDW  

Data  Cubes/    Marts  

ReporAng  /  VisualizaAon  

Analysis  /  Modeling  

Page 5: Rethinking classical approaches to analysis and predictive modeling

5  

The  Stack  Has  Fallen!  

Page 6: Rethinking classical approaches to analysis and predictive modeling

6  

The  Analy(cs  Con(nuum  &                A  Single  Version  of  the  Truth  

Page 7: Rethinking classical approaches to analysis and predictive modeling

7  

Intui(ve  Access  to  Unlimited  Amounts  of  Data  

Partner  Data  

3rd  Party  Data  

1010data  Cloud  

Corporate  Data  

425,369,127,325  Rows!  

Page 8: Rethinking classical approaches to analysis and predictive modeling

8  

The  code:    Chart  1  

<layout  background_="white"  border_="1"  height_="525"  name="candlesAck_layout"  relpos_="0,50"  width_="650">          <widget  base_="nyse.trades.hist.all"  class_="graphics"  invmode_="hide"  name="candlesAck"  relpos_="25,25"  update_="manual"  width_="600">              <sel  value="between(date;'{@startdate}';'{@enddate}')"/>              <sel  value="(symbol='{@symbol}')"/>              <tabu  label="Candle  SAck"  breaks="date">                  <break  col="date"  sort="up"/>                  <tcol  source="prc"  fun="wavg"  name="vwap"  weight="vol"  label="VWAP"/>                  <tcol  source="prc"  fun="hi"  name="high"  label="High"/>                  <tcol  source="prc"  fun="lo"  name="low"  label="Low"/>                  <tcol  source="prc"  fun="first"  name="open"  label="Open"/>                  <tcol  source="prc"  fun="last"  name="close"  label="Close"/>              </tabu>              <graphspec>                  <chart  type="candlesAck"  Atle="CandlesAck  Chart  for  {@symbol}">                      <axes  xlabel="Date"  ylabel="Trading  Price"/>                  </chart>              </graphspec>          </widget>          <widget  class_="bulon"  name="candlesAck_refresh"  relpos_="475,475"  submit_="candlesAck"  text_="Refresh"  type_="submit"/>          <widget  class_="field"  label_="Choose  Symbol:"  name="symbol_input"  relpos_="125,475"  value_="@symbol"/>      </layout>  

Query  Chart  Spec  

Page 9: Rethinking classical approaches to analysis and predictive modeling

9  

Predic(ve  Analy(cs  on  a  Big  Data  Scale!  

 Big  Data  mandated  AnalyAcs  and  predicAve  modeling  -­‐  an  example:  The  larger  data  sets  have  mandated  more  rigorous  sampling  strategies  as  tradiAonal  systems  have  not  kept  up  with  the  computaAonal  needs  of    predicAve  analyAc  soluAons  on  Big  Data.      •  Can  we  use  all  but  a  small  holdout  set  in  predicAve  modeling?    •  What  are  the  challenges?  •  What  is  an  approach  that  works?    •  Are  the  results  any  good?  •  Is  this  soluAon  only  applicable  to  one  industry?    

Page 10: Rethinking classical approaches to analysis and predictive modeling

10  

Common  Predic(ve  Modeling  Approach  

" CPU  intensive  &  error  prone  steps:  

 »  Data  selecAon  »  IV  to  DV  relaAonship  »  TransformaAons  »  Sampling  and  validaAon  »  Model  esAmaAon  »  Model  tesAng  »  Repeat  

10  hlp://onlinepubs.trb.org/onlinepubs/nchrp/cd-­‐22/v2chapter5.html  

CPU   Error  Prone  

IV  to  DV  relaAonship  TransformaAons  Sampling  and  validaAon  Model  esAmaAon  Model  tesAng  Repeat  

Page 11: Rethinking classical approaches to analysis and predictive modeling

11  

“One  Segment”  =>  “A  Segment  of  One”  

“Any  customer  can  have  a  car  painted  any  color  that  he  wants  so  long  as  it  is  black.”    re:  the  Model-­‐T  in  1909  (from  My  Life  and  Work  ,  Henry  Ford,  1922,  Chap.  4,  p.71)  

Page 12: Rethinking classical approaches to analysis and predictive modeling

12  

Harry  Truman  displays  a  copy  of  the  Chicago  Daily  Tribune  newspaper  that  erroneously  reported  the  elecAon  of  Thomas  Dewey  in  1948.  Truman’s  narrow  victory  embarrassed  pollsters,  members  of  his  own  party,  and  the  press  who  had  predicted  a  Dewey  landslide.  

Page 13: Rethinking classical approaches to analysis and predictive modeling

13  

Build  A  30  Day  Shopping  List  For    Each  Loyal  Shopper  at  a  Retail  Chain  

Shopper   SKU   Probability  of  purchase  in  the  next  30  days  

A.  Smith   12345   90%  

A.  Smith   23567   85%  

A.  Smith   ….  

A.  Smith   87996   30%  

POS  

Loyalty  

Econ  House  prices  Mortgage  Rates  BLS  -­‐  Unemployment  

Inventory  

With  Permission  from  A&P    

Page 14: Rethinking classical approaches to analysis and predictive modeling

14  

If  The  Shopper  Bought  “It”  Before  Will  They  Buy  “It”  Again?  

" Classical  modeling:  variables  as  either  posiAvely  or  negaAvely  correlated  with  target  

" Shoppers  don’t  behave  the  same!  

" The  demographics  alributes  have  distribuAons  for  each  variable!  

Page 15: Rethinking classical approaches to analysis and predictive modeling

15  

Subscribers  are  “A  Segment  Of  One”!  

Page 16: Rethinking classical approaches to analysis and predictive modeling

16  

All  sources  of  Prepay  as  analyzed  in  1989  

D  

R  

M  

Interest  Rates  

House  prices  

Unemployment  

Loan  Age  

Cost  of  opAon  

Regional  economy  I  

hlp://w

ww.freeusandw

orldmaps.com

/html/U

S_CounAes/US_CounAes.htm

l  hl

p://www.tradingeconom

ics.com/united-­‐states/unem

ployment-­‐rate  

hlp://w

ww.w

fa.gov/  hl

p://www.richm

ondfed.org/banking/markets_trends_and_staAsAcs/trends/pdf/delinquency_and_foreclosure_rates.pdf  

Page 17: Rethinking classical approaches to analysis and predictive modeling

17  

Quality  Measures  :  Lia  =>  AUC  

Page 18: Rethinking classical approaches to analysis and predictive modeling

18  

Fine  vs.  Coarse:  Cash  flows  

Page 19: Rethinking classical approaches to analysis and predictive modeling

19  

InQuery  analy(cs  –          User  Defined  Group  Func(ons  

 

•  User  defined  −  KNN  − Naïve  Bayes  −  ARCH/AR  −  PCA  −  Kernel  − Decision  Tree  −  LogisAcs  trees  −  FFT  −  Etc……..  

Page 20: Rethinking classical approaches to analysis and predictive modeling

20  

Ques(ons?