Upload
analyticsweek
View
843
Download
1
Embed Size (px)
DESCRIPTION
Synopsis: The speaker will address the need to rethink classical approaches to analysis and predictive modeling. He will examine "iterative analytics" and extremely fine grained segmentation down to a single customer -- ultimately building one model per customer or millions of predictive models delivering on the promise of "segment of one" . The speaker will also address the speed at which all this has to work to maintain a competitive advantage for innovative businesses. Speaker: Afshin Goodarzi, Chief Analyst 1010data A veteran of analytics, Goodarzi has led several teams in designing, building and delivering predictive analytics and business analytical products to a diverse set of industries. Prior to joining 1010data, Goodarzi was the Managing Director of Mortgage at Equifax, responsible for the creation of new data products and supporting analytics to the financial industry. Previously, he led the development of various classes of predictive models aimed at the mortgage industry during his tenure at Loan Performance (Core Logic). Earlier on he had worked at BlackRock, the research center for NYNEX (present day Verizon) and Norkom Technologies. Goodarzi's publications span the fields of data mining, data visualization, optimization and artificial intelligence. Sponsor: 1010Data [ http://1010data.com ] Microsoft NERD [ http://microsoftnewengland.com ] Cognizeus [ http://cognizeus.com ]
Citation preview
2
About 1010data
• Founded in 2000
• Based in NYC
• Big Data analyAcs plaCorm in the cloud
• Library of pre-‐built analyAcal applicaAons
• Speed, power and flexibility second to none
3
We Host/Analyze 14+ Trillion Rows of Data
All Quotes and Trades since 2003 on NYSE are done on 1010data
All mortgages ever issued are analyzed on 1010data
Nearly all real-estate transactions are completed on 1010data
Big Data - Granular Data - Time series Data
All data for ~35,000 Retail outlets across the US are analyzed on 1010data
4
A Typical BI Technology Stack
Administrators
Data Sources
ETL
Inter-‐En
terprise Users
EDW
Data Cubes/ Marts
ReporAng / VisualizaAon
Analysis / Modeling
5
The Stack Has Fallen!
6
The Analy(cs Con(nuum & A Single Version of the Truth
7
Intui(ve Access to Unlimited Amounts of Data
Partner Data
3rd Party Data
1010data Cloud
Corporate Data
425,369,127,325 Rows!
8
The code: Chart 1
<layout background_="white" border_="1" height_="525" name="candlesAck_layout" relpos_="0,50" width_="650"> <widget base_="nyse.trades.hist.all" class_="graphics" invmode_="hide" name="candlesAck" relpos_="25,25" update_="manual" width_="600"> <sel value="between(date;'{@startdate}';'{@enddate}')"/> <sel value="(symbol='{@symbol}')"/> <tabu label="Candle SAck" breaks="date"> <break col="date" sort="up"/> <tcol source="prc" fun="wavg" name="vwap" weight="vol" label="VWAP"/> <tcol source="prc" fun="hi" name="high" label="High"/> <tcol source="prc" fun="lo" name="low" label="Low"/> <tcol source="prc" fun="first" name="open" label="Open"/> <tcol source="prc" fun="last" name="close" label="Close"/> </tabu> <graphspec> <chart type="candlesAck" Atle="CandlesAck Chart for {@symbol}"> <axes xlabel="Date" ylabel="Trading Price"/> </chart> </graphspec> </widget> <widget class_="bulon" name="candlesAck_refresh" relpos_="475,475" submit_="candlesAck" text_="Refresh" type_="submit"/> <widget class_="field" label_="Choose Symbol:" name="symbol_input" relpos_="125,475" value_="@symbol"/> </layout>
Query Chart Spec
9
Predic(ve Analy(cs on a Big Data Scale!
Big Data mandated AnalyAcs and predicAve modeling -‐ an example: The larger data sets have mandated more rigorous sampling strategies as tradiAonal systems have not kept up with the computaAonal needs of predicAve analyAc soluAons on Big Data. • Can we use all but a small holdout set in predicAve modeling? • What are the challenges? • What is an approach that works? • Are the results any good? • Is this soluAon only applicable to one industry?
10
Common Predic(ve Modeling Approach
" CPU intensive & error prone steps:
» Data selecAon » IV to DV relaAonship » TransformaAons » Sampling and validaAon » Model esAmaAon » Model tesAng » Repeat
10 hlp://onlinepubs.trb.org/onlinepubs/nchrp/cd-‐22/v2chapter5.html
CPU Error Prone
IV to DV relaAonship TransformaAons Sampling and validaAon Model esAmaAon Model tesAng Repeat
11
“One Segment” => “A Segment of One”
“Any customer can have a car painted any color that he wants so long as it is black.” re: the Model-‐T in 1909 (from My Life and Work , Henry Ford, 1922, Chap. 4, p.71)
12
Harry Truman displays a copy of the Chicago Daily Tribune newspaper that erroneously reported the elecAon of Thomas Dewey in 1948. Truman’s narrow victory embarrassed pollsters, members of his own party, and the press who had predicted a Dewey landslide.
13
Build A 30 Day Shopping List For Each Loyal Shopper at a Retail Chain
Shopper SKU Probability of purchase in the next 30 days
A. Smith 12345 90%
A. Smith 23567 85%
A. Smith ….
A. Smith 87996 30%
POS
Loyalty
Econ House prices Mortgage Rates BLS -‐ Unemployment
Inventory
With Permission from A&P
14
If The Shopper Bought “It” Before Will They Buy “It” Again?
" Classical modeling: variables as either posiAvely or negaAvely correlated with target
" Shoppers don’t behave the same!
" The demographics alributes have distribuAons for each variable!
15
Subscribers are “A Segment Of One”!
16
All sources of Prepay as analyzed in 1989
D
R
M
Interest Rates
House prices
Unemployment
Loan Age
Cost of opAon
Regional economy I
hlp://w
ww.freeusandw
orldmaps.com
/html/U
S_CounAes/US_CounAes.htm
l hl
p://www.tradingeconom
ics.com/united-‐states/unem
ployment-‐rate
hlp://w
ww.w
fa.gov/ hl
p://www.richm
ondfed.org/banking/markets_trends_and_staAsAcs/trends/pdf/delinquency_and_foreclosure_rates.pdf
17
Quality Measures : Lia => AUC
18
Fine vs. Coarse: Cash flows
19
InQuery analy(cs – User Defined Group Func(ons
• User defined − KNN − Naïve Bayes − ARCH/AR − PCA − Kernel − Decision Tree − LogisAcs trees − FFT − Etc……..
20
Ques(ons?