16
Discontinuities Demo Shrayes Ramesh, PhD. Data Tactics Corporation Thursday, November 14, 13

Discontinuities Demo

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Discontinuities Demo

Discontinuities Demo

Shrayes Ramesh, PhD.Data Tactics Corporation

Thursday, November 14, 13

Page 2: Discontinuities Demo

• how do you decide which variables capture what happened?

• /when/ did an event happen• what's the effect of the event on the variables

• Can we construct a UI and algorithm to tackle all three problem simultaneously?

Challenges

Thursday, November 14, 13

Page 3: Discontinuities Demo

• The goal is to feed in raw data as the sole input, and obtain answers to all three questions:

• (1) when did an event likely occur• (2) what variables can we use to measure the

event• (3) what was the effect of the event on those

variables

Goals

Thursday, November 14, 13

Page 4: Discontinuities Demo

Simple example

timeevent

effect

var

iabl

e ou

tcom

e

Thursday, November 14, 13

Page 5: Discontinuities Demo

Scaled up

Thursday, November 14, 13

Page 6: Discontinuities Demo

With limited insight…

• if we know timing and the important variables, we can measure the effect of the shock on the variables. (standard regression techniques)

• if we know the set of important variables and track variables over time, we can identify timing of shocks.

• if we know timing and have a long history of variable evolution, we can cluster variables by their behavior at the important point in time (relative to other points in time)

Thursday, November 14, 13

Page 7: Discontinuities Demo

Methodology

correct timing

correct effect

estim

ated

effe

ct

Thursday, November 14, 13

Page 8: Discontinuities Demo

Methodology• For every time T and variable K, run an OLS, under the hypothesis that a shock

occurred at time T to variable K

• Sample is restricted to variables for a neighborhood around t, i.e. [t-bandwith , t+bandwidth]:

Y(K,t) = A(K,T) + B(K,T)S(t) + e(K,t) with

S(t) = 1(t>T) is an indicator with T as the time to test

• Results are stored as the matrix of coefficients B(K,T)

• OLS estimates of B(K,T) are biased towards zero to the extent that S(t) is misspecified.

• In other words, B(K,T) will be maximally different from zero (and unbiased) at the true break T

Thursday, November 14, 13

Page 9: Discontinuities Demo

Methodology

Thursday, November 14, 13

Page 10: Discontinuities Demo

• when did an event likely occur?– aggregate (sum) effects across all variables

• what variables can we use to measure the event?– which variables had the largest effect at time point?

• what was the effect of the event on those variables?– we just measured that

• what variables move together often across time?– show similar variables

Answers

Thursday, November 14, 13

Page 11: Discontinuities Demo

Example 1: Super Bowl tweets

• Twitter streaming API (every tweet)• Sample of data selected from Sunday,

February 3, 1600-2210 hours• Binned into minute-by-minute word counts• Out of 651k 1-grams, kept 1035 least sparse

(> 30% sparse) words. • Input data is 371x1035 matrix

Thursday, November 14, 13

Page 12: Discontinuities Demo

SUPERBOWL SHINY

Thursday, November 14, 13

Page 13: Discontinuities Demo

Network graph of variables withcorrelations > .95

Power outage

Halftime show

Thursday, November 14, 13

Page 14: Discontinuities Demo

Deployable and Repeatable

• The model only requires data to be transformed to a KxT matrix.– K variables– T time periods

We could use this model on many other data sets!• minute-by-minute word count in twitter• stock prices• chatter on social media forums

Thursday, November 14, 13

Page 15: Discontinuities Demo

Network graph of forums with correlations > .27

Hezbollah

Thursday, November 14, 13

Page 16: Discontinuities Demo

Future improvements

• OLS is simple and efficient, but other models may be more accurate at estimating effects in some cases

• exploring different approaches to choosing which variables to consider and approach to aggregating variable effects.

• massively parallel on all 630k words simultaneously?

• real-time analytics on streaming data

Thursday, November 14, 13