Upload
june-andrews
View
90
Download
0
Embed Size (px)
Citation preview
March 2017Frances HaugenWhen is Data Science
a House of Cards?
Replication in Data Science
Dr June Andrews
Agenda
1
2
3
Explore Pinterest’s content Pinterest Replication Study Inspire the future
Design system
Agenda
1
2
3
Explore Pinterest’s content Pinterest Replication Study Inspire the future
Design system
Clothing Cooking Decorating Beauty Teaching Carpentry Cars Animated GIFs Electronics
Stereos Fashion Sewing Articles Painting Photography Nature Cute cats Tattoos Hair
Microscopy TV shows Apps Self help Motorcycles
General MethodologyRefine Goals with Stakeholders
ETL data
Analyze data (Some iteration involved)
Draw Conclusions
Share Conclusions & Support Stakeholders
1
2
3
4
5
General MethodologyRefine Goals with Stakeholders
ETL data
Analyze data (Some iteration involved)
Draw Conclusions
Share Conclusions & Support Stakeholders
1
2
3
4
5
Want Increased Visability
Reproducibility in Data ScienceSame Data + Same Code = Same Results
Jason Chin Writing a Genome Assembler with IPython: http://nbviewer.jupyter.org/github/cschin/Write_A_Genome_Assembler_With_IPython/blob/master/Write_An_Assembler.ipynb
Replication in Data ScienceSame Goal + Same Data = Same Conclusions
Jason Chin Writing a Genome Assembler with IPython: http://nbviewer.jupyter.org/github/cschin/Write_A_Genome_Assembler_With_IPython/blob/master/Write_An_Assembler.ipynb
Treat with Faviparivir
Replication Crisis in Psychology
Nature August 2015
Monya Baker - Over half of psychology studies fail reproducibility test
Crowd sourced study on red cards in soccerNature October 2015
Silberzahn & Ahlmann; Crowdsourced research: Many hands make tight work
Agenda
1
2
3
Explore Pinterest’s content Pinterest Replication Study Inspire the future
Design system
For a sample set of link domains we’re interested in:
• All Pin creates in their first year on Pinterest
• All repins in their first year on Pinterest
• 100k link domains sampled total
Links are behind every Pin
Current cluster analysisETL data into clustering algorithm
Build cluster visualizations
Tune parameters
Add human labels to each cluster
Share human interpretation of clusters
1
2
3
4
5
Current cluster analysisETL data into clustering algorithm
Build cluster visualizations
Tune parameters
Add human labels to each cluster
Share human interpretation of clusters
1
2
3
4
5
Expensive
Tool Pros Cons
Cluster algorithms (SVM, K-Means, Spectral)
Considers all users Accurate
Tough to communicate Definitions change over time
User experience studiesDeep knowledge
Captures the immeasurableCostly
Considers few users
Domain expert hypothesis Human interpretable Inaccurate
Human in the loop computingCommunity membership identification from small seed sets (Kloumann & Kleinberg)
Kloumann & Kleinberg - Community Membership Identification from Small Seed Sets - KDD
T
Domain Expert
Favorite Clustering Algorithm
Human in the loop computingWhen machine confidence dips, engage with domain expert
Domain Expert
Favorite Clustering Algorithm
T
Unsure
Confident
?
T
Kloumann & Kleinberg - Community Membership Identification from Small Seed Sets - KDD
Human in the loop computingIterate through problem space
Domain Expert
Favorite Clustering Algorithm
T
?
Unsure
Confident
T
T
Kloumann & Kleinberg - Community Membership Identification from Small Seed Sets - KDD
Human in the loop computingTerminate when Domain Expert determines labeling is done
Domain Expert
Favorite Clustering Algorithm
T
T That’s all!
Kloumann & Kleinberg - Community Membership Identification from Small Seed Sets - KDD
Human in the loop computingStage 2: Domain expert creates 1 human interpretable cluster
Domain Expert
Human in the loop computingStage 3: Remove human labeled clusters and iterate
Domain ExpertFavorite
Clustering Algorithm
Python NotebookSample visualization for each cluster
1000
800
600
400
200
0
1200
800
600
400
200
0
1000
0
35
30
25
20
15
10
5
1
Months Active
12
10
8
6
4
0
2
Few
Many
1
Peak Distance
12
10
8
6
4
0
2
200
150
100
50
01
Pin Creates
1
Repins
1
Total Pins
1
Repin/Create Ratio
Few
Many
Pin creates RepinsFew Many
Iteration 1Title Dark content
Description Fewer than 2 Pins a week on average
Examples Noisy low quality content
Machine Cluster 0
Cluster Size: 60587 link domains representing 58.00% of link domains
Feature Quantities
Pin Creates
Repins
Repins + Pin Creates
Months Active
Peak Distance
Repin to Pin Create Ratio
Iteration 2Pinterest Specials
Pin creates RepinsFew ManyDescription Domains with few Pins, but these Pins thrive in the Pinterest
ecosystem
Calculation
def detect_pinterest_specials(domain_engagement): ratio = domain_engagement.n_repins / max(1.0, float(domain_engagement.n_pin_creates)) return domain_engagement.n_pin_creates <= X and ratio >= Y
Examples Fashion and impulse sites
Iteration 3Steady growth
Pin creates RepinsFew ManyDescription Active Pin creates and steady growth throughout the year
Calculation
def detect_steady_growth(domain_engagement): (growth_rate, intercept) = np.polyfit(range(len(domain_engagement.monthly_repins)), domain_engagement.monthly_repins,1) return months_pins_created >= X and growth_rate >= Y
Examples Recipe and DIY sites
Iteration 4Slow growth
Pin creates RepinsFew ManyDescription Similar to steady growth, but not as fast
Calculation
def detect_steady_growth(domain_engagement): (growth_rate, intercept) = np.podef detect_steady_growth(domain_engagement): (growth_rate, intercept) = np.polyfit(range(len(domain_engagement.monthly_repins)), domain_engagement.monthly_repins,1) return months_pins_created >= X and growth_rate >= Ylyfit(range(len(domain_engagement.monthly_repins)), domain_engagement.monthly_repins,1) return months_pins_created >= X and growth_rate >= Y
Examples Little lower quality recipe and DIY sites
Iteration 5Churning
Pin creates RepinsFew ManyDescription Slowly fade through the year
Calculation
def detect_churning(domain_engagement): (repin_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), domain_engagement.monthly_repins[2:], 1) (pin_create_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), domain_engagement.monthly_pin_creates[2:], 1) return repin_growth < 0 and pin_create_growth < 0
Examples Fashion sale and click bait sites
Iteration 6Yearly
Pin creates RepinsFew ManyDescription Slowly fade through the year
Calculation
def detect_churning(domain_engagement): (repin_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), domain_engagement.monthly_repins[2:], 1) (pin_create_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), domain_engagement.monthly_pin_creates[2:], 1) return repin_growth < 0 and pin_create_growth < 0
Examples Seasonal fashion, such as snow boots
Iteration 7Late Bloomer
Pin creates RepinsFew ManyDescription Peak mid year
Calculation
def detect_late_bloomer(domain_engagement): (concavity, pin_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), [r + p for (r, p) in zip(domain_engagement.monthly_repins[2:], domain_engagement.monthly_pin_creates[2:])], 2) return concavity < 0
Examples Blogs that get off to a slow start
Clusters• Dark content
• Pinterest specials
• Steady growth
• Slow growth
• Churning
• Yearly
• Late bloomer
Current cluster analysisETL data into clustering algorithm
Build cluster visualizations
Tune parameters
Add human labels to each cluster
Share human interpretation of clusters
1
2
3
4
5
Interactive Notebook
9 data scientists and machine learning engineers. Same data, same UI, same day. Everyone finished in about one hour.
So we did it again
Baseline clusters Results e Results l Results d Results m Results z Results b Results k
Dark content
Pinterest specials
Steady growth
Slow growth
Churning
Yearly
Late bloomer
Existing clusters as our baseline
Baseline clusters Results e Results l Results d Results m Results z Results b Results k
Dark content Unpopular (95%) Trailing (90%)
Pinterest specials Trailing (100%)Viral on Pinterest (98%)
Pin creates drop off (97%)
Steady growth Increasing repins (94%)
Continuous growth (94%)
Slow growth
Churning
Yearly
Late bloomer
90% Matches
Baseline clusters Results e Results l Results d Results m Results z Results b Results k
Dark content Unpopular (95%) Trailing (90%)Original pinny (84%)
Pinterest specials Trailing (100%)Minimal original Pins (66%)
Viral on Pinterest (98%)
Pin creates drop off (97%)
Steady growth Pinterest viral content (62%)
Other (53%) Original Pinny (51%)Viral on the internet (69%)
Increasing repins (94%)
Continuous growth (94%)
Suspected Save button high Pin creates (73%)
Slow growth Pinterest viral content (55%)
Original Pinny (82%)
Viral on the internet (65%)
Increasing repins (65%)
Continuous growth (86%)
Suspected Save button high Pin creates (51%)
ChurningOriginal Pinny (68%)
Viral on the internet (53%)
Yearly Original Pinny (71%)
Late bloomer Original Pinny (71%)Continuous growth (55%)
Suspected Save button high Pin creates (59%)
50% Matches
Baseline clusters Results e Results l Results d Results m Results z Results b Results k
Dark content Unpopular (95%) Trailing (90%)Original pinny (84%)
Pinterest specials Trailing (100%)Minimal original Pins (66%)
Viral on Pinterest (98%)
Pin creates drop off (97%)
Steady growth Pinterest viral content (62%)
Other (53%) Original Pinny (51%)Viral on the internet (69%)
Increasing repins (94%)
Continuous growth (94%)
Suspected Save button high Pin creates (73%)
Slow growth Pinterest viral content (55%)
Original Pinny (82%)
Viral on the internet (65%)
Increasing repins (65%)
Continuous growth (86%)
Suspected Save button high Pin creates (51%)
ChurningOriginal Pinny (68%)
Viral on the internet (53%)
Yearly Original Pinny (71%)
Late bloomer Original Pinny (71%)Continuous growth (55%)
Suspected Save button high Pin creates (59%)
50% Matches
Baseline clusters Results e Results l Results d Results m Results z Results b Results k
Dark content Unpopular (95%) Trailing (90%)Original pinny (84%)
Pinterest specials Trailing (100%)Minimal original Pins (66%)
Viral on Pinterest (98%)
Pin creates drop off (97%)
Steady growth Pinterest viral content (62%)
Other (53%) Original Pinny (51%)Viral on the internet (69%)
Increasing repins (94%)
Continuous growth (94%)
Suspected Save button high Pin creates (73%)
Slow growth Pinterest viral content (55%)
Original Pinny (82%)
Viral on the internet (65%)
Increasing repins (65%)
Continuous growth (86%)
Suspected Save button high Pin creates (51%)
ChurningOriginal Pinny (68%)
Viral on the internet (53%)
Yearly Original Pinny (71%)
Late bloomer Original Pinny (71%)Continuous growth (55%)
Suspected Save button high Pin creates (59%)
50% Matches
Baseline clusters Results e Results l Results d Results m Results z Results b Results k
Dark content Unpopular (95%) Trailing (90%)Original pinny (84%)
Pinterest specials Trailing (100%)Minimal original Pins (66%)
Viral on Pinterest (98%)
Pin creates drop off (97%)
Steady growth Pinterest viral content (62%)
Other (53%) Original Pinny (51%)Viral on the internet (69%)
Increasing repins (94%)
Continuous growth (94%)
Suspected Save button high Pin creates (73%)
Slow growth Pinterest viral content (55%)
Original Pinny (82%)
Viral on the internet (65%)
Increasing repins (65%)
Continuous growth (86%)
Suspected Save button high Pin creates (51%)
ChurningOriginal Pinny (68%)
Viral on the internet (53%)
Yearly Original Pinny (71%)
Late bloomer Original Pinny (71%)Continuous growth (55%)
Suspected Save button high Pin creates (59%)
50% Matches
Baseline clusters Results e Results l Results d Results m Results z Results b Results k
Yearly Seasonal Throwback Seasonal Annual
Steady growth Gaining popularity Increasing repins Continuous growth High engagement
Pinterest specials Initial flurryMinimal original Pins
Viral on Pinterest Pin create drop offUnpopular domains with good content
Ideologically similar clustersBut not related in implementation
9 data scientists 9 answersImpact implications
Build different products
Same product applied to different users
Agenda
1
2
3
Explore Pinterest’s content Pinterest Replication Study Inspire the future
Design system
Source placeholder
Signs of suboptimal clusteringLeading with biases
Cherry-picking: responding to a limited subset of the data
Pin creates RepinsFew Many
Seasonal
Differences of perspectiveCluster m - Viral growth centric
• Viral on Pinterest
• Viral on the internet
• Lame
Turning of the tideMeasuring data science impact • Experimental systems are now standard
• Data scientists are more available
• Reproducibility is saving analysis
• [Now] Fast and cheap analysis by multiple people from changing algorithms and open source contributions [Prophet]
Next StepsTest variations in analysis • Record analysis decisions to
product outcomes
• Toss analysis variations at experimental systems
• Borrow from additional fields for rigorous processes
• Tailor our analysis techniques to replication
Concrete experimentsBreak down the problem and build up • Prime analysts before
jumping into the clustering
• Set expectations of what good is
• Train analysts with generated data
• Add process reminders for the goal
Pintrest
[email protected] FrancesHaugen Frances_Haugen
Dr. Frances Haugen
We’re hiring!https://engineering.pinterest.com/
[email protected] DrAndrews DrJuneAndrews
Dr. June Andrews
pin.it/data