Upload
nicholas-arcolano
View
166
Download
0
Tags:
Embed Size (px)
Citation preview
Data Science on a Budget: Maximizing Insight and Impact
Nicholas Arcolano, Ph.D.
Senior Data Scientist
@arcolano
Photo by giuseppemilo / CC BY
A little background…
• Spent 10 years at MIT Lincoln Laboratory working in ballistic missile defense and cyber security research
• Areas of interest: statistics, machine learning, parallel computing, “big data”
• Realized these things had been collectively re-branded as “data science”
• Started calling myself a “data scientist” and joined a start-up
Nicholas Arcolano – Data Science on a Budget – November 2014 2
What does a data scientist do?
Nicholas Arcolano – Data Science on a Budget – November 2014 3
What does a data scientist do?
• Something that happens at the intersection of statistics, machine learning, and computer science
• Usually involves data (typically lots of it)
• Actually, this isn’t the most critical question to be worrying about
Nicholas Arcolano – Data Science on a Budget – November 2014 4
A better question…
• What does a data team do?
• Basically, two things:
1. Use data to help the rest of the company understand what our users are doing
2. Help the rest of the company use this information to improve our product and our business
Nicholas Arcolano – Data Science on a Budget – November 2014 5
The Company
• Started in 2008
• Based in Boston
• About 50 people
• 4-person data team
Our Product
• RunKeeper app for GPS and manual tracking of running, walking, cycling, other activities
• Long-term fitness goals, training plans, and performance insights
• iOS, Android, web, 3rd party devices
The Data
• 37 million users
• 450 million fitness activities
• 200 billion GPS points
• 17 billion interactions and events
DATA
SYSTEMS PRODUCT
MARKETING
EXECUTIVE
BUSINESS DEVELOPMENT
USER EXPERIENCE
QUALITY ASSURANCE
• analytics and business intelligence
• modeling and forecasting
• data systems and archiving
• user research and testing
• data-driven features
• data stories and visualizations
7
SUPPORT
• analytics and business intelligence
• modeling and forecasting
• data systems and archiving
• user research and testing
• data-driven features
• data stories and visualizations “DATA SCIENCE”
How can we accomplish all this, quickly and with a small team?
It’s hard… but here are some steps to making it easier
Nicholas Arcolano – Data Science on a Budget – November 2014 8
Step 1: Communicate. A lot.
Nicholas Arcolano – Data Science on a Budget – November 2014 9
Step 1: Communicate. A lot.
Nicholas Arcolano – Data Science on a Budget – November 2014 10
Step 1: Communicate. A lot.
• You have a lot to learn about the rest of the company – Every part of the company has its own blend of tools, systems, processes,
environments
– Every part has data it understands and cares about
– Every part knows things that affect the data that you won’t see— user interviews, support feedback, product bugs, system failures
• You also have a lot to teach people – What data we have
– What it can—and can’t—do
– Empower people to “think with data” Nicholas Arcolano – Data Science on a Budget – November 2014 11
Step 1: Communicate. A lot.
• Be patient—sometimes you have to say the same things many times
• You may be the only one looking at certain data—if you see something, say something!
Nicholas Arcolano – Data Science on a Budget – November 2014 12
Setting expectations
Things our data team will discover
exciting new things things we already knew Anticipated impact of data exploration:
Things our data team will discover
bugs, missing data, and bad data
things we already knew
exciting new things
Actual impact of data exploration:
Nicholas Arcolano – Data Science on a Budget – November 2014 13
Step 2: Move quickly but carefully.
Nicholas Arcolano – Data Science on a Budget – November 2014 14
“Wisely and slow. They stumble that run fast.”
– Friar Laurence, from Shakespeare’s Romeo and Juliet
Step 2: Move quickly but carefully.
• On moving fast… – Data science can work well in an agile framework
– Make assumptions, but understand them
– Don’t be afraid to provide caveats
• On being cautious… – Bad analysis is worse than no analysis
– Make time for data QA
– Use common sense—if it seems to good (or bad) to be true, it usually is
Nicholas Arcolano – Data Science on a Budget – November 2014 15
Step 3: Keep it simple.
Nicholas Arcolano – Data Science on a Budget – November 2014 16
• Go for lots of small, quick wins
• Learn and iterate
• Resist the urge to show everyone how smart you are by doing something super complicated
Step 3: Keep it simple.
• Do the “stupid thing” first – It helps build understanding
– It helps uncover issues with the data
– It may turn out that you’re not even solving the right problem
– It may actually work pretty well
• When in doubt, favor a simpler method that you understand better over a more complex one – Easier to implement
– Easier to debug
– Easier to explain to others
Nicholas Arcolano – Data Science on a Budget – November 2014 17
You don’t have to use all the data
• Sometimes, using all the data is the right thing to do:
• Sometimes, though, you can solve your problem entirely with a small data set
• Benefits – Easier computation and data wrangling means faster results
– “Curse of dimensionality” is a real thing
– Mitigate bad assumptions (lack of stationarity, different product versions, changing environments, regional and seasonal effects, etc.)
SELECT COUNT(userid) FROM rk_user;
Nicholas Arcolano – Data Science on a Budget – November 2014 18
Step 4: Use the right tools.
• In any given scenario, the “right tool” is one of the following: – The tool you already know and are
comfortable with
– Something you don’t know but suspect would work really well
– Something that doesn’t exist yet
• It’s up to you to figure out which one it is
Nicholas Arcolano – Data Science on a Budget – November 2014 19
Languages and technologies I used during 10 years at my last job
Languages and technologies I’ve used during 1 year at my current job
Step 4: Use the right tools.
• Be comfortable using a variety of tools
• Make time to learn new ones
• Build your own tools for repeatable analysis—once you know it’s worth it
• Open source: take advantage of the hard work of others, but make sure you understand what you’re using
• Give back
Nicholas Arcolano – Data Science on a Budget – November 2014 20
Step 4: Use the right tools.
• Many of the same principles apply to your “analytical toolkit”
• Try to learn when to stick with a well-worn approach and when to try something new
• Be skeptical of the conventional wisdom – Just because a metric or analytical approach is common doesn’t mean it’s
the right thing to do for your situation
– Typical example: A/B testing
Nicholas Arcolano – Data Science on a Budget – November 2014 21
Hypothesis testing (“A/B testing”)
Nicholas Arcolano – Data Science on a Budget – November 2014 22
GROUP A “Control”
GROUP B “Treatment”
USERS
90%
10%
Standard flow
Experimental flow
Test statistic
DECISION “reject/accept
null hypothesis”
# of successes, failures
# of successes, failures
“Null hypothesis”: treatment has no effect “Alternate hypothesis”: treatment has some effect
Thoughts about A/B testing
• A/B testing is hard to do well – Need lots of data and good estimates of baseline rates to have a chance at significance
– Need lots of data infrastructure to do it quickly on a large scale
– Need to manage variables such multiple testing, changes in product and environment, interactions between tests, subjects
– Need to make sure tests align with high-level vision and learning goals
• An A/B test can help with one very specific decision, but typically will not... – Help you understand how multiple different factors interact
– Predict long-term reactions (the “taste test” phenomenon)—need longitudinal study
– Always give you the answer you want—results may be null or inconclusive
– Tell you anything of any value whatsoever if you did it wrong
Nicholas Arcolano – Data Science on a Budget – November 2014 23
Thoughts about A/B testing
Even when performed “correctly”, an A/B test may not tell you what you think it does
Step 5: Have faith and have fun
• Don’t try to understand everything all at once—keep looking from multiple angles and trust that more understanding will come in time
Nicholas Arcolano – Data Science on a Budget – November 2014 25
Step 5: Have faith and have fun
• Working data from millions of engaged users is awesome
• Helping your company have a real impact on their lives is even more awesome
• All the tools are available to do truly amazing things
• Make sure everyone knows how much you love the data, and they will grow to love it too
Nicholas Arcolano – Data Science on a Budget – November 2014 26
Things we’re still working on
• Synthesizing knowledge and communicating results
• Data-driven products and features
• Analytics and instrumentation
• Giving back (open source, blogging, tutorials, talks)
Nicholas Arcolano – Data Science on a Budget – November 2014 27
http://arcolano.com
@arcolano
Thanks for listening! Questions?
http://www.runkeeper.com