19
Data Scientists Leonid Zhukov Higher School of Economics , Moscow, 2013 www.hse.ru

Data Scientists

Embed Size (px)

DESCRIPTION

Data Science and Data Scientists

Citation preview

Page 1: Data Scientists

Data Scientists

Leonid Zhukov

Higher School of Economics , Moscow, 2013 www.hse.ru

Page 2: Data Scientists

The Sexiest Job of the 21st Century

2  

McKinsey estimates 140,000-190,000 shortage by 2018

Higher School of Economics , Moscow, 2013

Page 3: Data Scientists

Data Scientists wanted!

3  Higher School of Economics , Moscow, 2013

Page 4: Data Scientists

Supply and demand

4  Higher School of Economics , Moscow, 2013

Page 5: Data Scientists

Who are Data Scientists?

5  

Some backgrounds are better than others: •  Computer Science •  Statistics (mathematics) •  Natural sciences with strong quantitative •  PhD’s, but not only

Data Scientist: •  Loves data •  Investigator mind set •  Goal of his work is in finding patterns in data and data driven

products •  He is a practitioner, not theorist •  Has “hands on” skills •  Domain expertise (*) •  Team player

demand for a certain set of skills, while later demand wanes as many of those initial skills are automated by even newer tools. Consider, for instance, the way many data processing and network management jobs that used to require legions of computer operators are now handled by automated monitoring tools. Data science is still in its very early phase, with the amount of data exploding and

the right tools to process them just becoming available.

Although data science is generating new opportunities, our capacity to train new data scientists is not keeping up, and nearly two-thirds of respondents foresee a looming shortfall in the number of data scientists over the next five years. This aligns with other research, including a recent McKinsey Global Institute study that predicts a shortage of 190,000 data scientists by the year 2019iii. And when our respondents were asked where the best source for talent was, few looked to today’s business intelligence professional. Instead, nearly two-thirds looked for today’s

university students.

Who is the Data Scientist? Although the term data science has been around for decades – indeed, most scientists’ use data of some form – the term data scientist in its current context is relatively new, frequently credited to DJ Patil, who started the data science team at LinkedIn.iv But as a new term, the field is still very much in flux, and without evidence about the practitioners, we’re left to speculate about what it may mean. In our survey, we allowed users to self-identify as “data science professionals,” in order to avoid conflicts over terminology in job titles. In this section we’ll attempt to define the data scientist by comparing them with the previous big player in the analytics space, business intelligence professionals.

Twenty years ago, business intelligence was itself a new term, just emerging to take over the various database management and decisions support functions within an organization. As the field grew rapidly in the 90s, it also coalesced around a smaller number of tools, more consistent expectations for talent, better training, and more rigorous organizational standards. As our data demonstrates, data scientists are currently going through that transition,

Students studying computer science

34%

Students studying

fields other than

computer science

24%

Professionals in disciplines other than IT or computer

science 27%

Today's BI professionals

12%

Other 3%

The best source of new Data Science talent is:

Jim Asplund, Chief Scientist at Gallup Consulting, is a data scientist focused on evaluating the role that human perception has on everything from disease conditions and GDP to worker productivity and consumer behavior. He works with massive data sets linking perception with actual behavior, and micro

and macroeconomic outcomes. His work has isolated emotional factors that are most highly related to outcomes

organizations care about.

 EMC Data Science Community Survey, 2011 Higher School of Economics , Moscow, 2013

Page 6: Data Scientists

What do Data Scientists do?

•  Designs customized system and tools •  Works with structured and unstructured data •  Creates data processing pipelines •  Analyzes massive datasets (TB, PB) •  Builds predictive models •  Creates visualizations •  Designs data products •  Uses Hadoop, MapReduce, Hive, Python, R

6  Higher School of Economics , Moscow, 2013

Page 7: Data Scientists

Tools of the trade

•  Operating systems: •  Linux + shell tools

•  Big data instruments: •  Hadoop (MapReduce) + hadoop tools •  Hive, Pig •  NoSQL (Hbase, MongoDB, Cassandra, Neo4J)

•  Database: •  SQL

•  Programming: •  Python •  Java •  Scala

•  Machine Learning: •  R •  Matlab •  Python libraries (NumPy, SciPy, Nltk,SciKit) •  Java libraries (Mahaut)

.

7  Higher School of Economics , Moscow, 2013

Page 8: Data Scientists

Required skills

•  Programming •  Algorithms •  Statistics •  Data mining •  Machine learning •  NLP •  Distributed systems •  Big data tools •  Databases •  Visualization

8  

From: Swami Chandrasekaran,Executive Architect, IBM, Watson Solutions

Higher School of Economics , Moscow, 2013

Page 9: Data Scientists

Data Scientist roles

9  

From: “Analyzing the Analyzers” by Harlan Harris, Sean Murphy, and Marck Vaisman , O’Reilly Strata 2012 Higher School of Economics , Moscow, 2013

Page 10: Data Scientists

Data Science ”dream team”

10  

From: “Doing Data Science: Straight Talk from the Frontline”, Rachel Schutt, Cathy O'Neil, O'Reilly Media, 2013 Higher School of Economics , Moscow, 2013

Page 11: Data Scientists

Data Science project pipeline

Learning  a  problem     Acquiring  

data     Parsing  data    Cleaning,  

filtering  and  organizing  

Exploring  and  mining  for  paGerns  

Building  models  

Visualizing  results  

CommunicaJng  findings  

11  Higher School of Economics , Moscow, 2013

Page 12: Data Scientists

Business applications

•  Marketing: •  Market segmentation •  Product and media mix analysis •  Customer acquisition and churn modeling •  Recommendation system and cross sell •  Social media analysis

•  Finance & Insurance: •  Fraud prevention •  Anomaly detection •  Credit risk analysis •  Usage based insurance modeling •  Portfolio optimization

12  

•  Healthcare and Pharmaceuticals: •  Genetic analysis •  Clinical trials analysis •  Clinical decision support system

Higher School of Economics , Moscow, 2013

Page 13: Data Scientists

Industry training

©2013 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.

cloudera-intro-data-science-trainingsheet-102

Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304 | 1-888-789-1488 or 1-650-362-0488 | cloudera.com

TRAINING SHEET | 2

Course Outline: Cloudera Introduction to Data Science

Introduction

Data Science Overview > What Is Data Science? > The Growing Need for Data Science

> The Role of a Data Scientist

Use Cases > Finance > Retail > Advertising > Defense and Intelligence > Telecommunications and Utilities > Healthcare and Pharmaceuticals

Project Lifecycle > Steps in the Project Lifecycle

> Lab Scenario Explanation

Data Acquisition > Where to Source Data > Acquisition Techniques

Evaluating Input Data > Data Formats > Data Quantity > Data Quality

Data Transformation > Anonymization > File Format Conversion > Joining Datasets

Data Analysis and Statistical Methods > Relationship Between Statistics and

Probability > Descriptive Statistics > Inferential Statistics

Fundamentals of Machine Learning > Overview > The Three Cs of Machine Learning > Spotlight: Naïve Bayes Classifiers > Importance of Data and Algorithms

Recommender Overview > What Is a Recommender System? > Types of Collaborative Filtering > Limitations of Recommender Systems > Fundamental Concepts

Introduction to Apache Mahout > What Apache Mahout Is (and Is Not) > A Brief History of Mahout > Availability and Installation > Demonstration: Using Mahout’s Item-

Based Recommender

Implementing Recommenders with Apache Mahout

> Overview > Similarity Metrics for Binary Preferences > Similarity Metrics for Numeric Preferences > Scoring

Experimentation and Evaluation > Measuring Recommender Effectiveness > Designing Effective Experiments > Conducting an Effective Experiment > User Interfaces for Recommenders

Production Deployment and Beyond > Deploying to Production > Tips and Techniques for Working at Scale > Summarizing and Visualizing Results > Considerations for Improvement > Next Steps for Recommenders

Conclusion

Appendix A : Hadoop Overview

Appendix B: Mathematical Formulas

Appendix C : Language and Tool Reference

Cloudera Certified Professional: Data Scientist (CCP:DS)Establish yourself as an expert by completing the certification exam for data scientists. CCP:DS is the highest level of technical certification Cloudera offers and certifies your knowledge and skills as a data scientist using Apache Hadoop on large data sets. The credential requires both a multiple-choice Data Science Essentials exam and a hands-on, performance-based Data Science Challenge with a real-world problem on a live system.

TRAINING SHEET

Cloudera Introduction to Data Science: Building Recommender Systems

Take your knowledge to the next level with Cloudera’s Data Science Training and Certification

Data scientists build information platforms to ask and answer previously unimaginable questions. Learn how data science helps companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities.

Cloudera University’s three-day course helps participants understand what data scientists do and the problems they solve. Through in-class simulations, participants apply data science methods to real-world challenges in different industries and, ultimately, prepare for data scientist roles in the field.

Hands-On Hadoop Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:

> The role of data scientists, vertical use cases, and business applications of data science > Where and how to acquire data, methods for evaluating source data, and data

transformation and preparation > Types of statistics and analytical methods and their relationship > Machine learning fundamentals and breakthroughs, the importance of algorithms, and

data as a platform > How to implement and manage recommenders using Apache Mahout and how to set up

and evaluate data experiments > Steps for deploying new analytics projects to production and tips for working at scale

Audience & Prerequisites This course is suitable for developers, data analysts, and statisticians with basic knowledge of Apache Hadoop: HDFS, MapReduce, Hadoop Streaming, and Apache Hive. Students should have proficiency in a scripting language; Python is strongly preferred, but familiarity with Perl or Ruby is sufficient.

Data Scientist CertificationFollowing successful completion of the training class, attendees receive a Data Science Essentials practice test. Data Science Essentials plus the Data Science Challenge constitute the Cloudera Certified Professional: Data Scientist (CCP:DS). Certification is a great differen-tiator; it helps establish you as a leader in the field, providing employers and customers with tangible evidence of your skills and expertise.

The professionalism and expansive technical knowledge demonstrated by our instructor were incredible. The quality of the Cloudera training was on par with a university.

GENERAL DYNAMICS

““

13  Higher School of Economics , Moscow, 2013

Page 14: Data Scientists

14  

Industry training

Higher School of Economics , Moscow, 2013

Page 15: Data Scientists

Educational programs

University programs: •  University of Washington: Certificate in Data Science •  UC Berkeley: Master of information and data science program •  New York University: Data Science at NYU •  Columbia University: Institute for Data Sciences and Engineering •  University of Southern California (UCS) : Master of Science in Data

Science

15  

Online MOOC courses: •  Coursera •  edX •  Udacity

Accelerated educational programs: •  Zipfian Academy (12 weeks intensive program) •  Insight Data Science Fellows program ( 6 weeks post doc training)

Higher School of Economics , Moscow, 2013

Page 16: Data Scientists

Conferences

•  Industry conferences and meetings: •  O’Reilly Strata Conference Making Data Work •  Hadoop World •  Big Data Techcon •  Big Data Innovation summits

16  

•  Meetups

•  Academic conferences (peer reviewed): •  IEEE & ACM Supercomputing •  IEEE Big Data •  ACM KDD Knowledge Discovery and Data Mining •  ACM SIGIR Information Retrieval •  ICML International Conference on Machine Learning •  ICDM International Conference on Data Mining •  NIPS Neural Information Processing •  WWW World Wide Web Conference •  VLDB Very Large Data Bases •  ACM CIKM Information and Knowledge Management •  SIAM SDM International Conference on Data Mining •  IEEE ICDE Data Engineering •  IEEE Visualization

Higher School of Economics , Moscow, 2013

Page 17: Data Scientists

Textbooks

17  Higher School of Economics , Moscow, 2013

Page 18: Data Scientists

Open questions

• How important is domain expertise? • What is need more: education or experience?

• Future of Data Scientist, will they be replaced by software?

18  Higher School of Economics , Moscow, 2013

Page 19: Data Scientists

20, Myasnitskaya str., Moscow, Russia, 101000 Tel.: +7 (495) 628-8829, Fax: +7 (495) 628-7931

www.hse.ru