40
Life of Data Scientist myths and reality Jeong, Buhwan Ph.D Data Hacker / Kakao Corp.

Life of a data scientist (pub)

Embed Size (px)

Citation preview

Life of Data Scientistmyths and reality

Jeong, Buhwan Ph.D

Data Hacker / Kakao Corp.

Data scientists are big data wranglers. They take an enormous mass of messy data points and use their formidable skills in math, statistics and programming to clean, massage and organize them. Then they apply all their analytic powers and domain knowledge to uncover hidden solutions to business challenges.

Script (modified) from http://www.mastersindatascience.org/careers/data-scientist/

A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.

domain knowledge business understanding+

http://www.mastersindatascience.org/careers/data-scientist/

Diagram from https://www.quora.com/What-is-a-data-scientist-3

Image from http://paper4pc.com/superman-logosuperman.html

http://www.sintetia.com/wp-content/uploads/2014/05/Data-Scientist-What-I-really-do.png

DB

Log

SQLData

TXT / EXL Visualization

Implement

Test & Deploy

[KR]Algorithm

- Regression - Classification - Clustering

[Insight]

Big Data?

Volume Variety

Velocity Value

Value

BIG & FAST SMARTCount & Trend PredictiveTechnical Meaningful

Analytics

Volume Variety

Velocity

Engineering Science

Data Science

Scientific Method

Proved by

TheoryVerified with

Experiment

Algorithm(Equations)

Testing(Evidence)

Experiment & Test

Hypothesis

Experiment

Graduation

Observation

Deployment

Test (Comparison)

Observation

Off-line Test

Deployment

On-line Test

Test Deploy

Modeling Test set

Observe

A (Treatment)

B (Control)

← Offline : Online →Solve

M T W T F S S

Code Release

Off-line Test

On-line Test

Deployment

Monitoring & Improvement

Netflix’s Weekly Test & Deployment

Image from https://vwo.com/ab-testing/

On-line A/B Test

Image from https://vwo.com/ab-testing/

From Yahoo! (Creative Best Practices: Native Ads)

From Yahoo! (Creative Best Practices: Native Ads)

A/B Test Configuration

Traffic-driven For every incoming request, if random() < 0.1, then assign the treatment group (10%) otherwise, assign the control group (90%)

User-driven For every requestor (whose userId ends with ‘NN’) if ‘NN’ is in ’00 ~ 09’, then assign the treatment group otherwise, assign the control group

Random

Control Group

Treatment (A)

A/B Test

Random

Control Group

Treatment A

Treatment B

Treatment C

Multivariate Test

Multivariate test: https://www.optimizely.com/resources/multivariate-testing/

Red Daum vs Blue Daum

Data over Algorithm

Forbes.com: http://goo.gl/bauDHw

DB

Log

SQL Data

Implement

Test & Deploy

[KR]Algorithm

- Regression - Classification - Clustering

[Insight]

20 60

15

5

Forbes.com: http://goo.gl/bauDHw

Hacking Data for business goals - Right data - Right algorithm - Right evaluation

Good UI/UX is defined by

User Adoption

Human Hacker

Image from https://goo.gl/vClux5

Enjoy your Jeju