108
Data Science Starter Program Introduction to Data Science E. Le Pennec, A. Fermin Spring 2015

Data Science Starter Program Introduction to Data Science

  • Upload
    others

  • View
    25

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Data Science Starter Program Introduction to Data Science

Data Science Starter ProgramIntroduction to Data Science

E. Le Pennec, A. Fermin

Spring 2015

Page 2: Data Science Starter Program Introduction to Data Science

Introduction to Data ScienceOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 3: Data Science Starter Program Introduction to Data Science

Data Science in the mediaOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 4: Data Science Starter Program Introduction to Data Science

Data Science in the mediaLe Monde

Page 5: Data Science Starter Program Introduction to Data Science

Data Science in the mediaNY Times

Page 6: Data Science Starter Program Introduction to Data Science

Data Science in the mediaWorld Bank

Page 7: Data Science Starter Program Introduction to Data Science

Data Science in the mediaCriteo

Page 8: Data Science Starter Program Introduction to Data Science

Data Science in the mediaWe are in the press as well...

Page 9: Data Science Starter Program Introduction to Data Science

Data Science in the mediaData is the new oil?

Page 10: Data Science Starter Program Introduction to Data Science

From Data to ProductOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 11: Data Science Starter Program Introduction to Data Science

From Data to ProductWeb search

Page 12: Data Science Starter Program Introduction to Data Science

From Data to ProductRecommendation system

Page 13: Data Science Starter Program Introduction to Data Science

From Data to ProductAdvertisement

Page 14: Data Science Starter Program Introduction to Data Science

From Data to ProductIntrusion detection

Page 15: Data Science Starter Program Introduction to Data Science

From Data to ProductCrime prevention

Page 16: Data Science Starter Program Introduction to Data Science

From Data to ProductMarketing

Marketing Technology Landscape January 2014

INFRA&'

STRU

CTURE

'BA

CKBO

NE'

PLATFORM

S'MIDDLE&'

WARE

'

Databases' Big'Data'

by'Sco?'Brinker'''@chiefmartec'''h?p://chiefmartec.com'

Cloud'

CRM' MarkeNng'AutomaNon'/'Integrated'MarkeNng' Web'Site'/'WCM'/'WEM' E&commerce'

User'Mgmt' Cloud'Connectors' APIs'

MARKETING'EXPERIENCES'

Channel/Local'Mktg'

MarkeNng'Resource'Mgmt'

MARKETING'OPERATIONS'

Agile'&'Project'Mgmt'

Dashboards'

MarkeNng'AnalyNcs'

Business'Intelligence'

Digital'Asset'Mgmt'

MarkeNng'Data'

Sales'Enablement'

Content'MarkeNng'PersonalizaNon'

TesNng'&'OpNmizaNon'

SEO'

MarkeNng'Apps'

Customer'Experience/VoC'

Calls'&'Call'Centers'

Events'&'Webinars'

Loyalty'&'GamificaNon'

Social'Media'MarkeNng'

CommuniNes'&'Reviews'

Video'Ads'&'MarkeNng'

Email'MarkeNng'

Display'AdverNsing'

Search'&'Social'Ads'

Tag'Management'

INTERN

ET'Web'Dev' MarkeNng'Environment'

Data'Management'PlaYorms/Customer'Data'PlaYorms'

Web'&'Mobile'AnalyNcs'

Mobile'App'Dev'

Mobile'MarkeNng'

CreaNve'&'Design'

Page 17: Data Science Starter Program Introduction to Data Science

From Data to ProductHealth

Page 18: Data Science Starter Program Introduction to Data Science

From Data to ProductLinkedin

Page 19: Data Science Starter Program Introduction to Data Science

From Data to ProductSmart city

Page 20: Data Science Starter Program Introduction to Data Science

From Data to ProductSports

Page 21: Data Science Starter Program Introduction to Data Science

From Data to ProductGenomics

Page 22: Data Science Starter Program Introduction to Data Science

From Data to ProductPhysics

Page 23: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 24: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingAn example: Real Time Bidding

A customer visits a webpage with his browser: a complex processof content selection and delivery begins.

An advertiser might want to display an ad to this customer on thewebpage he is going to

The webpage belongs to a publisher. The publisher deliverscontents: news, music, information, sports, etc. This content drawsan audience

The publisher sells ad space to advertisers who want to reach thataudience

Page 25: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingAn example: Real Time Bidding

1. 2. The customer visits a publisher’s webpage: the browser opensa connection to the publisher’s content server. It returns thecontent for the page (html code).

The html code describing this content is retrieved by the browser,and it starts to render an interpret it.

But... there is a line in this html code that says “follow this URL toretrieve ad content”

Page 26: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingAn example: Real Time Bidding

3. The publisher has an ad server: it answers the request byconsidering possibilities: can I put an ad for my premium buyers ?Do I have data about this consumer viewing my content? (it couldhelp me to decide to which buyer I could give this displayopportunity). Only logical rules apply (no machine-learning here).

The ad display opportunity is not premium, and this space or type ofcustomer is not already reserved by a buyer. Publisher’s ad serverputs this opportunity of ad display in the open ad market.

Page 27: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingAn example: Real Time Bidding

4. The publisher ad server connects to an SSP (Supply-SidePlatform). This platform monetizes its programmable displayinventory.

The SSP asks: have I already seen this consumer before ? Do I haveadditional data on him? The SSP requests extra information to aDMP (Data-Management Platform) about the user: profiling,audience segments, etc. Here machine learning is applied.

5. Using this information, the SSP sends the ad request to anad-exchange

Page 28: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingAn example: Real Time Bidding

6. Meanwhile, the ad-exchange is connected and exchanges withmany potential buying systems: DSP (Demand-Side Platform),ad-networks, even other ad-exchange networks.

Ad-network and DSP can have pre-cached bid: I’m paying 1$ for1000 displays of 25years-old males in France, I buy 100 displays assoon as the price is below some threshold (like a broker).

If no pre-cached bids, the ad-exchange says: no direct buyer for thisdisplay. Let’s us an auction rule! The RTB (Real Time Bidding)begins.

Page 29: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingAn example: Real Time Bidding

6. RTB: buyers have 10ms (!) to give a price to thead-exchange. Buyers assess in real-time how willing they are todisplay an ad to this customer.

Machine learning is used here, but only the prediction step, e.g. toassess the probability that the customer will click on some ads. Themodel must contain few parameters to answer quickly: the use offeature selection in the training step is crucial here.

Page 30: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingAn example: Real Time Bidding

7. The ad-exchange selects the highest bidder. The winning DSPgives instruction to the ad-exchange to retrieve the ad creative.

8. The ad-exchange passes these instructions to the SSP

9. The SSP send the request to the publisher ad server

10. The publisher ad server responds to the still existing httpconnection of the browser,11. 12. and tells to the browser to go to the agency’s ad serverto download the ad.

Page 31: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingAn example: Real Time Bidding

Now the ad can be displayed in the browser.

Full process takes < 100ms !Where is data science:

DMP side, to cluster audience into marketing segments and toprofile customers: clustering and classificationBuyer’s side (DSP, ad-network) to compute the price proposedfor RTB. Need to estimate the probability of a click on ads:regression and classification

Page 32: Data Science Starter Program Introduction to Data Science

An example: Real Time BiddingAn example: Real Time Bidding

Some numbers for a large web-advertisement company:10 million prediction of click probability per secondanswers within 10msstores 20Terabytes of data daily

Page 33: Data Science Starter Program Introduction to Data Science

Data Science ecosystemOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 34: Data Science Starter Program Introduction to Data Science

Data Science ecosystemA new Context

Data everywhereHuge volume,Huge variety...

Affordable computation unitsCloud computingGraphical Processor Units (GPU)...

Growing academic and industrial interest!

Page 35: Data Science Starter Program Introduction to Data Science

Data Science ecosystemBig Data is (quite) Easy

Example of off the shelves solution

Page 36: Data Science Starter Program Introduction to Data Science

Data Science ecosystemBig Data is (quite) Easy

Example of off the shelves solution

export AWS_ACCESS_KEY_ID=<your-access-keyid>export AWS_SECRET_ACCESS_KEY=<your-access-key-secret>cellule/spark/ec2/sparl-ec2 -i cellule.pem -k cellule -s <number of machines> launch <cluster-name>ssh -i cellule.pem root@<your-cluster-master-dns>spark-ec2/copy-dir ephemeral-hdfs/confephemeral-hdfs/bin/hadoop distcp s3n://celluledecalcul/dataset/raw/train.csv /data/train.csvscp -i cellule.pem cellule/challenge/target/scala-2.10/target/scala-2.10/challenges_2.10-0.0.jar

cellule/spark/bin/spark-submit \--class fr.cc.challenge.Preprocess \challenges_2.10-0.0.jar \/data/train.csv \/data/train2.csv

cellule/spark/bin/spark-submit \--class fr.cc.sparktest.LogisticRegression \challenges_2.10-0.0.jar \/data/train2.csv

⇒ Logistic regression for arbitrary large dataset!

Page 37: Data Science Starter Program Introduction to Data Science

Data Science ecosystemBig Data is (quite) Easy

Example of off the shelves solution

Page 38: Data Science Starter Program Introduction to Data Science

Data Science ecosystemBig Data is (quite) Easy

Example of off the shelves solution

export AWS_ACCESS_KEY_ID=<your-access-keyid>export AWS_SECRET_ACCESS_KEY=<your-access-key-secret>cellule/spark/ec2/sparl-ec2 -i cellule.pem -k cellule -s <number of machines> launch <cluster-name>ssh -i cellule.pem root@<your-cluster-master-dns>spark-ec2/copy-dir ephemeral-hdfs/confephemeral-hdfs/bin/hadoop distcp s3n://celluledecalcul/dataset/raw/train.csv /data/train.csvscp -i cellule.pem cellule/challenge/target/scala-2.10/target/scala-2.10/challenges_2.10-0.0.jar

cellule/spark/bin/spark-submit \--class fr.cc.challenge.Preprocess \challenges_2.10-0.0.jar \/data/train.csv \/data/train2.csv

cellule/spark/bin/spark-submit \--class fr.cc.sparktest.LogisticRegression \challenges_2.10-0.0.jar \/data/train2.csv

⇒ Logistic regression for arbitrary large dataset!

Page 39: Data Science Starter Program Introduction to Data Science

Data Science ecosystemA Complex Ecosystem!

Page 40: Data Science Starter Program Introduction to Data Science

Data Science ecosystemA Complex Ecosystem!

Page 41: Data Science Starter Program Introduction to Data Science

Data cycleOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 42: Data Science Starter Program Introduction to Data Science

Data cycleData Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow vision

Goal orientedIterative and interactive process

Page 43: Data Science Starter Program Introduction to Data Science

Data cycleData Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow visionGoal oriented

Iterative and interactive process

Page 44: Data Science Starter Program Introduction to Data Science

Data cycleData

Raw material:Structured and unstructured data (Variety)Data quality issue (Veracity)Quantity (Volume and Velocity)

Various sources:Open data,Proprietary data

Page 45: Data Science Starter Program Introduction to Data Science

Data cycleAcquisition, Cleaning, Storage and Integration

Get the data from the sources.Storage issue and availability for processing.Cleaning and formatingIntegration: Data preparation for analysisTime consuming!

Page 46: Data Science Starter Program Introduction to Data Science

Data cycleAnalysis

Extract information from the dataStatistics/Machine learningBig Data: hardware is the limit (time/volume)

Page 47: Data Science Starter Program Introduction to Data Science

Data cycleVisualization and Interface

Reporting part: Visualization, text...Also used for data explorationVery important aspect!

Page 48: Data Science Starter Program Introduction to Data Science

Data cycleDecision and goal oriented analysis

Better decisions: ValueNeed to answer a problem/question!Need to formalize the problem: no answerwithout a question!Feedback everywhere...

Page 49: Data Science Starter Program Introduction to Data Science

Data cycleReal Data Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow vision

Goal orientedIterative and interactive process

Page 50: Data Science Starter Program Introduction to Data Science

Data cycleReal Data Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow visionGoal oriented

Iterative and interactive process

Page 51: Data Science Starter Program Introduction to Data Science

Data cycleReal Data Cycle

DataAcquisition,

Cleaning andStorage

Integration Analysis Visualizationand Interface

DecisionProcess

Practical Issue

Data/Information flow visionGoal orientedIterative and interactive process

Page 52: Data Science Starter Program Introduction to Data Science

Data cycleDoing Data Science

Doing Data Science: Straight talk from the frontline.Rachel Schutt, Cathy O’NeilO’Reilly

Page 53: Data Science Starter Program Introduction to Data Science

Data Science projectOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 54: Data Science Starter Program Introduction to Data Science

Data Science projectA 7 step program

1. Identify the problemType of problems and metric used to measure successIdentify key people within your organization and outsideGet specifications, requirements, priorities, budgetsHow accurate the solution needs to be?Do we need all the data?Outsourcing?

Page 55: Data Science Starter Program Introduction to Data Science

Data Science projectA 7 step program

2. Identify available data sourcesExtract and check sample data / Perform Exploratory DataAnalysisAssess quality of data, and value available in dataIdentify data glitches, find work-aroundData quality improvement?Verify with field expert that you understand the dataInfrastructure?

Page 56: Data Science Starter Program Introduction to Data Science

Data Science projectA 7 step program

2. Identify available data sourcesExtract and check sample data / Perform Exploratory DataAnalysisAssess quality of data, and value available in dataIdentify data glitches, find work-aroundData quality improvement?Verify with field expert that you understand the dataInfrastructure?

3. Identify if additional data sources are neededWhat? How much? How to?Real time?Do we need experimental design?

Page 57: Data Science Starter Program Introduction to Data Science

Data Science projectA 7 step program

4. Data preparation and analysesData preparation and cleaningExplore methodologiesSelect variables and modelsDetect / remove outliersValidate chosen methodologyMeasure accuracy, provide confidence intervalsProvide visualization and ask for feedback

Page 58: Data Science Starter Program Introduction to Data Science

Data Science projectA 7 step program

4. Data preparation and analysesData preparation and cleaningExplore methodologiesSelect variables and modelsDetect / remove outliersValidate chosen methodologyMeasure accuracy, provide confidence intervalsProvide visualization and ask for feedback

5. Implementation, developmentFSSRR: Fast, simple, scalable, robust, re-usableDebuggingNeed to create an API to communicate with other apps?

Page 59: Data Science Starter Program Introduction to Data Science

Data Science projectA 7 step program

6. Communicate resultsIntegration and visualizationDiscuss potential improvements (with cost estimates)Provide trainingCode and methodology documentation

Page 60: Data Science Starter Program Introduction to Data Science

Data Science projectA 7 step program

6. Communicate resultsIntegration and visualizationDiscuss potential improvements (with cost estimates)Provide trainingCode and methodology documentation

7. MaintenanceTest the model or implementation; stress testsRegular updatesOutsourcing?

Page 61: Data Science Starter Program Introduction to Data Science

Data scientistsOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 62: Data Science Starter Program Introduction to Data Science

Data scientistsSkills

Business:Business analysis, market knowledge, product usage,. . .

Data Management:Data collection, storage, cleaning, filtering,integration,. . .

Statistic and Machine Learning:Data modeling, inference, prediction, patternrecognition,. . .

Programming:Software development, Large-scale or parallel dataprocessing,. . .

Interface and Data Visualization:HCI design, visualization, story-telling,. . .

Page 63: Data Science Starter Program Introduction to Data Science

Data scientistsProfiles

No one masters all the skills!

Page 64: Data Science Starter Program Introduction to Data Science

Data scientistsData science team

Gather people having different skills

Page 65: Data Science Starter Program Introduction to Data Science

Data scientistsMain types of data scientists

There are the ones...

Strong in statistics: develop new statistical theories for bigdata: statistical modeling, experimental design, sampling,clustering, data reduction, confidence intervals, testing,modeling, predictive modeling, etc.

Strong in mathematics: operations research, analyticbusiness (inventory management and forecasting, pricingoptimization, supply chain, quality control, yield optimization)

Strong in data engineering, Hadoop, database/memory/filesystems optimization and architecture, API’s, Analytics as aService, optimization of data flows

Page 66: Data Science Starter Program Introduction to Data Science

Data scientistsMain types of data scientists

Strong in computer science (algorithms, computationalcomplexity, optimization)

Strong in business, ROI optimization, decision sciences(dashboards design, metric mix selection and metricdefinitions, ROI optimization, high-level database design)

Strong in production code development, software engineering

Page 67: Data Science Starter Program Introduction to Data Science

Data scientistsMore than data scientists?

Page 68: Data Science Starter Program Introduction to Data Science

Big Data, Data Science, StatisticsOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 69: Data Science Starter Program Introduction to Data Science

Big Data, Data Science, StatisticsWikipedia

Page 70: Data Science Starter Program Introduction to Data Science

Big Data, Data Science, StatisticsWikipedia

Big data is an all-encompassing term for any collection ofdata sets so large and complex that it becomes difficult toprocess using traditional data processing applications.Data science is the study of the generalizable extraction ofknowledge from data, yet the key word is science.Statistics is the study of the collection, analysis,interpretation, presentation and organization of data.

Page 71: Data Science Starter Program Introduction to Data Science

Big Data, Data Science, StatisticsData science evolution

Main Paradigmatic Changes in Big Data Analytics Environment

Big Analytics >2008 -up to now

(Unconstrained Data Mining)

Data storingLine & column dimensions fixedFlat Files, Hierarchical DBs, &first Relational DBs

Column dimensions fixedSQL DBs: MySQL, DB2, ORACLE &OLAP Cubes

No dimensions fixedNoSQL DBs:Column oriented DBs, object oriented DBs etc.

Basic Analytical Principles

Hypotheses driven mode: Power use of sampling Techniques

Mix Hypotheses driven &Data driven: Dimensions Reduction & Populations Segmentations

Full Data driven mode:Power use of learning techniques, mainly unsupervised

Main Algorithmic approaches

Regression Analysis, Factorial Analysis, Statistical Inference thru sampling, Linear general Models, Decision Trees.Etc.

Clustering (K- means, K Neighbours), Classification & Support Vector Machines Multi layers Neural Nets, Scoring Techniques, Sequential Patterns, etc.

Deep adaptive learning techniques, Auto encoded neural NetsHuge Graph Modularization, & Visual Analytics, Full unsupervised linear Clustering, etc.

New types of Business deliverables

Score Cards, Decisional Models based on sampling

Populations Profiling: CRM, Churn & Attrition Analysis, Loyalty & Propensity Programs,Cross selling

Data types Homogeneous Structured Data (proprietary)

Homogeneous Structured & Homogeneous Unstructured Data, separately

Mix of Heterogeneous Unstructured & Structured Data(proprietary + open data)

VolumeCost/volume Exponential volume increase

Statistical Data Analysis<1985

(Pure Statistical Inference)

Business Intelligence 1985-2008

(Constrained Data Mining)

Exponential cost decrease

Page 72: Data Science Starter Program Introduction to Data Science

Big Data, Data Science, Statistics3Vs of Big Data

Page 73: Data Science Starter Program Introduction to Data Science

Big Data, Data Science, Statistics5Vs of Big Data

Page 74: Data Science Starter Program Introduction to Data Science

Big Data, Data Science, StatisticsData science or statistics?

A vocabulary problem:

data scientist or statistician?

statistics or data science?

Page 75: Data Science Starter Program Introduction to Data Science

Big Data, Data Science, StatisticsData science or statistics?

A possible answer:

Page 76: Data Science Starter Program Introduction to Data Science

Computing and Distributed ComputingOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 77: Data Science Starter Program Introduction to Data Science

Computing and Distributed ComputingComputer Architecture

Everything should go through the CPU...

Page 78: Data Science Starter Program Introduction to Data Science

Computing and Distributed ComputingMemories

CPU register 64 b × 16Level 1 cache access 8-128 kbLevel 2 cache access 32-1024 kbLevel 3 cache access 1-8 MBMain memory access 2-16 GBSolid-state disk I/O 250 GB-1 TB (4TB)Rotational disk I/O 500 GB-4 TB

Page 79: Data Science Starter Program Introduction to Data Science

Computing and Distributed ComputingMemories

1 CPU cycle 0.3 ns 1 sLevel 1 cache access 0.9 ns 3 sLevel 2 cache access 2.8 ns 9 sLevel 3 cache access 12.9 ns 43 sMain memory access 120 ns 6 minSolid-state disk I/O 50-150 µs 2-6 daysRotational disk I/O 1-10 ms 1-12 monthsInternet: SF to NYC 40 ms 4 yearsInternet: SF to UK 81 ms 8 yearsInternet: SF to Australia 183 ms 19 yearsOS virtualization reboot 4 s 423 yearsSCSI command time-out 30 s 3000 yearsHardware virtualization reboot 40 s 4000 yearsPhysical system reboot 5 m 32 millenia

Page 80: Data Science Starter Program Introduction to Data Science

Computing and Distributed ComputingDistributed/Parallel Computing

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor Processor

(a)

(b)

(c)

Distributed (a/b)Parallel (c)

Page 81: Data Science Starter Program Introduction to Data Science

Computing and Distributed ComputingMultiCore

Several processors/cores with the same shared ram.No too expensive transfer between core.Strategies:

Independent batchParallelization technique limiting information transfer...

System memory limitation!

Page 82: Data Science Starter Program Introduction to Data Science

Computing and Distributed ComputingHadoop and Map/Reduce

Data transfer through disk and networked file system!Hadoop: Node failure handling and ecosystem.

Page 83: Data Science Starter Program Introduction to Data Science

Computing and Distributed ComputingSpark

Strategy: keep everything as much as possible in memory...

Page 84: Data Science Starter Program Introduction to Data Science

Computing and Distributed ComputingGP-GPU

Combine different processor types...CPU < DSP < FPGA < ASICS

Page 85: Data Science Starter Program Introduction to Data Science

Data Science ChallengesOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 86: Data Science Starter Program Introduction to Data Science

Data Science ChallengesNew Interdisciplinary Challenges

Applied math AND Computer scienceHuge importance of domain specific knowledge: physics,signal processing, biology, health, marketing...

Some joint math/computer science challengesData acquisitionUnstructured data and their representationHuge dataset and computationHigh dimensional data and model selectionLearning with less supervisionVisualization

Page 87: Data Science Starter Program Introduction to Data Science

Data Science ChallengesData acquisition

Some challengesHow to measure new things?How to choose what to measure?How to deal with distributed sensors?How to look for new sources of informations?

Page 88: Data Science Starter Program Introduction to Data Science

Data Science ChallengesUnstructured Data

Some challengesHow to store efficiently the data?How to describe (model) them to be able to process them?How to combine data of different nature?How to learn dynamics?

Page 89: Data Science Starter Program Introduction to Data Science

Data Science ChallengesHuge Dataset

Some challengesHow to take into account the locality of the data?How to construct distributed architectures?How to design adapted algorithms?

Page 90: Data Science Starter Program Introduction to Data Science

Data Science ChallengesHigh Dimensional Data

Main Paradigmatic Changes in Big Data Analytics Environment

Big Analytics >2008 -up to now

(Unconstrained Data Mining)

Data storingLine & column dimensions fixedFlat Files, Hierarchical DBs, &first Relational DBs

Column dimensions fixedSQL DBs: MySQL, DB2, ORACLE &OLAP Cubes

No dimensions fixedNoSQL DBs:Column oriented DBs, object oriented DBs etc.

Basic Analytical Principles

Hypotheses driven mode: Power use of sampling Techniques

Mix Hypotheses driven &Data driven: Dimensions Reduction & Populations Segmentations

Full Data driven mode:Power use of learning techniques, mainly unsupervised

Main Algorithmic approaches

Regression Analysis, Factorial Analysis, Statistical Inference thru sampling, Linear general Models, Decision Trees.Etc.

Clustering (K- means, K Neighbours), Classification & Support Vector Machines Multi layers Neural Nets, Scoring Techniques, Sequential Patterns, etc.

Deep adaptive learning techniques, Auto encoded neural NetsHuge Graph Modularization, & Visual Analytics, Full unsupervised linear Clustering, etc.

New types of Business deliverables

Score Cards, Decisional Models based on sampling

Populations Profiling: CRM, Churn & Attrition Analysis, Loyalty & Propensity Programs,Cross selling

Data types Homogeneous Structured Data (proprietary)

Homogeneous Structured & Homogeneous Unstructured Data, separately

Mix of Heterogeneous Unstructured & Structured Data(proprietary + open data)

VolumeCost/volume Exponential volume increase

Statistical Data Analysis<1985

(Pure Statistical Inference)

Business Intelligence 1985-2008

(Constrained Data Mining)

Exponential cost decrease

Some challengesHow to describe (model) the data?How to reduce the data dimensionality?How to select/mix models?

Page 91: Data Science Starter Program Introduction to Data Science

Data Science ChallengesLearning and Supervision

Some challengesHow to learn with the less possible interactions?How to learn simultaneously several related tasks?

Page 92: Data Science Starter Program Introduction to Data Science

Data Science ChallengesVisualization

Some challengesHow to look at the data?How to present results?How to help taking better informed decision?

Page 93: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 94: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceLots of words

Page 95: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceLots of words

Page 96: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceMain Fields

Data mining. Extract patterns from data by combining methodsfrom statistics, machine learning and data processing technologies.Example: market basket analysis to model the purchase behaviorof customers.

Machine learning. Design and develop algorithms allowingcomputers to learn from data, in order to take intelligent decisionsautomatically. Example: Natural language processing.

Statistics. Collection, organization, and interpretation of data.Mathematical methods to construct quantitative assessments oferrors and risks when taking decisions, estimating parameters anddoing predictions. Example: quantitative assessments ofrelationships between variables, computing confidence intervals formodel parameters, hypothesis testing.

Page 97: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceMain Fields

Natural language processing (NLP). Specialization of machinelearning and linguistics that builds algorithms to analyze human(natural) language. Example: sentiment analysis on socialnetworks.

Network analysis. Characterize relationships among nodes in agraph or a network, understand the communities, the influence ofnodes on the others, understand how information travels in thenetwork. Example: identify key opinion leaders in a social network,identify the information flows in a large company

Predictive modeling. Use of a mathematical model to predict anoutcome, e.g. regression, classification, etc. Example. Predict theprobability that a customer will churn.

Page 98: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceMain Fields

Supervised learning. Machine learning techniques that infer afunction or a relationship from a set of training data. Examples:classification, regression

Unsupervised learning. Machine learning techniques that findsstructure in unlabeled data. Example: clustering is a part ofunsupervised learning

Visualization. Techniques used for representation of data bycreating images, diagrams, animations, in order to communicate,understand, explore and improve understanding of data.

Page 99: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceMachine Learning

Labels. Characteristics / categories of interest in points of data.This is the information one wants to predict in supervised learning.

Features. A set of information about a point of data (a customer,a company, a country, etc.)

Clustering. Algorithms used in unsupervised learning to assign agroup to each data point. Groups are called clusters. Example:customer segmentation in a e-commerce platform

Classification. Algorithms used in supervised learning to predictthe labels of data points. It relies on the training of a model oralgorithm using labeled data. Keywords: Logistic Regression,SVM, CART, Boosting, etc.

Feature selection. Algorithms in machine learning that selectfeatures that best explain an outcome. Example. In biology, findthe genetic informations that best explains a patient’s response toa drug.

Page 100: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceMathematical Concepts

Generalization. Ability of a predictive algorithm to generalize:give good predictive results on a sample different than to one usedto train the algorithm.

Parameters. A set of coefficients (vectors, matrices), thatspecifies a model. Example: the mean and standard deviation of aGaussian distribution

Statistical model. A mathematical formulation that attempts toexplain how data is generated. Example: data is generated by amultivariate Gaussian distribution

Likelihood. The probability that data is generated by a model forsome parameters choice

Page 101: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceMathematical Concepts

Optimization. Study and design of numerical algorithms used to(but not only) minimize or maximize functions. Example:optimization of a likelihood is called the training step in machinelearning.

Goodness-of-fit. A quantity that assesses the closeness of a(trained) model to data. Example. The least-squares error forlinear regression.

Over-fitting. Something that must be avoided to have a goodpredictive performance on new data.

Cross-validation. Splitting of data into several subsets. A modelis trained on a subset and tested on another, to check itsgoodness-of-fit both on data used for training, but on new data aswell.

Page 102: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceData Engineering

Structured data. Data structured in fixed fields. Example:relational databases or excel spreadsheets.

Semi-structured data. Data not structured in fixed fields butcontain markers to separate data elements. Example: XML orHTML-tagged text.

Unstructured data. Data not structured in fixed fields. Example:books, articles, body of e-mail messages, audio, image and videodata, etc.

Metadata. Data that describes the content and context of data:creation, purpose, time and date, author, etc.

Page 103: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceData Engineering

Data fusion and data integration. Set of techniques thatintegrate and analyze data from multiple sources, instead ofanalyzing single sources of data. Example: combine analysis ofsocial network data with NLP and real-time sales data, to assessthe effect of a marketing campaign on customer sentiment andpurchases

Page 104: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceDatabases and Data Procesing

Cloud computing. A computing paradigm where computingresources are configured as a distributed system, which provides aservice through a network.

Distributed system. Several computers, communicating througha network, used to solve a common storing or computationalproblem. Aim is higher performance at a lower cost, higherreliability and scalability.

Relational database. A database consisting of collections oftables (relations), namely data are stored in rows and columns.SQL is the most widely used language for managing relationaldatabases.

Non-relational database. A database that does not store data intables (rows and columns).

Page 105: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceDatabases and Data Procesing

SQL. Acronym for Structured Query Language. It is a computerlanguage designed for managing data in relational databases.Example. insert, query, update, delete data, manage databasestructures, and control access to data in the database.

NoSQL. A group of database management systems. Data is notstored in tables like in Relational database. It does not rely on themathematical relationship between tables. It gives a way of storingand retrieving unstructured data quickly.

Hbase. A distributed, non-relational database. It is managed as aproject of the Apache Software foundation and a part of Hadoop.

Page 106: Data Science Starter Program Introduction to Data Science

Vocabulary of Data ScienceDatabases and Data Procesing

Hadoop. A framework that supports large scale data processingby allowing the decomposition of large tasks into smaller tasks,that are executed in parallel, on independently slices of the dataand then finally merged to answer to the task.

MapReduce. A software framework introduced by Google forprocessing huge datasets on a distributed system. Implemented inHadoop. It supports large scale data processing by decomposinglarge tasks into smaller tasks, executed in parallel, on independentparts of data and finally merged to answer to the task.

Stream processing. Technologies designed to process largereal-time streams of event data. Example: high-frequencyalgorithmic trading, analysis of Twitter data streams

Page 107: Data Science Starter Program Introduction to Data Science

BibliographyOutline

1 Data Science in the media2 From Data to Product3 An example: Real Time Bidding4 Data Science ecosystem5 Data cycle6 Data Science project7 Data scientists8 Big Data, Data Science, Statistics9 Computing and Distributed Computing10 Data Science Challenges11 Vocabulary of Data Science12 Bibliography

Page 108: Data Science Starter Program Introduction to Data Science

BibliographyBibliography

T. Hastie, R. Tibshirani, and J. Friedman (2009)The Elements of Statistical LearningSpringer Series in Statistics.

G. James, D. Witten, T. Hastie and R. Tibshirani (2013)An Introduction to Statistical Learning with Applications in RSpringer Series in Statistics.

B. Schölkopf, A. Smola (2002)Learning with kernels.The MIT Press

R. Schutt, and C. O’Neil (2014)Doing Data Science: Straight talk from the frontlineO’Reilly