60
DATA SCIENCE MEETS SOFTWARE DEVELOPMENT Alexis Seigneurin - Ippon Technologies

Data Science meets Software Development

Embed Size (px)

Citation preview

Page 1: Data Science meets Software Development

DATA SCIENCE MEETSSOFTWARE DEVELOPMENT

Alexis Seigneurin - Ippon Technologies

Page 2: Data Science meets Software Development

Who I am

• Software engineer for 15 years

• Consultant at Ippon Tech in Paris, France

• Favorite subjects: Spark, Cassandra, Ansible, Docker

• @aseigneurin

Page 3: Data Science meets Software Development

• 200 software engineers in France and the US

• In the US: offices in DC, NYC and Richmond, Virginia

• Digital, Big Data and Cloud applications

• Java & Agile expertise

• Open-source projects: JHipster, Tatami, etc.

• @ipponusa

Page 4: Data Science meets Software Development

The project

• Data Innovation Lab of a large insurance company

• Data → Business value

• Team of 30 Data Scientists + Software Developers

Page 5: Data Science meets Software Development

Data ScientistsWho they are

&How they work

Page 6: Data Science meets Software Development

Skill set of a Data Scientist

• Strong in:• Science (maths / statistics)• Machine Learning• Analyzing data

• Good / average in:• Programming

• Not good in:• Software engineering

Page 7: Data Science meets Software Development

Programming languages

• Mostly Python, incl. frameworks:• NumPy• Pandas• SciKit Learn

• SQL

• R

Page 8: Data Science meets Software Development

Development environments

• IPython Notebook

Page 9: Data Science meets Software Development

Development environments• Dataiku

Page 10: Data Science meets Software Development

Machine Learning

• Algorithms:• Logistic Regression• Decision trees• Random forests

• Implementations:• Dataiku• Scikit-Learn• Vowpal Wabbit

Page 11: Data Science meets Software Development

ProgrammersWho they are

&How they work

http://xkcd.com/378/

Page 12: Data Science meets Software Development

Skill set of a Developer

• Strong in:• Software engineering• Programming

• Good / average in:• Science (maths / statistics)• Analyzing data

• Not good in:• Machine Learning

Page 13: Data Science meets Software Development

How Developers work• Programming languages

• Java• Scala

• Development environment• Eclipse• IntelliJ IDEA

• Toolbox• Maven• …

Page 14: Data Science meets Software Development

A typical Data Science project

In the Lab

Page 15: Data Science meets Software Development

Workflow

1. Data Cleansing

2. Feature Engineering

3. Train a Machine Learning model1. Split the dataset: training/validation/test datasets2. Train the model

4. Apply the model on new data

Page 16: Data Science meets Software Development

Data Cleansing

• Convert strings to numbers/booleans/…

• Parse dates

• Handle missing values

• Handle data in an incorrect format

• …

Page 17: Data Science meets Software Development

Feature Engineering• Transform data into numerical features

• E.g.:• A birth date → age• Dates of phone calls → Number of calls• Text → Vector of words• 2 names → Levensthein distance

Page 18: Data Science meets Software Development

Machine Learning• Train a model

• Test an algorithm with different params

• Cross validation (Grid Search)

• Compare different algorithms, e.g.:• Logistic regression• Gradient boosting trees• Random forest

Page 19: Data Science meets Software Development

Machine Learning• Evaluate the accuracy of the

model• Root Mean Square Error (RMSE)• ROC curve• …

• Examine predictions• False positives, false negatives…

Page 20: Data Science meets Software Development

IndustrializationCookbook

Page 21: Data Science meets Software Development

Disclaimer

• Context of this project:• Not So Big Data (but Smart Data)• No real-time workflows (yet?)

Page 22: Data Science meets Software Development

Distribute the processing

R E C I P E # 1

Page 23: Data Science meets Software Development

Distribute the processing

• Data Scientists work with data samples

• No constraint on processing time

• Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)

Page 24: Data Science meets Software Development

Distribute the processing

• In production:• H/W resources are constrained• Large data sets to process

• Spark:• Included in CDH• DataFrames (Spark 1.3+) ≃ Pandas DataFrames• Fast!

Page 25: Data Science meets Software Development

Use a centralizeddata store

R E C I P E # 2

Page 26: Data Science meets Software Development

Use a centralized data store

• Data Scientists store data on their workstations• Limited storage• Data not shared within the team• Data privacy not enforced• Subject to data losses

Page 27: Data Science meets Software Development

Use a centralized data store

• Store data on HDFS:• Hive tables (SQL)• Parquet files

• Security: Kerberos + permissions

• Redundant + potentially unlimited storage

• Easy access from Spark and Dataiku

Page 28: Data Science meets Software Development

Rationalize the use of programming

languages

R E C I P E # 3

Page 29: Data Science meets Software Development

Programming languages

• Data Scientists write code on their workstations• This code may not run in the datacenter

• Language variety → Hard to share knowledge

Page 30: Data Science meets Software Development

Programming languages

• Use widely spread languages

• Spark in Python/Scala• Support for R is too young

• Provide assistance to ease the adoption!

Page 31: Data Science meets Software Development

Use an IDE

R E C I P E # 4

Page 32: Data Science meets Software Development

Use an IDE

• Notebooks:• Powerful for exploratory work• Weak for code edition and code

structuring• Inadequate for code versioning

Page 33: Data Science meets Software Development

Use an IDE

• IntelliJ IDEA / PyCharm• Code compilation• Refactoring• Execution of unit tests• Support for Git

Page 34: Data Science meets Software Development

Source Control

R E C I P E # 5

Page 35: Data Science meets Software Development

Source Control

• Data Scientists work on their workstations• Code is not shared• Code may be lost• Intermediate versions are not preserved

• Lack of code review

Page 36: Data Science meets Software Development

Source Control

• Git + GitHub / GitLab

• Versioning• Easy to go back to a version running in production

• Easy sharing (+permissions)

• Code review

Page 37: Data Science meets Software Development

Packaging the code

R E C I P E # 6

Page 38: Data Science meets Software Development

Packaging the code

• Source code has dependencies

• Dependencies in production ≠ at dev time

• Assemble the code + its dependencies

Page 39: Data Science meets Software Development

Packaging the code

• Freeze the dependencies:• Scala → Maven• Python → Setuptools

• Packaging:• Scala → Jar (Maven Shade plugin)• Python → Egg (Setuptools)

• Compliant with spark-submit.sh

Page 40: Data Science meets Software Development

R E C I P E # 7

Secure the build process

Page 41: Data Science meets Software Development

Secure the build process

• Data Scientists may commit code… without running tests first!

• Quality may decrease over time

• Packages built by hand on a workstation are not reproducible

Page 42: Data Science meets Software Development

Secure the build process

• Jenkins• Unit test report• Code coverage report• Packaging: Jar / Egg• Dashboard• Notifications (Slack + email)

Page 43: Data Science meets Software Development

Automate the process

R E C I P E # 8

Page 44: Data Science meets Software Development

Automate the process

• Data is loaded manually in HDFS:• CSV files, sometimes compressed• Often received by email• Often samples

Page 45: Data Science meets Software Development

Automate the process

• No human intervention should be required• All steps should be code / tools• E.g. automate file transfers, unzipping…

Page 46: Data Science meets Software Development

Adapt to living data

R E C I P E # 9

Page 47: Data Science meets Software Development

Adapt to living data

• Data Scientists work with:• Frozen data• Samples

• Risks with data received on a regular basis:• Incorrect format (dates, numbers…)• Corrupt data (incl. encoding changes)• Missing values

Page 48: Data Science meets Software Development

Adapt to living data

• Data Checking & Cleansing• Preliminary steps before processing the data• Decide what to do with invalid data

• Thetis• Internal tool• Performs most checking & cleansing operations

Page 49: Data Science meets Software Development

Provide a library of transformations

R E C I P E # 1 0

Page 50: Data Science meets Software Development

Library of transformations

• Dataiku « shakers »:• Parse dates• Split a URL (protocol, host, path, …)• Transform a post code into a city / department name• …

• Cannot be used outside Dataiku

Page 51: Data Science meets Software Development

Library of transformations

• All transformations should be code

• Reuse transformations between projects

• Provide a library• Transformation = DataFrame → DataFrame• Unit tests

Page 52: Data Science meets Software Development

Unit test the data pipeline

R E C I P E # 1 1

Page 53: Data Science meets Software Development

Unit test the data pipeline

• Independent data processing steps

• Data pipeline not often tested from beginning to end

• Data pipeline easily broken

Page 54: Data Science meets Software Development

Unit test the data pipeline

• Unit test each data transformation stage• Scala: Scalatest• Python: Unittest

• Use mock data

• Compare DataFrames:• No library (yet?)• Compare lists of lists

Page 55: Data Science meets Software Development

Assemble the Workflow

R E C I P E # 1 2

Page 56: Data Science meets Software Development

Assemble the Workflow

• Separate transformation processes:• Transformations applied to some data• Results are frozen and used in other processes

• Jobs are launched manually• No built-in scheduler in Spark

Page 57: Data Science meets Software Development

Assemble the workflow• Oozie:

• Spark• Map-Reduce• Shell• …

• Scheduling

• Alerts

• Logs

Page 58: Data Science meets Software Development

Summary&

Conclusion

Page 59: Data Science meets Software Development

Summary

• Keys:• Use industrialization-ready tools• Pair Programming: Data Scientist + Developer

• Success criteria:• Lower time to market• Higher processing speed• More robust processes

Page 60: Data Science meets Software Development

Thank you!

@aseigneurin - @ipponusa