28
Introduction to Guerrilla Analytics Presented by: Enda Ridge, PhD A Practical Approach to Doing Data Science

Introduction to Guerrilla Analytics

Embed Size (px)

Citation preview

Introduction to Guerrilla Analytics

Presented by:

Enda Ridge, PhD

A Practical Approach to Doing Data Science

What this talk is about

• Why doing Data Science is hard

• What is Guerrilla Analytics?

• How to cope in a Guerrilla Analytics environment

– The Guerrilla Analytics Principles

• Applying the Principles

• Next steps and research topics

Copyright Enda Ridge 2014 1

What we are told about Data Science

2Copyright Enda Ridge 2014

“Data is the new science. Big data holds the answers.”

“the sexy job in the next 10 years will be statisticians”

“Data Scientist: The Sexiest Job of the 21st Century”

“Information is the oil of the 21st century, and analytics is the combustion engine.”

http://www.gapminder.org/http://www.statistics.com/data-science-quotes/https://github.com/mbostock/d3/wiki/Gallery

Hi, we need to produce last week’s list of customers with incorrect addresses and report their total value. It’s going to the Chief Risk Officer this afternoon.

Um. Which list? I sent at least 2 lists and Jo sent one too. Maybe.

I’ll check my mailbox and send you the Excel file.

And we’ll also need the change in customer population since last week’s report.

Er.....the customer population has changed with the new data from this morning.

Oh and we might have deleted some duplicate customers yesterday so we can’t go back to last week.

The Data Science reality

3Copyright Enda Ridge 2014

My background

Mechanical Engineer

PhD Computer Science

(York 2007)

• “Design of Experiments for the Tuning of Algorithms”

Boutique Consultancy

• Social Network Analysis for Fraud

Forensic Data Analytics

• Professional Services

Senior Manager

• Data Science Product Development

Copyright Enda Ridge 2014 4

What is Data Science?

Copyright Enda Ridge 2014 5

Data Analytics Insight

Common Perception of the Analytics Workflow

Copyright Enda Ridge 2014 6

Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22

Project Reality – Guerrilla Analytics

•Changing Data

•Changing Requirements

•Changing Resources

•Changing Business Rules

Copyright Enda Ridge 2014 7

•Limited Time

•Limited Toolsets

•Limited Resources

•Reproducible

•Explainable

•Testable

Guerrilla Analytics Workflow

Copyright Enda Ridge 2014 8

GUERRILLA ANALYTICS PRINCIPLES

Copyright Enda Ridge 2014 9

Some of the Guerrilla Analytics Principles

• Space is cheap, confusion is expensive. 1

• Prefer simple, visual project structures over heavily documented and project-specific rules. 2

• Prefer automation with program code over manual graphical methods. 3

• Maintain a link between DATA on the file system, in the ANALYTICS environment, and in work products (INSIGHT). 4

• Version control changes to data and program code. 5Copyright Enda Ridge 2014 10

Stage 2: Data Receipt

Copyright Enda Ridge 2014 11

Guerrilla Analytics Environment

• Lost Data

• Multiple Copies of data

• No supporting information

• Local copies of data

• Renamed data

Guerrilla Analytics Approach

• Have 1 Data location

• Data Unique Identifiers

• Data log

• Keep supporting material near its data

Stage 2: Data Receipt

Guerrilla Analytics Environment Guerrilla Analytics Approach

Copyright Enda Ridge 2014 12

Stage 3: Data Load

Copyright Enda Ridge 2014 13

Guerrilla Analytics Environment

•Data corruption

•Data preparation for load

•Loss of loaded data

•Cluttered Data Manipulation Environment

Stage 3: Data Load

File system Data Manipulation Environment

Copyright Enda Ridge 2014 14

Stage 3: Data Load

File system Data Manipulation Environment

Copyright Enda Ridge 2014 15

Stage 4: Analytics

Copyright Enda Ridge 2014 16

Stage 4: Analytics Examples & Themes

• Multiple languages

• Multiple code files

• Output of images

SQL code file that goes through a supplier address table and identifies the address country for each supplier.

This derived address country is added into the dataset as new data field. Plot on a map.

Copyright Enda Ridge 2014 17

• Data manipulation on file system

• External tools

Python script that runs through thousands of office documents, calls a tool to convert these to XML, and saves the XML file.

This process is to prepare the data for further entity enrichment with another tool.

• Larger number of code files

• Multiple languages

• Multiple tools

Twenty code files are run in a particular order to manipulate and reshape data so it can be imported into a data-mining tool.

• Even simple code snippetsExport of a dataset out of the Data Manipulation Environment so the customer can do their own work with the data.

Stage 4: Analytics

Guerrilla Analytics Environment Guerrilla Analytics Approach

Copyright Enda Ridge 2014 18

Clearly labelled running order

Stage 4: Analytics

Guerrilla Analytics Environment

Copyright Enda Ridge 2014 19

Raw 1

Raw 2

Union_and_clean

Known_Population

Result

1

2

Stage 4: Analytics

Guerrilla Analytics Environment

Guerrilla Analytics Approach

Copyright Enda Ridge 2014 20

Raw 1

Raw 2

Union_and_clean

Known_Population

Result

1

2

Raw 1

Raw 2

UnionClean 1

Clean 2

Tagged

Known_Population

Result_filtered

1

2

3 4

Stage 7: Reporting

Copyright Enda Ridge 2014 21

Stage 7: Reporting – what is a report?

Copyright Enda Ridge 2014 22

Stage 7: Reporting – Guerrilla environment

Copyright Enda Ridge 2014 23

Stage 7: Reporting – Guerrilla Analytics approach

Copyright Enda Ridge 2014 24

1

2

5

Select min/max of transaction_time

WP_030

Select min/max of customer_age

WP_035

Purchases by type

WP_042

Wrap Up

Discussed

• Guerrilla Analytics Projects

• Disruptions

• Constraints

• Guerrilla Analytics Principles and Practice Tips

• Data Receipt and Load

• Analytics and Reporting

• Consolidation in Builds

Other topics

• Testing in a Guerrilla Analytics Environment

• Capability – People, Process, Technology

• Data Gymnastics – common analytics patterns

Copyright Enda Ridge 2014 25

Open questions and challenges

Software Engineering

• Version control of data and of analytics

• Build tools across multiple languages

Workflows & Project Management

• Appropriate workflow and supporting tools?

• Project Management methodologies

Testing

• Types (Builds, ad-hoc, data quality)

• Test Harnesses (multi-language, dataset vs code)

‘Big Data’

• Do Guerrilla Analytics Principles work with volume / velocity?

• NoSQL analytics

Copyright Enda Ridge 2014 26

Keep in Touch!

Copyright Enda Ridge 2014 27

@Enda_Ridge

[email protected]

www.guerrilla-analytics.net