Democratizing Data Science in the Enterprise

Preview:

Citation preview

Democratizing Data Science in the Enterprise

Better Title: The NO BS Guide to Getting Insights from your

Business Data

About Me

• Hackerpreneur• Founder of Tellago • Founder of KidoZen• Board member• Advisor: Microsoft, Oracle• Angel Investor• Speaker, Author

http://jrodthoughts.comhttps://twitter.com/jrdothoughts

Agenda

• A brief history of data science• Democratizing data science in the enterprise• Building a great data science infrastructure• Solving the last mile usability challenge

Key Takeaways

• How to build data science solutions in the real world without breaking the bank?

• What technologies can help?• Myths and realities of data science solutions

Data Science….Still Magic?

It ’s not a trick, it ’s an illusion.

Any sufficientlyadvanced technology isindistinguishable from

magic.— Arthur C. Clarke

1. create technology:people who are not experts canuse it easily with little difficultyand trust the output

2. make it “sufficiently advanced”

“data science”

d. conway, 2010

1

Basic Research

Applied Research

WorkingPrototype

Quality Code

Tool orService

Maybe someday, someone can use this.

I might be able to use this.

I can use this (sometimes).

Software engineers can use this.

People can use this.

The Wizard….The Data Scientist

Fred Benenson@fredbenenson,n -

Fol lowing

IMHO the majority of data work boils down to

3 th ings:

1. Counting stuff

2. Figuring out the denominator

3. The reproducibility of 1 & 2• *

RETWEETS

32FAVORITES

28

12:33 PM - 21 Aug 2013

They’re hot these days…

1

1

2

2

“data science”

jobs, jobs, jobs

2

“data science”

jobs, jobs, jobs

2

Where do they come from?

“data science”

ancient history: 2001

“The Future of Data Analysis,”

W.

1962

John Tukey

introduces:

“Exploratory data anlaysis”

2

Tukey 1965, via John Chambers

TUKEY BEGAT S WHICH BEGAT R

30hackNYDS.key -

Thursday:June.18

Tukey 1972

3

? 1972

3

Jerome H. Friedman

3

TUKEY BEGAT ESL

3

TUKEY BEGAN VDQI

3

Tukey 1977

3

TUKEY BEGAT EDA

3

fast forward -> 2001

3

Data Science in the Enterprise

Seems like magic…

But it boils down to 2 factors….

Data Science Success Factors in the Enterprise

• Building a great data science infrastructure

• Solving the last mile problem

Tricks to build a great data science infrastructure

Trick#1: Centralized Data Aggregation…

Goals & Challenges

• Correlate data from disparate data sources

• Enable a centralized data store for your enterprise

• Incorporate new information sources in an agile way

• Traditional multi-dimensional data warehouses are difficult to modify

• They are designed around a specific set of questions (schema-first)

• Challenges to incorporate semi-structure and unstructured data

I would like to… But…

Centralized Data Aggregation: Best Practices

• Implement an enterprise data lake

• Rely on big data DW platforms such as Apache Hive

• Use a federated architecture efficiently partitioned for different business units

• Establish SQL as the common query language

• Leverage in-memory computing to optimize query performance

Centralized Data Aggregation: Technologies & Vendors

Trick#2: Data Discovery…

Goals & Challenges

• Organically discover data sources relevant to my job

• Help others discover data more efficiently

• Collaborate with colleagues about specific data sources

• Business users typically don’t have access to the data lake

• There is no corporate data repository

• There is no search and metadata repository

I would like to… But…

Data Discovery: Best Practices

• Implement a corporate data catalog

• The data catalog should be the user interface to interact with the corporate data lake

• Copy ideas from data catalogs in the internet

• Provide rich metadata experience in your data catalog

• Extend your data lake with search capabilities

Data Discovery: Technologies & Vendors

Trick#3: Establish a Common Query Language…

Goals & Challenges

• Query data from different business systems in a consistent way

• Correlate information from different line of business systems

• Reuse queries as new sources of information

• Different business systems use different protocols to query data

• I need to learn a new query language to interact with my big data infrastructure

• Queries over large data sources can be SLOW

I would like to… But…

Query Language: Best Practices

• Standardize on SQL as the language query business data

• Implement a SQL interface for your data lake

• Correlate data sources using simple SQL joins

• Materialize query results in your data lake for future reuse

• Invest in in-memory technologies to optimize performance

Query Language: Technologies & Vendors

Trick#4: Focus on Data Quality…

Goals & Challenges

• Trust corporate data for my applications

• Actively merge new and historical data

• Integrate new data back into line of business systems

• Data in line of business systems in poorly curated

• Some data records need to be validated or cleanse

• Some data records need to be enriched with additional data points

I would like to… But…

Data Quality: Best Practices

• Implement a data quality process

• Leverage your data catalog as the main user interface to control data quality

• Trust the wisdom of the crowds to manage data quality

• Provide a great user experience to data quality

Data Quality: Technologies & Vendors

Trick#5: Understand your data….

Goals & Challenges

• Execute efficient queries against my corporate data

• Discover patterns and trends about business data sources

• Rapidly adapt to new data sources added to our business processes

• There is no simple way to understand corporate data sources

• We rely on users to determine which queries to execute

• New data patterns and trends often go undetected

I would like to… But…

Understanding your Data : Best Practices

• Leverage machine learning algorithms to understand business data sources

• Leverage clustering algorithms to detect interesting patterns from your business data

• Leverage classification algorithms to place data records in well-defined groups

• Leverage statistical distribution algorithms to reveal interesting information about your data

Understanding your Data : Technologies & Vendors

Trick#6: Predict…

Goals & Challenges

• Efficiently predict well-known variables in my business data

• Adapt results to future predictions

• Take actions based on the predicted outcomes

• Our analytics are based on after-the-fact reports

• Traditional predictive analytics technologies don’t work well with semi-structured and unstructured data

• Traditional predictive analytics require complex infrastructure

I would like to… But…

Predict : Best Practices

• Implement a modern predictive analytics platform

• Leverage the data lake as the main source of information to predictive analytics algorithms

• Leverage classification and clustering algorithms as the main mechanisms to train predictions

• Expose predictions to other applications for future reuse

Predict : Technologies & Vendors

Trick#7: Take Actions…

Goals & Challenges

• Not have to read a report to take actions on my business data

• Model automatic actions based on well-defined data rules

• Evaluate the effectiveness of the rules and adapt

• Data results are mostly communicated via reports and dashboards

• There is no interface to design rules against business data

• Actions are implemented based on human interpretation of data

I would like to… But…

Take Actions : Best Practices

• Implement a modern predictive analytics platform

• Leverage the data lake as the main source of information to predictive analytics algorithms

• Leverage classification and clustering algorithms as the main mechanisms to train predictions

• Expose predictions to other applications for future reuse

Take Actions: Technologies & Vendors

Trick#8: Embrace developers…

Goals & Challenges

• Leverage data analyses in new applications

• Help developers embrace corporate data infrastructure

• Expose data analyses to new mediums such as mobile or IOT

• Data results are mostly communicated via reports and dashboards

• Data analysis efforts are typically led by non-developers

• There is no easy way to organically discover and reuse corporate data sources

I would like to… But…

Leverage Developers: Best Practices

• Expose data sources and analyses via APIs

• Leverage industry standards to integrated with third party tools

• Provide data access samples and SDKs for different environments such as mobile and IOT clients

• Incorporate developer’s feedback into your data sources

Take Actions: Technologies & Vendors

Trick#9: Real time data is different…

Goals & Challenges

• Process large volumes or real time data

• Aggregate real time and historical data

• Detect and filter conditions in my real time data before it goes into corporate systems

• There is no infrastructure to query real time data

• We process real time and historical data using the same models

• Large data volumes affect performance

I would like to… But…

Real Time Data Processing: Best Practices

• Implement a stream analytics platform

• Model queries over real time data streams

• Add the results of the aggregated queries into the data lake

• Replay data streams to simulate real time conditions

Real Time Data Processing: Technologies & Vendors

Solving the last mile problem

Trick#1: Killer user experience…

Create a Killer User Experience

• Design matters

• Invest on a easy way for users to interact with corporate data source

• Leverage modern UX principles that work cross channels(mobile, web)

• Make data discoverable

• Leverage metadata

• Facilitate collaboration

Trick#2: Test test test…

Test Test Test

• Incorporate test models into your data sources

• Simulate real world conditions at the data level

• Assume everything will fail

Trick#3: Integrate with existing tools…

Integrate with Third Party Tools

• Integrate your data lake with mainstream tools like Tableau or Excel

• Use industry standards so that data sources can be incorporated

Trick#5: Collaborate…

Collaborate

• Integrate data sources with modern messaging and collaboration tools: Slack, Yammer etc

• Distribute updates via emails, push notifications, SMSs

Other things to consider

• On-premise, cloud or hybrid?

• Apply agile development practices to your data science infrastructure

• Infrastructure is cool but usability is more important

Summary

• Data science is not magic, is an illusion

• Implementing data science in the enterprise is about solving two problems• Building a great data infrastructure

• Solving the last mile usability challenge

• Today this can be done with commodity technology

• Data scientists are just “people “ ;)

THANKSJesus Rodriguez

https://twitter.com/jrdothoughts

http://jrodthoughts.com/

Recommended