28
Big Data and Open Source Swapnil (Neil) Jadhav

Big Data & Open Source - Neil Jadhav

Embed Size (px)

Citation preview

Page 1: Big Data & Open Source - Neil Jadhav

Big Data and Open Source▸

‐ Swapnil(Neil)Jadhav

Page 2: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Agenda

Introduction

Key strategic challenges for CDOs/CAOs

Key operational challenges for CDOs/CAOs

Top 10 big data tools and technologies

Why open source?

1 page strategy to implement big data

programs (Source: Gartner)

Next steps

Page 3: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Introduction

Current : Head of Business Intelligence & Analytics for the City of Carlsbad

Previously : Neil has provided technical and organizational leadership in the areas of big data and statistical analysis, database management, data mining, data architecture, and data warehouse design. He has experience in various industries.

Organizations Industries Large consulting firms Dynamic startup organizations Fortune 500 companies Government organizations

Oil & Gas – BP (formerly British Petroleum)

Hi-Tech – Adobe, Fujitsu Health & Fitness – Beachbody LLC FMCG – Cadbury, Australia State & local government – City of

Carlsbad

Page 4: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Key strategicchallenges for a CDO/CAO

Identify and communicate the business

context for data within big data analytic

projects

Move from “cool experiments” to driving

business value

Use analytics and information governance

to develop a culture of evidence-based

decision making

Information risk management

Page 5: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Key operationalchallenges

New technologies require an experimental

approach - it's a learning exercise

Repeatability is the new demand in big data

Getting the right tools and skills in place

Implement self-service data preparation tools that

can accelerate the shift towards business-user-

generated data discovery and advanced analytics

Reduce the time and complexity of preparing data

for analysis

Page 6: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Big Data tools & technologies (non open source)

Page 7: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

‘Open Source’ is the new normal

Page 8: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 Big Data tools/technologies

#1

1. Apache Spark™- Runs programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk

Developed at UC Berkeley’s Algorithms, Machines and People Lab (AMPLab) in 2009, later donated to Apache in 2010

In-memory vs. Hadoop’s two stage disk based map reduce

IBM will invest $300 Million, 3500 developers, and over a dozen of its labs worldwide to spark-related projects over the next few years

Stable & latest release 1.6, January 4th 2016

Page 9: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 BigDatatools/technologies

#2

2. R Needs no explanation on why this made it to this list

One of the highest paid skill

Most-used data science language after SQL

Used by 70% of data miners

Growing faster than any other data science

language

#1 Google Search for Advanced Analytics software

More than 2 million users worldwide

7,829 packages available for use

#1 choice for new graduates

Page 10: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 Big Data tools/technologies

#3

3. Talend Open Studio

#1 integration solution to offer GUI support for YARN

2.0 Big data integration without writing code

Real-time statistics for developers to test data jobs

and get immediate statistics

Connect anything, with over 900 connectors with native

support for Hadoop HDFS, HBase, Hive, Pig, Sqoop,

Google BigQuery and NoSQL databases.

Massive scalability that offers MapReduce, Pig and

Hive code

Page 11: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 Big Data tools/technologies

#4

4. Apache Storm – it’s all about real-time processing!

Storm, a distributed computation framework for event

stream processing, began life as a project of BackType,

a marketing intelligence company bought by Twitter in

2011

Twitter soon open-sourced the project and put it on

GitHub, but Storm ultimately moved to the Apache

Incubator and became an Apache top-level project in

September 2014

Apache Storm is getting ready to take on IoT

Page 12: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 Big Data tools/technologies

#5

5. Lumify Open Source under the Apache 2.0 license

Map Integration that allows users to integrate their preferred

GIS solution

Graph Visualization to analyze relationships, automatically

discover paths between entities, and establish new links in 2D or

3D

Live, Shared Workspaces to organize work into separate

workspaces that users can share with colleagues; updates are

pushed to all users viewing the workspace in real-time

Fine Grained Security to protect data with separate access

controls on entire entities, individual properties, and each

relationship

Page 13: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 Big Data tools/technologies

#6

6. Apache HIVE

Apache Hive is a data warehouse infrastructure

built on top of Hadoop for providing data

summarization, query, and analysis

Initially developed by Facebook

HiveQL

Execution Environment : Mapreduce, Tex, Spark

Data in HDFS or Hbase

Data Mining, analytics, machine learning, Ad hoc

Analysis

Page 14: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

NoSQL Databases skills index

Page 15: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 Big Data tools/technologies

#7

7. Mongodb First developed by MongoDB Inc. in 2007, the company

shifted to open source in 2009, with MongoDB offering

commercial support and other services

First choice of NoSQL developer because it’s easy to learn

Not a one-trick pony, balanced approach to support wide

variety of applications

Suitable for OLTP workloads, not necessarily for reporting

style workloads

Simplicity makes it a great start

The most widely adopted document store DB

Page 16: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 Big Data tools/technologies

#8

8. Apache Cassandra

Development simplicity (MongoDB) vs. Operational

simplicity (Cassandra)

MongoDB gets credit for an easy out-of-the-box

experience, Cassandra earns full marks for being easy to

manage at scale

Apple is one of the largest production deployments with

over 75,000 nodes storing over 10 PB of data.

Other large Cassandra installations include Netflix (2,500

nodes, 420 TB, over 1 trillion requests per day), Chinese

search engine Easou (270 nodes, 300 TB, over 800 million

reqests per day), and eBay (over 100 nodes, 250 TB)

Page 17: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 Big Data tools/technologies

#9

9. Apache Hbase

A column-oriented key-value store, gets a lot of use because of its common pedigree with Hadoop

Highly scalable, modeled after Google’s Big Table

Facebook messaging platform, Linkedin, Sophos, Spotify

Data is readily available to users and applications via SQL queries (using Cloudera Impala, Apache Phoenix, or Apache Hive)

Page 18: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Top 10 Big Data tools/technologies

#10

10. Your pick!

Page 19: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Benefits of Open Source

Ease of access : Instantly accessible, limited budget option,

enables immediate progress

Low investment entry point : Good support and assistance

available from within the developer community through online

forums, chat rooms and developer networks. Low cost of

support and maintenance e.g. H2O, Dato, Databrix, DataStax

compared to commercial proprietary vendors

Growing base of skills : A lot of training available online,

meetup groups, seminars, and community encourages constant

learning and training

Professional satisfaction : Developers are typically

comfortable with, and enjoy using tools and frameworks to craft

tailor-made analytic solutions. They can participate in and

contribute toward the open source community

Page 20: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Benefits of Open Source

Flexibility

Foster analytic agility and avoid vendor lock-in

Innovation : Investment in learning is not wasted,

even if the specific model does not deliver an

immediate outcome

Cutting-edge capabilities : Cutting-edge

approaches, such as new ensemble techniques

and deep learning capabilities, are sometimes

found in open-source solutions years before they

are put into commercial software

Page 21: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Benefits of Open Source

Compatibility with open features by commercial vendors: Many vendors are already incorporating

compatibility with popular open-source languages, interfaces,

analytics libraries and packages, thereby offering more

flexibility to their enterprise analytics platforms. Examples

include Datameer, IBM, Microsoft Azure, Oracle, SAP,

Tableau, Teradata and Tibco Software

Avoiding large IT vendors: They typically create a large (costly) footprint

Skills training in a special product configuration

becomes increasingly scarce and expensive

If working with a large vendor, you are locked into its

product roadmap

Page 22: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Go “BiModal” – says Gartner!

Combine corporate software with open-source software to be

able to support both bimodal Mode 1 (engineered) and Mode

2 (innovative) approaches

Make investment decisions for advanced analytics capability

based on overall ROI and TCO, not only initial capital

purchase costs

Page 23: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Quick 1 page strategy

Page 24: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Get Inspired

Look Outside

Agree Inside

Big data use cases usually center around four types:

1. Operational excellence: Using big data to improve

operations

2. Customer intimacy: Delivering a superior experience,

aka Amazonification

3. Risk management: Mitigating operational, reputational,

financial and strategic risks, including fraud detection

4. New business development: Introducing new products

and services

Page 25: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Get Going

How do you explain to someone who has never eaten an

orange how it tastes?... It is far easier if you just give them one.

Start with Skills : Build a small team

Try Techniques and Technologies : Be pragmatic about

investments.

Start with the free versions of open-source software (you can

move to a managed version later), and with a straightforward

data lake as the basis for the data.

Use existing hardware or go to the Cloud, and either run the

initiative under the radar, or use a small amount of portfolio

funding to seed a number of experiments.

Anticipate that some of these efforts will lead to no results.

Page 26: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Get Organized

Create the right architecture - Use the concept

of the "logical data warehouse." In almost all

cases, big data implementations complement the

data warehouse instead of replacing it

Create a governance model

Organizing too early will take forever and

eliminates the experimentation effect. But being

too late with implementing governance, and the

process of taking results into production, leads to

yet another disconnected stovepipe and impacts

user adoption

Page 27: Big Data & Open Source - Neil Jadhav

Swapnil (Neil) Jadhav | [email protected] | 408.636.3772

Next Steps

Page 28: Big Data & Open Source - Neil Jadhav

Questions? Connect with me!

Email : [email protected] Cell : 408.636.3772

Linkedin : https://www.linkedin.com/in/jadhavswapnil