Upload
swapnil-neil-jadhav
View
222
Download
2
Embed Size (px)
Citation preview
Big Data and Open Source▸
‐ Swapnil(Neil)Jadhav
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Agenda
Introduction
Key strategic challenges for CDOs/CAOs
Key operational challenges for CDOs/CAOs
Top 10 big data tools and technologies
Why open source?
1 page strategy to implement big data
programs (Source: Gartner)
Next steps
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Introduction
Current : Head of Business Intelligence & Analytics for the City of Carlsbad
Previously : Neil has provided technical and organizational leadership in the areas of big data and statistical analysis, database management, data mining, data architecture, and data warehouse design. He has experience in various industries.
Organizations Industries Large consulting firms Dynamic startup organizations Fortune 500 companies Government organizations
Oil & Gas – BP (formerly British Petroleum)
Hi-Tech – Adobe, Fujitsu Health & Fitness – Beachbody LLC FMCG – Cadbury, Australia State & local government – City of
Carlsbad
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Key strategicchallenges for a CDO/CAO
Identify and communicate the business
context for data within big data analytic
projects
Move from “cool experiments” to driving
business value
Use analytics and information governance
to develop a culture of evidence-based
decision making
Information risk management
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Key operationalchallenges
New technologies require an experimental
approach - it's a learning exercise
Repeatability is the new demand in big data
Getting the right tools and skills in place
Implement self-service data preparation tools that
can accelerate the shift towards business-user-
generated data discovery and advanced analytics
Reduce the time and complexity of preparing data
for analysis
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Big Data tools & technologies (non open source)
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
‘Open Source’ is the new normal
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 Big Data tools/technologies
#1
1. Apache Spark™- Runs programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk
Developed at UC Berkeley’s Algorithms, Machines and People Lab (AMPLab) in 2009, later donated to Apache in 2010
In-memory vs. Hadoop’s two stage disk based map reduce
IBM will invest $300 Million, 3500 developers, and over a dozen of its labs worldwide to spark-related projects over the next few years
Stable & latest release 1.6, January 4th 2016
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 BigDatatools/technologies
#2
2. R Needs no explanation on why this made it to this list
One of the highest paid skill
Most-used data science language after SQL
Used by 70% of data miners
Growing faster than any other data science
language
#1 Google Search for Advanced Analytics software
More than 2 million users worldwide
7,829 packages available for use
#1 choice for new graduates
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 Big Data tools/technologies
#3
3. Talend Open Studio
#1 integration solution to offer GUI support for YARN
2.0 Big data integration without writing code
Real-time statistics for developers to test data jobs
and get immediate statistics
Connect anything, with over 900 connectors with native
support for Hadoop HDFS, HBase, Hive, Pig, Sqoop,
Google BigQuery and NoSQL databases.
Massive scalability that offers MapReduce, Pig and
Hive code
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 Big Data tools/technologies
#4
4. Apache Storm – it’s all about real-time processing!
Storm, a distributed computation framework for event
stream processing, began life as a project of BackType,
a marketing intelligence company bought by Twitter in
2011
Twitter soon open-sourced the project and put it on
GitHub, but Storm ultimately moved to the Apache
Incubator and became an Apache top-level project in
September 2014
Apache Storm is getting ready to take on IoT
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 Big Data tools/technologies
#5
5. Lumify Open Source under the Apache 2.0 license
Map Integration that allows users to integrate their preferred
GIS solution
Graph Visualization to analyze relationships, automatically
discover paths between entities, and establish new links in 2D or
3D
Live, Shared Workspaces to organize work into separate
workspaces that users can share with colleagues; updates are
pushed to all users viewing the workspace in real-time
Fine Grained Security to protect data with separate access
controls on entire entities, individual properties, and each
relationship
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 Big Data tools/technologies
#6
6. Apache HIVE
Apache Hive is a data warehouse infrastructure
built on top of Hadoop for providing data
summarization, query, and analysis
Initially developed by Facebook
HiveQL
Execution Environment : Mapreduce, Tex, Spark
Data in HDFS or Hbase
Data Mining, analytics, machine learning, Ad hoc
Analysis
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
NoSQL Databases skills index
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 Big Data tools/technologies
#7
7. Mongodb First developed by MongoDB Inc. in 2007, the company
shifted to open source in 2009, with MongoDB offering
commercial support and other services
First choice of NoSQL developer because it’s easy to learn
Not a one-trick pony, balanced approach to support wide
variety of applications
Suitable for OLTP workloads, not necessarily for reporting
style workloads
Simplicity makes it a great start
The most widely adopted document store DB
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 Big Data tools/technologies
#8
8. Apache Cassandra
Development simplicity (MongoDB) vs. Operational
simplicity (Cassandra)
MongoDB gets credit for an easy out-of-the-box
experience, Cassandra earns full marks for being easy to
manage at scale
Apple is one of the largest production deployments with
over 75,000 nodes storing over 10 PB of data.
Other large Cassandra installations include Netflix (2,500
nodes, 420 TB, over 1 trillion requests per day), Chinese
search engine Easou (270 nodes, 300 TB, over 800 million
reqests per day), and eBay (over 100 nodes, 250 TB)
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 Big Data tools/technologies
#9
9. Apache Hbase
A column-oriented key-value store, gets a lot of use because of its common pedigree with Hadoop
Highly scalable, modeled after Google’s Big Table
Facebook messaging platform, Linkedin, Sophos, Spotify
Data is readily available to users and applications via SQL queries (using Cloudera Impala, Apache Phoenix, or Apache Hive)
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Top 10 Big Data tools/technologies
#10
10. Your pick!
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Benefits of Open Source
Ease of access : Instantly accessible, limited budget option,
enables immediate progress
Low investment entry point : Good support and assistance
available from within the developer community through online
forums, chat rooms and developer networks. Low cost of
support and maintenance e.g. H2O, Dato, Databrix, DataStax
compared to commercial proprietary vendors
Growing base of skills : A lot of training available online,
meetup groups, seminars, and community encourages constant
learning and training
Professional satisfaction : Developers are typically
comfortable with, and enjoy using tools and frameworks to craft
tailor-made analytic solutions. They can participate in and
contribute toward the open source community
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Benefits of Open Source
Flexibility
Foster analytic agility and avoid vendor lock-in
Innovation : Investment in learning is not wasted,
even if the specific model does not deliver an
immediate outcome
Cutting-edge capabilities : Cutting-edge
approaches, such as new ensemble techniques
and deep learning capabilities, are sometimes
found in open-source solutions years before they
are put into commercial software
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Benefits of Open Source
Compatibility with open features by commercial vendors: Many vendors are already incorporating
compatibility with popular open-source languages, interfaces,
analytics libraries and packages, thereby offering more
flexibility to their enterprise analytics platforms. Examples
include Datameer, IBM, Microsoft Azure, Oracle, SAP,
Tableau, Teradata and Tibco Software
Avoiding large IT vendors: They typically create a large (costly) footprint
Skills training in a special product configuration
becomes increasingly scarce and expensive
If working with a large vendor, you are locked into its
product roadmap
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Go “BiModal” – says Gartner!
Combine corporate software with open-source software to be
able to support both bimodal Mode 1 (engineered) and Mode
2 (innovative) approaches
Make investment decisions for advanced analytics capability
based on overall ROI and TCO, not only initial capital
purchase costs
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Quick 1 page strategy
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Get Inspired
Look Outside
Agree Inside
Big data use cases usually center around four types:
1. Operational excellence: Using big data to improve
operations
2. Customer intimacy: Delivering a superior experience,
aka Amazonification
3. Risk management: Mitigating operational, reputational,
financial and strategic risks, including fraud detection
4. New business development: Introducing new products
and services
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Get Going
How do you explain to someone who has never eaten an
orange how it tastes?... It is far easier if you just give them one.
Start with Skills : Build a small team
Try Techniques and Technologies : Be pragmatic about
investments.
Start with the free versions of open-source software (you can
move to a managed version later), and with a straightforward
data lake as the basis for the data.
Use existing hardware or go to the Cloud, and either run the
initiative under the radar, or use a small amount of portfolio
funding to seed a number of experiments.
Anticipate that some of these efforts will lead to no results.
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Get Organized
Create the right architecture - Use the concept
of the "logical data warehouse." In almost all
cases, big data implementations complement the
data warehouse instead of replacing it
Create a governance model
Organizing too early will take forever and
eliminates the experimentation effect. But being
too late with implementing governance, and the
process of taking results into production, leads to
yet another disconnected stovepipe and impacts
user adoption
Swapnil (Neil) Jadhav | [email protected] | 408.636.3772
Next Steps
Questions? Connect with me!
Email : [email protected] Cell : 408.636.3772
Linkedin : https://www.linkedin.com/in/jadhavswapnil