Upload
mongodb
View
1.978
Download
0
Embed Size (px)
DESCRIPTION
Big Data is the evolution of supercomputing for commercial enterprise and governments. Originally the domain of companies operating at Internet scale, today Big Data connects organizations of all sizes with discovery about their patterns, and insights into their business. But understanding the differences between the plethora of new technologies can be daunting. Graph / columnar / key value store / document are all called NoSQL, but which is best? How does Hadoop play in this ecosystem - its low cost and high efficiency have made it very popular, but how does it fit? In this webinar, we will explore: The full spectrum of Big Data Hadoop and MongoDB: friends or frenemies? Differences between Systems of Record and Systems of Engagement MongoDB customer examples of Systems of Engagement
Citation preview
Hadoop & MongoDB Understanding your Big Data
2
MongoDB World
4
• Last 12 years (2002-Now) - Executive Consultant, on the board and advisory board of several new software companies including Big Data players such as MongoDB
• 10 Years (1992-2002) – Oracle, Group Vice President, Systems Architecture and Technology, responsible for the server product planning and rollout
• 16 years (1975-1992) – IBM, Planner, architect, and development manager for DB2 product line at Silicon Valley Lab and Austin Lab. Head of IBM’s Database architecture, strategy, and technology
Jnan Dash
5
• Finally, some real innovation in DBMS
• MongoDB momentum is unprecedented!
• The changing landscape needs MongoDB– “Internet scale” distributed operations + highly flexible
data model for agile development + open source
• Perfect fit for cloud, mobility, and big data
Why am I excited about MongoDB?
6
• Big Data - Observations
• Evolution of Database Technology
• Hadoop+MongoDB
• Customer Examples
• Roadmap
• Summary
Agenda
7
1. Thousand years ago – Experimental ScienceDescription of natural phenomenon
2. Last few hundred years – Theoretical ScienceNewton’s Laws, Maxwell’s Equation,..
3. Last few decades – Computational ScienceSimulation of complex phenomena
4. Today – Data-intensive ScienceScientists overwhelmed with data deluge
Unify theory, experiment & simulation
The Fourth Paradigm
8
Internet Scale Commercial Supercomputing
• Originated with companies operating at Internet scale (to process ever increasing #users and data)
– Yahoo in the 1990s, then Google, Facebook, Twitter
– They needed to do it quickly, economically, and affordably at scale
• Hadoop is the first commercial supercomputing software platform
– Works at scale, affordable at scale
• HPC was used for meteorology and engineering scientific super computing. Big data is commercial equivalent of HPC
– Less about equations, more about discovery, patterns
• Many technologies have been around for decades• Clustering• Parallel processing• Distributed file systems
9
Big Data: 3V’s
10
Some Make it 4V’s
11
What’s driving Big Data
- Ad-hoc querying and reporting- Data mining techniques- Structured data, typical sources- Small to mid-size datasets
- Optimizations and predictive analytics- Complex statistical analysis- All types of data, and many sources- Very large datasets- More of a real-time
12
Big Data – the full spectrum
Transaction Processing
Analytical Processing
Data Mining, Visualization,
and Integration
Tools
RDBMS OLAP/DW
DW Appliance
Hadoop, Impala,..
NoSQL
NewSQL, In-
Memory, Stream...
Online/Realtime
Offline/Batch
13
Hadoop Ecosystem
Programming Languages
Computation
Object Storage
Zoo
keep
er
(Coo
rdin
atio
n)
Core Apache Hadoop Related Apache Projects
HDFS (Hadoop Distributed File System)
MapReduce(Distributed Programing Framework)
Hive(SQL)
Pig(Data Flow)
HBase(Wide Column Storage)
HCatalog(Meta Data)
HM
S(M
anag
emen
t)
Table Storage
Database Technology Evolution
15
Data Management over the years
1960’s
File Systems
1970’s
1st Generation DBMS
Data asShared Resource
1980’s
Relational Technology
Ease of Query
1990’s
New data types
OLAP/DW
Web Support
Unstructured Data
2005+ Big Data
Post-PC, Data Deluge, 3Vs,
NoSQL
16
Operational vs. Analytics
2010
RDBMS
Key-Value/Wide-column
OLAP/DW
Hadoop
2000
RDBMS
OLAP/DW
1990
RDBMS
Operational Database
Data warehouse
Document DB
NoSQL
17
MongoDB Features
• JSON Document Model with Dynamic Schemas
• Auto-Sharding for Horizontal Scalability
• Text Search
• Aggregation Framework and MapReduce
• Full, Flexible Index Support and Rich Queries
• Native Replication for High Availability
• Advanced Security
• Large Media Storage with GridFS
18
Documents are Rich Data Structures
{ first_name: ‘Paul’, surname: ‘Miller’, cell: ‘+447557505611’ city: ‘London’, location: [45.123,47.232], Profession: [banking, finance, trader], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}
Fields can contain an array of sub-documents
Fields
Typed field values
Fields can contain arrays
String
Number
Geo-
Coordinate
s
19
Machine Generated Data
20
• Hundreds of thousands of records per second
• Fast response required
• Sometimes all data kept, sometimes just summary
• Horizontal scalability required
Fast Moving Data
21
• A machine generates a specific kind of data
• The data model is unlikely to change
• But there are so many different machines…
• Queryability across all types
Data is Structured, but Varied…
22
• Event data written multiple times per second, minute, or hour
• Tracking progression of metrics over time
Time Series Data
23
Do More With Your Data
MongoDBRich Queries
• Find Paul’s cars• Find everybody in London with a car
built between 1970 and 1980
Geospatial• Find all of the car owners within 5km of
Trafalgar Sq.
Text Search• Find all the cars described as having
leather seats
Aggregation• Calculate the average value of Paul’s
car collection
Map Reduce• What is the ownership pattern of colors
by geography over time? (is purple trending up in China?)
{ first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [51.524,-0.087], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}
Hadoop & MongoDB
25
Enterprise Big Data Stack
EDWHadoop
Man
agem
ent
& M
on
ito
rin
gS
ecurity &
Au
ditin
g
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Online Data Offline Data
26
MongoDB & Hadoop
• Multi-source analytics• Interactive & Batch• Data lake
• Online, Real-time• High concurrency &
HA• Live analytics
Operational Analytical
MongoDB Connector for
Hadoop
27
Hadoop Is Good for…
Risk Modeling Churn AnalysisRecommendation
Modeling
Ad TargetingTransaction
AnalysisTrade
Surveillance
Network Failure Prediction
Search Quality Data Lake
28
MongoDB Is Good for…
Single View Mobile Apps Fraud Detection
Customer Data Management
Content Management &
Delivery
Database-as-a- Service
Product & Asset Catalogs
Internet of Things
Social & Collaboration
Customer Examples
30
Many more examples
Big Data Product & Asset Catalogs
Security & Fraud
Internet of Things
Database-as-a-Service
Mobile Apps
Customer Data Management
Single View
Social & Collaboration
Content Management
Intelligence Agencies
Top Investment and Retail Banks
Top US Retailer
Top Global Shipping Company
Top Industrial Equipment Manufacturer
Top Media Company
Top Investment and Retail Banks
31
MongoDB Enterprise Value
32
• Makes MongoDB a Hadoop-enabled file system
• Full use of MongoDB’s indexes
• Read and write to live data, in-place
• Copy data between Hadoop and MongoDB
• Full support for data processing
– Hive
– MapReduce
– Pig
– Streaming
– EMR
MongoDB+Hadoop Connector
MongoDB Connector for
Hadoop
33
Customer Example – MetLife
Customer Service
• Insurance policies• Demographic data• Customer web data• Call center data• Real-time churn
detection
• Customer action analysis
• Churn prediction algorithms
Churn Analysis
MongoDB Connector for
Hadoop
34
Customer Example - eCommerce
Travel
• Flights, hotels and cars
• Real-time offers• User profiles, reviews• User metadata
(previous purchases, clicks, views)
• User segmentation• Offer recommendation
engine• Ad serving engine• Bundling engine
Algorithms
MongoDB Connector for
Hadoop
35
Roadmap
Capability Today Soon
Connectivity CustomCentralized Administration
MongoDB Hadoop Dynamic reads Automated Snapshots
BSON Support MapReduce, Hive, Pig Impala, Tez, Spark
Hadoop MongoDB Dynamic writes Bulk Loader
36
• Big Data covers a wide spectrum– Volume, Velocity, Variety– Hence the mythical equation Big Data = Hadoop
• Enterprises are more concerned about Variety– MongoDB provides the best platform
• Hadoop and MongoDB are complimentary– MongoDB for operational workloads– Hadoop for analytical workloads
Summary