View
226
Download
2
Category
Preview:
Citation preview
© Copyright 2013. Apps Associates LLC. 1
What Next for DBAs in the Big Data Era
February 21st , 2015
© Copyright 2013. Apps Associates LLC. 2
Satyendra Kumar Pasalapudi
Associate Practice Director – IMS @ Apps Associates
Co Founder & President of AIOUG
@pasalapudi
© Copyright 2014. Apps Associates LLC. 4
Agenda
• Technology Trends
• Big Data Overview
• Hadoop Basics
• NoSQL Databases
• Big Data Sql
• What Next for DBAs
© Copyright 2014. Apps Associates LLC. 5
Cost effectively manage
and analyze
all available data in its
native form
unstructured,
structured, streaming
ERP CRM
RFID
Website
Network Switches
Social Media
Billing
Big data Challenge
History of databases Magnetic tape
“flat” (sequential) files
Pre-computer technologies:
Printing press Dewey decimal system Punched cards
Magnetic Disk
IMS
Relational Model defined
Indexed-Sequential Access Mechanism (ISAM)
Network Model
IDMS
ADABAS
System R
Oracle V2
Ingres
dBase
DB2
Informix
Sybase
SQL Server
Access
Postgres
MySQL
Cassandra
Hadoop
Vertica
Riak
HBase
Dynamo
MongoDB
Redis
VoltDB
Hana
Neo4J
Aerospike
Hierarchical model
1960-70 1940-50 1950-60 1970-80 1980-90 1990-2000
2000-2010
Why?
• 3rd Platform drives
new demands on
the database:
– Global High
Availability
– Data volumes
– Unstructured data
– Transaction rates
– Latency
• A single
architecture cannot
meet all those
demands
Why
Operational RDBMS
(Oracle, SQL Server, …)
In-memory Analytics (HANA,
Exalytics …)
In-memory processing
(Spark)
Hadoop
Web DBMS (MySQL, Mongo,
Cassandra)
ERP & in-house CRM
Analytic/BI software
(SAS, Tableau
Web Server Data
Warehouse RDBMS
(Oracle, Terradata …)
Enterprise Big data Architecture
The instrumented human
• Bluetooth Personal Area Network
• 3G/WiFi Wide Area Network
• GPS
• Storage
• Pulse, temp monitor
• Silent alarms
• Pedometer, sleep monitoring
• Compass
• Camera
• Mike/earphones
• Heads up display
• Emotion/Attention monitor
Google File System (GFS)
Map Reduce BigTable
Google Applications
Google Software Architecture (circa 2005)
Start Reduce Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Map
Map Reduce
© Copyright 2014. Apps Associates LLC. 19
Hadoop Design Principles
• System shall manage and heal itself
– Automatically and transparently route around failure
– Speculatively execute redundant tasks if certain nodes are detected to be slow
• Performance shall scale linearly
– Proportional change in capacity with resource change
• Compute should move to data
– Lower latency, lower bandwidth
• Simple core, modular and extensible
© Copyright 2014. Apps Associates LLC. 20
Hadoop History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Starts as a Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• May 2009 – Hadoop sorts Petabyte in 17 hours
© Copyright 2014. Apps Associates LLC. 21
Hadoop Ecosystem
HDFS (Hadoop Distributed File System)
HBase (key-value store)
MapReduce (Job Scheduling/Execution System)
Data Access
Sqoop Flume
Client Access
Hue Hive(Sql)
Pig(Pl/Sql)
Zoo
Kee
pe
r (C
oo
rdin
atio
n)
(Streaming/Pipes APIs)
Ch
ukw
a (M
on
ito
rin
g)
Data Mining
Mahout
OS – Redhat, Suse, Ubuntu,Windows
Commodity Hardware
Java Virtual Machine
Networking
Orchestration
Oozie
© Copyright 2014. Apps Associates LLC. 23
Hadoop 2.0
Hadoop at Yahoo
• 2010(biggest cluster):
• 4000 nodes 16PB disk
• 64 TB of RAM
• 32,000 Cores
• 2014:
– 16 Clusters
– 32,500 nodes
© Copyright 2014. Apps Associates LLC. 27
Database Market Disruption
$30B Database Market Being Disrupted
Name Site Counter
Dick Ebay 507,018
Dick Google 690,414
Jane Google 716,426
Dick Facebook 723,649
Jane Facebook 643,261
Jane ILoveLarry.com 856,767
Dick MadBillFans.com 675,230
NameId Name
1 Dick
2 Jane
SiteId SiteName
1 Ebay
2 Google
3 Facebook
4 ILoveLarry.com
5 MadBillFans.com
NameId SiteId Counter
1 1 507,018
1 3 690,414
2 3 716,426
1 3 723,649
2 3 643,261
2 4 856,767
1 5 675,230
Id Name Ebay Google Facebook (other columns) MadBillFans.com
1 Dick 507,018 690,414 723,649 . . . . . . . . . . . . . . 675,230
Id Name Google Facebook (other columns) ILoveLarry.com
2 Jane 716,426 643,261 . . . . . . . . . . . . . . 856,767
BigTable Data Model
Financial services Discover fraud patterns based on multi-years worth of credit card transactions and in a time scale that does not allow new patterns to accumulate significant losses. Measure transaction processing latency across many business processes by processing and correlating system log data.
Internet retailer Discover fraud patterns in Internet retailing by mining Web click logs. Assess risk by product type and session/Internet Protocol (IP) address activity.
Retailers Perform sentiment analysis by analyzing social media data.
Drug discovery Perform large-scale text analytics on publicly available information sources.
Healthcare Analyze medical insurance claims data for financial analysis, fraud detection, and preferred patient treatment plans. Analyze patient electronic health records for evaluation of patient care regimes and drug safety.
Mobile telecom Discover mobile phone churn patterns based on analysis of CDRs and correlation with activity in subscribers’ networks of callers.
IT technical support Perform large-scale text analytics on help desk support data and publicly available support forums to correlate system failures with known problems.
Scientific research Analyze scientific data to extract features (e.g., identify celestial objects from telescope imagery).
Internet travel Improve product ranking (e.g., of hotels) by analysis of multi-years worth of Web click logs.
Big Data /Hadoop Use Cases
Document databases
• Structured documents – XML and JSON (JavaScript Object Notation) become more prevalent within applications
• Web programmers start storing these in BLOBS in MySQL
• Emergence of XML and JSON databases
Graph Database
Neo4J
Infinite Graph
FlockDB
Document
JSON based
MongoDB
CouchDB
RethinkDB
XML based
MarkLogic
BerkeleyDB XML
Key Value
MemchacheDB
Oracle NoSQL
Dynamo
Voldemort
DynamoDB
Riak
Table Based BigTable
Cassandra
Hbase
HyperTable
Accumulo
© Copyright 2013. Apps Associates LLC. 43
Big Data Architecture
D A T A
S O U R C E S
DATA LAKE – On AWS Big Data Infra (Optrion2)
DATA CONNECTORS
A N A L Y T I C S
DATA LAKE on Oracle Big data Appliance (Option1)
DATA LAKE – On Premise Hadoop Infra(Option3) D A T A L A K E
© Copyright 2013. Apps Associates LLC. 44
On Premise Hadoop as RDBMS “active archive”
SALES 2013
Oracle Database
Structured Data Analytics from Apps
SALES 2012
SALES 2011
SALES 2010
SALES 2011
SALES 2010
“Hive” provides an SQL-like query layer over Hadoop and MapReduce
Unstructured + Structured Data Analytics from Apps
Hadoop for Structured Archive and Unstructured data
© Copyright 2013. Apps Associates LLC. 45
AWS EMR as RDBMS “active archive”
SALES 2013
Oracle Database
Structured Data Analytics from Apps
SALES 2012
SALES 2011
SALES 2010
SALES 2011
SALES 2010
“Hive” provides an SQL-like query layer over Amazon EMR
Unstructured + Structured Data Analytics from Apps
AWS EMR for Structured Archive and Unstructured data
Amazon Elastic MapReduce (Amazon EMR)
Oracle Database Support for All Data
• Structured Data • Numeric, String, Date, …
• Row and column formats
• Unstructured Data • LOB
• Text
• XML
• JSON
• Spatial
• Graph
46
Run the Business Scale-out and scale-up
Collect any data
SQL
Transactional and analytic
applications for the enterprise
Secure and highly available
Relational
Oracle Support for Any Data Management System
Hadoop
Change the Business
Scale-out, low cost store
Collect any data
Map-reduce, SQL
Analytic applications
NoSQL
Scale the Business
Scale-out, low cost store
Collect key-value data
Find data by key
Web applications
Big Data SQL
48
SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id;
Relevant SQL runs on BDA nodes
10’s of Gigabytes of Data
Only columns and rows needed to answer query are returned
Hadoop Cluster
B B B
Big Data SQL
Oracle Database
CUSTOMERS WEB_LOGS
SQL Push Down in Big Data SQL
• Hadoop Scans on Unstructured Data • WHERE Clause Evaluation • Column Projection • Bloom Filters for Better Join Performance • JSON Parsing, Data Mining Model Evaluation
Storage Layer
Big Data SQL : A New Hadoop Processing Engine
Filesystem (HDFS) NoSQL Databases
(Oracle NoSQL DB, Hbase)
Resource Management (YARN, cgroups)
Processing Layer MapReduc
e and Hive
Spark Impala Search Big Data
SQL
What Next for DBA’s in Big Data Era? NoSQL Hadoop Big data Sql 12c New Features on Big data Engineered Systems Knowledge
© Copyright 2014. Apps Associates LLC. 58
Connect with Us
Web: www.appsassociates.com
Email: satyendra.pasalapudi@appsassociates.com | satyendra.kumar@aioug.org
YouTube: www.youtube.com/user/AppsAssociates
LinkedIn: www.us.linkedin.com/company/apps-associates
Twitter: @AppsAssociates
Facebook: www.facebook.com/AppsAssociatesGlobal
Recommended