Upload
don-demsak
View
1.619
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Introduction to Big Data and NoSQL
NJ SQL Server User GroupMay 15, 2012
Don Demsak
Advisory Solutions Architect
EMC Consulting
www.donxml.com
Melissa Demsak
SQL Architect
Realogy
www.sqldiva.com
Meet Don
• Advisory Solutions Architect– EMC Consulting
• Application Architecture, Development & Design
• DonXml.com, Twitter: donxml• Email – [email protected]• SlideShare - http://www.slideshare.net/dondemsak
The era of Big Data
How did we get here?• Expensive
o Processorso Disk spaceo Memoryo Operating Systemso Softwareo Programmers
• Culture of Limitationso Limit CPU cycleso Limit disk spaceo Limit memoryo Limited OS Developmento Limited Softwareo Programmers
• One language• One persistence store
Typical RDBMS Implementations
• Fixed table schemas
• Small but frequent reads/writes
• Large batch transactions
• Focus on ACIDo Atomicityo Consistencyo Isolationo Durability
How we scale RDBMS implementations
1st Step – Build a relational database
RelationalDatabase
2nd Step – Table Partitioning
RelationalDatabase
p1 p2 p3
3rd Step – Database Partitioning
Web TierBrowser B/L Tier RelationalDatabase
Customer #2
Web TierBrowser B/L Tier RelationalDatabase
Customer #1
Web TierBrowser B/L Tier RelationalDatabase
Customer #3
4th Step – Move to the cloud?
Web TierBrowser B/L TierSQL AzureFederation
Customer #2
Web TierBrowser B/L Tier SQL AzureFederation
Customer #1
Web TierBrowser B/L TierSQL AzureFederation
Customer #3
Problems created by too much data
• Where to store• How to store• How to process• Organization, searching, and
metadata• How to manage access• How to copy, move, and backup• Lifecycle
Polyglot Programmer
Polyglot Persistence
(how to store)
• Atlanta 2009 - No:sql(east) conference
select fun, profit from real_world where relational=false
• Billed as “conference of no-rel datastores”
• (often) Open source• Non-relational• Distributed• (often) does not guarantee ACID
(loose) Definition
Types Of NoSQL Data Stores
5 Groups of Data Models
Relational
Document
Key Value
Graph
Column Family
Document?• Think of a web page...
o Relational model requires column/tago Lots of empty columnso Wasted space and processing time
• Document model just stores the pages as iso Saves on spaceo Very flexible
• Document Databaseso Apache Jackrabbito CouchDBo MongoDBo SimpleDBo XML Databases
• MarkLogic Server• eXist.
Key/Value Stores• Simple Index on Key
• Value can be any serialized form of data
• Lots of different implementationso Eventually Consistent
• “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent”
o Cached in RAMo Cached on disko Distributed Hash Tables
• Exampleso Azure AppFabric Cacheo Memcache-do VMWare vFabric GemFire
Graph?• Graph consists of
o Node (‘stations’ of the graph)o Edges (lines between them)
• Graph Storeso AllegroGrapho Core Datao Neo4jo DEXo FlockDB
• Created by the Twitter folks• Nodes = Users• Edges = Nature of relationship between nodes.
o Microsoft Trinity (research project)• http://research.microsoft.com/en-us/projects/trinity/
Column Family?• Lots of variants
o Object Stores• Db4o• GemStone/S• InterSystems Caché• Objectivity/DB• ZODB
o Tabluar• BigTable• Mnesia• Hbase• Hypertable• Azure Table Storage
o Column-oriented• Greenplum• Microsoft SQL Server 2012
Okay got it, Now Let’s Compare Some Real
World Scenarios
04/10/2023Footer Text 24
You Need Constant Consistency
• You’re dealing with financial transactions• You’re dealing with medical records• You’re dealing with bonded goods• Best you use a RDMBS
04/10/2023Footer Text 25
You Need Horizontal Scalability
• You’re working across defined timezones• You’re Aggregating large quantities of data• Maintaining a chat server (Facebook chat)• Use Column Family Storage.
04/10/2023Footer Text 26
Frequently Written Rarely Read
• Think web counters and the like• Every time a user comes to a page = ctr+
+• But it’s only read when the report is run• Use Key-Value Storage.
04/10/2023Footer Text 27
Here Today Gone Tomorrow
• Transient data like..o Web Sessionso Lockso Short Term Stats
• Shopping cart contents
• Use Key-Value Storage
Where to store• RAM
o Fasto Expensiveo volatile
• Parallel File Systemo HDFS (Hadoop)o Auto-replicated for
parallel decentralized I/O
• Local Disko SSD – super fasto Fast spinning disks (7200+)o High Bandwidth possibleo Persistent
• SANo Storage Area Networko Fully managedo Expensive
• Cloudo Amazono Box.Neto DropBox
Big Data
Big Data Definition
Volume
• Beyond what traditional environments can handle
Velocity
• Need decisions fast
Variety
• Many formats
Additional Big Data Concepts• Volumes & volumes of data
• Unstructured
• Semi-structured
• Not suited for Relational Databases
• Often utilizes MapReduce frameworks
Big Data Examples• Cassandra
• Hadoop
• Greenplum
• Azure Storage
• EMC Atmos
• Amazon S3
• SQL Azure (with Federations support)?
Real World Example
• Twittero The challenges
• Needs to store many graphs
Who you are following Who’s following you Who you receive phone
notifications from etc• To deliver a tweet requires
rapid paging of followers• Heavy write load as
followers are added and removed
• Set arithmetic for @mentions (intersection of users).
What did they try?• Started with Relational
Databases
• Tried Key-Value storage of denormalized lists
• Did it work?o Nope
• Either good at Handling the write load Or paging large
amounts of data But not both
What did they need?
• Simplest possible thing that would work
• Allow for horizontal partitioning
• Allow write operations to
• Arrive out of ordero Or be processed more than onceo Failures should result in redundant work
• Not lost work!
The Result was FlockDB• Stores graph data
• Not optimized for graph traversal operations
• Optimized for large adjacency listso List of all edges in a graph
• Key is the edge value a set of the node end points
• Optimized for fast read and write
• Optimized for page-able set arithmetic.
How Does it Work?• Stores graphs as sets of edges between
nodes
• Data is partitioned by nodeo All queries can be answered by a single partition
• Write operations are idempotento Can be applied multiple times without changing the result
• And commutativeo Changing the order of operands doesn’t change the result.
How to Process Big Data
ACID• Atomicity
o All or Nothing
• Consistencyo Valid according to all defined rules
• Isolationo No transaction should be able to interfere with another
transaction
• Durabilityo Once a transaction has been committed, it will remain so, even in
the event of power loss, crashes, or errors
BASE• Basically Available
o High availability but not always consistent
• Soft stateo Background cleanup mechanism
• Eventual consistencyo Given a sufficiently long period of time over which no changes are
sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.
Traditional (relational) Approach
Extract
Transform
Load
Transactional Data Store
Data Warehouse
Big Data Approach
• MapReduce Pattern/Frameworko an Input Readero Map Function – To transform to a common
shape (format)o a partition functiono a compare functiono Reduce Functiono an Output Writer
MongoDB Example
> // map function> m = function(){... this.tags.forEach(... function(z){... emit( z , { count : 1 } );... }... );...};
> // reduce function> r = function( key , values ){... var total = 0;... for ( var i=0; i<values.length; i++ )... total += values[i].count;... return { count : total };...};
> // execute> res = db.things.mapReduce(m, r, { out : "myoutput" } );
What is Hadoop?• A scalable fault-tolerant grid operating system for
data storage and processing
• Its scalability comes from the marriage of:o HDFS: Self-Healing High-Bandwidth Clustered Storageo MapReduce: Fault-Tolerant Distributed Processing
• Operates on unstructured and structured data
• A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)
• Open source under the friendly Apache License
• http://wiki.apache.org/hadoop/
Hadoop Design Axioms
1. System Shall Manage and Heal Itself
2. Performance Shall Scale Linearly
3. Compute Should Move to Data
4. Simple Core, Modular and Extensible
Hadoop Core Components
Store
HDFS
Self-healingHigh-bandwidth
Clustered storage
Process
Map/Reduce
Fault-tolerantdistributedprocessing
HDFS: Hadoop Distributed File System
Block Size = 64MBReplication Factor = 3
Cost/GB is a few ¢/month vs $/month
Hadoop Map/Reduce
Hadoop Job Architecture
Microsoft embraces Hadoop
Good for enterprises & developers
Great for end users!
A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICs
EIS / ERP
RDBMS
File System
OData [RSS]
Azure Storage
HADOOP[Azure and Enterprise]
OCEAN OF DATA[unstructured, semi-structured, structured]
Java OMStreaming
OMHiveQL PigLatin (T)SQL.NET/C#/F#
HDFS
NOSQL ETL
04/10/2023Footer Text 52
Hive Plug-in for Excel
THANK YOU