Upload
torn
View
49
Download
5
Tags:
Embed Size (px)
DESCRIPTION
An Introduction to Big Data Ken Smith. April 10th , 2013. Big Data … Its Technologies & Analytic Ecosystem. Course Goal. Hype curve. ………….Tethered To Reality……. Outline. Background: What is “Big Data”? … Why is it big? Parallel Technologies for Big Data Problems Big Data Ecosystem - PowerPoint PPT Presentation
Citation preview
© 2014 The MITRE Corporation. All rights reserved
April 10th, 2013
An Introduction to Big DataKen Smith
© 2013 The MITRE Corporation. All rights reservedFor Internal MITRE Use
Big Data …Its Technologies & Analytic Ecosystem
© 2012 The MITRE Corporation. All rights reserved
3
Course Goal
………….Tethered To Reality…….......
Hype curve
© 2012 The MITRE Corporation. All rights reserved
4
Background: What is “Big Data”? … Why is it big?
Parallel Technologies for Big Data Problems
Big Data Ecosystem
Ongoing Challenges
Outline
© 2012 The MITRE Corporation. All rights reserved
5
O’Reilly:– “Big data is when the size of the data itself becomes part of the
problem”
EMC/IDC:– “Big data technologies describe a new generation of
technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.”
IBM: (The famous 3-V’s definition)– Volume (Gigabytes -> Exabytes)– Velocity (Batch -> Streaming Data)– Variety (Structured, Semi-structured, & Unstructured)
What is “Big Data”?
Credit: Big Data Now, Current Perspectives from O’Reilly Radar (O’Reilly definition); Extracting Value from Chaos, Gantz et al. (IDC definition); Understanding Big Data, Eaton et al. (IBM definition)
© 2012 The MITRE Corporation. All rights reserved
Data Size Terminology
© 2012 The MITRE Corporation. All rights reserved
A Simple Data Structure Taxonomy
7
Structured data– Data adheres to a strict template/schema– spreadsheets, relational databases, sensor feeds, …
Semi-structured data– Data adheres to a flexible (grammar-based) format
Optional fields, repeating fields– Web pages / forms, documents, XML, JSON, …
Unstructured data– Data adheres to an unknown format
No schema or grammar; you discover what each byte is and means by examining the data
– Unparsed text, raw disks, raw video & images, …
“Variety”: constantly coping with structure variations; multiple types; changing types
© 2012 The MITRE Corporation. All rights reserved
Why Are Volume & Velocity Increasing?
8
1) Internet-Scale Datasets– Activity logfiles (e.g., clickstreams, network logs)– Internet indices– Relationship data / social networks– Velocity note: Bin Laden’s death resulted in 5106
tweets/second
© 2012 The MITRE Corporation. All rights reserved
Why Are Volume & Velocity Increasing?
9
2) Sensor Proliferation– Weather satellites; flight recorders; GPS feeds; medical and
scientific instruments; cameras– Government agencies who want a sensor on every potentially mad
cow, in every cave in Afghanistan, on every cargo container, etc. What if their wish is granted?
– Velocity notes: Large Hadron Collider generates 40T/sec High Def UAVs that collect 1.4P/mission
– Variety note: increasing # of sensor feeds increasing variety
© 2012 The MITRE Corporation. All rights reserved
Why Are Volume & Velocity Increasing?
10
3) Because, with modern cloud parallelism, you can …. – Problem: “Frequent close encounters” are suspicious
Given: 73,241 ships reporting {id, lat, long} every 5 minutes for 2 weeks Resulting dataset = 15 GB (uncompressed and indexed)
– How do you detect all pairs of ships within X meters of each other? Many solutions generate intermediate “big data”
© 2013 The MITRE Corporation. All rights reserved
11
1) As a basis for analysis– As a human behavior sensor
– Supporting new approaches to science
2) To create a useful service
What Good is Big Data? Some Examples!
© 2012 The MITRE Corporation. All rights reserved
12
Background: What is “Big Data”? … Why is it big?
Parallel Technologies for Big Data Problems
Big Data Ecosystem
Ongoing Challenges
Outline
© 2012 The MITRE Corporation. All rights reserved
Traditional Scaling “Up”: Improve The Components of One System
13
OS: multiple threads / VMs
CPU: increase clock speed, bus speed, cache size
RAM: increase capacity
Disk: Increase capacity, decrease seek time, RAID
© 2012 The MITRE Corporation. All rights reserved
Scaling “Out”: From Component Speedup to Aggregation
14
Multicore cores on a chip (2, 4, 6, 8, ....)
© 2012 The MITRE Corporation. All rights reserved
From Component Speedup to Aggregation
15
Multiserver Racks (“Shared Nothing” – only interconnect)
© 2012 The MITRE Corporation. All rights reserved
From Component Speedup to Aggregation
16
Multi-Rack Data Centers
© 2012 The MITRE Corporation. All rights reserved
From Component Speedup to Aggregation
17
If you are Google or a few others: Multiple Data Centers
© 2012 The MITRE Corporation. All rights reserved
18
OS
CPU
RAM
Disk
......
This massively parallel architecture can be treated as a single computer
Applications for this “computer”:– Can exploit computational parallelism (near linear speedup)– Can have a vastly larger effective address space– Google and Facebook field applications whose user base is
measured as a reasonable fraction of the human race
The Resulting “Computer” & Its Applications
© 2012 The MITRE Corporation. All rights reserved
The Power of Parallelism: Divide & Conquer
“Work”
w1 w2 w3
r1 r2 r3
“Result”
“worker” “worker” “worker”
Partition
Combine
Source: a slide by Jimmy Lin, cc-licensed
Page 3.8
© 2012 The MITRE Corporation. All rights reserved
Some Important Software Realities In a Massively Parallel Architecture
20
Communication costs Fault-tolerance Programming abstractions
© 2012 The MITRE Corporation. All rights reserved
“Numbers Everyone Should Know”From SoCC 2010 Keynote – Jeffrey Dean, Google
21
L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K w/cheap algorithm 3,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip with same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from disk 20,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns
© 2012 The MITRE Corporation. All rights reserved
Some Important Software Realities In a Massively Parallel Architecture
22
Communication costs Fault-tolerance Programming abstractions
© 2012 The MITRE Corporation. All rights reserved
Fault Tolerance
23
Frequency of faults in massively parallel architectures:– Google reports an average of 1.2 failures per analysis job– We assume our laptop will last through the week; but you lose
this when you compute with 1000’s of commodity machines.
What if the result waits because 499 / 500 worker tasks have completed– but #500 never will finish:
Strategy:– Redundancy and checkpointing
© 2012 The MITRE Corporation. All rights reserved
Some Important Software Realities In a Massively Parallel Architecture
24
Communication costs Fault-tolerance Programming abstractions
© 2012 The MITRE Corporation. All rights reserved
How Do You Program A Massively Parallel Computer?
25
Parallel programming without help can be very painful!– Parallelize: translate your application into a set of parallel tasks– Task management: assigning tasks to processors, inter-task
communication, task restart when they crash– Task synchronization: avoiding extended waits and deadlocks
Programmers need simplifying abstractions to be productive– Pioneers Google & Facebook were forced to invent these– Hadoop now provides a tremendous suite
Analogy: RDBMSs provide the atomic transaction abstraction– no programmer wants to worry about the details who is reading &
writing data while they do!– Just use “begin transaction” and “end transaction” to insulate
your code from others using the system
© 2012 The MITRE Corporation. All rights reserved
Apache Hadoop
26
Open source framework for developing & running parallel applications on hardware clusters– Cloudera & Hortonworks sell “premium” versions & support– adapted from Google’s internal programming model– available at: hadoop.apache.org
Key components:– HDFS (Hadoop Distributed File System)– Map-Reduce (parallel programming pattern)– Hive, Pig (higher-level languages which compile into Map-Reduce)– HBase (key-value store)– Mahout (data mining library)
Some non-Hadoop parallel frameworks also exist:– Asterdata & Greenplum sell {RDBMS + Map-Reduce + analytics}
© 2012 The MITRE Corporation. All rights reserved
..
27
HDFS (Hadoop Distributed File System)
Map Reduce
....
HDFS files
HDFS:– provides a single unified file system
abstracting away the many underlying machines’ file systems– load balances file fragments, maintains replication levels
Underlying file system
files
© 2012 The MITRE Corporation. All rights reserved 28
HDFS (Hadoop Distributed File System)
HDFS components:– NameNode manages overall file system metadata– DataNodes (one per machine) manage actual data– DataNodes are easy to add, expanding the file system– Both DataNode and NameNodes include a webserver, so node
status can be easily checked
Example commands:– “/bin/hdfs dfs –ls”
lists files in an HDFS directory corresponds to linux “ls”
– “/bin/hdfs dfs -rm xx” removes HDFS file xx corresponds to linux “rm xx”
© 2012 The MITRE Corporation. All rights reserved
Adapted from (Ghemawat et al., SOSP 2003)
HDFS Architecture
Page 21
© 2012 The MITRE Corporation. All rights reserved
MapReduce
Page 3.7
■ Iterate over a large number of records
■ Extract something of interest from each
■ Shuffle and sort intermediate results
■ Aggregate intermediate results
■ Generate final output
Key idea: provide a functional abstraction for these two operations
Map
ReduceBuild asequenceof MRsteps
© 2012 The MITRE Corporation. All rights reserved
Ideal MapReducable Problems
Page 3.9
1) Input data can be naturally split into “chunks” and distributed
2) Large amounts of data– If smaller than HDFS block size, don’t bother
3) Data independence – Ideally, map operation does not depend on data at other nodes
4) Good redistribution key exists– Output of map job is key-value pairs– The key is used to shuffle/sort the output to the reducers
Example: build a word-count index for a huge document corpus– Map: emit {docid, word, 1} tuple for each occurence– Reduce: sum similar tuples, like: {“War And Peace”, *, 1}
Not all problems are “ideal”, but MR can still work: www.adjoint-functors.net/su/web/354/references/graph-processing-w-mapreduce.pdf
© 2012 The MITRE Corporation. All rights reserved
From Wikipedia Commons: http://en.wikipedia.org/wiki/File:Hadoop_1.png
MapReduce/HDFS Architecture
Page 21
© 2012 The MITRE Corporation. All rights reserved
Higher Level Languages: Hive Hive is a system for managing and querying structured
data– Used extensively to provide SQL-like functionality:– Compiles into map-reduce jobs– Includes an optimizer*
Developed by Facebook– Almost 99.9% Hadoop jobs at Facebook are generated by
a Hive front-end system.
Page 25
*Hive optimizations at: citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.151.2637
© 2012 The MITRE Corporation. All rights reserved
Apache Pig Open source scripting language
– Provides SQL-like primitives in a scripting language– Developed by Yahoo! Almost 30% of their analytic jobs are
written in “Pig Latin”
Execution Model– compiles into MapReduce (over HDFS files, HBase tables)– Approximately 30% overhead– Optimizes multi-query scripts, filter and limit optimizations that
reduce the size of intermediate results
Example commands– FILTER: hour00 = FILTER hour_frequency2 BY hour eq '00';
– ORDER: ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour, score);
– GROUP: hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
– COUNT: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
Page 26
© 2012 The MITRE Corporation. All rights reserved
The Human Approach
35
Massively parallel human beings– “crowdsourcing”
A good list of projects:– en.wikipedia.org/wiki/
List_of_crowdsourcing_projects
© 2012 The MITRE Corporation. All rights reserved
36
Background: What is “Big Data”? … Why is it big?
Parallel Technologies for Big Data Problems
Big Data “Ecosystems”
Ongoing Challenges in Big Data Ecosystems
Outline
© 2012 The MITRE Corporation. All rights reserved
General “Funnel” Model of Big Data Analytic Workflows
Some examples & technology stacks:– Clickstream analysis; stock “tick” analysis; social network analysis– Google’s Tenzing stack: SQL/OLAP over Hadoop– Cloudera’s stack: Hive/Pig compiling into Hadoop– Greenplum’s stack: SQL compiling directly onto servers, OR into
MapReduce via “external tables”
4) Generate & explore user-facing analytic models (data
cubes, graphs, clusters). Drill down to details.
3) Generate more structured datasets as
needed: RDBMS tables, objects, triple stores
1) Ingest of diverseraw data sources: text, sensor feeds,
semi-structured (e.g., web content, email)
2) Transform, clean, subset integrate,
index, new datasets. Enrich: extract
entities, compute features & metrics.
Data science teams work across entire spectrum
© 2012 The MITRE Corporation. All rights reserved
Ecosystem Overview
38
A frequent workflow is emerging:– 1) Ingest data from diverse sources– 2) ETL / enrichment– 3) Intermediate data management– 4) Refined data management (graphs, parsed triples from text,
OLAP/relational data)– 5) Analytics & viz tools to build/test models, support decisions– 6) Reachback into earlier steps by “data scientists”
Common to diverse types of organizations:– marketing, financial research, scientists, intelligence agencies, …– (social media providers are a bit different: they host the big data)
Many technologies working together– Map reduce, semistructured (“NoSQL”) databases, graph databases,
RDBMSs, machine learning/data mining algorithms, analytic tools, visualization techniques
We will touch on some of these through the rest of today– Many are new and evolving; this is a rapidly moving train!
© 2012 The MITRE Corporation. All rights reserved
Emergence of the Data Scientist
Page 8.5
© 2012 The MITRE Corporation. All rights reserved
Spectrum of Big Data Ecosystem Classes Big Data Ecosystems differ along several key questions:
1) Is there a hypothesis being tested?– Testing a hypothesis requires a more sophisticated analysis process
2) Is external data being gathered?– Versus all internally generated data.– External data requires more ETL effort
3) Does it make sense to evolve and expand this ecosystem?– The greater the up-front investment, the more important it is to address
serendipitous new hypotheses by reusing/augmenting existing data resources
40
© 2012 The MITRE Corporation. All rights reserved
Spectrum of Big Data Ecosystem Classes 1) Non-experiment (no hypothesis exists, external data)
– No hypothesis or learning experiment– Ecosystem reports aspects of external data, little analysis / new truth– Example: CNN “trending now” alerts– (Note: subject to being “gamed” by manipulation of external data!)
2) Evolving experimental ecosystem (hypotheses & external data)
3) Self contained experiment (hypothesis exists, no external data)
41
© 2012 The MITRE Corporation. All rights reserved
The Non-Experiment(Example: “Trending Now”)
1) External data ingested
2) Basic processing applied to “add value” for consumers (but no rigorous model learning, or hypothesis testing)
© 2012 The MITRE Corporation. All rights reserved
A Spectrum of Big Data Ecosystems 1) Non-experiment
2) Evolving experimental ecosystem (hypotheses, external data)
3) Self-contained experiment (a hypothesis exists, no external data)– Pre-existent (scientific) hypothesis to test– All necessary data generated to spec within the ecosystem– Example: Argonne National Labs
43
© 2012 The MITRE Corporation. All rights reserved
The Self-Contained Experiment(Example: Argonne National Labs)
1) A scientific hypothesis H exists, and a plan to test H by analyzing large datasets.
2) Any data needed to test H is generated “internally”
3) Data analysis. Perhaps requiring a predictive model to be learned & refined
Valid
Not valid
4) Plan / model applied to data to validate/ invalidate H
© 2012 The MITRE Corporation. All rights reserved
Spectrum of Big Data Ecosystem Classes 1) Non-experiment (no hypothesis exists, external data)
2) Evolving experimental ecosystem (potential hypotheses, external data)– Massive external datasets suggest new insights / competitive advantage– Hypothesis formed and external data gathered – Experiment / ecosystem designed to test hypothesis, provide insight– Once in place, ecosystem is reused & evolves:
new data & hypotheses, cost amortized– Sweet spot … (Consumer analysis, Intelligence analysis …)
3) Self-contained experiment
45
© 2012 The MITRE Corporation. All rights reserved
Evolving Experiment Ecosystem: E3
(Example: Google Adwords)
Valid
Not valid
1) Massive external data suggests new insights / competitive advantages
2) Initial hypothesis H formed & data gathered to test it.
3) Data analysis. Perhaps requiring a predictive model to be learned & refined
Valid
Not valid
4) Plan / model applied to data to validate/ invalidate H
1b) Incremental data suggests incremental
insights …
© 2012 The MITRE Corporation. All rights reserved
Questions?
47
© 2012 The MITRE Corporation. All rights reserved
48
Background: What is “Big Data”? … Why is it big?
Parallel Technologies for Big Data Problems
Big Data “Ecosystems”
Ongoing Challenges in Big Data Ecosystems
Outline
© 2012 The MITRE Corporation. All rights reserved
49
Ecosystems are mature to the extent that they work now. But definitely not a fully “solved problem”!! Some outstanding issues to keep an eye on:
– Sampling What if two sources are sampled differently?
– Security– Privacy– Metadata
E.g., How do we deal with evolution of processing?– Moving/loading big data– People
finding, retaining, assigning to roles, training/growing, paying– Outsourcing options
disk growth beyond your budget, need for services you can’t provide
Some General & Ongoing Challenges
© 2012 The MITRE Corporation. All rights reserved
Normal “Funnel” Model of Big Data Analytic Workflows
- Assumption that all data “melts together” within the funnel
© 2012 The MITRE Corporation. All rights reserved
Security-Partitioned “Funnel” Model of Big Data Analytic Workflows
- Assumption that certain data must not be mixed …
- How do you implement separation?- Issues: What does this mean for
the ability to aggregate, infer?
© 2012 The MITRE Corporation. All rights reserved
52
Parallel HW is often managed by 3rd parties for economics:– Should I expose my sensitive data to DBAs who don’t work for me?– What about other unknown/untrusted tenants of a rented HW
infrastructure? Standard encryption only addresses data at rest
– When a query hits the DBMS it becomes plaintext in RAM. A rouge cloud DBA can see all my “encrypted” data.
It’s hard to map high level policies onto detailed implementations, big data makes this worse– E.g., Books about the stock market cannot be checked out to freshmen
Other Security Issues
© 2012 The MITRE Corporation. All rights reserved
KeyValue
Row IDColumn
TimestampFamily Qualifier Visibility
Accumulo Data Sensitivity Labels
Page 3.4
Label definition– Labels (e.g., SECRET, NOFORN) are defined, applied on ingest– Cryptographically bound to data– Applied at the key-level (i.e., to every value individually)
See: accumulo.apache.org/1.4/user_manual/Security.html Access:
– Database users obtain are assigned labels; these are used to gain access when a user authenticates as that user.
Issues to consider:– Admin overhead of defining and applying labels to every value– Aligning heterogeneous label sets to realize possible sharing– Label assurance
© 2012 The MITRE Corporation. All rights reserved
Lack of Metadata As Harmful:
54
© 2012 The MITRE Corporation. All rights reserved
Metadata Challenges in Sponsor Ecosystems 1) Exploiting myriads of datasets with agility
– What columns link voice recordings to radar? When do they simultaneously exist in this table? Where are temperature readings?
2) Dealing with “shape-changing” data sources– When data format continually changes, how does my reader interpret
serialized data instances without schema information? 3) Accurately matching analytics to datasets
– Analytic A requires column C1, derived by f8(). Does C1 exist for May? If C1 exists, but was derived from f7(), it would be bad if A “fails silently”!
4) Rapidly incorporating unknown data sources– Can I reuse the ingest & transformation code from other data sources?
5) Reasoning about the data (data scientist needs)– Where are value distributions & trends over time (e.g., to test a hypothesis,
to infer semantics, for process optimization)…
55
Theme: Poorly understood datasets result in high overhead & degraded analytics
© 2012 The MITRE Corporation. All rights reserved
Our Big Data sponsors are obligated to know: What data should be retained?
– Given the size of the data, all information can’t be retained forever. Decisions are currently made ‘off the cuff’ which data to retain, and which to let go. Can we characterize data’s use to support retention decisions?
Where did this data come from?– Analysts are writing reports and need to know the source of the data so they can determine
trustworthiness, legality, dissemination restrictions, and potentially reference the original data object
Where a class of data resides?– This is largely a compliance and auditing function. A redacted use case would be: “Which of
my systems currently house PII data? Do any systems house this data that aren’t approved for it? Are my security controls working?” With an increasing reliance on both public and private clouds, this is growing increasingly challenging.
Where a specific data item resides? – If the lawyers call and say I need to get rid of a certain piece of intelligence, can I locate all
copies of it? Who else did I sent it to? If there is a breach at a cloud provider or partner, do I know what data items landed within their perimeter? This would enable more granular breach notifications.
More Use Cases For Metadata
Page 56
© 2012 The MITRE Corporation. All rights reserved
“Family Tree” of relationships– Ovals = data, rectangles =
processes– Show how data is used and
reused Basic metadata
– Timestamp– Owner– Name/Descr
Can also include annotations– E.g. quality info
Is not the actual data object
What is Provenance?
Page 57
© 2012 The MITRE Corporation. All rights reserved
How is it Done Today?
The general approach is: “The developers just kinda know.”– This does not scale! (with variety … the under-served “V”)
Some large companies are now developing point solutions, as vast #’s of different data formats accumulate:– Protobuf schema repository from Google– Avro schema repository from LinkedIn– Hive metacatalog (basis of hCatalog)
■ But these are not general & powerful “first principles” solutions
Format-specific data model (e.g., hCatalog favors Hive) Typically focus only on the “SeDe” issue– “poor man’s metadata repository”– https://issues.apache.org/jira/si/jira.issueviews:issue-html/AVRO-1124/AVRO-1124.html
58
© 2012 The MITRE Corporation. All rights reserved
Questions?
59
© 2012 The MITRE Corporation. All rights reserved
60
Intro to Big Data and Scalable Databases– Part 1: Big Data… Its Technologies & Analytic Ecosystem– Part 2: An Introduction To Parallel Databases– Part 3: Technological Innovations and MPP RDBMS
Next Topic in the Outline
© 2012 The MITRE Corporation. All rights reservedFor Internal MITRE Use
Parallel Databases
Parallel Databases
Parallel Databases
An Introduction To Parallel Databases
© 2012 The MITRE Corporation. All rights reserved
Purpose of This Talk
62
Let’s say you have a problem involving:
Lots of data can apply multipleprocessors
What can a database do for me?What databases are available? How do I pick?
and
© 2012 The MITRE Corporation. All rights reserved
Outline
63
Taxonomy
Software realities for parallel databases
Systems engineering strategies
© 2012 The MITRE Corporation. All rights reserved
A Simple Taxonomy of Parallel Databases
64
Structured Relational
Semi-structured (e.g., “Document-
oriented)
Triples,Key-value
1
Max Number of Processors
Data Model Structure
Traditional RDBMS
10
100
1000
A Lot!BigTable / Hbase /
Accumulo
MongoDB FlockDB
“Clouds” are increasingly attractive computational platforms– Traditional solutions don’t automatically scale well to clouds,
innovation is occuring rapidly ...
Greenplum
Aster DataNon-relational
(aka NoSQL)
Parallel Relational
Market Trends– Consolidation– Hybrids– To “upper left”
© 2012 The MITRE Corporation. All rights reserved
65
A More Complex Taxonomy (451 group)
Oh My!
© 2012 The MITRE Corporation. All rights reserved
Taxonomy Used In This Talk
66
Key-value stores
Semi-structured databases
Parallel relational
Graph databases & Triplestores
© 2012 The MITRE Corporation. All rights reserved
A Short History of Key Value Stores
67
2004: Google invented BigTable– Now being replaced by Spanner (distributed transactions, SQL)
2007: Hbase (open source BigTable): hbase.apache.org– Large & growing user community; HDFS file system
2008: Facebook invents Cassandra– HBase data model, but P2P file system; released open source
2010: Facebook enhances & adopts HBase internally
2011: NSA releases Accumulo open source: accumulo.apache.org– Similar to Hbase; includes data sensitivity labels
2012: Basho releases Riak: wiki.basho.com– Web friendly; based on Amazon’s dynamo paper
© 2012 The MITRE Corporation. All rights reserved
Key-Value Store Data Model
68
Datasets typically modeled as one very large table
Key: <row id, column id, version>– Row id (canonical Google row id: reversed URL)– Column id
static number of carefully designed column “families” each family can have an unbounded number of columns
– Version-timestamp Database keeps record of all previous values (update = append)
Query examples:– given a full key, return the value– given a column ID and a value, return all matching rows
© 2012 The MITRE Corporation. All rights reserved
Other Characteristics of Key Value Stores
69
Performance: designed for scale out– 1 index on the key (faster than HDFS scan), no optimizer
Cost: Typically open source; need Hadoop / programming skills– Cloudera support is ~$4K/node
Roles:– Great fit: for data you don’t understand well yet (e.g., ETL)
Massive, rapidly arriving, highly non-homogenous datasets Need for query by key; enriching by adding aribtrary columns
– Poor fit: if you know exactly what your data looks like (lose schema)
© 2012 The MITRE Corporation. All rights reserved
HBase Table Creation Example
70
Create a table named test with a single column family named cf. Verify its creation by listing all tables and then insert some values.
hbase(main):003:0> create 'test', 'cf' 0 row(s) in 1.2200 seconds hbase(main):003:0> list 'test' .. 1 row(s) in 0.0550 seconds hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in 0.0560 seconds hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in 0.0370 seconds hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in 0.0450 seconds
© 2012 The MITRE Corporation. All rights reserved
HBase Example
71
Verify the data insert by running a scan of the table: hbase(main):007:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value1 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3 3 row(s) in 0.0590 seconds
Get a single row: hbase(main):008:0> get 'test', 'row1' COLUMN CELL cf:a timestamp=1288380727188, value=value1 1 row(s) in 0.0400 seconds
© 2012 The MITRE Corporation. All rights reserved
Taxonomy Used In This Talk
72
Key-value stores
Semi-structured databases
Parallel relational
Graph databases & Triplestores
© 2012 The MITRE Corporation. All rights reserved
A Short History of Semi-structured Databases
73
1980’s: “Object-oriented” DBs invented; didn’t take off– Addressed gap between relations & prog. Languages– Good for data hard for RDBMS’s: aircraft & chip designs
1995: Stanford LORE project induces XML schema from data– Coined term “semi-structured” due to flexible schema
2000’s: “Sharding” gave semi-structured databases new life– Now often called “document oriented” (but not “Documentum”)– Great list at en.wikipedia.org/wiki/Document-oriented_database
2009: open source MongoDB; 10gen support; JSON data model
2012: UCI Asterix project www.cs.ucsb.edu/common/wordpress/?p=1533– Goal: Open source “Postgres-quality” flexible schema DBMS
© 2012 The MITRE Corporation. All rights reserved
Semi-structured Database Data Model
74
Objects defined by grammar (XML, JSON)– One table per object type; optional attributes– Tight programming language interface– Good compromise between Key-Value and RDBMS
JSON Example: (JavaScript Object Notation)– JSON provides syntax for storing and exchanging text information; JSON is
smaller than XML and easier to parse.– Looks much like C, Java, etc. data structures
{"employees": [
{ "firstName":"John" , "lastName":"Doe" }, { "firstName":"Anna" , "lastName":"Smith" }, { "firstName":"Peter" , "lastName":"Jones" }
]}
– The employees object is an array of 3 employee records (objects).
© 2012 The MITRE Corporation. All rights reserved
Other Features of Semi-structured Databases
75
Speed: shards for scale out; often a limited optimizer
Cost: Some free, few features; some $500K, many features
Killer app(s):– Good fit for “like-but-varying” objects, accessed similarly– would have used a relational database, but objects aren’t regular– Rapid prototyping in scientific lab– “Cloud server” – serving objects used as web content
© 2012 The MITRE Corporation. All rights reserved
MongoDB Table Creation Example
76
Create a collection named library with a maximum of 50,000 entries.> db.createCollection(”library", { capped : true, size : 536870912, max : 50000 } )
Insert a book (a JSON object):> p = { author: “F. Scott Fitzgerald”, acquisitiondate: new Date(), title: “The Great Gatsby”, tags: [“Crash”, “Reckless” “1920s”]}> db.library.save(p)
Retrieve the book: > db.library.find( { title: “The Great Gatsby”} )> { "_id" : ObjectId("50634d86be4617f17bb159cd"), “author” : “F. Scott Fitzgerald”, “acquisitiondate” : “10/28/2012", “title”: “The Great Gatsby”, “tags" : [“Crash”, “Reckless” “1920s”] }
© 2012 The MITRE Corporation. All rights reserved
Taxonomy Used In This Talk
77
Key-value stores
Semi-structured databases
Parallel relational
Graph databases & TriplestoresThis is Irina’s talk
© 2012 The MITRE Corporation. All rights reserved
Example Systems
78
Key-value stores– BigTable Hbase Accumulo, Cassandra, Riak, …
Semi-structured– MongoDB, CouchDB (JSON-like); Gemfire (OQL); Marklogic
(Xquery, SQL), Asterix, …
Parallel relational– Vertica, Greenplum, AsterData, Paraccel, Teradata, Netezza, …
Graph databases & Triplestores– FlockDB (simple), “Big Linked Data”, Titan (Gremlin/Tinkerpop),
Neo4j (Gremlin/Tinkerpop, SPARQL) AllegroGraph (SPARQL)
ProprietaryOpen source or ResearchOpen source, commerical version / supportOpen source, GOTS
Legend
Commercially available
Many are “noSQL” systems
© 2012 The MITRE Corporation. All rights reserved
Outline
79
Taxonomy
Some important software realities for parallel databases– Sharding– Optimizers– Data Consistency
Systems engineering strategies
© 2012 The MITRE Corporation. All rights reserved
A Simple Comparison of Properties
80
Sharding Optimizer Pr. Lang.Integration
Flexible Data Model
Data Consistency
Key Value
Semi-Struct
Parallel RDBMS
The Asterix system being developed at UCI intends to have a high score on all 5 properties
© 2012 The MITRE Corporation. All rights reserved
81
“Sharding” maps one table into a set of distributed fragments– Each fragment located at a single compute node
Horizontal partitioning– Shards typically defined by key range partition; but various
hashing strategies possible– Speeds up parallel operations (e.g., search, summation)
Replication– Multiple copies can be generated for each partition– Speeds read access, improves availability
Issue: how do you shard graph data??– Facebook does it randomly! (No good split)
Sharding
All parallel DBMSs shard data somehow
Sharding Illustration
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Key Range0..30
Key Range31..60
Key Range61..90
Key Range91.. 100
Mul
tiple
cop
ies
Horizontal Partitions
© 2012 The MITRE Corporation. All rights reserved
83
Realities:– Sharding– Optimizers– Transactions & Data Consistency
Software Realities for Parallel Databases
© 2012 The MITRE Corporation. All rights reserved
84
Optimizers automatically rewrite user queries into an equivalent and more efficiently executable form
Invented in the 70’s to make SQL possible The crown jewels of commercial (one node) RDBMSs!
Parallel databases can “scale out” to improve performance– Want an order of magnitude speedup? 100 1000 nodes!– Many use a far simpler query language, if one at all (e.g., search by key)
Less need/benefit for an optimizer– Example: Hbase provides 1 index, bloom filters, caching, no optimizer
Parallel relational databases– Can scale out, and also provide optimizers to get more done with fewer
nodes– Very sophisticated data migration primitives (moving shards to the
computation, if cheaper, managing solid state & disk, …)
Optimizers & Efficient Queries
Scale out and/or optimizers? It depends!!
MITRE
85
Optimizing a Single Node RDBMS
0 Given 3 relations (tables) of data:
0 Which pilots have flown prop-jets? (In SQL)
SELECT DISTINCT Pilots.name FROM Pilots, Flights, Aircraft WHERE Pilot.name = Flights.pilot_name AND Flights.aircraft_id = Aircraft.id AND Aircraft.type = “prop-jets”
Pilots Flights Aircraft
Pilot.name = Flights.pilot_name Flights.aircraft_id = Aircraft.id
MITRE
86
Initial Query Execution Plan
project
select
join
scan
Database :
(only prop-jets - 0.1%)
(the distinct pilot names)
(2000)
join
scanscan
(50) (10,000,000)
(10,000,000)
(10,000,000)
(10,000)
(10)
answer
(50) (10,000,000)
(2000)
Total tuplesprocessed:30,012,060
Pilots Flights Aircraft
MITRE
87
Query Optimization: Improved Plan
project
join
(only prop-jets - 0.1%)
(only distinct pilot’s names)
join
indexedretrieval
(10,000) (2)
(10)
answer
select
(10,000)
(10,000)
scan(50)
Total tuplesprocessed:
30,062
Database :
(2000)(50) (10,000,000)
Pilots Flights Aircraft
© 2012 The MITRE Corporation. All rights reserved
Parallel DBMS Optimizer Comparison
88
Key value stores– typically do not optimize queries; rely on scale out
Semi-structured DBMS’s– Typically a simple approach, also relying on scale-out– MongoDB tries to determine best index when two are available
Parallel RDBMSs– Typically provide sophisticated optimizers
Migration; reasoning about storage hierarchy– Greenplum migration primitives www.greenplum.com/technology/optimizer:
1) Broadcast Motion (N:N) - Every segment sends target data to all others 2) Redistribute Motion (N:N) - Every segment rehashes the target data (by
join column) and redistributes each row to the appropriate segment 3) Gather Motion (N:1) - Every segment sends the target data to a single
node (usually the master)
© 2012 The MITRE Corporation. All rights reserved
89
Realities:– Sharding– Optimizers– Transactions & Data Consistency
Software Realities for Parallel Databases
© 2012 The MITRE Corporation. All rights reserved
Given updates to replicated data shards, how do you keep them all consistent?
Global Data Consistency
90
3
3
2
update
+1
+1
+1
Classic DB theory solution:– Two phase commit (2PC): all vote; if all say yes, then all commit– Nice, but communication is costly in a global data center network!– Thus, Amazon has been happy to sell a book it doesn’t have sometimes.
Eventual consistency (a hallmark of early “NoSQL”)– No guarantee of “snapshot isolation”– Over time, replicas converge despite node failures & network partitions– Many different flavors / implementations (e.g., HBase, Cassandra)– See also: www.cs.kent.edu/~jin/Cloud12Spring/HbaseHivePig.pptx
Google just invented “Spanner” (~2PC!)– Global consistency via atomic clocks/GPS (not everyone has these );
reduces communications
© 2012 The MITRE Corporation. All rights reserved
Outline
91
Taxonomy
Software realities for parallel databases
Systems engineering strategies
© 2012 The MITRE Corporation. All rights reserved
Systems Engineering Strategy
92
You can often get by with just one parallel database– a key value store for ETL, and some BI– a parallel RDBMS for BI, and as a cloud server– or no DBMS (e.g., just use HDFS)
… But one size is NOT the best fit for all– Sweet spots exist for each type– This is different from relational era!
© 2012 The MITRE Corporation. All rights reserved
Roles In The Funnel Workflow Model
4) Generate & explore user-facing analytic models (data cubes,
graphs, clusters). Drill down to details.
3) Generate more structured datasets as needed: RDBMS tables, objects, triple stores
1) Ingest of diverseraw data sources: text,
sensor feeds, semi-structured (e.g., web
content, email)
2) Transform, clean, subset integrate, index, new datasets. Enrich:
extract entities, compute features & metrics.
1) Key value stores: Manage & query ETL datasets, compute metrics2) Semi-structured DBS: Persist / query generated objects3) Parallel RDBMSs, Graph DBS: Support BI queries,graph exploration, …
© 2012 The MITRE Corporation. All rights reserved
Some Systems Engineering Strategies
94
1) Tunnel vision:– Use one type of DBMS & just live with its shortcomings if/when
you encounter them
2) Optimal assignment:– Pick the best one for each type of workload you will encounter– It takes skill to know how to pick, mix, match up front!
3) Keep your eye on it:– Look at user experiences (forums), best practices– Pick initial system(s) that look right & be ready to learn as you go– May migrate to a more “final” system over time– Google, Facebook are doing this all the time!
BigTable to Caffeine to Spanner; Cassandra to (customized) HBase
© 2012 The MITRE Corporation. All rights reserved
Questions?
95