View
4
Download
0
Category
Preview:
Citation preview
© 2014 MapR Technologies ‹#›© 2014 MapR Technologies
HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall
Sept. 30, 2015
© 2015 MapR Technologies ‹#›@tgrall
{“about” : “me”}Tugdual “Tug” Grall • MapR
• Technical Evangelist • MongoDB
• Technical Evangelist • Couchbase
• Technical Evangelist • eXo
• CTO • Oracle
• Developer/Product Manager • Mainly Java/SOA
• Developer in consulting firms
• Web • @tgrall • http://tgrall.github.io • tgrall
• NantesJUG co-founder
• Pet Project : • http://www.resultri.com
• tug@mapr.com • tugdual@gmail.com
© 2014 MapR Technologies ‹#›
Agenda• What does good mean? • What do we mean by loose typing? • Examples of what you can do • Real database with 10-20x fewer tables • Looking forward • Questions
© 2014 MapR Technologies ‹#›
What Does Good Mean (for a DB)?• Expressive
– Must express the concepts we need
• Efficient – Must run fast enough on cheap enough hardware
© 2014 MapR Technologies ‹#›
What Does Good Mean (for a DB)?• Expressive
– Must express the concepts we need
• Efficient – Must run fast enough on cheap enough hardware
• Introspectable – Must be able to inspect the data and schema and gain understanding
© 2014 MapR Technologies ‹#›
What is New Here• Introspection is better when
– A minimum of data entities are used to describe our model – No name overflow – Referential scoping helps narrow our focus to a simpler problem – Many-to-one relations can in-lined
• Introspection was not a goal for the design of the relational model • Introspection was therefore not a result either
© 2014 MapR Technologies ‹#›
Older than Dirt• Relational theory is old (1970)
– Pre-dates data structures – Predates mainstream recursive procedures – Predates lexical scoping – Predates logic programming – Predates real functional programming (Church, McCarthy, Iverson,
Backus and not-withstanding)
• Some updates are in order to enhance introspection
© 2014 MapR Technologies ‹#›
Contrast Relational and HBase Style noSQLRelational
• Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional)
• Expressions over sets of rows
HBase / MapR DB • Rows contain fields • Fields bytes • Structure is flexible • No pre-defined structure • Single key • Column families • Timestamps • Versions
© 2014 MapR Technologies ‹#›
Contrast relational and HBase with StructuringRelational
• Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional)
• Expressions over sets of rows
HBase + Structuring • Rows contain fields • Fields contain primitive types
– Or objects, or lists • Structure is flexible, ragged • No pre-defined structure • Single key
© 2014 MapR Technologies ‹#›
Turtle Models for Databases• Allows complex objects in field values
– JSON style lists and objects • Allow references to objects via join
– Includes references localized within lists • Lists of objects and objects of lists are isomorphic to tables so …
• Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables
© 2014 MapR Technologies ‹#›
A Catalog of noSQL Idioms
© 2014 MapR Technologies ‹#›
Tables as Objects, Objects as Tables
c1 c2 c3
Row-wise form
c1 c2 c3
Column-wise form
[ { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 } ]
List of objects
{ c1:[v1, v2, v3], c2:[v1, v2, v3], c3:[v1, v2, v3] }
Object containing lists
© 2014 MapR Technologies ‹#›
c1 c2 c3
c1 c2 c3
Micro Columnar Formats
An entire table stored in columnar form can be a
first-class value using these techniques
This is very powerful for in-lining one-to-many relations.
© 2014 MapR Technologies ‹#›
Note• If embedded tables are first-class, schema becomes data
• If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible
• Thus, embedded first-class objects implies late discovery of schema information
© 2014 MapR Technologies ‹#›
A first example: Time-series data
© 2014 MapR Technologies ‹#›
Column names as data• When column names are not pre-defined, they can convey
information
• Examples – Time offsets within a window for time series – Top-level domains for web crawlers – Vendor id’s for customer purchase profiles
• Predefined schema is impossible for this idiom
© 2014 MapR Technologies ‹#›
Relational Model for Time-series
© 2014 MapR Technologies ‹#›
Table Design: Point-by-Point
© 2014 MapR Technologies ‹#›
Table Design: Hybrid Point-by-Point + Sub-table
After close of window, data in row is restated as column-oriented tabular value in different column family.
© 2014 MapR Technologies ‹#›
Compression Results
Samples are 64b time, 16 bit sample
Sample time at 10kHz
Sample time jitter makes it important to keep original time-stamp
How much overhead to retain time-stamp?
© 2014 MapR Technologies ‹#›
A second example: Music meta-data
© 2014 MapR Technologies ‹#›
MusicBrainz on NoSQL• Artists, albums, tracks and labels are key objects • Reality check:
– Add works (compositions), recordings, release, release group • 7 tables for artist alone • 12 for place, 7 for label, 17 for release/group, 8 for work
– (but only 4 for recording!) – Total of 12 + 7 + 17 + 8 + 4 = 48 tables
• But wait, there’s more! – 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link
tables, 5 cover art tables and 3 tables for CD timing info (138 total) – And 50 more tables that aren’t documented yet
© 2014 MapR Technologies ‹#›
© 2014 MapR Technologies ‹#›
180 tablesnot shown
© 2014 MapR Technologies ‹#›
236 tables to describe 7 kinds of things
© 2014 MapR Technologies ‹#›
Can we do better?
© 2014 MapR Technologies ‹#›
artist
idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>
artist
idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>
© 2014 MapR Technologies ‹#›
artist
idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>
© 2014 MapR Technologies ‹#›
artist
idgidnamesort_namebegin_dateend_dateendedtypegenderareabegin_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>
{ name, begin_date, end_date }
© 2014 MapR Technologies ‹#›
© 2014 MapR Technologies ‹#›
{id, recording_id, name, list<credit>length}
recordingidgidlist<credit>namelist<track_ref>
{id, format, name, list<track>}
release_groupidgidnamelist<credit>typelist<release_id>
© 2014 MapR Technologies ‹#›
27 tables reduce to 4
© 2014 MapR Technologies ‹#›
27 tables reduce to 4 so far
© 2014 MapR Technologies ‹#›
Further Reductions• All 86 link tables become properties on artists, releases and other
entities • All 44 tag, rating and annotation tables become list properties • All 5 cover art tables become lists of file references
• Current score: 162 tables become 4
• You get the idea
© 2014 MapR Technologies ‹#›
Is This Good?• Expressivity
– The JSON data model is at least as expressive as the original relational model
• Many cases easier to describe in nested data • No cases are harder
• Efficiency – Inlining can increase data size. Locality improves, however – Sessionizing can substantially decrease data size – Inlining back-references is more efficient than ordinary indexes – Inlined columnar data allows 1000x speedup for time series
• Introspection (you decide)
© 2014 MapR Technologies ‹#›
But How Can We Query This?• Can’t use SQL
– SQL is strongly typed – SQL is heavily tied into the original relational model – SQL generating tools require relational model
• Must use SQL – Vast numbers of tools and people understand how to write SQL – SQL is the lingua franca of databases
© 2014 MapR Technologies ‹#›
Squaring the Circle• Enter Apache Drill
• Drill is SQL compliant – Uses standard syntax and semantics
• Drill extends SQL – First class treatment of objects, lists – Full support for destructuring, flattening – Full power of relational model can be applied to complex data
© 2014 MapR Technologies ‹#›
Drill Provides Scalable and Extended SQL
© 2014 MapR Technologies ‹#›
Sample Query• Find Elvis
select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )
© 2014 MapR Technologies ‹#›
Example Query• Find discs where Elvis was credited
select distinct album_id, name from ( select id album_id, name, flatten(credit) from release ) albums join ( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ ) ) artists using artist_id
© 2014 MapR Technologies ‹#›
Summary• Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection – This is good
• Apache Drill gives very high performance execution for extended relational problems
• You can try this out today
© 2014 MapR Technologies ‹#›
© 2014 MapR Technologies ‹#›
Thank You!
© 2015 MapR Technologies ‹#›@tgrall
Q & A@tgrall maprtech
tug@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
Recommended