HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is...

Preview:

Citation preview

© 2014 MapR Technologies ‹#›© 2014 MapR Technologies

HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall

Sept. 30, 2015

© 2015 MapR Technologies ‹#›@tgrall

{“about” : “me”}Tugdual “Tug” Grall • MapR

• Technical Evangelist • MongoDB

• Technical Evangelist • Couchbase

• Technical Evangelist • eXo

• CTO • Oracle

• Developer/Product Manager • Mainly Java/SOA

• Developer in consulting firms

• Web • @tgrall • http://tgrall.github.io • tgrall

• NantesJUG co-founder

• Pet Project : • http://www.resultri.com

• tug@mapr.com • tugdual@gmail.com

© 2014 MapR Technologies ‹#›

Agenda• What does good mean? • What do we mean by loose typing? • Examples of what you can do • Real database with 10-20x fewer tables • Looking forward • Questions

© 2014 MapR Technologies ‹#›

What Does Good Mean (for a DB)?• Expressive

– Must express the concepts we need

• Efficient – Must run fast enough on cheap enough hardware

© 2014 MapR Technologies ‹#›

What Does Good Mean (for a DB)?• Expressive

– Must express the concepts we need

• Efficient – Must run fast enough on cheap enough hardware

• Introspectable – Must be able to inspect the data and schema and gain understanding

© 2014 MapR Technologies ‹#›

What is New Here• Introspection is better when

– A minimum of data entities are used to describe our model – No name overflow – Referential scoping helps narrow our focus to a simpler problem – Many-to-one relations can in-lined

• Introspection was not a goal for the design of the relational model • Introspection was therefore not a result either

© 2014 MapR Technologies ‹#›

Older than Dirt• Relational theory is old (1970)

– Pre-dates data structures – Predates mainstream recursive procedures – Predates lexical scoping – Predates logic programming – Predates real functional programming (Church, McCarthy, Iverson,

Backus and not-withstanding)

• Some updates are in order to enhance introspection

© 2014 MapR Technologies ‹#›

Contrast Relational and HBase Style noSQLRelational

• Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional)

• Expressions over sets of rows

HBase / MapR DB • Rows contain fields • Fields bytes • Structure is flexible • No pre-defined structure • Single key • Column families • Timestamps • Versions

© 2014 MapR Technologies ‹#›

Contrast relational and HBase with StructuringRelational

• Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional)

• Expressions over sets of rows

HBase + Structuring • Rows contain fields • Fields contain primitive types

– Or objects, or lists • Structure is flexible, ragged • No pre-defined structure • Single key

© 2014 MapR Technologies ‹#›

Turtle Models for Databases• Allows complex objects in field values

– JSON style lists and objects • Allow references to objects via join

– Includes references localized within lists • Lists of objects and objects of lists are isomorphic to tables so …

• Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables

© 2014 MapR Technologies ‹#›

A Catalog of noSQL Idioms

© 2014 MapR Technologies ‹#›

Tables as Objects, Objects as Tables

c1 c2 c3

Row-wise form

c1 c2 c3

Column-wise form

[ { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 } ]

List of objects

{ c1:[v1, v2, v3], c2:[v1, v2, v3], c3:[v1, v2, v3] }

Object containing lists

© 2014 MapR Technologies ‹#›

c1 c2 c3

c1 c2 c3

Micro Columnar Formats

An entire table stored in columnar form can be a

first-class value using these techniques

This is very powerful for in-lining one-to-many relations.

© 2014 MapR Technologies ‹#›

Note• If embedded tables are first-class, schema becomes data

• If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible

• Thus, embedded first-class objects implies late discovery of schema information

© 2014 MapR Technologies ‹#›

A first example: Time-series data

© 2014 MapR Technologies ‹#›

Column names as data• When column names are not pre-defined, they can convey

information

• Examples – Time offsets within a window for time series – Top-level domains for web crawlers – Vendor id’s for customer purchase profiles

• Predefined schema is impossible for this idiom

© 2014 MapR Technologies ‹#›

Relational Model for Time-series

© 2014 MapR Technologies ‹#›

Table Design: Point-by-Point

© 2014 MapR Technologies ‹#›

Table Design: Hybrid Point-by-Point + Sub-table

After close of window, data in row is restated as column-oriented tabular value in different column family.

© 2014 MapR Technologies ‹#›

Compression Results

Samples are 64b time, 16 bit sample

Sample time at 10kHz

Sample time jitter makes it important to keep original time-stamp

How much overhead to retain time-stamp?

© 2014 MapR Technologies ‹#›

A second example: Music meta-data

© 2014 MapR Technologies ‹#›

MusicBrainz on NoSQL• Artists, albums, tracks and labels are key objects • Reality check:

– Add works (compositions), recordings, release, release group • 7 tables for artist alone • 12 for place, 7 for label, 17 for release/group, 8 for work

– (but only 4 for recording!) – Total of 12 + 7 + 17 + 8 + 4 = 48 tables

• But wait, there’s more! – 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link

tables, 5 cover art tables and 3 tables for CD timing info (138 total) – And 50 more tables that aren’t documented yet

© 2014 MapR Technologies ‹#›

© 2014 MapR Technologies ‹#›

180 tablesnot shown

© 2014 MapR Technologies ‹#›

236 tables to describe 7 kinds of things

© 2014 MapR Technologies ‹#›

Can we do better?

© 2014 MapR Technologies ‹#›

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>

© 2014 MapR Technologies ‹#›

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>

© 2014 MapR Technologies ‹#›

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabegin_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>

{ name, begin_date, end_date }

© 2014 MapR Technologies ‹#›

© 2014 MapR Technologies ‹#›

{id, recording_id, name, list<credit>length}

recordingidgidlist<credit>namelist<track_ref>

{id, format, name, list<track>}

release_groupidgidnamelist<credit>typelist<release_id>

© 2014 MapR Technologies ‹#›

27 tables reduce to 4

© 2014 MapR Technologies ‹#›

27 tables reduce to 4 so far

© 2014 MapR Technologies ‹#›

Further Reductions• All 86 link tables become properties on artists, releases and other

entities • All 44 tag, rating and annotation tables become list properties • All 5 cover art tables become lists of file references

• Current score: 162 tables become 4

• You get the idea

© 2014 MapR Technologies ‹#›

Is This Good?• Expressivity

– The JSON data model is at least as expressive as the original relational model

• Many cases easier to describe in nested data • No cases are harder

• Efficiency – Inlining can increase data size. Locality improves, however – Sessionizing can substantially decrease data size – Inlining back-references is more efficient than ordinary indexes – Inlined columnar data allows 1000x speedup for time series

• Introspection (you decide)

© 2014 MapR Technologies ‹#›

But How Can We Query This?• Can’t use SQL

– SQL is strongly typed – SQL is heavily tied into the original relational model – SQL generating tools require relational model

• Must use SQL – Vast numbers of tools and people understand how to write SQL – SQL is the lingua franca of databases

© 2014 MapR Technologies ‹#›

Squaring the Circle• Enter Apache Drill

• Drill is SQL compliant – Uses standard syntax and semantics

• Drill extends SQL – First class treatment of objects, lists – Full support for destructuring, flattening – Full power of relational model can be applied to complex data

© 2014 MapR Technologies ‹#›

Drill Provides Scalable and Extended SQL

© 2014 MapR Technologies ‹#›

Sample Query• Find Elvis

select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )

© 2014 MapR Technologies ‹#›

Example Query• Find discs where Elvis was credited

select distinct album_id, name from ( select id album_id, name, flatten(credit) from release ) albums join ( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ ) ) artists using artist_id

© 2014 MapR Technologies ‹#›

Summary• Extended relational model allows massive simplification

– On a real example, we see >20x reduction in number of tables

• Simplification drives improved introspection – This is good

• Apache Drill gives very high performance execution for extended relational problems

• You can try this out today

© 2014 MapR Technologies ‹#›

© 2014 MapR Technologies ‹#›

Thank You!

© 2015 MapR Technologies ‹#›@tgrall

Q & A@tgrall maprtech

tug@mapr.com

Engage with us!

MapR

maprtech

mapr-technologies

Recommended