44
© 2014 MapR Technologies ‹#› © 2014 MapR Technologies HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall Sept. 30, 2015

HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›© 2014 MapR Technologies

HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall

Sept. 30, 2015

Page 2: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2015 MapR Technologies ‹#›@tgrall

{“about” : “me”}Tugdual “Tug” Grall • MapR

• Technical Evangelist • MongoDB

• Technical Evangelist • Couchbase

• Technical Evangelist • eXo

• CTO • Oracle

• Developer/Product Manager • Mainly Java/SOA

• Developer in consulting firms

• Web • @tgrall • http://tgrall.github.io • tgrall

• NantesJUG co-founder

• Pet Project : • http://www.resultri.com

[email protected][email protected]

Page 3: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Agenda• What does good mean? • What do we mean by loose typing? • Examples of what you can do • Real database with 10-20x fewer tables • Looking forward • Questions

Page 4: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

What Does Good Mean (for a DB)?• Expressive

– Must express the concepts we need

• Efficient – Must run fast enough on cheap enough hardware

Page 5: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

What Does Good Mean (for a DB)?• Expressive

– Must express the concepts we need

• Efficient – Must run fast enough on cheap enough hardware

• Introspectable – Must be able to inspect the data and schema and gain understanding

Page 6: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

What is New Here• Introspection is better when

– A minimum of data entities are used to describe our model – No name overflow – Referential scoping helps narrow our focus to a simpler problem – Many-to-one relations can in-lined

• Introspection was not a goal for the design of the relational model • Introspection was therefore not a result either

Page 7: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Older than Dirt• Relational theory is old (1970)

– Pre-dates data structures – Predates mainstream recursive procedures – Predates lexical scoping – Predates logic programming – Predates real functional programming (Church, McCarthy, Iverson,

Backus and not-withstanding)

• Some updates are in order to enhance introspection

Page 8: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Contrast Relational and HBase Style noSQLRelational

• Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional)

• Expressions over sets of rows

HBase / MapR DB • Rows contain fields • Fields bytes • Structure is flexible • No pre-defined structure • Single key • Column families • Timestamps • Versions

Page 9: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Contrast relational and HBase with StructuringRelational

• Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional)

• Expressions over sets of rows

HBase + Structuring • Rows contain fields • Fields contain primitive types

– Or objects, or lists • Structure is flexible, ragged • No pre-defined structure • Single key

Page 10: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Turtle Models for Databases• Allows complex objects in field values

– JSON style lists and objects • Allow references to objects via join

– Includes references localized within lists • Lists of objects and objects of lists are isomorphic to tables so …

• Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables

Page 11: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

A Catalog of noSQL Idioms

Page 12: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Tables as Objects, Objects as Tables

c1 c2 c3

Row-wise form

c1 c2 c3

Column-wise form

[ { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 } ]

List of objects

{ c1:[v1, v2, v3], c2:[v1, v2, v3], c3:[v1, v2, v3] }

Object containing lists

Page 13: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

c1 c2 c3

c1 c2 c3

Micro Columnar Formats

An entire table stored in columnar form can be a

first-class value using these techniques

This is very powerful for in-lining one-to-many relations.

Page 14: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Note• If embedded tables are first-class, schema becomes data

• If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible

• Thus, embedded first-class objects implies late discovery of schema information

Page 15: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

A first example: Time-series data

Page 16: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Column names as data• When column names are not pre-defined, they can convey

information

• Examples – Time offsets within a window for time series – Top-level domains for web crawlers – Vendor id’s for customer purchase profiles

• Predefined schema is impossible for this idiom

Page 17: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Relational Model for Time-series

Page 18: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Table Design: Point-by-Point

Page 19: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Table Design: Hybrid Point-by-Point + Sub-table

After close of window, data in row is restated as column-oriented tabular value in different column family.

Page 20: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Compression Results

Samples are 64b time, 16 bit sample

Sample time at 10kHz

Sample time jitter makes it important to keep original time-stamp

How much overhead to retain time-stamp?

Page 21: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

A second example: Music meta-data

Page 22: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

MusicBrainz on NoSQL• Artists, albums, tracks and labels are key objects • Reality check:

– Add works (compositions), recordings, release, release group • 7 tables for artist alone • 12 for place, 7 for label, 17 for release/group, 8 for work

– (but only 4 for recording!) – Total of 12 + 7 + 17 + 8 + 4 = 48 tables

• But wait, there’s more! – 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link

tables, 5 cover art tables and 3 tables for CD timing info (138 total) – And 50 more tables that aren’t documented yet

Page 23: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Page 24: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

180 tablesnot shown

Page 25: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

236 tables to describe 7 kinds of things

Page 26: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Can we do better?

Page 27: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>

Page 28: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>

Page 29: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabegin_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>

{ name, begin_date, end_date }

Page 30: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Page 31: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

{id, recording_id, name, list<credit>length}

recordingidgidlist<credit>namelist<track_ref>

{id, format, name, list<track>}

release_groupidgidnamelist<credit>typelist<release_id>

Page 32: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

27 tables reduce to 4

Page 33: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

27 tables reduce to 4 so far

Page 34: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Further Reductions• All 86 link tables become properties on artists, releases and other

entities • All 44 tag, rating and annotation tables become list properties • All 5 cover art tables become lists of file references

• Current score: 162 tables become 4

• You get the idea

Page 35: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Is This Good?• Expressivity

– The JSON data model is at least as expressive as the original relational model

• Many cases easier to describe in nested data • No cases are harder

• Efficiency – Inlining can increase data size. Locality improves, however – Sessionizing can substantially decrease data size – Inlining back-references is more efficient than ordinary indexes – Inlined columnar data allows 1000x speedup for time series

• Introspection (you decide)

Page 36: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

But How Can We Query This?• Can’t use SQL

– SQL is strongly typed – SQL is heavily tied into the original relational model – SQL generating tools require relational model

• Must use SQL – Vast numbers of tools and people understand how to write SQL – SQL is the lingua franca of databases

Page 37: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Squaring the Circle• Enter Apache Drill

• Drill is SQL compliant – Uses standard syntax and semantics

• Drill extends SQL – First class treatment of objects, lists – Full support for destructuring, flattening – Full power of relational model can be applied to complex data

Page 38: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Drill Provides Scalable and Extended SQL

Page 39: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Sample Query• Find Elvis

select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )

Page 40: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Example Query• Find discs where Elvis was credited

select distinct album_id, name from ( select id album_id, name, flatten(credit) from release ) albums join ( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ ) ) artists using artist_id

Page 41: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Summary• Extended relational model allows massive simplification

– On a real example, we see >20x reduction in number of tables

• Simplification drives improved introspection – This is good

• Apache Drill gives very high performance execution for extended relational problems

• You can try this out today

Page 42: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Page 43: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2014 MapR Technologies ‹#›

Thank You!

Page 44: HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall ... – Time offsets within a window for time

© 2015 MapR Technologies ‹#›@tgrall

Q & A@tgrall maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies