HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is...

HBase and Drill How loosely typed SQL is ideal for NoSQL Tugdual Grall @tgrall

Sept. 30, 2015

{“about” : “me”}Tugdual “Tug” Grall • MapR

• Technical Evangelist • MongoDB

• Technical Evangelist • Couchbase

• Technical Evangelist • eXo

• CTO • Oracle

• Developer/Product Manager • Mainly Java/SOA

• Developer in consulting firms

• Web • @tgrall • http://tgrall.github.io • tgrall

• NantesJUG co-founder

• Pet Project : • http://www.resultri.com

• tug@mapr.com • tugdual@gmail.com

Agenda• What does good mean? • What do we mean by loose typing? • Examples of what you can do • Real database with 10-20x fewer tables • Looking forward • Questions

What Does Good Mean (for a DB)?• Expressive

– Must express the concepts we need

• Efficient – Must run fast enough on cheap enough hardware

What Does Good Mean (for a DB)?• Expressive

– Must express the concepts we need

• Efficient – Must run fast enough on cheap enough hardware

• Introspectable – Must be able to inspect the data and schema and gain understanding

What is New Here• Introspection is better when

– A minimum of data entities are used to describe our model – No name overflow – Referential scoping helps narrow our focus to a simpler problem – Many-to-one relations can in-lined

• Introspection was not a goal for the design of the relational model • Introspection was therefore not a result either

Older than Dirt• Relational theory is old (1970)

– Pre-dates data structures – Predates mainstream recursive procedures – Predates lexical scoping – Predates logic programming – Predates real functional programming (Church, McCarthy, Iverson,

Backus and not-withstanding)

• Some updates are in order to enhance introspection

Contrast Relational and HBase Style noSQLRelational

• Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional)

• Expressions over sets of rows

HBase / MapR DB • Rows contain fields • Fields bytes • Structure is flexible • No pre-defined structure • Single key • Column families • Timestamps • Versions

Contrast relational and HBase with StructuringRelational

• Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional)

• Expressions over sets of rows

HBase + Structuring • Rows contain fields • Fields contain primitive types

– Or objects, or lists • Structure is flexible, ragged • No pre-defined structure • Single key

Turtle Models for Databases• Allows complex objects in field values

– JSON style lists and objects • Allow references to objects via join

– Includes references localized within lists • Lists of objects and objects of lists are isomorphic to tables so …

• Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables

A Catalog of noSQL Idioms

Tables as Objects, Objects as Tables

c1 c2 c3

Row-wise form

c1 c2 c3

Column-wise form

[ { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 } ]

List of objects

{ c1:[v1, v2, v3], c2:[v1, v2, v3], c3:[v1, v2, v3] }

Object containing lists

c1 c2 c3

Micro Columnar Formats

An entire table stored in columnar form can be a

first-class value using these techniques

This is very powerful for in-lining one-to-many relations.

Note• If embedded tables are first-class, schema becomes data

• If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible

• Thus, embedded first-class objects implies late discovery of schema information

A first example: Time-series data

Column names as data• When column names are not pre-defined, they can convey

information

• Examples – Time offsets within a window for time series – Top-level domains for web crawlers – Vendor id’s for customer purchase profiles

• Predefined schema is impossible for this idiom

Relational Model for Time-series

Table Design: Point-by-Point

Table Design: Hybrid Point-by-Point + Sub-table

After close of window, data in row is restated as column-oriented tabular value in different column family.

Compression Results

Samples are 64b time, 16 bit sample

Sample time at 10kHz

Sample time jitter makes it important to keep original time-stamp

How much overhead to retain time-stamp?

A second example: Music meta-data

MusicBrainz on NoSQL• Artists, albums, tracks and labels are key objects • Reality check:

– Add works (compositions), recordings, release, release group • 7 tables for artist alone • 12 for place, 7 for label, 17 for release/group, 8 for work

– (but only 4 for recording!) – Total of 12 + 7 + 17 + 8 + 4 = 48 tables

• But wait, there’s more! – 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link

tables, 5 cover art tables and 3 tables for CD timing info (138 total) – And 50 more tables that aren’t documented yet

180 tablesnot shown

236 tables to describe 7 kinds of things

Can we do better?

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabeing_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>

artist

idgidnamesort_namebegin_dateend_dateendedtypegenderareabegin_areaend_areacommentlist<ipi>list<isni>list<alias>list<release_id>list<recording_id>

{ name, begin_date, end_date }

{id, recording_id, name, list<credit>length}

recordingidgidlist<credit>namelist<track_ref>

{id, format, name, list<track>}

release_groupidgidnamelist<credit>typelist<release_id>

27 tables reduce to 4

27 tables reduce to 4 so far

Further Reductions• All 86 link tables become properties on artists, releases and other

entities • All 44 tag, rating and annotation tables become list properties • All 5 cover art tables become lists of file references

• Current score: 162 tables become 4

• You get the idea

Is This Good?• Expressivity

– The JSON data model is at least as expressive as the original relational model

• Many cases easier to describe in nested data • No cases are harder

• Efficiency – Inlining can increase data size. Locality improves, however – Sessionizing can substantially decrease data size – Inlining back-references is more efficient than ordinary indexes – Inlined columnar data allows 1000x speedup for time series

• Introspection (you decide)

But How Can We Query This?• Can’t use SQL

– SQL is strongly typed – SQL is heavily tied into the original relational model – SQL generating tools require relational model

• Must use SQL – Vast numbers of tools and people understand how to write SQL – SQL is the lingua franca of databases

Squaring the Circle• Enter Apache Drill

• Drill is SQL compliant – Uses standard syntax and semantics

• Drill extends SQL – First class treatment of objects, lists – Full support for destructuring, flattening – Full power of relational model can be applied to complex data

Drill Provides Scalable and Extended SQL

Sample Query• Find Elvis

select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )

Example Query• Find discs where Elvis was credited

select distinct album_id, name from ( select id album_id, name, flatten(credit) from release ) albums join ( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ ) ) artists using artist_id

Summary• Extended relational model allows massive simplification

– On a real example, we see >20x reduction in number of tables

• Simplification drives improved introspection – This is good

• Apache Drill gives very high performance execution for extended relational problems

• You can try this out today

Thank You!

Q & A@tgrall maprtech

tug@mapr.com

Engage with us!

maprtech

mapr-technologies

HBase and Drill How loosely typed SQL is ideal for NoSQL...HBase and Drill How loosely typed SQL is...

Documents

hbase - arif.works

HBase internals

Apache HBase 0 - files.meetup.comfiles.meetup.com/1350427/Apache HBase 0.98.v2.pdfEndpoint EXEC Grants (HBASE-6104) • HBase ACLs can grant a familiar set of privileges to users (and

HBase inside

PHP, MVC, SQL - d2o9nyf4hwsci4.cloudfront.netd2o9nyf4hwsci4.cloudfront.net/2013/fall/quizzes/1/zamyla.pdf · PHP variables! start with $! loosely typed! That doesn’t mean that there

Introduction to Hbase. Agenda What is Hbase About RDBMS Overview of Hbase Why Hbase instead of RDBMS Architecture of Hbase Hbase interface

MyLife with HBase or HBase three flavors

Lesson 4 Typed Arithmetic Typed Lambda Calculus

HBase Snapshots

Faster HBase queries - events.static.linuxfound.org · Faster HBase queries Introducing hindex – Secondary indexes for HBase Rajeshbabu Chintaguntla rajeshbabu@apache.org ApacheCon

HBASE Overview

HBase Basic

HBase + Hue - LA HBase User Group

HBase @ Stumbleupon

Toward Loosely Loosely Coupled Programming …iraicu/research/presentations/2008_SC08...Toward Loosely Loosely Coupled Programming Coupled Programming on Petascale Systems Ioan Raicu

HBase ArcheTypes

hadoop developer - SevenMentor · 2021. 2. 17. · D. HBASE: Introduction to HBASE Basic Configurations of HBASE Fundamentals of HBase What is NoSQL? HBase Data Model Table and Row

HBase Intro

HBase operations

Hbase Tutorial