View
757
Download
0
Category
Tags:
Preview:
DESCRIPTION
The Kiji Project is a modular, open-source framework that enables developers to efficiently build real-time Big Data applications. Kiji is built upon popular open-source technologies such as Cassandra, HBase, Hadoop, and Scalding, and contains components that implement functionality critical for Big Data applications, including the following: • Support for evolvable schemas of complex data types • Batch training of machine learning models with Hadoop • Real-time scoring with trained modelsIntegration with Hive and R • A REST endpoint Recently, we have updated Kiji to use Cassandra as a backing data store (previously, Kiji worked only with HBase). In this talk, we describe the process of integrating Cassandra and Kiji. Topics we cover include the following: • The Kiji architecture and data model • Implementing the Kiji data model in Cassandra using the Java driver and CQL3 • Integrating Cassandra with Hadoop 2.x • Building a flexible middleware platform that supports Cassandra and HBase (including projects that use both simultaneously) • Exposing unique features of Cassandra (e.g., variable consistency) to Kiji users
Citation preview
Building a Flexible, Real-time Big Data Applications Platform
on Cassandra with Kiji
Cassandra Day Silicon Valley07 April 2014
Clint KellyMember of Technical StaffWibiData
1
2
Have this...
Want to build this...
!
Kiji
!
3
History of the Kiji Project
• Created at WibiData• Originally built on top of HBase• Now works with Cassandra
One data model for two databases ➔ challenges!
Overview
• The Kiji Project• Kiji data model• Kiji on Cassandra
4
The Kiji Project
5
Components of Kiji
6
Batch
Data storage
Real-time
Kiji real-time components
• Score models• Manage models• REST interface
7
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
• Expressive DSL• Machine learning library• Hive
8
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
Kiji batch components
KijiSchema
• Serialization• Complex data types• Schema management
9
record UserLog { long timestamp; int user_id; string url; long session_id; array<string> terms;}
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
KijiSchema
• Initially HBase-only• Now HBase and
Cassandra
10
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
In production now
• Fortune 500 retailer: Personalized recommendations
• OPower: Energy usage and analytics reporting
11
Kiji data model
12
13
table
14
table
rowrowrowrowrowrowrowrowrowrowrowrow
row
15
entity ID data
16
Row key = entity ID
data0xfa “bob”
17
Composite entity IDs
info0xfa “bob” songs
18
Data organized into column families
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment
19
Column families contain columnsColumn name = qualifier
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
1396560123
20
Columns can have timestamped versions
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
1396560123
21
Column values can be complex data types
record SongPlay { long song_id; int user_rating; long session_id; device_type device;}
info songsentity ID
22
Locality groups
Arrange data based on query pattern
info songs_todayentity ID songs_prev_year
23
Locality groups
Arrange data based on query pattern
Need only one version of each column.
Need ASAP for real-time scoring; expires quickly.
Used for training ML algorithms in batch;
keep forever.
info songs_todayentity ID songs_prev_year
24
Locality groups
Arrange data based on query pattern
Need only one version of each column.
Need ASAP for real-time scoring; expires quickly.
Used for training ML algorithms in batch;
keep forever.MAX_VERSIONS=1TTL=FOREVER
MAX_VERSIONS=INFINITETTL=”1 DAY”CACHED
MAX_VERSIONS=INFINITETTL=FOREVERCOMPRESSED
KijiSchema• Similar to Cassandra, HBase, BigTable• Originally based on HBase ➔ timestamped versions• Logical and physical organization are separate• Complex data types
25
Kiji on Cassandra
26
info songs_todayentity ID songs_prev_year
Locality groups ➔ Tables
27
Locality group ~ query
CREATE TABLE loc_grp...
Entity ID ➔ Primary key
28
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
1396560123
CREATE TABLE loc_grp (userid bigint, user text,
PRIMARY KEY (userid, user) )
WITH CLUSTERING ORDER BY (user ASC);
Family, Qualifier, Version ➔ Clustering Columns
29
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
1396560123
CREATE TABLE loc_grp (userid bigint, user text,
family text, qualifier text, version bigint,
PRIMARY KEY (userid, user, family, qualifier, version) )
WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);
Column values ➔ Blobs
30
CREATE TABLE loc_grp (userid bigint, user text,
family text, qualifier text, version bigint, value blob,
PRIMARY KEY (userid, user, family, qualifier, version) )
WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
1396560123
31
cqlsh:kiji_music>SELECT * FROM kiji_table_users;
userid | user | family | qualifier | timestamp | value--------+------+--------+----------------+-----------+--------------- 123456 | bob | songs | abbey road | 139656012 | 0x81274b31032 123456 | bob | songs | help | 139625013 | 0x7c13270f129 123456 | bob | songs | help | 139621359 | 0x2307ff10370 123456 | bob | songs | help | 139625013 | 0x45e1822a497 123456 | bob | songs | helter skelter | 139621324 | 0x104bb974c34
Distinct Kiji column ➔ CQL row
Physical organization of data on disk
32
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
13965601230xfa:bob:info:email:t0:bob@gmail.com
0xfa:bob:info:payment:t1:AMEX1234...
0xfa:bob:songs:let it be:t5:...
0xfa:bob:songs:let it be:t4:…
0xfa:bob:songs:let it be:t2:…
0xfa:bob:songs:help:t2:…
0xfa:bob:songs:helter skelter:t1:…
Efficient queries = continuous scans!
Kiji queries ➔ CQL queries
33
Kiji queries can be complicated...
Kiji queries ➔ CQL queries
All data in “info” column family for “bob” ➔SELECT qualifier, value FROM loc_grp_info WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ LIMIT 1;
34
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
1396560123
Kiji queries ➔ CQL queries
Data in “info:email” and last play of “help” for “bob” ➔
SELECT value FROM lg_music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ AND qualifier=‘email’;
SELECT value FROM lg_music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1;
35
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
1396560123
Kiji queries ➔ CQL queries
All songs played by “bob” on April 2nd ➔SELECT qualifier, value FROM lg_music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND timestamp >= 1396396800
AND timestamp <= 1396483200 ALLOW FILTERING;😱😱
36
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
1396560123
Kiji queries ➔ CQL queries
37
songs:let it be
songs:help
songs:helterskelter
0xfa “bob” info:email
info:payment songs:
let it besongs:let it besongs:
let it besongs:let it be
1396560123
!Bad Request: PRIMARY KEY part timestamp cannot be restricted (preceding part qualifier is either not restricted or by a non-EQ relation)
Tricky queries
• Filter in CQL where possible• Break up into multiple CQL queries• Filter on the client• Designing table layout is important
38
MapReduce
• New InputFormat, OutputFormat• Java driver• Hadoop 2.x• Multiple C* queries per RecordReader
39
Project status
40
Initial release in ~2 weeks
41
www.kiji.org/getstarted
Next quarter
• Cassandra in all Kiji components• Expose Cassandra-specific features• Kiji support in CQLSH
42
Thanks to Cassandra community
• Great help on mailing lists for users, dev, java driver
• Webinars, meetups, C* Summit all available online
• Free training from DataStax• Very easy to get up-to-speed• Thanks to hosts and organizers today
43
Try it now — Kiji Bento Box
• Latest compatible versions of all components• Hadoop, ZooKeeper, HBase• Cassandra in ~2 weeks
44
www.kiji.org/getstarted
http://jobs.wibidata.com/
45
Recommended