Upload
jbellis
View
7.144
Download
1
Tags:
Embed Size (px)
Citation preview
Brisk: More Powerful Hadoop Powered by [email protected]
Monday, July 25, 2011
The evolution of Analytics
Analytics + Realtime
Monday, July 25, 2011
The evolution of Analytics
Analytics Realtime
replication
Monday, July 25, 2011
The evolution of Analytics
ETL
Monday, July 25, 2011
Brisk re-unifies realtime and analytics
Monday, July 25, 2011
The Traditional Hadoop Stack
Master Nodes
Name Node
Secondary Name Node
Job Tracker
ZooKeeper
MetaStore
Slave Nodes
Data Node
Task Tracker
Region Server
Hbase MasterPig
Hive
Region Server
Client Nodes
Monday, July 25, 2011
7
Monday, July 25, 2011
Brisk Architecture
Monday, July 25, 2011
Brisk Highlights
✤ Easy to deploy and operate✤ No single points of failure✤ Scale and change nodes with no downtime✤ Cross-DC, multi-master clusters✤ Allocate resources for OLAP vs OLTP
✤ With no ETL
Monday, July 25, 2011
Cassandra data model
✤ ColumnFamilies contain rows + columns✤ (Not really schemaless for a while now)
password name site* Nate McCall* Brandon Williams
* Jonathan Ellis datastax.com
zznatedriftxjbellis
Monday, July 25, 2011
Sparse
password name* Nate McCall
zznate
password name* Brandon Williams
driftx
password name site* Jonathan Ellis datastax.com
jbellis
Monday, July 25, 2011
Rows as containers / materialized views
driftx thobbs pcmanus jbellis zznatecircle1
xedin mdenniscircle2
xedin pcmanus ymorishitacircle3
Monday, July 25, 2011
Monday, July 25, 2011
CassandraFS
✤ data stored as ByteBuffer internally -- excellent fit for blocks✤ local reads mmap data directly (no rpc)✤ blocks are compressed with google snappy✤ hadoop distcp hdfs:///mydata cfs:///mydata
Monday, July 25, 2011
Hive support
✤ Hive MetaStore in Cassandra✤ Unified schema view from any node, with no external systems
and no SPOF✤ Automatically maps Cassandra column families to Hive tables
✤ Supports static and dynamic column families (and supercolumns)
Monday, July 25, 2011
Hive: CFS and ColumnFamilies
CREATE TABLE users (name STRING, zip INT);
LOAD DATA LOCAL INPATH 'kv2.txt' OVERWRITE INTO TABLE users;
CREATE EXTERNAL TABLE Keyspace1.Users(name STRING, zip INT)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';
CREATE EXTERNAL TABLE Keyspace1.Users
(row_key STRING, column_name STRING, value string)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';
Monday, July 25, 2011
Pig Support
✤ With standard Cassandra:$ export PIG_HOME=/path/to/pig
$ export PIG_INITIAL_ADDRESS=localhost
$ export PIG_RPC_PORT=9160
$ export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
$ contrib/pig/bin/pig_cassandra
grunt>
✤ With Brisk:$ bin/brisk pig
grunt>
Monday, July 25, 2011
Pig: CFS and ColumnFamilies
grunt> data = LOAD 'cfs:///example.txt' using PigStorage() as (name:chararray, value:long);
data = LOAD 'cassandra://Demo1/Scores' using CassandraStorage() AS (key, columns: {T: tuple(name, value)});
data = LOAD 'cassandra://Demo1/Scores&slice_start=M&slice_end=S' using CassandraStorage() AS (key, columns: {T: tuple(name, value)});
Monday, July 25, 2011
19
Monday, July 25, 2011
Data model: Realtime
GOOG LNKD P AMZN AAPLE80 20 40 100 20
Portfolio1
Portfolios
2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11
GOOG
StockHist
last$95.52
$186.10
$112.98
GOOG
LiveStocks
AAPLAMZN
Monday, July 25, 2011
Data model: Analytics
worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93
Portfolio1
HistLoss
Portfolio2Portfolio3
Monday, July 25, 2011
Data model: Analytics
ticker rdate returnGOOG 2011-07-25 $8.23GOOG 2011-07-24 $6.14GOOG 2011-07-23 $7.78AAPL 2011-07-25 $15.32AAPL 2011-07-24 $12.68
10dayreturns
INSERT OVERWRITE TABLE 10dayreturnsSELECT a.row_key ticker, b.column_name rdate, b.value - a.valueFROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name);
Monday, July 25, 2011
2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11
GOOG
row_key column_name valueGOOG 2011-01-01 $8.23GOOG 2011-01-02 $6.14GOOG 2011-001-03 $7.78
Monday, July 25, 2011
Data model: Analytics
portfolio rdate preturnPortfolio1 2011-07-25 $118.21Portfolio1 2011-07-24 $60.78Portfolio1 2011-07-23 -$34.81Portfolio2 2011-07-25 $2143.92Portfolio3 2011-07-24 -$10.19
portfolio_returns
INSERT OVERWRITE TABLE portfolio_returnsSELECT row_key portfolio, rdate, SUM(b.return)FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker)GROUP BY row_key, rdate;
Monday, July 25, 2011
Data model: Analytics
INSERT OVERWRITE TABLE HistLossSELECT a.portfolio, rdate, minpFROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);
worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93
Portfolio1
HistLoss
Portfolio2Portfolio3
Monday, July 25, 2011
Portfolio Demo dataflow
Portfolios
Historical Prices
Intermediate Results
Largest loss
Web-based Portfolios
Live Prices for today
Largest loss
Monday, July 25, 2011
OpsCenter
Monday, July 25, 2011
Monday, July 25, 2011
Where to get it
✤ http://www.datastax.com/brisk
Monday, July 25, 2011
Monday, July 25, 2011