18
Physical Data Storage Stephen Dawson-Haggerty

Physical Data Storage

  • Upload
    kiele

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Physical Data Storage. Stephen Dawso n-Haggerty. Hadoop. sMAP. StreamFS. HDFS. - Data exploration/visualization Control Loops Demand response Analytics Mobile feedback Fault detection . sMAP. Applications. sMAP. sMAP. Data Sources. Time -Series Databases. Expected workload - PowerPoint PPT Presentation

Citation preview

Page 1: Physical Data Storage

Physical Data Storage

Stephen Dawson-Haggerty

Page 2: Physical Data Storage

Data Sources

sMAP

sMAP

sMAP

sMAP

- Data exploration/visualization- Control Loops- Demand response- Analytics- Mobile feedback- Fault detection

Hadoop

HDFS

Applications

StreamFS

Page 3: Physical Data Storage

Time-Series Databases

• Expected workload• Related work• Server architecture• API• Performance• Future directions

Page 4: Physical Data Storage

Dent

ci

rcui

t m

eter

sMAP

sMAP

Write Workload

• sMAP Sources– HTTP/REST protocol for exposing physical

information– Data trickles in as its generated– Typical data rates: 1 reading/1-60s

• Bulk imports– Existing databases– Migrations

Page 5: Physical Data Storage

Read Workload

• Plotting engine• Matlab & python

adaptors for analysis

• Mobile apps• Batch analysis

Dominated by range queries

Latency is important, for interactive data exploration

Page 6: Physical Data Storage

Page Cache Lock Manager

Key-Value Store

Storage Alloc.

Time-series Interface

Bucketing RPC Compression

read

ingd

b

insert

resample

aggregate

query

stre

amin

g pi

pelin

e

SQL

Storage mapper

MySQL

Page 7: Physical Data Storage

Time series interface

db_open()db_query(streamid, start, end) Query points in a range

db_next(streamid, ref), db_prev(...) Query points near a reference time

db_add(streamid, vector) Insert points into the database

db_avail(streamid) Retrieve storage map

db_close()

All data is part of a stream, identified only by streamid

A stream is a series of tuples: (timestamp, sequence, value, min, max)

Page 8: Physical Data Storage

Storage Manager: BDB

• Berkeley Database: embedded key-value store• Store binary blobs using B+ trees• Very mature: around since 1992, supports

transactions, free-threading, replication• We use version 4

Page 9: Physical Data Storage

RPC Evolution

• First: shared memory– Low latency

• Move to threaded TCP• Google protocol buffers– zig-zag integer representation, multiple language

bindings– Extensible for multiple versions

Page 10: Physical Data Storage

On-Disk Format• All data stores perform poorly

with one key per reading– index size is high– unnecessary

• Solution: bucket readings• Excellent locality of reference

with B+ tree intexes– Data sorted by streamid and

timestamp– Range queries translate into

mostly large sequential IOs

bucket

(streamid, timestamp)

Page 11: Physical Data Storage

• Represent in memory with materialized structure – 32b/rec– Inefficient on disk – lots of repeated

data, missing fields• Solution: compression

– First: delta encode each bucket in protocol buffer

– Second: Huffman Tree or Run Length encoding (zlib)

• Combined compression 2x better than gzip or either one

• 1m rec/second compress/decompress on modest hardware

On-Disk Format

compress

bdb page

...

Page 12: Physical Data Storage

Other Services: Storage Mapping• What is in the database?

– Compute a set of tuples (start, end, n)• The desired interpretation is “the data source was alive”

• Different data sources have different ways of maintaining this information and maintaining confidence– Sometimes you have to infer it from the data– Sometime data sources give you liveness/presence guarantees – “I haven’t heard from you in an hour, but I’m still alive!”

dead or alive?

Page 13: Physical Data Storage

readingdb6

• Up since December supporting Cory Hall, SDH Hall, most other LoCal Deployments– behind www.openbms.org

• > 2 billion points in 10k streams– 12Gb on disk ~= 5b/rec including index– So... we fit in memory!

• Import at around 300k points/sec– We maxed out the NIC

Page 14: Physical Data Storage

Low Latency RPC

Page 15: Physical Data Storage

Compression ratios

Page 16: Physical Data Storage

Write load

Importing old data: 150k points/sec Continuous write load: 300-500pts/sec

Page 17: Physical Data Storage

Future thoughts

• A component of a cloud storage stack for physical data

• Hadoop adaptor: improve Mapreduce performance over Hbase solution

• The data is small: 2 billion points in 12GB– We can go a long time without distributing this

very much– Probably necessary for reasons other than

performance

Page 18: Physical Data Storage

THE END