(Distributed) (Structured) Storage Systems

Preview:

DESCRIPTION

Mark Feltner. (Distributed) (Structured) Storage Systems. Big Data. 2.5 Petabytes/day: Wal-Mart's transaction database 40 Terabytes/second: CERN 1 Terabyte/day: NYSE Trading data 10 billion: Facebook photos. Overview. Theory Algorithms Implementations & Technology. - PowerPoint PPT Presentation

Citation preview

(Distributed) (Structured) Storage SystemsMark Feltner

Big Data

2.5 Petabytes/day: Wal-Mart's transaction database

40 Terabytes/second: CERN 1 Terabyte/day: NYSE Trading data 10 billion: Facebook photos

Overview

Theory Algorithms Implementations & Technology

Relational databases

ACID

Atomicty

All-or-nothing

Consistency

Data is always in a valid state

Isolation

Serially executed transactions result in same state as concurrent transactions

Durability

COMMIT means transaction is permanent across all clients

Non-relational databases

Key-value

Document-oriented

Graphs

Distributed Systems

Fallacies of Distributed Computing

1. The network is reliable.2. Latency is zero.3. Bandwidth is infinite.4. The network is secure.5. Topology doesn't change.6. There is one administrator.7. Transport cost is zero.8. The network is homogeneous.

CAP Theorem

Consistency

Eventual consistency

“…there must exist a total order on all operations such that eachoperation looks as if it were completed at a single instant. This is equivalentto requiring requests of the distributed shared memory to act as if they wereexecuting on a single node, responding to operations one at a time.” (Gilbert, Lynch)

Availability“For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” (Gilbert, Lynch)

Partition Tolerance“In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost”(Gilbert, Lynch)

(CA || CP || AP) ?

Algorithms

Row- versus column- orientationTitle Artist Album Year

Breaking the Law Judas Priest British Steel 1980

Aces High Iron Maiden Powerslave 1984

Kickstart My Heat Motley Crue Dr. Feelgood 1989

Raining Blood Slayer Reign in Blood 1986

I Wanna Be Somebody W.A.S.P. W.A.S.P. 1984

Row-orientedData Storage Model:Breaking the LawJudas PriestBritish Steel1980Aces HighIron MaidenPowerslave1984Kickstart My heartMotley CrueDr. Feelgood1989Raining BloodSlayerReign in Blood1986I Wanna Be SomebodyW.A.S.P.W.A.S.P.1984

Column-orientedData Storage Model:Breaking the LawAces HighKickstart My HeartRaining BloodI Wanna Be SomebodyJudas PriestIron MadienMotley CrueSlayerW.A.S.P.British SteelPowerslaveDr. FeelgoodReign in BloodW.A.S.P.19801984198919861984

Comparison of Row- vs. Column-Orientation

CREATE SELECT MAX, MIN, SUM, AVG, …

MapReduce

Technology

Implementations

BigTable

High performance MapReduce Powers: Google Reader, Maps,

Book Search, YouTube, Gmail, …

Hadoop

MapReduce Yahoo! World Record Holder!

Cassandra

Key-value MapReduce Facebook Eventual consistency Scalable, fault-tolerant

MySQL

Relational LAMP

Redis

Key-value What is lacks in durability, it makes

up for in speed / simplicity.

HBase

MapReduce Hadoop + HDFS Java and REST API Column-oriented Excellent fault-tolerance Replication Streaming

Neo4J

Graph Database

MongoDB

Document-oriented

Conclusions

Pick the right tool for the job.

Recommended