94
BigTable: A Distributed Storage System for Structured Data Amir H. Payberah [email protected] Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 1 / 57

BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah [email protected] Amirkabir University of Technology

  • Upload
    others

  • View
    17

  • Download
    1

Embed Size (px)

Citation preview

Page 1: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable: A Distributed Storage System for StructuredData

Amir H. [email protected]

Amirkabir University of Technology(Tehran Polytechnic)

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 1 / 57

Page 2: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Motivation

I Lots of (semi-)structured data at Google.

• URLs: contents, crawl metadata, links, anchors• Per-user data: user preferences settings, recent queries, search

results• Geographical locations: physical entities, e.g., shops, restaurants,

roads

I Scale is large

• Billions of URLs, many versions/page - 20KB/page• Hundreds of millions of users, thousands of q/sec - Latency

requirement• 100+TB of satellite image data

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 2 / 57

Page 3: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Motivation

I Lots of (semi-)structured data at Google.• URLs: contents, crawl metadata, links, anchors

• Per-user data: user preferences settings, recent queries, searchresults

• Geographical locations: physical entities, e.g., shops, restaurants,roads

I Scale is large

• Billions of URLs, many versions/page - 20KB/page• Hundreds of millions of users, thousands of q/sec - Latency

requirement• 100+TB of satellite image data

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 2 / 57

Page 4: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Motivation

I Lots of (semi-)structured data at Google.• URLs: contents, crawl metadata, links, anchors• Per-user data: user preferences settings, recent queries, search

results

• Geographical locations: physical entities, e.g., shops, restaurants,roads

I Scale is large

• Billions of URLs, many versions/page - 20KB/page• Hundreds of millions of users, thousands of q/sec - Latency

requirement• 100+TB of satellite image data

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 2 / 57

Page 5: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Motivation

I Lots of (semi-)structured data at Google.• URLs: contents, crawl metadata, links, anchors• Per-user data: user preferences settings, recent queries, search

results• Geographical locations: physical entities, e.g., shops, restaurants,

roads

I Scale is large

• Billions of URLs, many versions/page - 20KB/page• Hundreds of millions of users, thousands of q/sec - Latency

requirement• 100+TB of satellite image data

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 2 / 57

Page 6: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Motivation

I Lots of (semi-)structured data at Google.• URLs: contents, crawl metadata, links, anchors• Per-user data: user preferences settings, recent queries, search

results• Geographical locations: physical entities, e.g., shops, restaurants,

roads

I Scale is large

• Billions of URLs, many versions/page - 20KB/page• Hundreds of millions of users, thousands of q/sec - Latency

requirement• 100+TB of satellite image data

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 2 / 57

Page 7: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Motivation

I Lots of (semi-)structured data at Google.• URLs: contents, crawl metadata, links, anchors• Per-user data: user preferences settings, recent queries, search

results• Geographical locations: physical entities, e.g., shops, restaurants,

roads

I Scale is large• Billions of URLs, many versions/page - 20KB/page

• Hundreds of millions of users, thousands of q/sec - Latencyrequirement

• 100+TB of satellite image data

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 2 / 57

Page 8: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Motivation

I Lots of (semi-)structured data at Google.• URLs: contents, crawl metadata, links, anchors• Per-user data: user preferences settings, recent queries, search

results• Geographical locations: physical entities, e.g., shops, restaurants,

roads

I Scale is large• Billions of URLs, many versions/page - 20KB/page• Hundreds of millions of users, thousands of q/sec - Latency

requirement

• 100+TB of satellite image data

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 2 / 57

Page 9: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Motivation

I Lots of (semi-)structured data at Google.• URLs: contents, crawl metadata, links, anchors• Per-user data: user preferences settings, recent queries, search

results• Geographical locations: physical entities, e.g., shops, restaurants,

roads

I Scale is large• Billions of URLs, many versions/page - 20KB/page• Hundreds of millions of users, thousands of q/sec - Latency

requirement• 100+TB of satellite image data

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 2 / 57

Page 10: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Goals

I Need to support:

• Very high read/write rates (millions of operations per second):Google Talk

• Efficient scans over all or interesting subset of data.• Efficient joins of large 1-1 and 1-* datasets.

I Often want to examine data changes over time.• Contents of web page over multiple crawls.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 3 / 57

Page 11: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Goals

I Need to support:• Very high read/write rates (millions of operations per second):

Google Talk

• Efficient scans over all or interesting subset of data.• Efficient joins of large 1-1 and 1-* datasets.

I Often want to examine data changes over time.• Contents of web page over multiple crawls.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 3 / 57

Page 12: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Goals

I Need to support:• Very high read/write rates (millions of operations per second):

Google Talk• Efficient scans over all or interesting subset of data.

• Efficient joins of large 1-1 and 1-* datasets.

I Often want to examine data changes over time.• Contents of web page over multiple crawls.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 3 / 57

Page 13: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Goals

I Need to support:• Very high read/write rates (millions of operations per second):

Google Talk• Efficient scans over all or interesting subset of data.• Efficient joins of large 1-1 and 1-* datasets.

I Often want to examine data changes over time.• Contents of web page over multiple crawls.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 3 / 57

Page 14: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Goals

I Need to support:• Very high read/write rates (millions of operations per second):

Google Talk• Efficient scans over all or interesting subset of data.• Efficient joins of large 1-1 and 1-* datasets.

I Often want to examine data changes over time.• Contents of web page over multiple crawls.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 3 / 57

Page 15: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable

I Distributed multi-level map

I Fault-tolerant, persistent

I Scalable• 1000s of servers• TB of in-memory data• Peta byte of disk based data• Millions of read/writes per second, efficient scans

I Self-managing• Servers can be added/removed dynamically• Servers adjust to the load imbalance

I CAP: strong consistency and partition tolerance

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 4 / 57

Page 16: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable

I Distributed multi-level map

I Fault-tolerant, persistent

I Scalable• 1000s of servers• TB of in-memory data• Peta byte of disk based data• Millions of read/writes per second, efficient scans

I Self-managing• Servers can be added/removed dynamically• Servers adjust to the load imbalance

I CAP: strong consistency and partition tolerance

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 4 / 57

Page 17: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable

I Distributed multi-level map

I Fault-tolerant, persistent

I Scalable• 1000s of servers• TB of in-memory data• Peta byte of disk based data• Millions of read/writes per second, efficient scans

I Self-managing• Servers can be added/removed dynamically• Servers adjust to the load imbalance

I CAP: strong consistency and partition tolerance

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 4 / 57

Page 18: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable

I Distributed multi-level map

I Fault-tolerant, persistent

I Scalable• 1000s of servers• TB of in-memory data• Peta byte of disk based data• Millions of read/writes per second, efficient scans

I Self-managing• Servers can be added/removed dynamically• Servers adjust to the load imbalance

I CAP: strong consistency and partition tolerance

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 4 / 57

Page 19: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable

I Distributed multi-level map

I Fault-tolerant, persistent

I Scalable• 1000s of servers• TB of in-memory data• Peta byte of disk based data• Millions of read/writes per second, efficient scans

I Self-managing• Servers can be added/removed dynamically• Servers adjust to the load imbalance

I CAP: strong consistency and partition tolerance

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 4 / 57

Page 20: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Data Model

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 5 / 57

Page 21: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Reminder

[http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques]

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 6 / 57

Page 22: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Column-Oriented Data Model (1/2)

I Similar to a key/value store, but the value can have multiple at-tributes (Columns).

I Column: a set of data values of a particular type.

I Store and process data by column instead of row.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 7 / 57

Page 23: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Columns-Oriented Data Model (2/2)

I In many analytical databases queries, few attributes are needed.

I Column values are stored contiguously on disk: reduces I/O.

[Lars George, “Hbase: The Definitive Guide”, O’Reilly, 2011]

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 8 / 57

Page 24: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable Data Model (1/5)

I Table

I Distributed multi-dimensional sparse map

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 9 / 57

Page 25: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable Data Model (2/5)

I Rows

I Every read or write in a row is atomic.

I Rows sorted in lexicographical order.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 10 / 57

Page 26: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable Data Model (3/5)

I Column

I The basic unit of data access.

I Column families: group of (the same type) column keys.

I Column key naming: family:qualifier

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 11 / 57

Page 27: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable Data Model (4/5)

I Timestamp

I Each column value may contain multiple versions.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 12 / 57

Page 28: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable Data Model (5/5)

I Tablet: contiguous ranges of rows stored together.

I Tables are split by the system when they become too large.

I Auto-Sharding

I Each tablet is served by exactly one tablet server.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 13 / 57

Page 29: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Building Blocks

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 14 / 57

Page 30: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

BigTable Cell

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 15 / 57

Page 31: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Main Components

I Master server

I Tablet server

I Client library

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 16 / 57

Page 32: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Master Server

I One master server.

I Assigns tablets to tablet server.

I Balances tablet server load.

I Garbage collection of unneeded files in GFS.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 17 / 57

Page 33: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Server

I Many tablet servers.

I Can be added or removed dynamically.

I Each manages a set of tablets (typically 10-1000 tablets/server).

I Handles read/write requests to tablets.

I Splits tablets when too large.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 18 / 57

Page 34: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Client Library

I Library that is linked into every client.

I Client data does not move though the master.

I Clients communicate directly with tablet servers for reads/writes.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 19 / 57

Page 35: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Building Blocks

I The building blocks for the BigTable are:• Google File System (GFS): raw storage• Chubby: distributed lock manager• Scheduler: schedules jobs onto machines

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 20 / 57

Page 36: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Google File System (GFS)

I Large-scale distributed file system.

I Master: responsible for metadata.

I Chunk servers: responsible for reading and writing large chunks ofdata.

I Chunks replicated on 3 machines, master responsible for ensuringreplicas exist.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 21 / 57

Page 37: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Chubby Lock Service (1/2)

I Name space consists of directories/files used as locks.

I Read/Write to a file are atomic.

I Consists of 5 active replicas: one is elected master and serves re-quests.

I Needs a majority of its replicas to be running for the service to bealive.

I Uses Paxos to keep its replicas consistent during failures.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 22 / 57

Page 38: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Chubby Lock Service (2/2)

I Chubby is used to:

• Ensure there is only one active master.

• Store bootstrap location of BigTable data.

• Discover tablet servers.

• Store BigTable schema information.

• Store access control lists.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 23 / 57

Page 39: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Chubby Lock Service (2/2)

I Chubby is used to:

• Ensure there is only one active master.

• Store bootstrap location of BigTable data.

• Discover tablet servers.

• Store BigTable schema information.

• Store access control lists.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 23 / 57

Page 40: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Chubby Lock Service (2/2)

I Chubby is used to:

• Ensure there is only one active master.

• Store bootstrap location of BigTable data.

• Discover tablet servers.

• Store BigTable schema information.

• Store access control lists.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 23 / 57

Page 41: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Chubby Lock Service (2/2)

I Chubby is used to:

• Ensure there is only one active master.

• Store bootstrap location of BigTable data.

• Discover tablet servers.

• Store BigTable schema information.

• Store access control lists.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 23 / 57

Page 42: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Chubby Lock Service (2/2)

I Chubby is used to:

• Ensure there is only one active master.

• Store bootstrap location of BigTable data.

• Discover tablet servers.

• Store BigTable schema information.

• Store access control lists.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 23 / 57

Page 43: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

SSTable

I Immutable, sorted file of key-value pairs.

I Chunks of data plus an index.

I Index of block ranges, not values.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 24 / 57

Page 44: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Implementation

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 25 / 57

Page 45: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Assignment

I 1 tablet → 1 tablet server.

I Master uses Chubby to keep tracks of set of live tablet serves andunassigned tablets.

• When a tablet server starts, it creates and acquires an exclusive lockin Chubby.

I Master detects the status of the lock of each tablet server by check-ing periodically.

I Master is responsible for finding when tablet server is no longerserving its tablets and reassigning those tablets as soon as possible.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 26 / 57

Page 46: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Assignment

I 1 tablet → 1 tablet server.

I Master uses Chubby to keep tracks of set of live tablet serves andunassigned tablets.

• When a tablet server starts, it creates and acquires an exclusive lockin Chubby.

I Master detects the status of the lock of each tablet server by check-ing periodically.

I Master is responsible for finding when tablet server is no longerserving its tablets and reassigning those tablets as soon as possible.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 26 / 57

Page 47: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Assignment

I 1 tablet → 1 tablet server.

I Master uses Chubby to keep tracks of set of live tablet serves andunassigned tablets.

• When a tablet server starts, it creates and acquires an exclusive lockin Chubby.

I Master detects the status of the lock of each tablet server by check-ing periodically.

I Master is responsible for finding when tablet server is no longerserving its tablets and reassigning those tablets as soon as possible.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 26 / 57

Page 48: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Assignment

I 1 tablet → 1 tablet server.

I Master uses Chubby to keep tracks of set of live tablet serves andunassigned tablets.

• When a tablet server starts, it creates and acquires an exclusive lockin Chubby.

I Master detects the status of the lock of each tablet server by check-ing periodically.

I Master is responsible for finding when tablet server is no longerserving its tablets and reassigning those tablets as soon as possible.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 26 / 57

Page 49: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Finding a Tablet

I Three-level hierarchy.

I Root tablet contains location of all tablets in a special METADATAtable.

I METADATA table contains location of each tablet under a row.

I The client library caches tablet locations.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 27 / 57

Page 50: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Finding a Tablet

I Three-level hierarchy.

I Root tablet contains location of all tablets in a special METADATAtable.

I METADATA table contains location of each tablet under a row.

I The client library caches tablet locations.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 27 / 57

Page 51: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Finding a Tablet

I Three-level hierarchy.

I Root tablet contains location of all tablets in a special METADATAtable.

I METADATA table contains location of each tablet under a row.

I The client library caches tablet locations.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 27 / 57

Page 52: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Finding a Tablet

I Three-level hierarchy.

I Root tablet contains location of all tablets in a special METADATAtable.

I METADATA table contains location of each tablet under a row.

I The client library caches tablet locations.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 27 / 57

Page 53: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Master Startup

I The master executes the following steps at startup:

• Grabs a unique master lock in Chubby, which prevents concurrentmaster instantiations.

• Scans the servers directory in Chubby to find the live servers.

• Communicates with every live tablet server to discover what tabletsare already assigned to each server.

• Scans the METADATA table to learn the set of tablets.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 28 / 57

Page 54: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Master Startup

I The master executes the following steps at startup:

• Grabs a unique master lock in Chubby, which prevents concurrentmaster instantiations.

• Scans the servers directory in Chubby to find the live servers.

• Communicates with every live tablet server to discover what tabletsare already assigned to each server.

• Scans the METADATA table to learn the set of tablets.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 28 / 57

Page 55: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Master Startup

I The master executes the following steps at startup:

• Grabs a unique master lock in Chubby, which prevents concurrentmaster instantiations.

• Scans the servers directory in Chubby to find the live servers.

• Communicates with every live tablet server to discover what tabletsare already assigned to each server.

• Scans the METADATA table to learn the set of tablets.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 28 / 57

Page 56: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Master Startup

I The master executes the following steps at startup:

• Grabs a unique master lock in Chubby, which prevents concurrentmaster instantiations.

• Scans the servers directory in Chubby to find the live servers.

• Communicates with every live tablet server to discover what tabletsare already assigned to each server.

• Scans the METADATA table to learn the set of tablets.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 28 / 57

Page 57: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Master Startup

I The master executes the following steps at startup:

• Grabs a unique master lock in Chubby, which prevents concurrentmaster instantiations.

• Scans the servers directory in Chubby to find the live servers.

• Communicates with every live tablet server to discover what tabletsare already assigned to each server.

• Scans the METADATA table to learn the set of tablets.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 28 / 57

Page 58: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Serving (1/2)

I Updates committed to a commit log.

I Recently committed updates are stored in memory - memtable

I Older updates are stored in a sequence of SSTables.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 29 / 57

Page 59: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Serving (1/2)

I Updates committed to a commit log.

I Recently committed updates are stored in memory - memtable

I Older updates are stored in a sequence of SSTables.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 29 / 57

Page 60: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Serving (1/2)

I Updates committed to a commit log.

I Recently committed updates are stored in memory - memtable

I Older updates are stored in a sequence of SSTables.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 29 / 57

Page 61: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Serving (2/2)

I Strong consistency• Only one tablet server is responsible for a given piece of data.• Replication is handled on the GFS layer.

I Tradeoff with availability• If a tablet server fails, its portion of data is temporarily unavailable

until a new server is assigned.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 30 / 57

Page 62: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Tablet Serving (2/2)

I Strong consistency• Only one tablet server is responsible for a given piece of data.• Replication is handled on the GFS layer.

I Tradeoff with availability• If a tablet server fails, its portion of data is temporarily unavailable

until a new server is assigned.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 30 / 57

Page 63: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Compaction

I When in-memory is full

I Minor compaction• Convert the memtable into an SSTable.• Reduce memory usage and log traffic on restart.

I Merging compaction• Reduces number of SSTables.• Reads the contents of a few SSTables and the memtable, and writes

out a new SSTable.

I Major compaction• Merging compaction that results in only one SSTable.• No deleted records, only sensitive live data.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 31 / 57

Page 64: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Compaction

I When in-memory is full

I Minor compaction• Convert the memtable into an SSTable.• Reduce memory usage and log traffic on restart.

I Merging compaction• Reduces number of SSTables.• Reads the contents of a few SSTables and the memtable, and writes

out a new SSTable.

I Major compaction• Merging compaction that results in only one SSTable.• No deleted records, only sensitive live data.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 31 / 57

Page 65: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Compaction

I When in-memory is full

I Minor compaction• Convert the memtable into an SSTable.• Reduce memory usage and log traffic on restart.

I Merging compaction• Reduces number of SSTables.• Reads the contents of a few SSTables and the memtable, and writes

out a new SSTable.

I Major compaction• Merging compaction that results in only one SSTable.• No deleted records, only sensitive live data.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 31 / 57

Page 66: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Compaction

I When in-memory is full

I Minor compaction• Convert the memtable into an SSTable.• Reduce memory usage and log traffic on restart.

I Merging compaction• Reduces number of SSTables.• Reads the contents of a few SSTables and the memtable, and writes

out a new SSTable.

I Major compaction• Merging compaction that results in only one SSTable.• No deleted records, only sensitive live data.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 31 / 57

Page 67: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

The Bigtable API

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 32 / 57

Page 68: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

The Bigtable API

I Metadata operations• Create/delete tables, column families, change metadata

I Writes: single-row, atomic• write/delete cells in a row, delete all cells in a row

I Reads: read arbitrary cells in a Bigtable table• Each row read is atomic.• Can restrict returned rows to a particular range.• Can ask for just data from one row, all rows, etc.• Can ask for all columns, just certain column families, or specific

columns.• Can ask for certain timestamps only.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 33 / 57

Page 69: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

The Bigtable API

I Metadata operations• Create/delete tables, column families, change metadata

I Writes: single-row, atomic• write/delete cells in a row, delete all cells in a row

I Reads: read arbitrary cells in a Bigtable table• Each row read is atomic.• Can restrict returned rows to a particular range.• Can ask for just data from one row, all rows, etc.• Can ask for all columns, just certain column families, or specific

columns.• Can ask for certain timestamps only.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 33 / 57

Page 70: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

The Bigtable API

I Metadata operations• Create/delete tables, column families, change metadata

I Writes: single-row, atomic• write/delete cells in a row, delete all cells in a row

I Reads: read arbitrary cells in a Bigtable table• Each row read is atomic.• Can restrict returned rows to a particular range.• Can ask for just data from one row, all rows, etc.• Can ask for all columns, just certain column families, or specific

columns.• Can ask for certain timestamps only.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 33 / 57

Page 71: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Writing Example

// Open the table

Table *T = OpenOrDie("/bigtable/web/webtable");

// Write a new anchor and delete an old anchor

RowMutation r1(T, "com.cnn.www");

r1.Set("anchor:www.c-span.org", "CNN");

r1.Delete("anchor:www.abc.com");

Operation op;

Apply(&op, &r1);

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 34 / 57

Page 72: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Reading Example

Scanner scanner(T);

scanner.Lookup("com.cnn.www");

ScanStream *stream;

stream = scanner.FetchColumnFamily("anchor");

stream->SetReturnAllVersions();

for (; !stream->Done(); stream->Next()) {

printf("%s %s %lld %s\n",

scanner.RowName(),

stream->ColumnName(),

stream->MicroTimestamp(),

stream->Value());

}

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 35 / 57

Page 73: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 36 / 57

Page 74: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase

I Type of NoSQL database, based on Google Bigtable

I Column-oriented data store, built on top of HDFS

I CAP: strong consistency and partition tolerance

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 37 / 57

Page 75: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Region and Region Server

[Lars George, “Hbase: The Definitive Guide”, O’Reilly, 2011]

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 38 / 57

Page 76: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Cell

[Lars George, “Hbase: The Definitive Guide”, O’Reilly, 2011]

I (Table, RowKey, Family, Column, Timestamp) → Value

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 39 / 57

Page 77: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Cluster

[Tom White, “Hadoop: The Definitive Guide”, O’Reilly, 2012]

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 40 / 57

Page 78: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Components

[Lars George, “Hbase: The Definitive Guide”, O’Reilly, 2011]

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 41 / 57

Page 79: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Components - Region Server

I Responsible for all read and write requests for all regions they serve.

I Split regions that have exceeded the thresholds.

I Region servers are added or removed dynamically.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 42 / 57

Page 80: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Components - Master

I Responsible for managing regions and their locations.• Assigning regions to region servers (uses Zookeeper).• Handling load balancing of regions across region servers.

I Doesn’t actually store or read data.• Clients communicate directly with region servers.

I Responsible for schema management and changes.• Adding/removing tables and column families.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 43 / 57

Page 81: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Components - Zookeeper

I A coordinator service: not part of HBase

I Master uses Zookeeper for region assignment.

I Ensures that there is only one master running.

I Stores the bootstrap location for region discovery: a registry forregion servers

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 44 / 57

Page 82: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Components - HFile

I The data is stored in HFiles.

I HFiles are immutable sequences of blocks and saved in HDFS.

I Block index is stored at the end of HFiles.

I Cannot remove key-values out of HFiles.

I Delete marker (tombstone marker) indicates the removed records.• Hides the marked data from reading clients.

I Updating key/value pairs: picking the latest timestamp.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 45 / 57

Page 83: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Components - WAL and memstore

I When data is added it is written to a log called Write Ahead Log(WAL) and is also stored in memory (memstore).

I When in-memory data exceeds maximum value it is flushed to anHFile.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 46 / 57

Page 84: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Installation and Shell

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 47 / 57

Page 85: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Installation

I Three options

• Local (Standalone) Mode

• Pseudo-Distributed Mode

• Fully-Distributed Mode

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 48 / 57

Page 86: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Installation - Local

I Uses local filesystem (not HDFS).

I Runs HBase and Zookeeper in the same JVM.

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 49 / 57

Page 87: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Installation - Pseudo-Distributed (1/3)

I Requires HDFS.

I Mimics Fully-Distributed but runs on just one host.

I Good for testing, debugging and prototyping, not for production.

I Configuration files:• hbase-env.sh• hbase-site.xml

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 50 / 57

Page 88: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Installation - Pseudo-Distributed (2/3)

I Specify environment variables in hbase-env.sh

export JAVA_HOME=/opt/jdk1.7.0_51

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 51 / 57

Page 89: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Installation - Pseudo-Distributed (3/3)

I Starts an HBase Master process, a ZooKeeper server, and a Region-Server process.

I Configure in hbase-site.xml

<property>

<name>hbase.cluster.distributed</name>

<value>true</value>

</property>

<property>

<name>hbase.rootdir</name>

<value>hdfs://localhost:8020/hbase</value>

</property>

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 52 / 57

Page 90: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Start HBase and Test

I Start the HBase daemon.

start-hbase.sh

hbase shell

I Web-based management• Master host: http://localhost:60010• Region host: http://localhost:60030

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 53 / 57

Page 91: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

HBase Shell

status

list

create ’Blog’, {NAME=>’info’}, {NAME=>’content’}

# put ’table’, ’row_id’, ’family:column’, ’value’

put ’Blog’, ’Matt-001’, ’info:title’, ’Elephant’

put ’Blog’, ’Matt-001’, ’info:author’, ’Matt’

put ’Blog’, ’Matt-001’, ’info:date’, ’2009.05.06’

put ’Blog’, ’Matt-001’, ’content:post’, ’Do elephants like monkeys?’

# get ’table’, ’row_id’

get ’Blog’, ’Matt-001’

get ’Blog’, ’Matt-001’, {COLUMN=>[’info:author’,’content:post’]}

scan ’Blog’

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 54 / 57

Page 92: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Summary

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 55 / 57

Page 93: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Summary

I BigTable

I Column-oriented

I Main components: master, tablet server, client library

I Basic components: GFS, chubby, SSTable

I HBase

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 56 / 57

Page 94: BigTable: A Distributed Storage System for Structured DataBigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology

Questions?

Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26 57 / 57