Authors Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson Hsieh Deborah Wallach Mike Burrows Tushar...
If you can't read please download the document
Authors Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson Hsieh Deborah Wallach Mike Burrows Tushar Chandra Andrew Fikes Robert Gruber Bigtable: A Distributed
Authors Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson Hsieh
Deborah Wallach Mike Burrows Tushar Chandra Andrew Fikes Robert
Gruber Bigtable: A Distributed Storage System for Structured Data
Presented by: Arif Bin Hossain Dept. of Computer Science UTSA
Slide 2
Motivation Large scale structured data URLs: Contents, links,
anchors, page rank User data: P ref. settings, recent queries,
search results Geographic locations: Physical entities, roads,
satellite image Large set of structured MATLAB data EEG, EMG, Eye
motion Field are not uniform among datasets Data types are not
uniform among datasets
Slide 3
Why not Relational Database? Scale is too large for most
commercial databases Even if it werent, cost would be very high
Low-level storage optimizations help performance significantly Hard
to map semi-structured data to relational database Non-uniform
fields makes it difficult to insert/query data
Slide 4
Bigtable BigTable is a distributed storage system for managing
structured data. Designed to scale to a very large size Used for
many Google projects Web indexing, Personalized Search, Google
Earth, Google Analytics, Google Finance Efficient scans over all or
interesting subsets of data Efficient joins of large one-to-one and
one-to- many datasets
Slide 5
Bigtable Used for variety of demanding workloads Throughput
oriented batch processing Latency sensitive data serving Data is
indexed using row and column names Treats data as uninterpreted
strings Clients can control the locality Dynamic controls to serve
data out of memory or from disk
Slide 6
Building Blocks Google File System (GFS) Large scale
distributed file system Maintains multiple replicas Consists for
Master and Chunk server Chunk Server Stores the data files Each
data file broken into fixed size chunks Each chunk is replicated at
least three times Master Stores the metadata associated with the
chunks
Slide 7
Building Blocks Chubby lock service Have five active replicas
Provides namespace that consists of directories and files Each file
can be used as a lock Each Chubby client maintains a session with
Chubby service When the session expires, it loses any locks and
open handles
Slide 8
Building Block SSTable Immutable file format used internally to
store data files Sorted Key-Value pairs of arbitrary byte strings
Contains a sequence of blocks Block index is used to locate blocks
Index is loaded into memory when the SSTable is opened Lookup can
be performed in single disk access Index 64K block SSTable
Slide 9
Basic Data Model A table is a sparse, distributed, persistent
multidimensional sorted map Data is organized into three dimensions
(row: string, column: string, time: int64) string Each cell is
referenced by a row key, column key and timestamp
Slide 10
Basic Data Model (row, column, timestamp) cell contents
Example: webtable
Slide 11
Data Model: Row Name is an arbitrary string. Access to data in
a row is a atomic. Row creation is implicit upon storing data.
Transactions with in a row Rows ordered lexicographically by row
key Rows close together lexicographically usually on one or a small
number of machines. Rows are grouped together to form the unit of
load balancing
Slide 12
Data Model: Column Columns has two-level name structure:
Family:qualifier Example: anchor: cnnsi.com Column keys are grouped
into sets called Column Family Unit of access control All data
stored in a column family is usually of same type Additional level
of indexing, if desired Main idea: Limited families, Unbounded
columns
Slide 13
Data Model: Timestamp Used to store different versions of data
in a cell New writes default to current time Can also be set
explicitly by clients Look up examples Return most recent K values
Return all values in timestamp range(on all values) Can be used to
mark column family Only retain most recent K values in a cell Keep
values until they are older than K seconds
Slide 14
Tablets Rows with consecutive key are grouped into tablets Unit
of load balancing Reads of short row ranges are efficient and
require communication with a small number of machines Clients can
use this property to get good locality by selecting row keys
efficiently
Slide 15
Tablets (cont.) Contains some range of rows, essentially a set
of SSTables Index 64K block SSTable Index 64K block SSTable
Tablet
Slide 16
Implementation Three major components Library linked into every
client Single master server Assigning tablets to tablet servers
Detecting addition and expiration of tablet servers Balancing
tablet-server load Garbage collection files in GFS Many tablet
servers Manages a set of tablets Tablet servers handle read and
write requests to its table Splits tablets that have grown too
large
Slide 17
Implementation (cont.) Clients communicates directly with
tablet servers for read/write Each table consists of a set of
tablets Initially, each table have just one tablet Tablets are
automatically split as the table grows Row size can be arbitrary
(hundreds of GB)
Slide 18
Locating Tablets How do clients find a right machine ? Need to
find tablet whose row range covers the target row Three level
hierarchy Level 1: Chubby file containing location of the root
tablet Level 2: Root tablet contains the location of METADATA
tablets Level 3: Each METADATA tablet contains the location of user
tablets Location of tablet is stored under a row key that encodes
table identifier and its end row
Slide 19
Locating Tablets
Slide 20
Assigning Tablets Each tablet is assigned to one tablet server
at a time. Master server keeps track of Set of live tablet servers
Current assignments of tablets to servers. Unassigned tablets. When
a tablet is unassigned, master assigns the tablet to an tablet
server with sufficient space.
Slide 21
Assigning Tablets Tablet server startup It creates and acquires
an exclusive lock on uniquely named file on Chubby Master monitors
this directory to discover tablet servers. Tablet server stops
serving tablets If it loses its exclusive lock. Tries to reacquire
the lock on its file as long as the file still exists. If file no
longer exists, the tablet server will never be able to serve
again
Slide 22
Assigning Tablets Master server startup Grabs unique master
lock in Chubby. Scans the tablet server directory in Chubby.
Communicates with every live tablet server Scans METADATA table to
learn set of tablets. Master is responsible for finding when tablet
server is no longer serving its tablets and reassigning those
tablets as soon as possible. Periodically asks each tablet server
for the status of its lock If no reply, master tries to acquire the
lock itself If successful to acquire lock, then tablet server is
either dead or having network trouble
Slide 23
Tablet Serving Updates are committed to a commit log that
stores the redo records Recently committed updates are stored in
memory in a sorted buffer called memtable Memtable maintains the
updates on a row-by-row basis Older updates are stored in a
sequence of immutable SSTables. To recover a tablet Tablet server
reads data from METADATA table. Metadata contains list of SSTables
and set of redo points Server reads the indices of the SSTables in
memory Reconstructs the memtable by applying all of the updates
since redo points.
Slide 24
Tablet Serving Write operation Server checks if it is
well-formed Checks if the sender is authorized Write to commit log
After commit, contents are inserted into Memtable Read operation
Similar check for well-formedness and authorization Executed on a
merged view of the sequence of SSTables and memtable
Slide 25
Compaction: Minor As write operations execute, size of memtable
increases When memtable reaches threshold Frozen memtable is
converted to an SSTable SSTable written to file system Goals Reduce
memory usage of the tablet server Reduce the amount of data to read
from commit log during recovery
Slide 26
Compaction Problem: too many SSTable Read operations might need
to merge from a number of SSTables Merging compaction Reads the
contents of a few SSTable and memtable Writes new SSTable Merging
compaction that re-writes all SSTables into exactly one SSTable is
a major compaction
Slide 27
Locality Groups Each column families is assigned to a locality
group defined by client Seperate SSTable is created for each
locality group during compaction Increases read efficiency as
columns that are grouped together are usually accessed together
Used to organize underlying storage representation for performance
Scans over one locality group are O(bytes_in_locality_group), not
O(bytes_in_table) Data in locality group can be explicitly memory
mapped
Slide 28
Refinements Compression Clients can control SSTable compression
for a locality group Caching Scan Cache: a high-level cache that
caches key-value pairs returned by the SSTable interface Block
Cache: a lower-level cache that caches SSTable blocks read from
file system Bloom Filters Allows to ask whether an SSTable might
contain any data for a given row/column pair Reduces disk access
while reading SSTables
Slide 29
Example: Cassandra Initially developed by Facebook for inbox
search Built on BigTable data model Provides a structured key-value
store Keys map to multiple values, which are grouped into column
families Used by
Slide 30
Cassandra A table in cassandra is distributed multidimensional
map indexed by a key The row key in a table is a string with no
size restrictions Usually a four dimensional map Keyspace ->
Column Family Column Family -> Column Family Row Column Family
Row -> Columns Column -> Data value