Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang [email protected] Fay Chang, Jeffrey Dean, Sanjay

Title 1

Bigtable: A Distributed Storage System for Structured Data

Google’s NoSQL Solution

2013/4/1

Chao [email protected]

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach

Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

An Example 2

Webtable Example How many web pages are there?

Recently Google reported finding 1 trillion unique URLs, which would require 80 terabytes to store. How much storage is required to hold a single

snapshot of the Web?

1 trillion web pages at 100K bytes per page requires 100 petabytes How is the data stored in the Bigtable?

2013/4/1

Introduction 3

IntroductionBigtable is a distributed storage system for

managing structured data that is designed to scale to a very large size petabytes of data across thousands of commodity servers.

Many projects at Google store data in Bigtable, including web indexing, Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth.

Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability.

2013/4/1

Data Model 4

Data ModelA Bigtable is a sparse,

distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

2013/4/1

Data Model 5

Data Model Rows The row keys in a table are arbitrary strings (currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. Column Families Column keys are grouped into sets called column families, which form the basic unit of access control. A column key is named using the following syntax: family:qualier. Column family names must be printable, but qualiers may be arbitrary strings. Timestamps Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent “real time” in microseconds, or be explicitly assigned by client

2013/4/1

An Example 6

Webtable Example Rows

In Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html under the key com.google.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient. Column Families

An example column family for the Webtable is language, which stores the language in which a web page was written. We use only one column key in the language family, and it stores each web page's language ID. Timestamps

In our Webtable example, we set the timestamps of the crawled pages stored in the contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism lets us keep only the most recent several versions ,which we specify, of every page.

2013/4/1

API 7

APIThe Bigtable API provides functions for

creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights.

2013/4/1

API 8

APIBigtable supports single-row transactions, which

can be used to perform atomic read-modify-write sequences on data stored under a single row key.

Bigtable allows cells to be used as integer counters.

Bigtable supports the execution of client-supplied scripts in the address spaces of the servers. The scripts are written in a language developed at Google for processing data called Sawzall.

Bigtable can be used with MapReduce, a framework for running large-scale parallel computations developed at Google.

2013/4/1

Building Blocks 9

Building Blocks Bigtable is built on several other pieces of Google

infrastructure. Bigtable uses the distributed Google File System (GFS) to store log and data files.

The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.

Bigtable relies on a highly-available and persistent distributed lock service called Chubby. Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data; to discover tablet servers and nalize tablet server deaths; to store Bigtable schema information (the column family information for each table); and to store access control lists.

2013/4/1

Implementation 10

Implementation The Bigtable implementation has three major

components: a library that is linked into every client, one master server, and many tablet servers.

The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS.

Each tablet server manages a set of tablets. The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large.

As with many single-master distributed storage systems, client data does not move through the master: clients communicate directly with tablet servers for reads and writes.

2013/4/1

Implementation 11

ImplementationTablet Location Using a three-level hierarchy analogous to that of a B+- tree to store tablet location information.

Tablet Assignment Each tablet is assigned to one tablet server at a time. Bigtable uses Chubby to keep track of tablet servers.

2013/4/1

Implementation 12

Implementation

Tablet Serving

Compactions As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. A merging compaction that rewrites all SSTables into exactly one SSTable.

2013/4/1

Refinements 13

RefinementsLocality groups Clients can group multiple column families together into a locality group. A separate SSTable is generated for each locality group in each tablet. Segregating column families that are not typically accessed together into separate locality groups enables more effcient reads.Compression Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block.

2013/4/1

Refinements 14

Refinements Caching for read performance To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. Bloom filters A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks required for read operations. Commit-log implementation Using one log provides significant performance benefits during normal operation, but it complicates recovery.

2013/4/1

Refinements 15

Refinements Speeding up tablet recovery If the master moves a tablet from one tablet server to another, the source tablet server first does a minor compaction on that tablet. After finishing this compaction, the tablet server stops serving the tablet. Before it actually unloads the tablet, the tablet server does another(usually very fast) minor compaction to eliminate any remaining uncompacted state in the tablet server's log that arrived while the first minor compaction was being performed. After this second minor compaction is complete, the tablet can be loaded on another tablet server without requiring any recovery of log entries. Exploiting immutability Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable.

2013/4/1

Pros 16

ProsIntroduce the structure and

function of Bigtable comprehensively. Discuss how Bigtable face to different requirements.

Introduce the experience during the process of designing Bigtable.

2013/4/1

Cons 17

ConsAccording to professor Eric

Brewer’s CAP theory, consistency, availability and partition tolerance cannot be met by a distributed system at the same time. As a typical AP database, consistency, its weakness, is not discussed in this paper.

2013/4/1

Documents

Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang [email protected] Fay Chang, Jeffrey Dean, Sanjay