22
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1

Big Table: Distributed Storage System For Structured Data

  • Upload
    harlan

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Big Table: Distributed Storage System For Structured Data. Sergejs Melderis. 1. Dennis Kafura – CS5204 – Operating Systems. BigTable. Unstructured Data vs. Structured Data. Unstructured data refers to computerized information that either does not have a data model plain text, audio - PowerPoint PPT Presentation

Citation preview

Page 1: Big Table: Distributed Storage System For Structured Data

1Dennis Kafura – CS5204 – Operating Systems

Big Table:Distributed Storage System For

Structured Data

Sergejs Melderis

1

Page 2: Big Table: Distributed Storage System For Structured Data

2

BigTable

Dennis Kafura – CS5204 – Operating Systems

Unstructured Data vs. Structured Data

Unstructured data refers to computerized information that either does not have a data model plain text, audio

Structured data can be described by data modelFlat Hierarchical Network RelationalDimensionalObject-relational

Page 3: Big Table: Distributed Storage System For Structured Data

3

BigTable

Dennis Kafura – CS5204 – Operating Systems

Relational Model and RDBMS

most popular model of organizing structured datamodel based on first-order predicate logicprovides a declarative method for specifying data

and queries via SQLdata is organized in tables of fixed-length recordsvariety of open source and commercial

implementationsprovides ACID properties

3

Page 4: Big Table: Distributed Storage System For Structured Data

4

BigTable

Dennis Kafura – CS5204 – Operating Systems

NoSQL

not relational databaseno fixed table schemasno join operationsno sql

flexible and/or no data modelusually do not provide ACID propertiesscale horizontally

4

Page 5: Big Table: Distributed Storage System For Structured Data

5

BigTable

Dennis Kafura – CS5204 – Operating Systems

BigTable

distributed, high performance, fault-tolerant, NoSql storage system build on top of Google File System

designed to scale to a very large size on low cost commodity hardware

it was designed by Google and used in various projects (web indexing)

the paper was published in 2006related implementations

HBaseHypertableApache CassandraNeptune 5

Page 6: Big Table: Distributed Storage System For Structured Data

6

BigTable

Dennis Kafura – CS5204 – Operating Systems

BigTable Data Model

sparse, distributed, persistent multi-dimensional sorted map

map is indexed by a row key, column family, column key, and a timestamp

{ row : { column_family : {

column : { timestamp : value}

}

}

6

Page 7: Big Table: Distributed Storage System For Structured Data

7

BigTable

Dennis Kafura – CS5204 – Operating Systems

Webtable

7

“<html>...” “CNN” “CNN.com”

“contents” “anchor:cnnsi.com “anchor:my.look.ca”

t6 t9 t9“com.cnn.www”

Page 8: Big Table: Distributed Storage System For Structured Data

8

BigTable

Dennis Kafura – CS5204 – Operating Systems

Relational Data Model

8

Student

student_id - PK

first_name

last_name

birthday

major

academic_level

Course

crn PK

course

title

type

instructor_id

seats

StudentCourse

student_id

crn

Page 9: Big Table: Distributed Storage System For Structured Data

9

BigTable

Dennis Kafura – CS5204 – Operating Systems

Student table

info course

last_name <crn>

first_name

birthday

major

academic_level

student_id

Row Key Column Family

Column Qualifier

Column Qualifier

Page 10: Big Table: Distributed Storage System For Structured Data

10

BigTable

Dennis Kafura – CS5204 – Operating Systems

Course table

info students

course <student_id>

title

type

instructor_id

seats

crn

Row Key Column Family

Column Qualifier

Column Qualifier

Page 11: Big Table: Distributed Storage System For Structured Data

11

BigTable

Dennis Kafura – CS5204 – Operating Systems

Example

11

“Sergejs” “Melderis”“Computer Science”

“YES” “NO”

info:first_name info:last_name info:major courses:96322 courses:96320

“905514”

“CS5204”“Operating Systems”

“1983943” “YES” “YES”

info:course info:title info:instructor_id students:905514 students:905520

“96322”

Page 12: Big Table: Distributed Storage System For Structured Data

12

BigTable

Dennis Kafura – CS5204 – Operating Systems

Students data view in JSON

{ 905514: { info : {

first_name : { t1 : Sergejs },last_name : { t1 : Melderis },major : { t1 : Comp Science }

}, courses : {

96322: { t1 : “YES” },96320: { t2 : “NO” }

}

}

12

Page 13: Big Table: Distributed Storage System For Structured Data

13

BigTable

Dennis Kafura – CS5204 – Operating Systems

Rows

row keys are arbitrary strings up to 64 KBread and write of data under a single row is atomicordered in lexicographic order by row keyrow range is dynamically partitioned into blocks

called tablets tablets are units of distribution and loadbalancing

13

Page 14: Big Table: Distributed Storage System For Structured Data

14

BigTable

Dennis Kafura – CS5204 – Operating Systems

Columns

Column keys are grouped by column familiesColumn family is a basic unit of access controlAll data stored in a column family is of the same

typeNumber of column families should be smallThere can be unlimited number of columnsColumn key is named using family:qualifier

14

Page 15: Big Table: Distributed Storage System For Structured Data

15

BigTable

Dennis Kafura – CS5204 – Operating Systems

Timestamps

Bigtable can contain multiple versions of the same data

timestamps are 64-bit integers assigned by Bigtable or client

client can specify to keep up to n versions of data

15

Page 16: Big Table: Distributed Storage System For Structured Data

16

BigTable

Dennis Kafura – CS5204 – Operating Systems

Implementation

client libraryone master server distributed lock service called Chubbymany tablet servers containing several tabletstablet server

handles read and write requestsautomatically splits tablets that have grown too large (100 -

200 MB)

client data directly goes to tablet server

16

Page 17: Big Table: Distributed Storage System For Structured Data

17

BigTable

Dennis Kafura – CS5204 – Operating Systems

Tablet Location

three-level hierarchy to store tablet locationfirst level is stored in lock serviceroot tablet contains the location of metadata tablesmetadata tablets contain the location of user tables

UserTable1

UserTable2

METADATAtablets

Root tabletLock Service

Page 18: Big Table: Distributed Storage System For Structured Data

18

BigTable

Dennis Kafura – CS5204 – Operating Systems

Distribution of data

One master serverChubby distributed lock serviceHundred or thousands of tablet serversEach tablet contains a contiguous range of rowsMaster distributes tablets across of serversEach tablet server contains tablets with different ranges

18

Page 19: Big Table: Distributed Storage System For Structured Data

19

BigTable

Dennis Kafura – CS5204 – Operating Systems

Tablet Representation

19

SSTable SSTable

memtable Read Op

Write Op

tablet log

Memory

GFS

Page 20: Big Table: Distributed Storage System For Structured Data

20

BigTable

Dennis Kafura – CS5204 – Operating Systems

Compactions

compaction is a process of writing memtable to SSTable

minor compaction write memtable to SSTableshrinks the memory usage of the tablet serverreduces the commit log

merging compaction merges several SSTablesmajor compaction rewrites all SSTables into

exactly one SSTable

20

Page 21: Big Table: Distributed Storage System For Structured Data

21

BigTable

Dennis Kafura – CS5204 – Operating Systems

API

create, delete tables and column familieswrite or delete valueslook up values from individual rowsscan over a subset of the data in a table

21

Page 22: Big Table: Distributed Storage System For Structured Data

22

BigTable

Dennis Kafura – CS5204 – Operating Systems 22