Anatomy of Google Service Platform · 3/2/2007 · Automated Setup, Automated Backup, Standard components, Commodity drives, Flexible co-location, Easy-access design … Service

Anatomy of Anatomy of Google Service PlatformGoogle Service Platform

March 2, 2007Jaesun Han ([email protected])

NexRContact : http://www.web2hub.com

mailto:[email protected]

http://www.web2hub.com/

ContentsContents

Web 2.0 Technologies & GISPGoogle Service Platform

Google File System(GFS)

Bigtable

MapReduce

Chubby

Google Services over PlatformGoogle Analytics

Google Earth

Personalized Search

Web 2.0 Technology MapWeb 2.0 Technology Map

Web 2.0 Technology LayerWeb 2.0 Technology Layer

Raw Data

Processed Data

DB

ClientLayer

Front-EndLayer

Data ProcessingLayer

ExternalData

Source

Dis

trib

ute

d/P

aral

lel P

roce

ssin

g

Dis

trib

ute

d Fi

le S

yste

m

Dis

trib

ute

d S

tora

ge

Clu

ster

Man

agem

ent

PlatformLayer

XHTML, CSS,Microformats,

RIA(Ajax, Flex,XUL, XAML,

Gadget)

PHP, Python,Ruby, RoR,Dojo, DWR,Atlas, GWT,

Apache,PHP, MySQL

RSS,Atom,OpenAPI,

REST,JSON,SOAP,

Mashup

Recommendation(Collaborative Filtering)

Ranking, Clustering,Data mining,

Personalization,Social Network Analysis

Cluster Computing,Beowulf, Grid, Globus,

Condor, P2P, DHT,MPI, Utility Computing,

Virtualization,Autonomous Computing

GISP(GlobalGISP(Global Internet Service Platform)Internet Service Platform)

Web Technologies : Open Standard, Open SourceWeb Services : IntegrationWeb Business : Globalization

Client 기술 (RIA)Ajax, Flash&Flex, XAML, XUL, SVG, Widgets …

Server 기술LAMP, JSP, ASP, Ruby, RoR …

Contents 기술Blog, Wiki, RSS, Tagging,Podcasting, Mashup …

Global Platform 기술Development, Deployment,Operation, Management …

GISP(GlobalGISP(Global Internet Service Platform)Internet Service Platform)

Development개발편의성, 재사용성Internet Services Tech, CBD

Operation확장성 & 견고성Resource controlGrid & Utility Computing

Management자동화된감시, 평가, 분석자율적인대처Autonomic Computing

DeploymentAuto global deploymentEasy migration/replicationVirtualization

Core ComponentsServer Cluster

Distributed File System

Related Project : UCB Related Project : UCB RADLabRADLabRAD Lab ( http://radlab.cs.berkeley.edu/ )

Reliable, Adaptable, Distributed systems

UC Berkeley CS 5-year Project launched in 2005

Google, MS, SUN으로부터 매년 150만불 (각 50만불) 지원

The 5-year VisionSingle person can invent and run the next revolutionary IT service (“the Fortune 1 million”)

Technical GoalsOpen source

Systematize the process for

developing, assessing, deploying,

and operating (DADO) services

http://radlab.cs.berkeley.edu/

Google Service PlatformGoogle Service Platform


Google OS

Service Library

Services

Google Cluster

System Software 기술Google Linux, Google File System,MapReduce Library, Chubby, BigTableIntelligent System,Programming Model(River, TACC),Replication/Redundancy …

Hardware 기술Clusters, Geographic distribution,Automated Setup, Automated Backup,Standard components, Commodity drives,Flexible co-location, Easy-access design …

Service Software 기술Search engine, Email server, IM server,Map database, Various Web sites …

GoogleService

Platform

• 450,000 or more servers (NYT)• All PC servers less than $1,000• 40 or more pizza box servers per rack

Advantages• Easy Development• Scalability• Robustness

Principles of Google PlatformPrinciples of Google PlatformCheap Hardware and Smart Software

Use cheap commodity hardware frequent failureDevelop smart software for reducing the cost of failure

Easy ManagementHigh Scalability by automatic discovery of new servers and racksHigh Redundancy for failure of servers, racks, even data centers

Speed and Then More SpeedHigh speed with low cost (580MB/s read rate at $1,000 vs. 58MB/s at $18,000 IBM EXP)Rapid development and deployment of new products

Use existing technologiesUse techniques from the leading edge of computer scienceUse open source codes as a starting point

GoogleGoogle’’s new data center (Project 02)s new data center (Project 02)On the banks of Columbia River at The Dalles of Oregon

at the intersection of cheap electricity and readily accessible data networking

A computing center as big as two football fields, with twin cooling plants

cf) MS : 200,000 servers 800,000 servers (2011)

From New York Times (2006.6.14)

Google

MS Yahoo

http://www.nytimes.com/2006/06/14/technology/14search.html?ex=1307937600&en=c96a72bbc5f90a47&ei=5088&partner=rssnyt&emc=rss


GFS (SOSP 2003)Distributed File System

Bigtable (OSDI 2006)Distributed storage system for structured data

MapReduce (OSDI 2006)Distributed data processing library

Chubby (OSDI 2006)Distributed Lock Manager

Computation

Storage

GFSGFS

GFS: OverviewGFS: OverviewScalable distributed file system for large distributed data-intensive applications

Running on inexpensive commodity hardware Delivering high aggregate performance to a large number of clients

Featuresuser-level distributed file systemcentralized architecture

metadata: client <-> a single masterdata: client <-> chunkservers

64MB fixed large chunk sizenon-standard file system interface (not POSIX API)

create, delete, open, close, read, write, snapshot and record appendthree replicas of a chunkno client caching (but caching metadata like chunk location)

GFS: ArchitectureGFS: Architecture

• file and chunk namespaces• mapping from files to chunks• chunk locations

operation loglookup table map

Separation ofcontrol flow and

data flow

File creation/deletionFile renamingChunk addition/deletion

In-Memory Data Structure

File read/write

GFS: WriteGFS: Write

ordering write requestsfor the same

chunk

pipelined data deliveryto fully utilize each machine’s

network bandwidth primary and replicaslocations

primary lease(initial timeout=60s)

GFS: Relaxed ConsistencyGFS: Relaxed Consistency

0123

4567

0123

4567

0123

4567

write from 2

write from 4

chunk1 chunk2

case1(chunk2:B -> A)

case2(chunk2:A -> B)

A

B

Undefined

replicas of chunk2

Consistent

GFS: Atomic Record AppendsGFS: Atomic Record Appends

0123

write from 1

write from 2

chunk1

A

B

0123

case2(A -> B)

0123

case1(B -> A)

0123

case2(A -> B)

0123

case1(B -> A)

0123

Record Appendchunk1

A

B

error inwriting B

replicas of chunk1

writing at the exact offset

max record size: ¼ of the max chunk size( padding)

BigtableBigtable

BigtableBigtable: Overview: OverviewMotivation

Lots of structured and semi-structured data web crawl data, satellite image, user data, email, …

No commercial system big enough

BigtableDistributed storage system for structured data

A sparse, distributed, persistent multi-dimensional sorted map

Goalswide applicability, scalability, high performance, and high availability

Target workloadsfrom throughput-oriented batch-processing jobs to latency-sensitive serving of data to end-uses

Applicationsmore than 60 Google products and projects (Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth)

BigtableBigtable: Data Model: Data Model

com.web2hub.www

Row Key

(Column Family:Qualifier)Column KeyTimestamp

Tablet

Tablet

Tabletsthe unit of distribution and load balancing

Table

Atomic read/write for a single row key

Column Family: the basic unit of access control

Indexing(row:string, column:string, time:int64) string(com.cnn.www, anchor:my.look.ca, t8) “CNN.com”

BigtableBigtable: API: API

Metadata operationsCreate/delete tables and column families, change metadata

Several other features of APIsingle-row transactions: atomic read-modify-write sequencesexecution of client-supplied scripts (written in Sawzall)

// Open the tableTable *T =

OpenOrDie(“/bigtable/web/webtable”);

// Write a new anchor and delete an old anchorRowMutation r1(T, “com.cnn.www”);r1.Set(“anchor:www.c-span.org”, “CNN”);r1.Delete(“anchor:www.abc.com”);Operation op;Apply(&op, &r1)

Scanner scanner(T);ScanStream *stream;stream = scanner.FetchColumnFamily(“anchor”);stream->SetReturnAllVersions();scanner.Lookup(“com.cnn.www”);for (; !stream->Done(); stream->Next()) {

printf(“%s %s %lld %s\n”,scanner.RowName(),stream->ColumnName(),stream->MicroTimestamp(),stream->Value());

}

Writing to Bigtable Reading from Bigtable

BigtableBigtable: : SSTableSSTableUsed internally to store Bigtable dataImmutable, sorted file of key-value pairsdata blocks + an index

block size is 64KB, but configurable

an index is used to locate blocks

loaded into memory when the SSTable is opened

64K block

64K block

64K block

index

key-value

BigtableBigtable: Tablet & Locality Group: Tablet & Locality Group

contents anchor language checksum

SSTable1 SSTable2 SSTable3(100MB) (50MB) (30MB)

Tablet(100~200MB)com.cnn.www

(abc.html ~ help.html)

Locality Group1 Locality Group2 Locality Group3

GFSchunks 64MB 64MB 64MB 64MB

BigtableBigtable: Tablet Location: Tablet LocationFeatures

Three-level hierarchy (served on tablet servers, not the master)client library’s caching and prefetching of tablet locations

128MB= 217x1KB row

Addressing 234 tablets

row key(tablet’s table id + its end row)

tablet locationex) webtable:com.cnn.www

BigtableBigtable: Tablet Assignment: Tablet Assignment

Chubby

Cluster Management

System

TabletServer

(tab_svr10)

2) cre

ate a

lock

3) acq

uire the lo

ck

new exclusive lock/servers/tab_svr10

BigtableMaster

4) monitor /servers5) assign tablets

6) check lock status

7) failure orlosing lock

8) acquire and

delete the lock

1) starta server

Tablet Servers

9) reassignunassigned

tablets

BigtableBigtable: Master Failure: Master Failure

Chubby

BigtableMaster

Tablet Servers

1) acquire

master lock

2) scan /servers

& get live server list

master lock/servers/master

3) checkassignedtablets

4) scan METADATA tablets

5) reassignunassigned

tablets

Tablet Changes

0) start a master

CreateDeleteMergeSplit

initiated by master

initiated by tablet server

BigtableBigtable: Read/Write: Read/Write

t1: Set(“anchor:www.c-span.org”, “CNN”)t2: Delete(“anchor:www.abc.com”)t3: Set(“anchor:www.abc.com”, “ABC”)

memtable(sorted buffer)

anchor:www.abc.comABC

anchor:www.abc.comnull

anchor:www.c-span.orgCNN

anchorv1.0

anchorv2.0

anchorv3.0

anchorv4.0

a single commit logper tablet server

read on a merged view

Fast writing: mutation is logged in memoryEfficient reading: a merged view of sorted data structures

BigtableBigtable: Compactions: Compactions

v4.0 v3.0 v2.0 v1.0

v5.0

v6.0

minor compactionmemtable A new SSTable

major compactionmemtable + all SSTables

Only one SSTable

MapReduceMapReduce

MapReduceMapReduce: Overview: OverviewMotivation

Input data is large

Lots of machines: hundreds of thousands of PC servers

MapReduceProgramming model and implementation for parallel processing large data sets

parallelization, fault-tolerance, data distribution, and load balancing in a MapReduce library

map & reduce functionsmap (k1, v1) list (k2, v2)

reduce (k2, list (v2)) list (v2)

Usage ExamplesDistributed Grep, Count of URL Access Frequency, Reverse Web-Link Graph, Term-Vector per-Host, Inverted Index, Distributed Sort

MapReduceMapReduce: Data Processing Flow: Data Processing Flow

MapReduceMapReduce: Architecture: Architecture

(k1,v1) list(k2,v2)

partitioning(hash(key) mod R)

(0) split input files

(k2,list(v2)) list(v2)

notifying

global writing

OtherMapReducePrograms

(over GFS)(over GFS)

MapReduceMapReduce: Code Example: Code Exampleclass WordCounter : public Mapper {

public:virtual void Map(const MapInput& input) {

const string& text = input.value();const int n = text.size();for (int i = 0; i < n; ) {while ((i < n) && isspace(text[i])) i++;int start = i;while ((i < n) && !isspace(text[i])) i++;if (start < i)

Emit(text.substr(start, i-start), “1”);}}}

REGISTER_MAPPER(WordCounter);class Adder : public Reducer {

virtual void Reduce(ReduceInput* input) {int64 value = 0;while (!input->done()) {

value += StringToInt(input->value());input->NextValue();

}Emit(IntToString(value));

}}REGISTER_REDUCER(Adder);

int main(int argc, char** argv) {ParseCommandLineFlags(argc, argv);MapReduceSpecification spec;for (int i = 1; i < argc; i++) {

MapReduceInput* input = spec.add_input();input->set_format(“text”);input->set_filepattern(argv[i]);input->set_mapper_class(“WordCounter”);

MapReduceOutput* out = spec.output();out->set_filebase(“/gfs/test/freq”);out->set_num_tasks(100);out->set_format(“text”);out->set_reducer_class(“Adder”);

out->set_combiner_class(“Adder”);spec.set_machines(2000);spec.set_map_megabytes(100);spec.set_reduce_megabytes(100);

MapReduceResult result;if (!MapReduce(spec, &result)) abort();

}

MapReduceMapReduce: Fault: Fault--tolerancetoleranceWorker Failure

Re-execution of workers (map or reduce task)

Completed map tasks (local disk) re-executed

Completed reduce tasks (GFS) no need of re-execution

Master FailurePeriodic checkpointing of the master data structure

re-execution

Semantics in the Presence of FailureGuarantee atomic commits of map and reduce task outputs

map output: by master’s confirm

reduce output: by atomic rename operation of GFS

ChubbyChubby

Chubby: OverviewChubby: OverviewDistributed lock service

Target: loosely-coupled distributed systemmoderately large number of small machines connected by a high-speed network

Goalsreliability, availability, and easy-to-understand semanticsthroughput and storage capacity are considered secondary

Similar to a simple file system, but different fromwhole-file read/writeaugmented with advisory locks and with event notifications

Usage in both GFS and Bigtablefor master electionfor discovering servers and finding the masteras a well-known location to store a small amount of metadataas the root of their distributed data structures

Chubby: System structureChubby: System structure

simple database

distributedconsensusprotocol

• master election• database update

Bigtable master, tablet serversGFS master, chunkservers

…

DNS server

replicas list

Chubby: InterfaceChubby: InterfaceSimilar to a file system interface

Example) /ls/datacenter000/servers/svr_10980/ls: stand for lock service/datacenter000: Chubby cell’s name

/local: client’s local Chubby cell/global: global Chubby cell

/servers/svr_10980: interpreted within the named Chubby cell

Node(file & directory)’s metadatathree ACL filenames (reading, writing and changing ACL names)four monotonically increasing 64-bit numbers

an instance number, a content generation number, a lock generation number, an ACL generation number

Handlesreturned when clients open nodesincludes check digits, a sequence number, mode information

Chubby: Global cellChubby: Global cell

global cell/ls/global

local cell/ls/cellname

subtree /ls/global/masteris mirrored tosubtree /ls/cell/slave

• Chubby’s own ACLs• Advertisement of presence to monitoring services• Pointers to allow clients to locate large data sets such as Bigtable cells• many configuration files for other systems

Chubby: APIChubby: APIAPIs

Open(), Close(), Poison()GetContentsAndStat(), GetStat(), ReadDir()

contents and metadataread atomically and in entirety

SetContents(), SetACL()written atomically and in entirety

Delete()Acquire(), TryAcquire(), Release()GetSequencer(), SetSequencer(), CheckSequencer()

Usage example: primary electionAll potential primaries Open() and Acquire()The primary SetContents() : write its identityAll replicas event notified and GetContentsAndStat()

Chubby: Database & BackupChubby: Database & BackupDatabase Implementation

The first version: replicated version of Berkeley DB

Now: writing a simple databasewrite ahead logging, snapshotting and atomic operations

BackupEvery few hours, the master writes a snapshot of its DB to a GFSfile server in a different building

Usagedisaster recovery

initializing the DB of a newly replaced replica

Overall ViewOverall View

BirdBird’’s View Revisiteds View Revisited

Google OS (Linux File System, Multithreading)

GFS

Bigtable

Workqueue (Scheduler)

MapReduce Chubby

Cluster Managem

ent SystemDB

C++ Sawzall Java Python

ClientInterface

Batch Clients Runtime Clients

Server Process ViewServer Process View

Chunkserver

SSTable(LG1)

SSTable(LG2)

SSTable(LG3)

Tablet

Tablet Server

Local Scheduler

worker pool

M M R

GFSMaster

BigtableMaster

GlobalScheduler

Cluster Management

System

Storage(GFS)

Database(Bigtable)

Computation(MapReduce)

ChubbyCell

chunks

A Single Server

Google ServicesGoogle Servicesover Google Platformover Google Platform

Google AnalyticsGoogle Analytics

<script src="http://www.google-analytics.com/urchin.js" type="text/javascript"></script><script type="text/javascript">_uacct = “xxxxxxxxxx"; urchinTracker();</script>

EmbeddedJavaScript

Google AnalyticsGoogle Analytics

tuple(URL,time) session info

row column

com.abc.www:0001com.abc.www:0027com.abc.www:0050

………

raw click table(~200TB)

tablet:com.abc.www(a.html~o.html)

SSTable(GFS file)

website summary

row column

com.abc.www …

summary table(~20TB)

tablet:com.abc.www(p.html~z.html)

SSTable(GFS file)

Map

Map

Reduce

value:each session info

key:website’s URLvalue:analyzed info

analyzing

aggregating

Google EarthGoogle Earth

preprocessing &consolidating

Google EarthGoogle Earth

geographic segment

imagesources

row columnimagery table(~70TB, CF:8, LG:3)

(x1,y1),(x2,y2)

tablet:(x0,y0),(x4,y4)

SSTable(GFS file) Map

value:image source

preprocessing

GFS(raw images)

Reduce

key:segmentvalue:final image

GFS(final images)

consolidating & indexing

geographic segment

finalimages

row columnindex table(~500GB , CF:7, LG:2)

(x1,y1),(x2,y2)

Personalized SearchPersonalized Search

user histories(web queries,

click URLs, search keywords, …)

UserProfile

Personalized SearchPersonalized Search

useridweb

queries

row columnuser table(~4TB, CF:93, LG:11)

jaesun_hanjisoo1004jk_tong

searchkeywords

clickURLs

userprofile

tablet:ja ~ jn

SSTable(GFS file)

Map

Map

value:web queries

value:click URLs

Reduce

key:useridvalue:user history

analyzing

generating profile

Q & AQ & A

Documents

Anatomy of Google Service Platform · 3/2/2007 · Automated Setup, Automated Backup, Standard components, Commodity drives, Flexible co-location, Easy-access design … Service