Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Anatomy of Anatomy of Google Service PlatformGoogle Service Platform
March 2, 2007Jaesun Han ([email protected])
NexRContact : http://www.web2hub.com
ContentsContents
Web 2.0 Technologies & GISPGoogle Service Platform
Google File System(GFS)
Bigtable
MapReduce
Chubby
Google Services over PlatformGoogle Analytics
Google Earth
Personalized Search
Web 2.0 Technology MapWeb 2.0 Technology Map
Web 2.0 Technology LayerWeb 2.0 Technology Layer
Raw Data
Processed Data
DB
ClientLayer
Front-EndLayer
Data ProcessingLayer
ExternalData
Source
Dis
trib
ute
d/P
aral
lel P
roce
ssin
g
Dis
trib
ute
d Fi
le S
yste
m
Dis
trib
ute
d S
tora
ge
Clu
ster
Man
agem
ent
PlatformLayer
XHTML, CSS,Microformats,
RIA(Ajax, Flex,XUL, XAML,
Gadget)
PHP, Python,Ruby, RoR,Dojo, DWR,Atlas, GWT,
Apache,PHP, MySQL
RSS,Atom,OpenAPI,
REST,JSON,SOAP,
Mashup
Recommendation(Collaborative Filtering)
Ranking, Clustering,Data mining,
Personalization,Social Network Analysis
Cluster Computing,Beowulf, Grid, Globus,
Condor, P2P, DHT,MPI, Utility Computing,
Virtualization,Autonomous Computing
GISP(GlobalGISP(Global Internet Service Platform)Internet Service Platform)
Web Technologies : Open Standard, Open SourceWeb Services : IntegrationWeb Business : Globalization
Client 기술 (RIA)Ajax, Flash&Flex, XAML, XUL, SVG, Widgets …
Server 기술LAMP, JSP, ASP, Ruby, RoR …
Contents 기술Blog, Wiki, RSS, Tagging,Podcasting, Mashup …
Global Platform 기술Development, Deployment,Operation, Management …
GISP(GlobalGISP(Global Internet Service Platform)Internet Service Platform)
Development개발편의성, 재사용성Internet Services Tech, CBD
Operation확장성 & 견고성Resource controlGrid & Utility Computing
Management자동화된감시, 평가, 분석자율적인대처Autonomic Computing
DeploymentAuto global deploymentEasy migration/replicationVirtualization
Core ComponentsServer Cluster
Distributed File System
Related Project : UCB Related Project : UCB RADLabRADLabRAD Lab ( http://radlab.cs.berkeley.edu/ )
Reliable, Adaptable, Distributed systems
UC Berkeley CS 5-year Project launched in 2005
Google, MS, SUN으로부터 매년 150만불 (각 50만불) 지원
The 5-year VisionSingle person can invent and run the next revolutionary IT service (“the Fortune 1 million”)
Technical GoalsOpen source
Systematize the process for
developing, assessing, deploying,
and operating (DADO) services
Google Service PlatformGoogle Service Platform
Google Service PlatformGoogle Service Platform
Google OS
Service Library
Services
Google Cluster
System Software 기술Google Linux, Google File System,MapReduce Library, Chubby, BigTableIntelligent System,Programming Model(River, TACC),Replication/Redundancy …
Hardware 기술Clusters, Geographic distribution,Automated Setup, Automated Backup,Standard components, Commodity drives,Flexible co-location, Easy-access design …
Service Software 기술Search engine, Email server, IM server,Map database, Various Web sites …
GoogleService
Platform
• 450,000 or more servers (NYT)• All PC servers less than $1,000• 40 or more pizza box servers per rack
Advantages• Easy Development• Scalability• Robustness
Principles of Google PlatformPrinciples of Google PlatformCheap Hardware and Smart Software
Use cheap commodity hardware frequent failureDevelop smart software for reducing the cost of failure
Easy ManagementHigh Scalability by automatic discovery of new servers and racksHigh Redundancy for failure of servers, racks, even data centers
Speed and Then More SpeedHigh speed with low cost (580MB/s read rate at $1,000 vs. 58MB/s at $18,000 IBM EXP)Rapid development and deployment of new products
Use existing technologiesUse techniques from the leading edge of computer scienceUse open source codes as a starting point
GoogleGoogle’’s new data center (Project 02)s new data center (Project 02)On the banks of Columbia River at The Dalles of Oregon
at the intersection of cheap electricity and readily accessible data networking
A computing center as big as two football fields, with twin cooling plants
cf) MS : 200,000 servers 800,000 servers (2011)
From New York Times (2006.6.14)
MS Yahoo
Google Service PlatformGoogle Service Platform
GFS (SOSP 2003)Distributed File System
Bigtable (OSDI 2006)Distributed storage system for structured data
MapReduce (OSDI 2006)Distributed data processing library
Chubby (OSDI 2006)Distributed Lock Manager
Computation
Storage
GFSGFS
GFS: OverviewGFS: OverviewScalable distributed file system for large distributed data-intensive applications
Running on inexpensive commodity hardware Delivering high aggregate performance to a large number of clients
Featuresuser-level distributed file systemcentralized architecture
metadata: client <-> a single masterdata: client <-> chunkservers
64MB fixed large chunk sizenon-standard file system interface (not POSIX API)
create, delete, open, close, read, write, snapshot and record appendthree replicas of a chunkno client caching (but caching metadata like chunk location)
GFS: ArchitectureGFS: Architecture
• file and chunk namespaces• mapping from files to chunks• chunk locations
operation loglookup table map
Separation ofcontrol flow and
data flow
File creation/deletionFile renamingChunk addition/deletion
In-Memory Data Structure
File read/write
GFS: WriteGFS: Write
ordering write requestsfor the same
chunk
pipelined data deliveryto fully utilize each machine’s
network bandwidth primary and replicaslocations
primary lease(initial timeout=60s)
GFS: Relaxed ConsistencyGFS: Relaxed Consistency
0123
4567
0123
4567
0123
4567
write from 2
write from 4
chunk1 chunk2
case1(chunk2:B -> A)
case2(chunk2:A -> B)
A
B
Undefined
replicas of chunk2
Consistent
GFS: Atomic Record AppendsGFS: Atomic Record Appends
0123
write from 1
write from 2
chunk1
A
B
0123
case2(A -> B)
0123
case1(B -> A)
0123
case2(A -> B)
0123
case1(B -> A)
0123
Record Appendchunk1
A
B
error inwriting B
replicas of chunk1
writing at the exact offset
max record size: ¼ of the max chunk size( padding)
BigtableBigtable
BigtableBigtable: Overview: OverviewMotivation
Lots of structured and semi-structured data web crawl data, satellite image, user data, email, …
No commercial system big enough
BigtableDistributed storage system for structured data
A sparse, distributed, persistent multi-dimensional sorted map
Goalswide applicability, scalability, high performance, and high availability
Target workloadsfrom throughput-oriented batch-processing jobs to latency-sensitive serving of data to end-uses
Applicationsmore than 60 Google products and projects (Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth)
BigtableBigtable: Data Model: Data Model
com.web2hub.www
Row Key
(Column Family:Qualifier)Column KeyTimestamp
Tablet
Tablet
Tabletsthe unit of distribution and load balancing
Table
Atomic read/write for a single row key
Column Family: the basic unit of access control
Indexing(row:string, column:string, time:int64) string(com.cnn.www, anchor:my.look.ca, t8) “CNN.com”
BigtableBigtable: API: API
Metadata operationsCreate/delete tables and column families, change metadata
Several other features of APIsingle-row transactions: atomic read-modify-write sequencesexecution of client-supplied scripts (written in Sawzall)
// Open the tableTable *T =
OpenOrDie(“/bigtable/web/webtable”);
// Write a new anchor and delete an old anchorRowMutation r1(T, “com.cnn.www”);r1.Set(“anchor:www.c-span.org”, “CNN”);r1.Delete(“anchor:www.abc.com”);Operation op;Apply(&op, &r1)
Scanner scanner(T);ScanStream *stream;stream = scanner.FetchColumnFamily(“anchor”);stream->SetReturnAllVersions();scanner.Lookup(“com.cnn.www”);for (; !stream->Done(); stream->Next()) {
printf(“%s %s %lld %s\n”,scanner.RowName(),stream->ColumnName(),stream->MicroTimestamp(),stream->Value());
}
Writing to Bigtable Reading from Bigtable
BigtableBigtable: : SSTableSSTableUsed internally to store Bigtable dataImmutable, sorted file of key-value pairsdata blocks + an index
block size is 64KB, but configurable
an index is used to locate blocks
loaded into memory when the SSTable is opened
64K block
64K block
64K block
index
key-value
BigtableBigtable: Tablet & Locality Group: Tablet & Locality Group
contents anchor language checksum
SSTable1 SSTable2 SSTable3(100MB) (50MB) (30MB)
Tablet(100~200MB)com.cnn.www
(abc.html ~ help.html)
Locality Group1 Locality Group2 Locality Group3
GFSchunks 64MB 64MB 64MB 64MB
BigtableBigtable: Tablet Location: Tablet LocationFeatures
Three-level hierarchy (served on tablet servers, not the master)client library’s caching and prefetching of tablet locations
128MB= 217x1KB row
Addressing 234 tablets
row key(tablet’s table id + its end row)
tablet locationex) webtable:com.cnn.www
BigtableBigtable: Tablet Assignment: Tablet Assignment
Chubby
Cluster Management
System
TabletServer
(tab_svr10)
2) cre
ate a
lock
3) acq
uire the lo
ck
new exclusive lock/servers/tab_svr10
BigtableMaster
4) monitor /servers5) assign tablets
6) check lock status
7) failure orlosing lock
8) acquire and
delete the lock
1) starta server
Tablet Servers
9) reassignunassigned
tablets
BigtableBigtable: Master Failure: Master Failure
Chubby
BigtableMaster
Tablet Servers
1) acquire
master lock
2) scan /servers
& get live server list
master lock/servers/master
3) checkassignedtablets
4) scan METADATA tablets
5) reassignunassigned
tablets
Tablet Changes
0) start a master
CreateDeleteMergeSplit
initiated by master
initiated by tablet server
BigtableBigtable: Read/Write: Read/Write
t1: Set(“anchor:www.c-span.org”, “CNN”)t2: Delete(“anchor:www.abc.com”)t3: Set(“anchor:www.abc.com”, “ABC”)
memtable(sorted buffer)
anchor:www.abc.comABC
anchor:www.abc.comnull
anchor:www.c-span.orgCNN
anchorv1.0
anchorv2.0
anchorv3.0
anchorv4.0
a single commit logper tablet server
read on a merged view
Fast writing: mutation is logged in memoryEfficient reading: a merged view of sorted data structures
BigtableBigtable: Compactions: Compactions
v4.0 v3.0 v2.0 v1.0
v5.0
v6.0
minor compactionmemtable A new SSTable
major compactionmemtable + all SSTables
Only one SSTable
MapReduceMapReduce
MapReduceMapReduce: Overview: OverviewMotivation
Input data is large
Lots of machines: hundreds of thousands of PC servers
MapReduceProgramming model and implementation for parallel processing large data sets
parallelization, fault-tolerance, data distribution, and load balancing in a MapReduce library
map & reduce functionsmap (k1, v1) list (k2, v2)
reduce (k2, list (v2)) list (v2)
Usage ExamplesDistributed Grep, Count of URL Access Frequency, Reverse Web-Link Graph, Term-Vector per-Host, Inverted Index, Distributed Sort
MapReduceMapReduce: Data Processing Flow: Data Processing Flow
MapReduceMapReduce: Architecture: Architecture
(k1,v1) list(k2,v2)
partitioning(hash(key) mod R)
(0) split input files
(k2,list(v2)) list(v2)
notifying
global writing
OtherMapReducePrograms
(over GFS)(over GFS)
MapReduceMapReduce: Code Example: Code Exampleclass WordCounter : public Mapper {
public:virtual void Map(const MapInput& input) {
const string& text = input.value();const int n = text.size();for (int i = 0; i < n; ) {while ((i < n) && isspace(text[i])) i++;int start = i;while ((i < n) && !isspace(text[i])) i++;if (start < i)
Emit(text.substr(start, i-start), “1”);}}}
REGISTER_MAPPER(WordCounter);class Adder : public Reducer {
virtual void Reduce(ReduceInput* input) {int64 value = 0;while (!input->done()) {
value += StringToInt(input->value());input->NextValue();
}Emit(IntToString(value));
}}REGISTER_REDUCER(Adder);
int main(int argc, char** argv) {ParseCommandLineFlags(argc, argv);MapReduceSpecification spec;for (int i = 1; i < argc; i++) {
MapReduceInput* input = spec.add_input();input->set_format(“text”);input->set_filepattern(argv[i]);input->set_mapper_class(“WordCounter”);
MapReduceOutput* out = spec.output();out->set_filebase(“/gfs/test/freq”);out->set_num_tasks(100);out->set_format(“text”);out->set_reducer_class(“Adder”);
out->set_combiner_class(“Adder”);spec.set_machines(2000);spec.set_map_megabytes(100);spec.set_reduce_megabytes(100);
MapReduceResult result;if (!MapReduce(spec, &result)) abort();
}
MapReduceMapReduce: Fault: Fault--tolerancetoleranceWorker Failure
Re-execution of workers (map or reduce task)
Completed map tasks (local disk) re-executed
Completed reduce tasks (GFS) no need of re-execution
Master FailurePeriodic checkpointing of the master data structure
re-execution
Semantics in the Presence of FailureGuarantee atomic commits of map and reduce task outputs
map output: by master’s confirm
reduce output: by atomic rename operation of GFS
ChubbyChubby
Chubby: OverviewChubby: OverviewDistributed lock service
Target: loosely-coupled distributed systemmoderately large number of small machines connected by a high-speed network
Goalsreliability, availability, and easy-to-understand semanticsthroughput and storage capacity are considered secondary
Similar to a simple file system, but different fromwhole-file read/writeaugmented with advisory locks and with event notifications
Usage in both GFS and Bigtablefor master electionfor discovering servers and finding the masteras a well-known location to store a small amount of metadataas the root of their distributed data structures
Chubby: System structureChubby: System structure
simple database
distributedconsensusprotocol
• master election• database update
Bigtable master, tablet serversGFS master, chunkservers
…
DNS server
replicas list
Chubby: InterfaceChubby: InterfaceSimilar to a file system interface
Example) /ls/datacenter000/servers/svr_10980/ls: stand for lock service/datacenter000: Chubby cell’s name
/local: client’s local Chubby cell/global: global Chubby cell
/servers/svr_10980: interpreted within the named Chubby cell
Node(file & directory)’s metadatathree ACL filenames (reading, writing and changing ACL names)four monotonically increasing 64-bit numbers
an instance number, a content generation number, a lock generation number, an ACL generation number
Handlesreturned when clients open nodesincludes check digits, a sequence number, mode information
Chubby: Global cellChubby: Global cell
global cell/ls/global
local cell/ls/cellname
subtree /ls/global/masteris mirrored tosubtree /ls/cell/slave
• Chubby’s own ACLs• Advertisement of presence to monitoring services• Pointers to allow clients to locate large data sets such as Bigtable cells• many configuration files for other systems
Chubby: APIChubby: APIAPIs
Open(), Close(), Poison()GetContentsAndStat(), GetStat(), ReadDir()
contents and metadataread atomically and in entirety
SetContents(), SetACL()written atomically and in entirety
Delete()Acquire(), TryAcquire(), Release()GetSequencer(), SetSequencer(), CheckSequencer()
Usage example: primary electionAll potential primaries Open() and Acquire()The primary SetContents() : write its identityAll replicas event notified and GetContentsAndStat()
Chubby: Database & BackupChubby: Database & BackupDatabase Implementation
The first version: replicated version of Berkeley DB
Now: writing a simple databasewrite ahead logging, snapshotting and atomic operations
BackupEvery few hours, the master writes a snapshot of its DB to a GFSfile server in a different building
Usagedisaster recovery
initializing the DB of a newly replaced replica
Overall ViewOverall View
BirdBird’’s View Revisiteds View Revisited
Google OS (Linux File System, Multithreading)
GFS
Bigtable
Workqueue (Scheduler)
MapReduce Chubby
Cluster Managem
ent SystemDB
C++ Sawzall Java Python
ClientInterface
Batch Clients Runtime Clients
Server Process ViewServer Process View
Chunkserver
SSTable(LG1)
SSTable(LG2)
SSTable(LG3)
Tablet
Tablet Server
Local Scheduler
worker pool
M M R
GFSMaster
BigtableMaster
GlobalScheduler
Cluster Management
System
Storage(GFS)
Database(Bigtable)
Computation(MapReduce)
ChubbyCell
chunks
A Single Server
Google ServicesGoogle Servicesover Google Platformover Google Platform
Google AnalyticsGoogle Analytics
<script src="http://www.google-analytics.com/urchin.js" type="text/javascript"></script><script type="text/javascript">_uacct = “xxxxxxxxxx"; urchinTracker();</script>
EmbeddedJavaScript
Google AnalyticsGoogle Analytics
tuple(URL,time) session info
row column
com.abc.www:0001com.abc.www:0027com.abc.www:0050
………
raw click table(~200TB)
tablet:com.abc.www(a.html~o.html)
SSTable(GFS file)
website summary
row column
com.abc.www …
summary table(~20TB)
tablet:com.abc.www(p.html~z.html)
SSTable(GFS file)
Map
Map
Reduce
value:each session info
key:website’s URLvalue:analyzed info
analyzing
aggregating
Google EarthGoogle Earth
preprocessing &consolidating
Google EarthGoogle Earth
geographic segment
imagesources
row columnimagery table(~70TB, CF:8, LG:3)
(x1,y1),(x2,y2)
tablet:(x0,y0),(x4,y4)
SSTable(GFS file) Map
value:image source
preprocessing
GFS(raw images)
Reduce
key:segmentvalue:final image
GFS(final images)
consolidating & indexing
geographic segment
finalimages
row columnindex table(~500GB , CF:7, LG:2)
(x1,y1),(x2,y2)
Personalized SearchPersonalized Search
user histories(web queries,
click URLs, search keywords, …)
UserProfile
Personalized SearchPersonalized Search
useridweb
queries
row columnuser table(~4TB, CF:93, LG:11)
jaesun_hanjisoo1004jk_tong
searchkeywords
clickURLs
userprofile
tablet:ja ~ jn
SSTable(GFS file)
Map
Map
value:web queries
value:click URLs
Reduce
key:useridvalue:user history
analyzing
generating profile
Q & AQ & A