Upload
kareem
View
21
Download
0
Embed Size (px)
DESCRIPTION
myS3 Fabrizio Manfredi Furuholmen Federico Mosca. Agenda. Introduction Goals P rincipals myS3 Architecture Internals Sub project Conclusion Developments. Unsolved problem. Web Interface . - PowerPoint PPT Presentation
Citation preview
Beolink.org
myS3
Fabrizio Manfredi FuruholmenFederico Mosca
Beolink.org
FOSDEM 2014
2
Agenda
Introduction Goals Principals
myS3 Architecture Internals Sub project
Conclusion Developments
Beolink.org
3
Unsolved problem
Beolink.org
4
Web Interface
“Amazon S3 provides a simple web-services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web…”
Beolink.org S3
5
• Every file you upload to Amazon S3 is stored in a container called a bucket.
• Each bucket name should be unique. • Each bucket can contain an unlimited number of object (key/value). • Buckets cannot be nested, you can not create a bucket within a
bucket.• Object
– Id – Version– Metadata– Subresources– ACL
• Http Rest Call• Byte range transfer• Parallel transfer
Beolink.org myS3
6
Translate S3 Request to local Disk
Beolink.org Mapping
7
S3 Bucket is a directory in the AFS space
S3 Object is file or a directory, the directory
S3 ACLFake object
AFS ACL permission are returned as a S3 metadata unix permission are returned as a S3 metadata
All other S3 features are not implemented
Beolink.org S3 Request
8
GET /mybucket/puppy.jpg HTTP/1.1User-Agent: dotnetHost: s3.amazonaws.comDate: Tue, 15 Jan 2008 21:20:27 +0000x-amz-date: Tue, 15 Jan 2008 21:20:27 +0000Authorization: AWS AKIAIOSFODNN7EXAMPLE:k3nL7gH3+PadhTEVn5EXAMPLE
Objects in the same bucket don’t have any relation !!!No Hierarchically
GET /mybucket/puppy.jpgGET /mybucket/yesterday/puppy.jp
“yesterday” doesn’t exist
Beolink.org S3 Request
9
For retrieving directory content :- Prefix for the parent directory - ‘/’ for end name Delimiter
For create a Directoy- Object name with ‘/’ at the end
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Name>ExampleBucket</Name> <Prefix>/mydir/</Prefix> <Marker></Marker> <MaxKeys>1000</MaxKeys> <Delimiter>/</Delimiter> <IsTruncated>false</IsTruncated> <Contents>
Beolink.org AWS Auth
10
Authorization = "AWS" + " " + AWSAccessKeyId + ":" + Signature;
Signature = Base64( HMAC-SHA1( YourSecretAccessKeyID, UTF-8-Encoding-Of( StringToSign ) ) );
StringToSign = HTTP-Verb + "\n" +Content-MD5 + "\n" +Content-Type + "\n" +Date + "\n" +CanonicalizedAmzHeaders +CanonicalizedResource;
CanonicalizedResource = [ "/" + Bucket ] +<HTTP-Request-URI, from the protocol name up to the query string> +[ subresource, if present. For example "?acl", "?location", "?logging", or "?torrent"];
CanonicalizedAmzHeaders = <described below>
Beolink.org Authentication
11
IP Base Computer Account, the authentication of the users is handle by internal db
Impersonate Forge the ticket for the users on the server side, the authentication is handle by internal db
Token Generation Web interface authentication( kbr auth), one time AWS token generation
Beolink.org
12
Server Architecture
S3 Interface
StorageManager
Auth Manager
Bucket Manager
Storage Driver Cache
Inte
rfac
eM
anag
ers
Driv
ers
Plug
in
/afs
Token Manager
Web Interface
Beolink.org InternalDB
13
Bucket DB - Contains the map btw the bucket name and the AFS Path ex. Myhome -> /afs/beolink/home/manfred
Token DB - Contains the access key and secret key for Amazon Authentication, with web base authentication the db contains the kerberos token
Beolink.org Storage Manager
14
NFS style Most of the operation are made on temporary file (.NFSXXX)
Caching Save temporary file in non AFS space
NoWait Return Ok as soon the file is on the S3 server
MemKeep file transferred in memory (max 100MB)
ACLEnable write operation on AFS ACL
MD5Enable or disable MD5
Beolink.org TODO
15
• Parallel Transfer• Locking• Kerberos Token base• Chunk transfer (http 100)/ byte range transfer• Create a interface for CloudStack• Automatic Volume release
Beolink.org
16
RestFS
Beolink.org
17
GOAL
Create a framework for testing a new technologies and paradigm
Beolink.org Principle 1/3
18
“Moving Computation is
Cheaper than Moving Data”
Beolink.org Principle 2/3
19
“There is always a failure waiting around the corner”
*Werner Vogel
Beolink.org Principle 3/3
20
“Decompose into small loosely coupled, stateless building
blocks”
*’ Leaving a Legacy System Revisited’ Chad Fowler
Beolink.org Five pylons
21
Obj
ects •Separation
btw data and metadata
• Each element is marked with a revision
•Each element is marked with an hash.
Cac
he• Client side
• Callback/Notify
• Persistent
Tran
smis
sion
• Parallel operation
• Http like protocol
• Compression
• Transfer by difference
Dis
trib
utio
n •Resource discovery by DNS
•Data spread on multi node cluster
•Decentralize
•Independents cluster
•Data Replication
Secu
rity •Secure
connection
• Encryption client side,
• Extend ACL
• Delegation/Federation
•Admin Delegation
Beolink.org
22
RestFS Key Words
RestFS
Cellcollection of servers
Bucket virtual container, hosted by one or
more server
Object entity (file, dir, …)
contained in a Bucket
Beolink.orgObject
23
Data Metadata
Segments Obj
ect
Attributes set by user
Properties
ACL
Ext Properties
Block 1
Block 2
Block n
Block …
Has
hH
ash
Has
hH
ash
Seria
lSe
rial
Seria
lSe
rial
Seria
l
Beolink.orgBucket Discovery
24
Client
DNSLookup
Cell 1
Cell 2
N server
N server
Bucket name Cell RL IP list
Bucket name
Server list +Load info
Server Priority Type
IP 1
.. …
Server list priority List
Beolink.org
25
RestFS Cache client side
DNS
RestFS Metadata
RestFS Block
Federated Auth
Callbacks
Metadata cache
Block cache
RestFS BlockRestFS Block
Pers
iste
nt
Cac
heResource Locator
ServerList
Tokens
Pub/SubList
Tem
pora
ry
Locks
Beolink.org
26
Server Architecture
S3
Service
StorageMgr
Auth Manager
Meta Mgr
Storage Driver
Token Driver
RestFSRPC
Resource Manager
Distributed Cache
CallbacksManager
Meta Driver
Auth Driver
CallbacksDriver
Auth
Inte
rfac
eM
anag
ers
Driv
ers
Plug
in
Resource Locator
Backends
Token Sub/Pub
Token Manager
Resource DriverM
eta
Serv
ice
RL
Serv
ice
Cal
lbac
k Se
rvic
e
Aut
h Se
rvic
e
Toke
n Se
rvic
e
Blo
ck S
ervi
ce
Locks Mgr
Locks DriverLo
cks
Serv
ice
Beolink.org
27
Mounting
Cell
Bucket NObjects
Cell
Bucket NObjects
Beolink.org
28
Object Versioning
Cell
Bucket N
Objects
Objects
Objects
The segment contain the diff to upstream object
Each object knows the previous and the next. The current object knows the previous and the last
Beolink.org
29
Block Storage
Beolink.org
30
Backend: Consistent Hashing
Number of key to move for add/remove a node :
Keys/Node= keys to relocate
Blocks are collected in shards
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
Beolink.org Block Storage
31
AFS - Volume store a range of HASH - Chunk is write in 3 volume - Server
PISA- cluster of node - communication base on zmq- consensus base on raft
CEPH - Use CEPH node directly
Beolink.org
32
Backend: Storage
3 CopiesConfigurable read and write consistent level and security:- 2W1R- 2W2R- 1W1R- …
Monitor of neighbored small cluster of 3 nodes (GOSSIP)
Mini cluster electionkey space reclaim for replica coordination, leave join cluster
Beolink.org
33
Protocols
Europython 2013
Beolink.org
34
RestFS Protocol
{"hello": "world"}→"\x16\x00\x00\x00\x02hello\x00 \x06\x00\x00\x00world\x00\x00"
Europython 2013
--> { "method": ”readBlock", "params": [”…"], "id": 1}<-- { "result": [..], "error": null, "id": 1}
GET /mychat HTTP/1.1Host: server.example.comUpgrade: websocketConnection: UpgradeSec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==Sec-WebSocket-Protocol: chatSec-WebSocket-Version: 13Origin: http://example.com
WebSocket is a web technology for multiplexing bi-directional, full-duplex communications channels over a single TCP connection.
Standard HTTP/HTTPS port
JSON-RPC is lightweight remote procedure call protocol similar to XML-RPC. It's designed to be simpleSimple to covert in
python dict
BSON short for Binary JSON,is a binary-encoded serialization of JSON-like documents..BSON can be compared to binary interchange formats
*Compression is a long story…
Beolink.org Protocols Metadata
35
Europython 2013
{ "method": ”readBlock", "params": [“bucket_name: test, segment:1 , blocks:[1,2,3,4]"], "id": 1}
Collecting per segment
Parallel request per segment
{ "method": ”getSegmentVer", "params": [“bucket_name: test, segment:1 , , "id": 1}
<-- { "result": [ver: 1335519328.091779], "error": null, "id": 1}
Check cached Data
{ "method": ”getSegmentHash", "params": [“bucket_name: test, segment:1 , , "id": 1}
<-- { "result": [1:16db0420c9cc29a9d89ff89cd191bd2045e473782:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c…], "error": null, "id": 1}
Block hash list for a specific segment
Beolink.org
36
NOSQL DB
Beolink.org
37
Redis performance
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 –q
SET: 552028.75 requests per secondGET: 707463.75 requests per secondLPUSH: 767459.75 requests per secondLPOP: 770119.38 requests per second
Beolink.org
38
Code
Beolink.org
39
Pluggable
Protocol
• Connection Handler• Data transcoding
Service
• High level Operations across multiple functions (like locking)
• Integrity operations/transaction
Manager
• Operations handler for specific area (ex. metadata)
• Split info in sub info
Driver
• Read and write operation to storage system, agnostic operation
Inte
rfac
e, d
ynam
ic lo
ad
Beolink.orgSupport
40
Beolink.org
42
Bucket
Europython 2013
Beolink.org
43
Bucket
Europython 2013
Bucket Namezebra
Propertysegment_size= 512block_size = 16kmax_read’=1000Bucket_size=0Bucket_quota=10000storage_class=STANDARDcompression= nonelogging=enablebucket_type=fs…
The bucket has many properties, the property element is a collection of object information, with this element you can retrieve the default value for the bucket (logging level, security level, ect).
Bucket Name
Properties objects:- Property- Property Ext- Property ACL- Property Stats
- Filesystm, The bucket is used as a filesystem- Logging, Logging operation done on the specific Bucket- Replica RO, Bucket shadow replication…Custom definition
Default parameters
Python Dict
Beolink.org
44
Objects
Europython 2013
Beolink.orgObject
45
Data Metadata
Segments Obj
ect
Attributes set by user
Europython 2013
Properties
ACL
Ext Properties
Block 1
Block 2
Block n
Block …
Has
hH
ash
Has
hH
ash
Seria
lSe
rial
Seria
lSe
rial
Seria
l
Beolink.org
46
MetaData Properties
Europython 2013
Object
zebra.c1d2197420bd41ef24fc665f228e2c76e98da247
PropertyObject_type=datasegment_size= 512block_size = 16kcontent_type = md5=ab86d732d11beb65ed0183d6a87b9b0max_read’=1000storage_class=STANDARDcompression= noneName=“my first object”Object_size=10000Object_prev=zebra.c1d2197420bd41ef24fc665f228e2c76e98dartg…vers:1335519328.091779
Object id (Special id is : bucket_name.ROOT is the starting point of the file system)
Object default
Object version
Object hash (replaced by merkel tree)
Pointer to the previous Object
Object type:- Data, Contains files- Folder, Special object that contain others objects- Mount point, Contains the name of the buckets- Link, Contains the name of the objects- Immutable, Gold imageCustom, Defined by the users
Bucket name
Beolink.org
47
Metadata Segment
Europython 2013
Segment Segment-1
Segment-id 1:16db0420c9cc29a9d89ff89cd191bd2045e473782:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c3:158aa47df63f79fd5bc227d32d52a97e1451828c4:1ee794c0785c7991f986afc199a6eee1fa45:c3c662928ac93e206e025a1b08b14ad02e77b29d …vers:1335519328.091779
…
Segment element
Block pos: integrity hash
Version base on timestamp +Incremental useful for vector clock conflict resolution
Data_size------------------------------------- = Total Segmentblock_size*segment_size
Python Dict
Beolink.org
48
Restfs ID
Europython 2013
Id Bucket
Id Object
Id segment and id block
Chunck data on the storage
Plain text DNS name
UUID random generation
Base on the position of the content
SHA-1 hash of the concatenation of Bucket.object.segment.block_id
Id Object is unique inside of the Bucket, with bucket name the id is a UUID
Beolink.org
49
Mounting
Europython 2013
Cell
Bucket NObjects
Cell
Bucket NObjects
Beolink.org
50
Object Versioning
Europython 2013
Cell
Bucket N
Objects
Objects
Objects
The segment contain the diff to upstream object
Each object knows the previous and the next. The current object knows the previous and the last
Beolink.org
51
Protocols
Europython 2013
Beolink.org
52
RestFS Protocol
{"hello": "world"}→"\x16\x00\x00\x00\x02hello\x00 \x06\x00\x00\x00world\x00\x00"
Europython 2013
--> { "method": ”readBlock", "params": [”…"], "id": 1}<-- { "result": [..], "error": null, "id": 1}
GET /mychat HTTP/1.1Host: server.example.comUpgrade: websocketConnection: UpgradeSec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==Sec-WebSocket-Protocol: chatSec-WebSocket-Version: 13Origin: http://example.com
WebSocket is a web technology for multiplexing bi-directional, full-duplex communications channels over a single TCP connection.
Standard HTTP/HTTPS port
JSON-RPC is lightweight remote procedure call protocol similar to XML-RPC. It's designed to be simpleSimple to covert in
python dict
BSON short for Binary JSON,is a binary-encoded serialization of JSON-like documents..BSON can be compared to binary interchange formats
*Compression is a long story…
Beolink.org Protocols Metadata
53
Europython 2013
{ "method": ”readBlock", "params": [“bucket_name: test, segment:1 , blocks:[1,2,3,4]"], "id": 1}
Collecting per segment
Parallel request per segment
{ "method": ”getSegmentVer", "params": [“bucket_name: test, segment:1 , , "id": 1}
<-- { "result": [ver: 1335519328.091779], "error": null, "id": 1}
Check cached Data
{ "method": ”getSegmentHash", "params": [“bucket_name: test, segment:1 , , "id": 1}
<-- { "result": [1:16db0420c9cc29a9d89ff89cd191bd2045e473782:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c…], "error": null, "id": 1}
Block hash list for a specific segment
Beolink.org
54
Block Storage
Europython 2013
Beolink.org
55
Backend: Consistent Hashing
Europython 2013
Number of key to move for add/remove a node :
Keys/Node= keys to relocate
Blocks are collected in shards
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
Beolink.org
56
Backend: Storage
Europython 2013
3 CopiesConfigurable read and write consistent level and security:- 2W1R- 2W2R- 1W1R- …
Monitor of neighbored small cluster of 3 nodes (GOSSIP)
Mini cluster electionkey space reclaim for replica coordination, leave join cluster
Beolink.org
57
Cache
Europython 2013
Beolink.org
58
Cache
Europython 2013
Server Side
Client Side
Distribute Cache
Publish Subscribe
Pattern matching
Persistent cache
Beolink.org
59
Security
Europython 2013
Beolink.org
60
Security
Europython 2013
Protocol,• SSL Protocol
Authentication• Token for devices
(Enrollment)• Session Token for
User• External password
provider
Data Integrity• Encryption on block
level
Authorization• Extended ACL
based on NFS4 ACL• Admin delegation on
the Bucket level
Beolink.org
61
NOSQL DB
Europython 2013
Beolink.org
62
redis as much as Possible
Europython 2013
Main characteristics- Fast- Store Hash of HASH- Atomic operation- Sub/Pub primitives
zebra.c1d2197420bd41ef24fc665f228e2c76e98da247
object id
Dot format to simplify subscription operation (callback)
GLP
name of the properties
Primary key :
Subkey :
00101010101010Value :
Serialized Python Dict (bson in the future)
HASH of HASH
* Version and Hash of the objects has a dedicated subkey, no serialization
Beolink.org
63
Redis performance
Europython 2013
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 –q
SET: 552028.75 requests per secondGET: 707463.75 requests per secondLPUSH: 767459.75 requests per secondLPOP: 770119.38 requests per second
Beolink.org
64
Code
Europython 2013
Beolink.org
65
Pluggable
Europython 2013
Protocol •Connection Handler•Data transcoding
Service •High level Operations across multiple functions (like locking)•Integrity operations/transaction
Manager •Operations handler for specific area (ex. metadata)•Split info in sub info
Driver •Read and write operation to storage system, agnostic operationInte
rfac
e, d
ynam
ic lo
ad
Beolink.org
66
What we are using
Module SoftwareStorage Filesystem, DHT (kademlia, Pastry*)
Metadata SQL(mysql,sqlite), Nosql (Redis)
Auth Oauth(google, twitter, facebook), kerberos*, internal
Protocol Websocket
Message Format
JSON-RPC 2.0, Amazon S3
Encoding Plain, bson
CallBack Subscribe/Publish Websocket/Redis, Async I/O TornadoWeb, ZeroMQ*
HASH Sha-XXX, MD5-XXX, AES
Encryption
SSL, ciphers supported by crypto++
Discovery DNS, file base* are planned
Europython 2013
Beolink.orgWhat is it good for ?
67
User
• Home directory• Remote/Internet disks
Application
• Object storage• Shared space• Virtual Machine
Distribution
• CDN (Multimedia)• Data replication• Disaster Recovery
Europython 2013
Beolink.org
68
Backend: Storage
Transport Layer ZeroMQ
Storage Compressed DAta
Europython 2013