NoSQL em Windows Azure Table Storage - Vitor Tomaz

Preview:

DESCRIPTION

Nesta sessão vamos analisar as características deste serviço fazer uma breve introdução à arquitectura que a suporta. Iremos verificar as considerações que devem ser tidas em conta na criação e utilização deste tipo de armazenamento, analisando o impacto que as decisões tomadas têm no que respeita a performance e objectivos de escalabilidade. Serão ainda mostrados alguns exemplos de utilização em cenários distintos, incluindo algumas optimizações que se podem fazer para melhorar a performance. Comunidade NetPonto, a comunidade .NET em Portugal! http://netponto.org

Citation preview

NoSQL em Windows Azure Table StorageVítor Tomaz

http://netponto.org37ª Reunião Presencial @ Lisboa - 23/03/2013

Vítor TomazISEL – LEICSAFIRA

NetPontoAzurePTRevista ProgramarPortugal@ProgramarSQLPortMSDN

Agenda

• Characteristics & Concepts• Service Architecture• Scalability Targets• Non-Relational Data Modeling• Best Practices

Windows Azure Storage Characteristics • A “pay for what you use” cloud storage system

Durable: Store multiple replicas of your data Local replication:

Synchronous replication before returning success Geo replication:

Replicated to data center at least 400+ miles apart Asynchronous replication after returning success to user.

Available: Multiple replicas are placed to provide fault tolerance

Scalable: Automatically partitions data across servers to meet traffic demands

Strong consistency: Default behavior is consistent reads once data is committed

Windows Azure Storage Abstractions

TablesStructured storage. A table is a set of entities; an entity is

a set of properties.

QueuesReliable storage and delivery of messages for an application.

BlobsSimple named files along with metadata for the file.

DrivesDurable NTFS volumes for Windows Azure applications to use. Based on Blobs.

Storage Libraries in Many Languages

Windows Azure Storage AccountUser specified globally unique account name

North Central USNorthern Europe

Western Europe East Asia

South East Asia

US Europe Asia

Can choose geo-location to host storage account:

South Central US

West US East US

Table Storage ConceptsEntityTableAccount

contoso

Name =…Email = …

Name =…EMailAdd=

customers

Photo ID =…Date =…

photos

Photo ID =…Date =…

No Fixed Schema

FIRST LAST BIRTHDATE

Wade Wegner 2/2/1981

Nathan Totten 3/15/1965

Nick Harris May 1, 1976

FAV SPORT

Canoeing

Table Details

InsertUpdate Merge – Partial update

Replace – Update entire entity

UpsertDeleteQueryEntity Group Transactions Multiple CUD Operations in a single atomic transaction

Create, Query, DeleteTables can have metadataNot an RDBMS! Table

Entities

Entity PropertiesEntity can have up to 255 propertiesUp to 1MB per entity

Mandatory Properties for every entityPartitionKey & RowKey (only indexed properties)Uniquely identifies an entityDefines the sort order

Timestamp Optimistic ConcurrencyExposed as an HTTP Etag

No fixed schema for other propertiesEach property is stored as a <name, typed value> pairNo schema stored for a tableProperties can be the standard .NET types String, binary, bool, DateTime, GUID, int, int64, and double

Scalability

Partition: Range of entities with same partition key value.Partitions are fanned out based on loadThey can be condensed when load decreasesReads are load balanced against three replicas

Server 1 Server 2 Server 3

P1

P2

Pn

Service Architecture

Storage Stamp Architecture

Extent Nodes (EN)

Front End Layer FE

Incoming Write Request

PartitionServer

PartitionServer

PartitionServer

PartitionServer

PartitionMaster

FE FE FE FE

Lock Service

Ack

Partition Layer

Stream Layer

Windows Azure Storage - Architecture

PartitionKeyUnique identifier for the partition within a give table.

RowKeyUnique Identifier for an entity within a given partition.

Both Keys matter!Define Primary KeyForms a single clustered index

Scalability

SlowestNo Partition KeyNo Row Key

SlowerOnly Partition KeyNo Row Key

Very FastPartition Key + Row Key

Table Storage – Key Points

1000 EntitiesAny query not including the Rowkey and PartitionKey (only those as well) needs to handle Continuation tokenshttp://tinyurl.com/ContToken

Continuation Tokens• Next Table• Next PartitionKey• Next RowKey

Transient Fault Handling• Network• Hardware• DataCenter

Scalability Targets

Scalability Targets -Storage AccountStorage Account level targets by end of 2012 Applies to accounts created after June 7th 2012

Capacity – Up to 200 TBs

Transactions – Up to 20,000 entities per second

Bandwidth for a Geo Redundant storage accountIngress - up to 5 GibpsEgress - up to 10 Gibps

Bandwidth for a Locally Redundant storage account

Ingress - up to 10 Gibps Egress - up to 15 Gibps

Scalability Targets – PartitionPartition level Targets by end of 2012 Applies to accounts created after June 7th 2012Single Table Partition – Account Name + Table Name + PartitionKey value

Up to 2,000 entities per second  

Non-Relational Data Modeling

Why Partition

Data Volume (too many bytes)

Work Load (too many transactions/second)

Cost (using different cost storage)

Elasticity (just in time partitioning for high load periods)

Choosing a Partition Key

Natural Keys•Country•First letter, last name•Date

Mathematical•Hash functions•Modulo operator

Lookup Based•Lookup table to resolve value to partitions

Using Modulo

The remainder of a divisionNice properties for partitioning:•Given two positive integers M and N•M mod N will return a number between 0 and N-1

Want equi-sized partitions?•Given an appropriate distribution of M we will get N ‘equally full’ buckets.

Using Hash Values

Using a hash function projects one distribution into anotherUse a hash function that projects a random distributionDo NOT use a cryptographic hash functionBe careful if using Object.GetHashCode()•Boxed types may return different value to un-boxed equivalent

Re-partition all data

Version partitioning scheme

Partition Stability Over Time

May need to change partitioning schemeTwo options:

e.g. <Version><PartitionKey><v1><A3E567D7D8C68789><v2><A8B978C8B6D77836>

wherev1 = GUID mod 4v2 = GUID mod 101 2

E.g. Tweet Storage

TweetID

UserID

DateTimeStamp

Message

With an RDBMS you’d probably start something like this:SELECT * FROM Tweet WHERE Message Like %SearchTerm%

E.g. Tweet StorageYou’d soon realize that LIKE isn’t so wonderful.

You’d do a little normalization

Message

TweetID

WordID

WordID

Word (IX)

Message

TweetID

Word (IX)

E.g. Tweet Storage

With Tables we go the whole way

TweetID (RK)

UserID (PK)

DateTimeStamp

Message

TweetID (RK)

UserID

DateTimeStamp

Message

Word (PK)

E.g. Tweet Storage

We may create multiple indexes

TweetID (RK)

UserID (PK)

DateTimeStamp

Message

TweetID (RK)

UserID

DateTimeStamp

Message

UserID (PK)

Entity Group Transactions

Modeling In Tables

Currently no secondary indexes (coming)•Be careful to minimize cross partition queries

Build indexes yourself•Concentrate on useful partition keys

If associated data is small enough•Save additional queries•Duplicate data with each index

Best Practices

Common Design & ScalabilityCommon Settings

Turn off Nagling & Expect 100 (.NET – ServicePointManager)Set connection limit (.NET – ServicePointManager.DefaultConnectionLimit)Turn off Proxy detection when running in cloud (.NET – Config: autodetect setting in proxy element)

Design you application that allows distributing requests across your range of partition keys to avoid hotspots Avoid Append/Prepend pattern: Access pattern lexically sorted by Partition Key valuesPerform one time operations at startup rather than every request Creating containers/tables/queues which should always exist Setting required constant ACLs on container/table/queue

Common Design & ScalabilityTurn on analytics & take control of your investigations– Logging and MetricsWho deleted my container? – Look at the client IP for delete container requestWhy is my request latency increased? - Look at E2E vs. Server latencyWhat is my user demographics? – Use client request id to trace requests & client IPHow can I tune my service usage? – Use metrics to analyze API usage & peak traffic statsAnd many more…

Use appropriate retry policy for intermittent errors Storage client uses exponential retry by default

Storage AccountsCollocate storage accounts with your compute roles as egress is free within same region

Use multiple storage accounts to: achieve targets that exceed a single storage achieve client proximityMap multiple clients to same storage account

Use different containers/tables/queues instead an account for each customer

Storage Accounts

Design to add more accounts as needed

Use different account for Windows Azure Diagnostics

Choose local redundant storage ifData can be restored on major disastersGeographical boundary constraints on where data can be stored

WA Table Client - Service Layer• Option 1 – WCF Data Services

• Good for fixed schema used like relational tables• Do not require control on serialization/deserialization

• Option 2 – Table Service Layer’s Dynamic Table Entity• Entity containing a Dictionary of Key-Value properties• Used when schema is not known example: Explorers• Performance!

• Option 3 – Table Service Layer’s POCO • POCO derives from ITableServiceEntity or TableServiceEntity• Control over serialization and deserialization – make your data

dance to your tune!• ETag maintained with Entities - easy to update!• Performance!

Performance - Storage Client Library 2.0

Storage Client 1.7 Storage Client 2.0 : DataServices

Storage Client 2.0 : Reflection

Storage Client 2.0 : No Reflection

0

5

10

15

20

25

30

35

40

0

20

40

60

80

100

120

140

160

Batch Stress Scenario Per Entity Latencies

DeleteQueryInsertProcessor Time (s)Test Duration (s)

Tim

e (

ms)

Faster NoSQL table accessUpto 72.06% reduction in execution timeUpto 31.92% reduction in processor time Upto 69-90% reduction in latency

Performance - Storage Client Library 2.0

Storage Client 1.7 Storage Client 2.00

5,000

10,000

15,000

20,000

25,000

30,000

Large Blob Scenario (256MB) Resource Utilization

Total Test Time (s)Total Processor Time (s)

Tim

e (

s)

Storage Client 1.7 Storage Client 2.00

10

20

30

40

50

60

70

Large Blob Scenario (256MB) Latencies

UploadDownload

Tim

e (

s)

Faster uploads and downloads31.46% reduction in processor time Upto 22.07% reduction in latency

Take Away

Partitioning Data Key to Cloud Scale Apps

Horizontally Partition for Scale Out

Choose appropriate partition keys

Table storage requires different approach to data modeling

Don’t be afraid to aggressively de-normalize and duplicate data

Questões?

Próximas reuniões presenciais

23/03/2013 – Março (Lisboa)20/04/2013 – Abril (Lisboa)22/06/2013 – Junho (Lisboa)??/??/2013 – ? (Porto)??/??/2013 – ? (Coimbra)Reserva estes dias na agenda! :)

Patrocinador “GOLD”

Twitter: @PTMicrosoft http://www.microsoft.com/portugal

Patrocinadores “Bronze”

Obrigado!Vítor Tomazvitorbstomaz AT gmail.comhttp://twitter.com/vitortomaz

Recommended