Upload
andrew-liu
View
188
Download
0
Embed Size (px)
Citation preview
Azure DocumentDB:Advanced Features for Large-Scale Apps
{ "name": "Andrew Liu", "e-mail": "[email protected]", "twitter": "@aliuy8"}
First… a Rant
managing servers makes me cry
structuring data is really hard
managing schema and indexes makes me angry
DocumentDBNoSQL… as a service!
Let's talk about…• A quick recap on NoSQL
• Big Data Challenges
• Partitioning, Data Modeling, Stored Procedures
• Q&A
• NoSQL is buzzword
• NoSQL is varied• Key-value• Wide-column • Graph• Document-oriented
NoSQL in a nutshell
{ "name": "SmugMug", "permalink": "smugmug", "homepage_url": "http://www.smugmug.com", "blog_url": "http://blogs.smugmug.com/", "category_code": "photo_video", "products": [ { "name": "SmugMug", "permalink": "smugmug" } ], "offices": [ { "description": "", "address1": "67 E. Evelyn Ave", "address2": "", "zip_code": "94041", "city": "Mountain View", "state_code": "CA", "country_code": "USA", "latitude": 37.390056, "longitude": -122.067692 } ]}
Perfect for these
Documentsschema-agnostic JSON store
for
hierarchical and de-normalized data at scale
Not these documents
{ "name": "SmugMug", "permalink": "smugmug", "homepage_url": "http://www.smugmug.com", "blog_url": "http://blogs.smugmug.com/", "category_code": "photo_video", "products": [ { "name": "SmugMug", "permalink": "smugmug" } ], "offices": [ { "description": "", "address1": "67 E. Evelyn Ave", "address2": "", "zip_code": "94041", "city": "Mountain View", "state_code": "CA", "country_code": "USA", "latitude": 37.390056, "longitude": -122.067692 } ]}
Perfect for these
Documentsschema-agnostic JSON store
for
hierarchical and de-normalized data at scale
Azure DocumentDB
Elastic Limitless scale
Millions of RPSMany TBs of data
Transparent Partitioning
<10ms Reads<15ms Writes
@P99
Low-latency access around the globe!
Guaranteed low latency
Globally replicated
Automatic IndexingEasy-to-learn query
grammarMulti-Record Transactions
Schema Freedom
Blazing fast, planet scale NoSQL service
99.99% SLAs for availability, latency, and throughput
How does this fit in the Azure family?
“If all you have is a hammer, everything looks like a nail“
-Abraham Maslow
The database renaissance!
Choose the right tools for the right job
Problem 1: Variety
Item Author Pages
Language
Harry Potter and the Sorcerer’s Stone
J.K. Rowling 309 English
Game of Thrones: A Song of Ice and Fire
George R.R. Martin
864 English
Item Author Pages
Language
Harry Potter and the Sorcerer’s Stone
J.K. Rowling 309 English
Game of Thrones: A Song of Ice and Fire
George R.R. Martin
864 English
Lenovo Thinkpad X1 Carbon
??? ??? ???
!=
!=
Item Author Pages Language Processor Memory StorageHarry Potter and the Sorcerer’s Stone
J.K. Rowling
309 English ??? ??? ???
Game of Thrones: A Song of Ice and Fire
George R.R. Martin
864 English ??? ??? ???
Lenovo Thinkpad X1 Carbon
??? ??? ??? Core i7 3.3ghz
8 GB 256 GB SSD
What a waste of space…
Item Author Pages
Language
Harry Potter and the Sorcerer’s Stone
J.K. Rowling 309 English
Game of Thrones: A Song of Ice and Fire
George R.R. Martin
864 English
Item CPU Memory StorageLenovo Thinkpad X1 Carbon
Core i7 3.3ghz
8 GB 256 GB SSD
More tables!
Okay… What if I have 100,000 product types?Or I have varying features for a single product
type?
ProductId Item1 Harry Potter and the
Sorcerer’s Stone2 Game of Thrones: A Song
of Ice and Fire3 Lenovo Thinkpad X1
Carbon
ProductId Attribute Value1 Author J.K. Rowling1 Pages 309
…2 Author George R.R. Martin2 Pages 864
…3 Processor Core i7 3.3ghz3 Memory 8 GB
…
{ "ItemType": "Book", "Title": "Harry Potter and the Sorcerer's Stone", "Author": "J.K. Rowling", "Pages": "864", "Languages": [ "English", "Spanish", "Portuguese", "Russian", "French" ]} {
"ItemType": "Laptop", "Name": "Lenovo Thinkpad X1 Carbon", "Processor": "Core i7 3.3 Ghz", "Memory": "8 GB DDR3L SDRAM", "Storage": "256 GB SSD", "Graphics": "Intel HD Graphics 4400", "Weight": "1 pound"}
It just works.
Problem 2: Scale (Volume and Velocity)
Let’s begin with a Story
Indexing JSON and fighting zombies at SCALE
Next Games Game Development Studio Based in
Helsinki, Finland
65 employees
Develop F2P mobile games for iOS and Android
Based on own & licensed IP
The Walking Dead TV show
Drama about a zombie walker apocalypse on AMC
First cable drama to beat broadcast shows
Most watched cable TV show in the US (16M users)
The Challenge
Scale with expectation of millions of users on Day 1
Deliver real time responsiveness for a lag-free, gaming experience
Highly competitive – high scoresand global leaderboards critical
More Users, More Problems
The Results
#1 in Apple app store free appsduring launch week
>1M downloads
~1B queries per day
99p queries served under 10ms
How?
Just throw some data in a database!
Just throw some data in a database!
Not that easy…
Why is this such a hard problem?
Caches Scoreboard keeps updating…
SQL database Need to shard
Schema and Index Management Loss of relational benefits
Azure Table Storage Secondary Indexes Latency Throughput
Planet-Scale NoSQL
Horizontal Scaling for storage andthroughput
High performance with SSDs andautomatic indexing
Operating on a global scale
Partitioning
Fact: Managing shards is really painful.
Elastic Scale
Good news: DocumentDB has done all the heavy lifting.
Request Units
Request Unit (RU) is the normalized currency
% Memory
% IOPS
% CPU
Replica gets a fixed budget of Request Units
READGET Resourc
e
Resourceset
INSERT
POSTResource
DELETEDELETE Resourc
e
QueryPOST Document
sSQL
EXECUTEPOST sprocsargs
REPLACE
PUTResource
Resource
Predictable PerformanceMost import metric in DocumentDB!
Partitioned Collections
What’s left? Choosing a Partition Key
Choosing a Partition Key• Workload – Read vs Write heavy?
• Top Queries
• Transaction Boundary
• Avoid Storage + Performance Bottlenecks
• Multi-Tenancy: Tenant Size
• Examples: partition by tenant, device, timestamp, or composite
Creating partitioned collections //pre-defined collectionsDocumentCollection collectionSpec = new DocumentCollection { Id = "Walkers" };RequestOptions options = new RequestOptions { OfferType = "S3" };
DocumentCollection documentCollection = await client.CreateDocumentCollectionAsync("dbs/" + database.Id, collectionSpec, options);
//partitioned collectionsDocumentCollection collectionSpec = new DocumentCollection { Id = "Walkers" };collectionSpec.PartitionKey.Paths.Add(“/walkerId”);int collectionThroughput = 100000; RequestOptions options = new RequestOptions { OfferThroughput = collectionThroughput };
DocumentCollection documentCollection = await client.CreateDocumentCollectionAsync("dbs/" + database.Id, collectionSpec, options);
Let's talk about a physics problem
Globally Distributed
• Not just for disaster recovery…. DocumentDB is unreasonably highly available
• Replicate data across any # of regions of your choice
• Low-latency access to your data around the globe
• Dynamically configure your write and read regions
Azure DocumentDB gives you the ability cheat the speed of light!
… with well-defined consistency models!
Bounded Staleness
Session
Eventual
Strong
LEFT TO RIGHT Relaxed consistency => better performance and availability
Consistency Level Strong Bounded Staleness Session Eventual
Total global order Yes Yes, outside of the “staleness window”
No, partial “session” order
No
Consistent prefix guarantee
Yes Yes Yes Yes
Monotonic reads Yes Yes, across regions outside of the staleness window and within a region all the time
Yes, for the given session
No
Monotonic writes Yes Yes Yes YesRead your writes Yes Yes (in the write region) Yes No
Strong consistency, High latency
Eventual consistency, Low latency
27%3%
54%
16%
Observed Distribution
Bounded-StalenessEventualSessionStrong
App defined regional preferencesConnectionPolicy docClientConnectionPolicy = new ConnectionPolicy { ConnectionMode =
ConnectionMode.Direct, ConnectionProtocol = Protocol.Tcp };
docClientConnectionPolicy.PreferredLocations.Add(LocationNames.EastUS2);docClientConnectionPolicy.PreferredLocations.Add(LocationNames.WestUS);
docClient = new DocumentClient( new Uri("https://myglobaldb.documents.azure.com:443"),
"PARvqUuBw2QTO4rRXr6d1GnLCR7VinERcYrBQvDRh6EDTJLOHtZxgjTS4pv8nQv2Lg1QQLBLfO6TVziOZKvYow==", docClientConnectionPolicy);
Enjoy true schema-freedom
Automatic Indexing• Index is a union of all the document trees
Commonstructure
1 2
Terms Postings List/Values
$/location/0/ 1, 2location/0/country/
1, 2
location/0/city/ 1, 20/country/Germany
1, 2
1/country/France 2 … …0/city/Moscow 20/dealers/0 2
http://aka.ms/docdbvldb
No need to define secondary indices / schema hints!
Index policiescustomize index management including storageoverhead, throughput and query consistency
range, hash and spatial indexes included and excluded paths indexing mode; consistent or lazy index precision online, in-place index transformations
{ "indexingMode": "consistent", "automatic": true, "includedPaths": [ { "path": "/*", "indexes": [ { "kind": "Range", "dataType": "Number", "precision": -1 }, { "kind": "Hash", "dataType": "String", "precision": 3 }, { "kind": "Spatial", "dataType": "Point" } ] } ], "excludedPaths": []}
-- Nested lookup against indexSELECT Books.AuthorFROM BooksWHERE Books.Author.Name = "Leo Tolstoy"
-- Transformation, Filters, Array accessSELECT { Name: Books.Title, Author: Books.Author.Name }FROM BooksWHERE Books.Price > 10 AND Books.Languages[0] = "English"
-- Joins, User Defined Functions (UDF)SELECT CalculateRegionalTax(Books.Price, "USA", "WA")FROM BooksJOIN LanguagesArr IN Books.LanguagesWHERE LanguagesArr.Language = "Russian"
SQL Query Grammar
Query over schema-free JSON
JavaScript as a Modern Day T-SQL
Transactional Integrated JavaScript
Transactional Integrated JavaScript
function(playerId1, playerId2) { var playersToSwap = __.filter (function (document) { return (document.id == playerId1 || document.id == playerId2); });
var player1 = playersToSwap[0], player2 = playersToSwap[1]; var player1ItemTemp = player1.item; player1.item = player2.item; player2.item = player1ItemTemp;
__.replaceDocument(player1) .then(function() { return __.replaceDocument(player2); }) .fail(function(error){ throw 'Unable to update players, abort'; });}
client.executeStoredProcedureAsync ("procs/1234", ["MasterChief", "SolidSnake“]) .then(function (response) { console.log(“success!"); }, function (err) { console.log("Failed to swap!", error); });
Client Database
Transactional Integrated JavaScript
Getting Started
Fully managed as a service
API and Toolchain Options
DocumentDB
REST over HTTPS/TCP
Java .NET
PowerBI
Tip: Data Modeling
{ "id": "1", "firstName": "Thomas", "lastName": "Andersen", "addresses": [ { "line1": "100 Some Street", "line2": "Unit 1", "city": "Seattle", "state": "WA", "zip": 98012 } ], "contactDetails": [ {"email: "[email protected]"}, {"phone": "+1 555 555-5555", "extension": 5555} ] }
Try model your entity as a self-contained documentGenerally, use embedded data models when:
There are "contains" relationships between entitiesThere are one-to-few relationships between entities Embedded data changes infrequentlyEmbedded data won’t grow without boundEmbedded data is integral to data in a document
Data modeling with denormalization
Denormalizing typically provides for better read performance
In general, use normalized data models when:
Write performance is more important than read performanceRepresenting one-to-many relationshipsCan representing many-to-many relationshipsRelated data changes frequently
Provides more flexibility than embeddingMore round trips to read data
Data modeling with referencing
{"id": "xyz","username:
"user xyz"}
{"id": "address_xyz","userid": "xyz",
"address" : {…
}}
{"id: "contact_xyz","userid": "xyz","email" :
"[email protected]" "phone" : "555 5555"}
User document
Address document
Contact details document
Normalizing typically provides better write performance
No magic bulletThink about how your data is going to be written, read and model accordingly
Hybrid models ~ denormalize + reference + aggregate
{ "id": "1", "firstName": "Thomas", "lastName": "Andersen", "countOfBooks": 3, "books": [1, 2, 3], "images": [
{"thumbnail": "http://....png"} {"profile": "http://....png"}
] }
{ "id": 1, "name": "DocumentDB 101", "authors": [
{"id": 1, "name": "Thomas Andersen", "thumbnail": "http://....png"},
{"id": 2, "name": "William Wakefield", "thumbnail": "http://....png"}
] }
Author document
Book document
• De-normalize data where appropriate
• Collections != Tables
• Tuning / Perf• Consistency Levels• Index Policies• Understand Query Costs / Limits / Avoid Scans• Pre-aggregate where possible
Quick Tips
Thank YouGet started with Azure DocumentDB
http://www.azure.com/docdb
Query Demo:https://www.documentdb.com/sql/demo
Andrew [email protected]
@aliuy8