39
Sr. Solutions Architect, MongoDB Jake Angerman #MongoDBWorld Sharding Time Series Data

MongoDB for Time Series Data Part 3: Sharding

  • Upload
    mongodb

  • View
    665

  • Download
    4

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: MongoDB for Time Series Data Part 3: Sharding

Sr. Solutions Architect, MongoDB

Jake Angerman

#MongoDBWorld

Sharding Time Series Data

Page 2: MongoDB for Time Series Data Part 3: Sharding

Let's Pretend We Are DevOps

What my friendsthink I do

What societythinks I do

What my Momthinks I do

What my bossthinks I do What I think I

doWhat I really do

DevOps

Page 3: MongoDB for Time Series Data Part 3: Sharding

Sharding Overview

Primary

Secondary

Secondary

Shard 1

Primary

Secondary

Secondary

Shard 2

Primary

Secondary

Secondary

Shard 3

Primary

Secondary

Secondary

Shard N

Query Router

Query Router

Query Router

……

Driver

Application

Page 4: MongoDB for Time Series Data Part 3: Sharding

Why do we need to shard?

• Reaching a limit on some resource– RAM (working set)– Disk space– Disk IO– Client network latency on writes (tag aware

sharding)– CPU

Page 5: MongoDB for Time Series Data Part 3: Sharding

Do we need to shard right now?• Two schools of thought:

1. Shard at the outset to avoid technical debt later2. Shard later to avoid complexity and overhead

today

• Either way, shard before you need to!– 256GB data size threshold published in

documentation– Chunk migrations can cause memory contention

and disk IOWorking SetFree RAM

Things seemed fine…

Working Set… then I

waited too long to shard

Page 6: MongoDB for Time Series Data Part 3: Sharding

> db.mdbw.stats()

{

"ns" : "test.mdbw",

"count" : 16000, // one hour's worth of documents

"size" : 65280000, // size of user data, padding included

"avgObjSize" : 4080,

"storageSize" : 93356032, // size of data extents, unused space included

"numExtents" : 11,

"nindexes" : 1,

"lastExtentSize" : 31354880,

"paddingFactor" : 1,

"systemFlags" : 1,

"userFlags" : 1,

"totalIndexSize" : 801248,

"indexSizes" : { "_id_" : 801248 },

"ok" : 1

}

collection stats

Page 7: MongoDB for Time Series Data Part 3: Sharding

Storage model spreadsheet

sensors 16,000years to keep data 6docs per day 384,000docs per year 140,160,000docs total across all years 840,960,000indexes per day 801248 bytesstorage per hour 63 MBstorage per day 1.5 GBstorage per year 539 GBstorage across all years 3,235 GB

Page 8: MongoDB for Time Series Data Part 3: Sharding

Why we need to shard now

539 GB in year one alone

1 2 3 4 5 60

500

1,000

1,500

2,000

2,500

3,000

3,500

YearTotal storage _x000d_(GB)

16,000 sensors today… … 47,000 tomorrow?

Page 9: MongoDB for Time Series Data Part 3: Sharding

What will our sharded cluster look like?

• We need to model the application to answer this question

• Model should include:– application write patterns (sensors)– application read patterns (clients)– analytic read patterns– data storage requirements

• Two main collections– summary data (fast query times)– historical data (analysis of environmental conditions)

Page 10: MongoDB for Time Series Data Part 3: Sharding

Option 1: Everything in one sharded cluster

Primary Primary Primary

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

Shard 2 Shard 3 Shard N

Primary

Secondary

Secondary

Shard 1Primary Shard

Primary

Secondary

Secondary

Shard 4

• Issue: prevent analytics jobs from affecting application

performance

• Summary data is small (16,000 * N bytes) and accessed

frequently

Page 11: MongoDB for Time Series Data Part 3: Sharding

Option 2: Distinct replica set for summaries

Primary Primary Primary

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

Shard 1 Shard 2 Shard N

Primary

Secondary

Secondary

Replica set

Primary

Secondary

Secondary

Shard 3

• Pros: Operational separation between business

functions

• Cons: application must write to two different databases

Page 12: MongoDB for Time Series Data Part 3: Sharding

Application read patterns

• Web browsers, mobile phones, and in-car navigation devices

• Working set should be kept in RAM

• 5M subscribers * 1% active * 50 sensors/query * 1 device query/min = 41,667 reads/sec

• 41,667 reads/sec * 4080 bytes = 162 MB/sec

– and that's without any protocol overhead

• Gigabit Ethernet is ≈ 118 MB/sec

Primary

Secondary

Secondary

Replica set

1 Gbps

Page 13: MongoDB for Time Series Data Part 3: Sharding

Application read patterns (continued)

• Options– provision more bandwidth ($$

$)– tune application read pattern– add a caching layer– secondary reads from the

replica set

Primary

Secondary

Secondary

Replica set

1 Gbps

1 Gbps

1 Gbps

Page 14: MongoDB for Time Series Data Part 3: Sharding

Secondary Reads from the Replica Set• Stale data OK in this use case

• caution: read preference of secondary could be disastrous in a 3-replica set if a secondary fails!

• app servers with mixed read preferences of primary and secondary are operationally cumbersome

• Use nearest read preference to access all nodes

Primary

Secondary

Secondary

Replica set

1 Gbps

1 Gbps

1 Gbps

db.collection.find().readPref( { mode: 'nearest'} )

Page 15: MongoDB for Time Series Data Part 3: Sharding

Replica Set Tags• app servers in different data centers use

replica set tags plus read preference

nearest

• db.collection.find().readPref( { mode:

'nearest', tags: [ {'datacenter':

'east'} ] } )

east

Secondary

Secondary

Primary

> rs.conf()

{ "_id" : "rs0",

"version" : 2,

"members" : [

{ "_id" : 0,

"host" : "node0.example.net:27017",

"tags" : { "datacenter": "east" }

},

{ "_id" : 1,

"host" : "node1.example.net:27017",

"tags" : { "datacenter": "east" }

},

{ "_id" : 2,

"host" : "node2.example.net:27017",

"tags" : { "datacenter": "east" }

},

}

Page 16: MongoDB for Time Series Data Part 3: Sharding

eastcentralwest

Replica Set Tags• Enables geographic distribution

Secondary

Secondary

Primary

Page 17: MongoDB for Time Series Data Part 3: Sharding

eastcentralwest

Replica Set Tags• Enables geographic distribution

• Allows scaling within each data center

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

Primary

Secondary

Secondary

Page 18: MongoDB for Time Series Data Part 3: Sharding

Analytic read patterns

• How does an analyst look at the data on the sharded cluster?

• 1 Year of data = 539 GB

2 4 6 8 10 12 14 16 180

50

100

150

200

250

300

Series1; 256

192

128

6432

Server RAM

Number of machines

Page 19: MongoDB for Time Series Data Part 3: Sharding

Application write patterns

• 16,000 sensors every minute = 267 writes/sec

• Could we handle 16,000 writes in one second?

– 16,000 writes * 4080 bytes = 62 MB

• Load test the app!

Page 20: MongoDB for Time Series Data Part 3: Sharding

Modeling the Application - summary

• We modeled:– application write patterns (sensors)– application read patterns (clients)– analytic read patterns– data storage requirements– the network, a little bit

Page 21: MongoDB for Time Series Data Part 3: Sharding

Shard Key

Page 22: MongoDB for Time Series Data Part 3: Sharding

Shard Key characteristics

• A good shard key has:– sufficient cardinality– distributed writes– targeted reads ("query isolation")

• Shard key should be in every query if possible

– scatter gather otherwise

• Choosing a good shard key is important!– affects performance and scalability– changing it later is expensive

Page 23: MongoDB for Time Series Data Part 3: Sharding

Hashed shard key• Pros:

– Evenly distributed writes

• Cons:– Random data (and index) updates can be IO

intensive– Range-based queries turn into scatter gather

Shard 1

mongos

Shard 2

Shard 3

Shard N

Page 24: MongoDB for Time Series Data Part 3: Sharding

Low cardinality shard key

• Induces "jumbo chunks"

• Examples: sensor ID

Shard 1

mongos

Shard 2

Shard 3

Shard N

[ a, b )

Page 25: MongoDB for Time Series Data Part 3: Sharding

Ascending shard key

• Monotonically increasing shard key values cause "hot spots" on inserts

• Examples: timestamps, _id

Shard 1

mongos

Shard 2

Shard 3

Shard N

[ ISODate(…), $maxKey )

Page 26: MongoDB for Time Series Data Part 3: Sharding

Choosing a shard key for time series data

• Consider compound shard key:{arbitrary value, incrementing value}

• Best of both worlds – local hot spotting, targeted reads

Shard 1

mongos

Shard 2

Shard 3

Shard N

[ {V1, ISODate(A)}, {V1, ISODate(B)} ),[ {V1, ISODate(B)}, {V1, ISODate(C)} ),[ {V1, ISODate(C)}, {V1, ISODate(D)} ),…

[ {V4, ISODate(A)}, {V4, ISODate(B)} ),[ {V4, ISODate(B)}, {V4, ISODate(C)} ),[ {V4, ISODate(C)}, {V4, ISODate(D)} ),…

[ {V2, ISODate(A)}, {V2, ISODate(B)} ),[ {V2, ISODate(B)}, {V2, ISODate(C)} ),[ {V2, ISODate(C)}, {V2, ISODate(D)} ),…

[ {V3, ISODate(A)}, {V3, ISODate(B)} ),[ {V3, ISODate(B)}, {V3, ISODate(C)} ),[ {V3, ISODate(C)}, {V3, ISODate(D)} ),…

Page 27: MongoDB for Time Series Data Part 3: Sharding

What is our shard key?

• Let's choose: linkID, date– example: { linkID: 9000006, date: 140312 }– example: { _id: "900006:140312" }– this application's _id is in this form already, yay!

Page 28: MongoDB for Time Series Data Part 3: Sharding

Summary

• Model the read/write patterns and storage

• Choose an appropriate shard key

• DevOps influenced the application– write recent summary data to separate database– replica set tags for summary database– avoid synchronous sensor checkins– consider changing client polling frequency– consider throttling REST API access to app servers

Page 29: MongoDB for Time Series Data Part 3: Sharding

Which DevOps person are you?

Page 30: MongoDB for Time Series Data Part 3: Sharding

Sr. Solutions Architect, MongoDB

Jake Angerman

#MongoDBWorld

Thank You

Page 31: MongoDB for Time Series Data Part 3: Sharding

$ mongo --nodb

> cluster = new ShardingTest({"shards": 1, "chunksize": 1})

$ mongo --nodb

> // now connect to mongos on 30999

> db = (new Mongo("localhost:30999")).getDB("test")

Sharding Experimentation

Page 32: MongoDB for Time Series Data Part 3: Sharding

I decided to shard from the outset

• Sensor summary documents can all fit in RAM

– 16,000 sensors * N bytes

• Velocity of sensor events is only 267 writes/sec

• Volume of sensor events is what dictates sharding

{ _id : <linkID>,

update : ISODate(“2013-10-10T23:06:37.000Z”),

last10 : {

avgSpeed : <int>,

avgTime : <int>

},

lastHour : {

avgSpeed : <int>,

avgTime : <int>

},

speeds : [ 52, 49, 45, 51, ... ],

times : [ 237, 224, 246, 233,... ],

pavement: "Wet Spots",

status: "Wet Conditions",

weather: "Light Rain"

}

Page 33: MongoDB for Time Series Data Part 3: Sharding

> this_is_for_replica_sets_not_sharding = {

_id : "mySet",

members : [

{_id : 0, host : "A”, priority : 3},

{_id : 1, host : "B", priority : 2},

{_id : 2, host : "C"},

{_id : 3, host : "D", hidden : true},

{_id : 4, host : "E", hidden : true, slaveDelay : 3600}

]

}

> rs.initiate(conf)

Configuring Sharding

Page 34: MongoDB for Time Series Data Part 3: Sharding

I'm off to my private island in New Zealand

Page 35: MongoDB for Time Series Data Part 3: Sharding

Replica Set Diagram

Page 36: MongoDB for Time Series Data Part 3: Sharding

> conf = {

_id : "mySet",

members : [

{_id : 0, host : "A”, priority : 3},

{_id : 1, host : "B", priority : 2},

{_id : 2, host : "C"},

{_id : 3, host : "D", hidden : true},

{_id : 4, host : "E", hidden : true, slaveDelay : 3600}

]

}

> rs.initiate(conf)

Configuration Options

Page 37: MongoDB for Time Series Data Part 3: Sharding

My Wonderful Subsection

Page 38: MongoDB for Time Series Data Part 3: Sharding

> conf = {

_id : "mySet”,

members : [

{_id : 0, host : "A”, priority : 3},

{_id : 1, host : "B", priority : 2},

{_id : 2, host : "C"},

{_id : 3, host : "D", hidden : true},

{_id : 4, host : "E", hidden : true, slaveDelay : 3600}

]

}

> rs.initiate(conf)

Configuration Options

Primary DC

Page 39: MongoDB for Time Series Data Part 3: Sharding

Tag Aware Sharding

• Control where data is written to, and read from

• Each member can have one or more tags– tags: {dc: "ny"}– tags: {dc: "ny", subnet: "192.168",

rack: "row3rk7"}

• Replica set defines rules for write concerns

• Rules can change without changing app code