MongoDB:Scaling write performance
Junegunn Choi
First impression:Easy
• Easy installation
• Easy data model
•No prior schema design
•Native support for secondary indexes
Second thought:Not so easy
•No SQL
• Coping with massive data growth
• Setting up and operating sharded cluster
• Scaling write performance
Today we’ll talk aboutinsert performance
Insert throughputon a replica set
* 1kB record. ObjectId as PK* WriteConcern: Journal sync on Majority
Steady 5k inserts/sec
Insert throughputwith a secondary index
Culprit:B+Tree index
• Good at sequential insert
• e.g. ObjectId, Sequence #, Timestamp
• Poor at random insert
• Indexes on randomly-distributed data
Sequential vs. Random insert
123456789101112
B+Tree
55757819936809152635633
B+Tree
Sequential insert ➔ Small working set➔ Fits in RAM ➔ Sequential I/O
(bandwidth-bound)
Random insert ➔ Large working set➔ Cannot fit in RAM ➔ Random I/O
(IOPS-bound)
working set working set
So, what do we do now?
B+Tree
1. Partitioning
does not fit in memory
Aug 2012 Sep 2012 Oct 2012
fits in memory
1. Partitioning
•MongoDB doesn’t support partitioning
• Partitioning at application-level
• e.g. Daily log collection
• logs_20121012
Switch collection every hour
2. Better H/W
•More RAM
•More IOPS
• RAID striping
• SSD
• AWS Provisioned IOPS (1k ~ 10k)
SHARD3SHARD2SHARD1
3. More H/W: Sharding
• Automatic partitioning across nodes
mongos router
3 shards (3x3)
3 shards (3x3)on RAID 1+0
There’s no free lunch• Manual partitioning
• Incidental complexity
• Better H/W
• $
• Sharding
• $$
• Operational complexity
“Do you really need that index?”
Scaling insert performancewith sharding
=Choosing the right shard key
SHARD3
USERS
Shard key example:year_of_birth
SHARD1
USERS
~ 1950 1951 ~ 1970
1991 ~ 2005
SHARD2
USERS
1971 ~ 1990
2010 ~ ∞
2006 ~ 2010
64MB chunk
mongos router
5k inserts/sec w/o sharding
Sequential key
•ObjectId as shard key
• Sequence #
• Timestamp
Worse throughput with 3x H/W.
Sequential key
• All inserts into one chunk
• Chunk migration overhead
SHARD-x
USERS
1000 ~ 2000
9000 ~ ∞
5000 ~ 7500
9001, 9002, 9003, 9004, ...
Sequential key
Hash key
• e.g. SHA1(_id) = 9f2feb0f1ef425b292f2f94 ...
•Distributes evenly across all ranges
Hash key
• Performance drops as collection grows
•Why? Mandatory shard key index
• B+Tree problem again!
Sequential keyHash key
Sequential + hash key• Coarse-grained sequential prefix
• e.g. Year-month + hash value
• 201210_24c3a5b9
B+Tree
201210_*201209_*201208_*
B+Tree
But what if...
201210_*201209_*201208_*
large working set
Sequential + hash key
• Can you predict data growth rate?
• Balancer not clever enough
•Only considers # of chunks
•Migration slow during heavy-writes
Sequential keyHash key
Sequential + hash key
Low-cardinality hash key
• e.g. A~Z, 00~FF
• Alleviates B+Tree problem
• Sequential access on fixed # of parts
LocalB+Tree
A B C
Shard key range: A ~ D
AA BB CC
Low-cardinality hash key
• Limits the # of possible chunks
• e.g. 00 ~ FF ➔ 256 chunks
• Chunk grows past 64MB
• Balancing becomes difficult
Sequential keyHash key
Sequential + hash keyLow-cardinality hash key
Low-cardinality hash prefix+ sequential part
• e.g. Short hash prefix + timestamp
• FA1350005981
•Nice index access pattern
• Unlimited # of chunks
LocalB+Tree
A123 B123 C123
Shard key range: A000 ~ C999
A000 B000 C000
Finally, 2x throughput
Lessons learned• Know the performance impact of secondary index
• Choose the right shard key
• Test with large data sets
• Linear scalability is hard
• If you really need it, consider HBase or Cassandra
• SSD