26
CONFIDENTIAL CONFIDENTIAL CONFIDENTIAL CONFIDENTIAL Geo Searches for Health Care Pricing Data with MongoDB NoSQL Now 2013 Robert Stewart Senior Architect, Castlight Health [email protected] @wombatnation 1

Geo Searches for Health Care Pricing Data with MongoDB

Embed Size (px)

DESCRIPTION

I presented this updated version of my talk at NoSQL Now! 2013 in San Jose, CA, on August 22, 2013. The presentation describes how Castlight Health uses MongoDB to support very low latency searches for very large volumes of health care pricing data. Key factors are geospatial indexes, SSDs and replica sets.

Citation preview

Page 1: Geo Searches for Health Care Pricing Data with MongoDB

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Geo Searches for Health Care Pricing Datawith MongoDB

NoSQL Now 2013

Robert Stewart

Senior Architect, Castlight Health

[email protected]

@wombatnation

1

Page 2: Geo Searches for Health Care Pricing Data with MongoDB

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Castlight Health

The Business and Technical Problems

Initial Solution

MongoDB, Geospatial Indexes and SSDs

Replica Set Flipping

2

Page 3: Geo Searches for Health Care Pricing Data with MongoDB

3

Hosted web and mobile applications providing unbiased information on health care cost and quality

Customers are employers and health plans

Founded in San Francisco in 2008

$181 million in VC funding

#1 on Wall Street Journal’s list of “Top 50 Venture-Backed Companies” for 2011

Hiring!

Castlight Health

Page 4: Geo Searches for Health Care Pricing Data with MongoDB

4

Home Page

Page 5: Geo Searches for Health Care Pricing Data with MongoDB

5

Search Results

Page 6: Geo Searches for Health Care Pricing Data with MongoDB

6

Business Problem

Support searches for

Prices for a procedure performed by any in-network provider in a geographical area

Prices for all procedures performed by a single provider

Sub-second response, even if returning data on thousands of prices

Page 7: Geo Searches for Health Care Pricing Data with MongoDB

7

Need a very fast geospatial index

Rate count at 1 billion and rising

Major rate updates monthly

Difficult to index data to ensure sequential reads

Sometimes lots of random reads

Technical Problems

Apr-11 Jun-11 Aug-11 Oct-11 Dec-11 Feb-12 Apr-12 Jun-12 Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13

Page 8: Geo Searches for Health Care Pricing Data with MongoDB

8

Pricing Retrieval Architecture

Page 9: Geo Searches for Health Care Pricing Data with MongoDB

9

Initial Solution

Store pricing data in MySQL

When Pricing Service starts, create two in-memory indexes and cache most of the rates

55 GB JVM Heap with lots of GC tuning

20-minute service startup time to build indexes

3 hours for background caching of most rates

Trouble Brewing: Total rates growing quickly Rolling restart becoming unacceptably slow If rates not in Java or MySQL cache, retrieval was very slow

Page 10: Geo Searches for Health Care Pricing Data with MongoDB

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Enter the Mongo

10

Page 11: Geo Searches for Health Care Pricing Data with MongoDB

11

Geospatial Indexes We Evaluated

Standard 2D index in MongoDB 2.2 too slow for my use case

Geo Haystack index From docs.mongodb.org:

“A haystack index is a special index that is optimized to return results over small areas. Haystack indexes improve performance on queries that use flat geometry.”

2DSphere index in MongoDB 2.4

Page 12: Geo Searches for Health Care Pricing Data with MongoDB

12

Mercator Projection with 10 degree grid

Page 13: Geo Searches for Health Care Pricing Data with MongoDB

13

Geo Haystack

We chose degrees long-lat for x-y coordinate system

25 miles is our default search radius Roughly 0.5 degrees in middle of the US

db.priceables_1.ensureIndex(

{ loc: "geoHaystack", pm: 1 },

{ bucketSize: 0.5 })

db.runCommand(

{ geoSearch: "priceables_1",

near: [-122.4, 37.79],

maxDistance: 0.5,

search: { pm: 6757 },

limit: 50000 })

Page 14: Geo Searches for Health Care Pricing Data with MongoDB

14

Geo Haystack Cons

Only one secondary filter

Second part of index can’t have an array value

Error on unindexed query on only the second part of the key

Page 15: Geo Searches for Health Care Pricing Data with MongoDB

15

Supports earth-like spherical geometries

Points can be GeoJSON or x,y pairs

GeoJSON LineString and Polygon

Queries for inclusion, intersection and proximity

2DSphere Index

Page 16: Geo Searches for Health Care Pricing Data with MongoDB

16

db.priceables_1.ensureIndex(

{ loc: "2dsphere", pm: 1, pn : 1 })

db.priceables_1.find(

{ "loc" :

{ "$geoWithin" :

{ "$centerSphere" :

[ [ -94.2128 , 36.3840], 0.006314]}},

"pm" : 6441,

"pn" : { "$in" : [ 5236 , 5237 ]

}})

2DSphere Index Creation and Sample Query

Page 17: Geo Searches for Health Care Pricing Data with MongoDB

17

Geospatially Accurate

Even Faster than Haystack

2DSphere Results

Page 18: Geo Searches for Health Care Pricing Data with MongoDB

18

SSDs

For uncached data on HDD, MongoDB geo index was twice as fast as custom Java geo index with MySQL

Still close to 1 minute for big queries with full data set

Death by random read

Tested with a $200 Samsung SSD Typical query dropped to 20 millis Big query only about 150 millis

Page 19: Geo Searches for Health Care Pricing Data with MongoDB

19

Random 4k block reads, 5 GB file, 16 threads

Mongoperf on SSDs

Env SSD Read Ops/s Read MB/s

Prod Samsung 200GB SLC 74k 288

QA VM Samsung 200GB SLC 30k 117

Dev Samsung 830 256GB SATA MLC 47k 183

Env SSD Write Ops/s Write MB/s

Prod Samsung 200GB SLC 1074 289

QA VM Samsung 200GB SLC 405 196

Dev Samsung 830 256GB SATA MLC 438 210

Sequential write of the 5 GB file

Page 20: Geo Searches for Health Care Pricing Data with MongoDB

20

Requirements Major price updates monthly Minor updates more frequently

Huge bulk loads with no impact on active replica set

I/O bound, not CPU bound

Solution Two MongoDB replica sets Multiple SSDs per server

Low Impact Pricing Updates

Page 21: Geo Searches for Health Care Pricing Data with MongoDB

21

Replica Set Architecture

Physical Servers

ReplicaSets

prodpricing1

prodpricing2

Server pricing1

mongod 28001primary

mongod 28002secondary

Server pricing2

mongod 28001secondary

mongod 28002primary

Server db1

mongod 28001arbiter

Server db2

mongod 28002arbiter

Page 22: Geo Searches for Health Care Pricing Data with MongoDB

22

Transfer compressed data files to passive replica set Protip: to compress and uncompress

tar cvf - pricing | pigz > ~/pricing.tgz

pigz -dc pricing.tgz | tar xvf -

Page in index and data db.runCommand({ touch: "priceables_1", index: true, data: true })

Pricing Service operation to atomically flip

Replica Set Flipping Solution

Page 23: Geo Searches for Health Care Pricing Data with MongoDB

23

Obviously, increased cost, but only for extra SSDs

Recently added caching of remote pricing lookups TTL collections

Cache is lost during a flip

But, usually flip late at night

Cache eviction time is only a few hours

Replica Set Flipping Drawbacks

Page 24: Geo Searches for Health Care Pricing Data with MongoDB

24

Geo search speed with cold cache acceptable

Geo search speed with warm cache awesome

Pricing Service startup down to a few seconds

No production impact for major rate updates

Lowered risk for minor rate updates

Overall Results

Page 25: Geo Searches for Health Care Pricing Data with MongoDB

25

Summary

Geo Haystack Index great for … Retrieving lots of documents in a constrained search area Very simple geospatial searches with a single secondary filter

2DSphere Index great for … Complex geospatial searches or complex indexing

SSDs great for … Random reads Reducing need for lots of complex indexes

Replica set flipping great for … Instant swap of large amounts of data Primarily, if not solely, read only Trading cost for operational flexibility

Page 26: Geo Searches for Health Care Pricing Data with MongoDB

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Q & A

26