65
One Catalog Service to rule them all Antoine Girbal Principal Solutions Engineer, MongoDB Inc. @antoinegirbal

Retail referencearchitecture productcatalog

  • Upload
    mongodb

  • View
    216

  • Download
    0

Embed Size (px)

DESCRIPTION

During this session we will cover the best practices for implementing a product catalog with MongoDB. We will cover how to model an item properly when it can have thousands of variations and thousands of properties of interest. You'll learn how to index properly and allow for faceted search with milliseconds response latency and how to implement per-store, per-sku pricing while still keeping a sane number of documents. We will also cover operational considerations, like how to bring the data closer to users to cut down the network latency.

Citation preview

Page 1: Retail referencearchitecture productcatalog

One Catalog Service to rule them all

Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal

Page 2: Retail referencearchitecture productcatalog

Problem Statement

Page 3: Retail referencearchitecture productcatalog

3

The many catalogs problem

Page 4: Retail referencearchitecture productcatalog

4

1. One department in charge of master product works hard at fitting data into SQL tables

2. Resulting data sits in a SQL server with a couple replicas. It's forbidden to hit it more than 100 times / sec

3. Other departments need to access the data way more often for their own services

4. Other departments need more information that is not available since it did not fit in that long devised rigid SQL schema

5. ETLs and Message Buses are put in place for other teams to try figure it out themselves…

6. Data becomes inconsistent, fragmented, not up-to-date…Problem visible both internally and by customers!

The many catalogs problem

Page 5: Retail referencearchitecture productcatalog

5

How many Catalogs and

Catalog Caches do you have?

Search – Using Solr

Page 6: Retail referencearchitecture productcatalog

6

The many catalogs problem

Online Store

Catalog

Marketing

Catalog

Department 3

Catalog

Product Department

MasterCatalog

Department 4

Catalog

Department 5

Catalog

Department 1

Catalog

Message Bus

ETLs

Dozens of catalogs!

Page 7: Retail referencearchitecture productcatalog

7

• Single view of a product, one central catalog service

• Flexible schema containing all useful data

• Read volume high and sustained, 100k reads / s

• Can seamlessly take write spikes during catalog update

• Advanced indexing and querying

• Geographical distribution for HA and low latency

Goal: Single View of Product

Page 8: Retail referencearchitecture productcatalog

8

1. MongoDB Overview

2. Catalog Service Architecture

3. Data Store Models

4. Product Search

Agenda

Page 9: Retail referencearchitecture productcatalog

MongoDB Overview

Page 10: Retail referencearchitecture productcatalog

10

• Holds complex JSON structures

• Dynamic Schema for Agility

• complex querying and in-place updating

• Secondary, compound and geo indexing

• full consistency, durability, atomic operations

• HA and geo-distributed via Replication

• Near linear scaling via Sharding

• Overall, MongoDB is a unique fit!

MongoDB is a great fit

Page 11: Retail referencearchitecture productcatalog

11

MongoDB Strategic Advantages

Horizontally Scalable-Sharding

AgileFlexible

High Performance &Strong Consistency

Application

HighlyAvailable-Replica Sets

{ customer: “roger”, date: new Date(), comment: “Spirited Away”, tags: [“Tezuka”, “Manga”]}

Page 12: Retail referencearchitecture productcatalog

12

build your data to fit your application

Relational MongoDB{ customer_id : 1,

name : "Mark Smith",city : "San Francisco",orders: [ {

order_number : 13,store_id : 10,date: “2014-01-03”,products: [

{SKU: 24578234,

Qty: 3, Unit_price:

350},{SKU:

98762345, Qty: 1, Unit_Price:

110}]

},{ <...> }

]}

CustomerID First Name Last Name City0 John Doe New York1 Mark Smith San Francisco2 Jay Black Newark3 Meagan White London4 Edward Danields Boston

Order Number Store ID Product Customer ID10 100 Tablet 011 101 Smartphone 012 101 Dishwasher 013 200 Sofa 114 200 Coffee table 115 201 Suit 2

Page 13: Retail referencearchitecture productcatalog

13

Notions

RDBMS MongoDB

Database Database

Table Collection

Row Document

Column Field

Page 14: Retail referencearchitecture productcatalog

Catalog Service Architecture

Page 15: Retail referencearchitecture productcatalog

15

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Architecture Overview

Customer

ChannelsAmazon

Ebay…

StoresPOSKiosk

MobileSmartphone

Tablet

Website

Contact Center

APIData and Service

Integration

SocialFacebook

Twitter…

Data Warehouse

Analytics

Supply Chain Management

System

Suppliers

3rd Party

In Network

Web Servers

Application Servers

Page 16: Retail referencearchitecture productcatalog

16

Commerce Functional Components

Information Layer

Look & Feel

Navigation

Customization

Personalization

Branding

Promotions

Chat

Ads

Customer's Perspective

ResearchBrowseSearch

SelectShopping Cart

PurchaseCheckout

ReceiveTrack

UseFeedbackMaintain

DialogAssist

Market / Offer

Guide

Offer

Semantic Search

Recommend

Rule-based Decisions

Pricing

Coupons

Sell / Fullfill

Orders

Payments

Fraud Detection

Fulfillment

Business Rules

InsightSession CaptureActivity

Monitoring

Customer Enterprise

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Page 17: Retail referencearchitecture productcatalog

17

Merchandising Components

Merchandising

MongoDB

Variant

Hierarchy

Pricing

Promotions

Ratings & Reviews

Calendar

Semantic Search

Item

Localization

Page 18: Retail referencearchitecture productcatalog

19

MongoDB Data Store

Merchandising - Architecture

Items Pricing Promotions

VariantsRatings & Reviews

Search Engine

Product Service API

Online Store Marketing Inventory SCMS Public API …

Page 19: Retail referencearchitecture productcatalog

Data Store Models

Page 20: Retail referencearchitecture productcatalog

21

Models - Product Page

Product images

General Informatio

n

List of Variants

External Informatio

n

Localized Descriptio

n

Page 21: Retail referencearchitecture productcatalog

22

• Item: the overall product info (e.g. Levi’s 501)

• Variant: a specific variant of an item (e.g. in black size 6) which typically has a specific SKU / UPC

• Price: price information may vary based on the store, the variant, etc

• Hierarchy: the item taxonomy

• Facet: facets to search products by

• Vendors: a given sku may be available through several vendors if the site is a marketplace

> Don't try to fit all in the same document!

Models - Overview

Page 22: Retail referencearchitecture productcatalog

23

Hundreds of sizes

One Item

Dozens of colors

Models – Overview

Page 23: Retail referencearchitecture productcatalog

24

• A single item may have thousands of variants

• Each variant can have hundreds of attributes

• Altogether a single item can represent many MBs worth of JSON text

• Don't try to fit everything into the same document!

• Use a schema that is natural and fits the API

Models - Overview

Page 24: Retail referencearchitecture productcatalog

25

{ "_id": "054VA72303012P", // the item id "desc": [ // item descriptions { "lang": "en", "val": "Give your dressy look a lift with ..." }, ... ], "name": "Women's Kate Ivory Peep-Toe Stiletto Heel", "category": "/84700/80009/1282094266/1200003270", // hierarchy "brand": { "id": "2483510", "img": "http://...", "name": "Metaphor" }, "assets": { // references to all assets "imgs": [ { "img": { "width": 1900, "height": 1900, "src": "http://..." }, ... ] }, "shipping": { // shipping specs }, "specs": { // item specs }, "attrs": [ // list of items attributes (facets) { "name": "Heel Height", "value": "High (2-1/2 to 4 in.)" }, { "name": "Toe", "value": "Open toe" }, ... ], "variants": { // quick info on the variants "cnt": 9, "attrs": [ { "dispType": "DROPDOWN", "name": "Color" }, { "dispType": "DROPDOWN", "name": "Shoe Size" }, ... ] }, "lastUpdated": 1400877254787 // keep track of updates }

Models - Item Model

Page 25: Retail referencearchitecture productcatalog

26

• Get item by id

db.definition.findOne( { _id: "301671" } )

• Get items from list of ids

db.definition.findOne( { _id: { $in: ["301671", "301672" ] } } )

• Get items by department

db.definition.find({ category: { $regex: "^/84700/" } })

• Get items by category prefix

db.definition.find( { category: { $regex: "^/84700/80009/" } } )

• Secondary Indices

name, category, lastUpdated

Models - Item Model

Page 26: Retail referencearchitecture productcatalog

27

{ "_id": "05458452563", // the sku

"name": "Width:Medium,Color:Ivory,Shoe Size:6.5",

"itemId": "054VA72303012P", // reference to the item id

"altIds": { "upc": "632576103580" },

"assets": { // list of assets specific to variant

"imgs": [

{ "width": 1900, "height": 1900, "src": "http://..." },

{ "width": 1900, "height": 1900, "src": "http://..." }, ...

]

},

"attrs": [ // list of attributes specific to variant

{ "name": "Width", "value": "Medium" },

{ "name": "Color", "family": "White", "value": "Ivory" },

{ "name": "Size", "value": "6.5" }, ...

],

"lastUpdated": 1400877254787 // keep track of updates }

Models – Variant Model

Page 27: Retail referencearchitecture productcatalog

28

• Get variant from SKU

db.variant.find( { _id: "05458452563" } )

• Get all variants for a product, sorted by SKU

db.variant.find( { itemId: "054VA72303012P" } ).sort( { _id: 1 } )

• Indices

itemId, lastUpdated

Models – Variant Model

Page 28: Retail referencearchitecture productcatalog

29

Models - Hierarchy

{

"_id": "1200003270", // the node id

"name": "Women's Heels & Pumps",

"count": 22305, // how many items in this category

"parents": [ // list of parents

"1282094266"

],

"facets": [ // facets that exists for this category

"Heel Height",

"Toe",

"Upper Material",

"Width",

"Shoe Size",

"Color"

]

}

Page 29: Retail referencearchitecture productcatalog

30

• Get hierarchy node by id

db.hierarchy.find( { _id: "1200003270" } )

• Get hierarchy node from parent id

db.hierarchy.find( { parents: "1282094266" } )

• Get departments (no parent)

db.hierarchy.find( { parents: null } )

• Secondary Indices

parents

Models – Hierarchy

Page 30: Retail referencearchitecture productcatalog

31

Per store pricing could result in billions of documents…unless it is built in a modular way:

_id: concatenation of item and store.

Item: can be an item id or variant id (sku)

Store: can be a store group (online) or store id.

Models – per Store Pricing

{ "_id": "skuSPM8824542513_1234/store123", "price": 69.99, "sale": { "salePrice": 42.72, "saleEndDate": "2050-12-31 23:59:59" }, "lastUpdated": 1374647707394 }

Page 31: Retail referencearchitecture productcatalog

32

• Get all prices for a given item

db.prices.find( { _id: /^item301671/ )

• Get all prices for a given sku (price could be at item level)

db.prices.find( { _id: { $in: [ /^sku730223104376/, /^item301671/ ])

• Get minimum and maximum prices for a sku

db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price },

max: { $max : price} } })

• Get price for a sku and store id (returns up to 4 prices)

db.prices.find( { _id: { $in: [ "sku730223104376/store1234",

"sku730223104376/sgroup0",

"item301671/store1234",

"item301671/sgroup0"] , { price: 1 })

Models – per store Pricing

Page 32: Retail referencearchitecture productcatalog

Product Search

Page 33: Retail referencearchitecture productcatalog

34

Search – Browse and Search products

Browse by category

Special Lists

Filter by attributes

Lists hundreds of item

summaries

By far the toughest page to get right and fast …

Page 34: Retail referencearchitecture productcatalog

35

The previous page presents many challenges:

• Response within milliseconds for hundreds of items

• Faceted search on many attributes: category, brand, …

• Efficient sorting on several attributes: price, popularity

• Pagination feature which requires deterministic ordering

> Search engines are built for this purpose!

Search – Browse and Search products

Page 35: Retail referencearchitecture productcatalog

36

Search – Traditional Architecture

Product Data Store Product Search

Indexing

#1 obtain search

results IDs

ApplicationCache

#2 obtain objects by ID from cache or DB

Pre-joined into objects

Page 36: Retail referencearchitecture productcatalog

37

The traditional architecture issues:

• 3 different systems to maintain: RDBMS, Search engine, Caching layer

• RDBMS schema is complex and static

• Applications needs to talk many languages

Search – Traditional Architecture

Page 37: Retail referencearchitecture productcatalog

38

Search – Architecture with MongoDB

Product Data Store Product Search

Indexing

#1 obtain search

results IDs

Applications

#2 obtain objects by list of IDs

MongoDB

Ready-to-use product documents

Search Engine

Product API

Application issues single

query

Page 38: Retail referencearchitecture productcatalog

39

MongoDB

Search - Mongo-Connector

Search Engine

OplogMongo

Connector

#1 Initial dump of the

collections

#2 Updates streaming via

OplogTranslation, filtering

Indexing

Indexing

Page 39: Retail referencearchitecture productcatalog

40

• Open-source Project at https://github.com/10gen-labs/mongo-connector

• Python app that reads from MongoDB's oplog and publishes to target of choice

• Supports initial sync by dumping the data

• Default connectors for Solr, Elastic Search, other MongoDB cluster

• Easily extensible to update other systems like SQL

Search - Mongo-Connector

Page 40: Retail referencearchitecture productcatalog

41

What is the data to index?

Search – Mongo-Connector

Page 41: Retail referencearchitecture productcatalog

42

Search – More Searching

Images of the matching variants are displayed

Facets for variants

Price and Rating

Page 42: Retail referencearchitecture productcatalog

43

… more challenges:

• Attributes at the variant level: color, size, etc

• Attributes from other docs: pricing, ratings, etc

• Display the matching variant's image and details

• Thousands of matching variants for an item, still need to display a single item

• Challenge to properly index the data

> Need for a single summary document per item

Search – More Searching

Page 43: Retail referencearchitecture productcatalog

44

MongoDB Data Store

Search - Architecture

SummariesItems Pricing

PromotionsVariantsRatings & Reviews

Page 44: Retail referencearchitecture productcatalog

45

{ "_id": "3ZZVA46759401P", // the item id "name": "Women's Chic - Black Velvet Suede", "dep": "84700", // useful as standalone for indexing "cat": "/84700/80009/1282094266/1200003270", "desc": { "lang": "en", "val": "This pointy toe slingback ..." }, "img": { "width": 450, "height": 330, "src": "http://..." }, "attrs": [ // global attributes, easily indexable by SE "heel height=mid (1-3/4 to 2-1/4 in.)", "brand=metaphor", "shoe size=6", "shoe size=6.5", ... ], "sattrs": [ // global attributes, not to be indexed "upper material=synthetic", "toe=open toe", ... ], "vars": [ { "id": "05497884001", "img": [ // images], "attrs": [ // list of variant attributes to index ] "sattrs": [ // list of variant attributes not to index ] }, … ] }

Search – Summary Model

Page 45: Retail referencearchitecture productcatalog

46

Let's use Solr …

Search – Using Solr

Page 46: Retail referencearchitecture productcatalog

47

Search - Using Solr

Page 47: Retail referencearchitecture productcatalog

48

Search - Using Solr

Defining the schema in schema.xml

<fields> <!-- some of the core fields --> <field name="_id" type="string" indexed="true" stored="true" /> <field name="name" type="text_general" indexed="true" stored="true" /> <field name="cat" type="string" indexed="true" stored="true" /> <field name="price" type="float" indexed="true" stored="true"/>

<!-- the full text to index --> <field name="desc.0.val" type="text_general" indexed="true" stored="true"/>

<!-- dynamic attributes for facetting --> <dynamicField name="attrs.*" type="string" indexed="true" stored="true"/>

<!– some Solr specific fields --> <field name="_version_" type="long" indexed="true" stored="true"/> <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/> <dynamicField name="*" type="ignored" multiValued="true"/></fields>

Page 48: Retail referencearchitecture productcatalog

49

Search - Using Solr

Starting up the connector

> Keep it running, it will just stream the Oplog

> mongo-connector -m ec2-54-80-63-229.compute-1.amazonaws.com:27017 // the mongo -t http://localhost:8983/solr // the solr -d mongo_connector/doc_managers/solr_doc_manager.py -n "catalog.summary" // target summary collection --auto-commit-interval=60 // commit every 1 min…

Page 49: Retail referencearchitecture productcatalog

50

Document in Solr looks like:

Lists are flattened which is difficult to use

> Must use to named fields to implement Facets

Search – Using Solr

{ "desc.0.val": "Our classic \"Flying Duck\" styled as a ...", "name": "Drake Waterfowl Duck Label SS T-Shirt Army Green", "attrs.1": "brand=Drake Waterfowl", "attrs.0": "style=t-shirts", "cat": "/84700/1200000239/1282094207/1200000817", "_id": "SPM10823491916", "_version_": 1479173524477182000, "timestamp": "2014-09-13T23:09:59.782Z"}

Page 50: Retail referencearchitecture productcatalog

51

Let's use Elastic Search…

Search – Using Elastic Search

Page 51: Retail referencearchitecture productcatalog

52

Search - Using Elastic Search

Page 52: Retail referencearchitecture productcatalog

53

Search - Using Elastic Search

ElasticSearch understands whole document right off the bat

Just need to tell ES not to tokenize the facets:

> Everything else is indexed auto-magically!

$ curl -XPOST localhost:9200/largecat3.summary -d '{ "settings" : { "number_of_shards" : 1 }, "mappings" : { "string" : { // string is the name of default mapping type "properties" : { "attrs" : { "type" : "string", "index" : "not_analyzed" } } } } }'

Page 53: Retail referencearchitecture productcatalog

54

Search - Using Elastic Search

Starting up the connector

> Keep it running, it will just stream the Oplog

> mongo-connector -m ec2-54-80-63-229.compute-1.amazonaws.com:27017 // the mongo -t http://localhost:9200 // the ES -d mongo_connector/doc_managers/elastic_doc_manager.py -n "catalog.summary" // target summary collection --auto-commit-interval=60 // commit every 1 min…

Page 54: Retail referencearchitecture productcatalog

55

Search - Using Elastic Search

Querying for documents, with Facet info… works well $ curl -X POST "http://localhost:9200/largecat3.summary/_search?pretty=true" -d ' { "query" : { "query_string" : {"query" : "Ipad"} }, "facets" : { "tags" : { "terms" : {"field" : "attrs"} } } }'{ "took" : 6, "hits" : { "total" : 151, "max_score" : 0.5892989, "hits" : [ { "_index" : "largecat3.summary", "_type" : "string", "_id" : "000000000000000012730000000000QAU-QR2442P", "_score" : 0.5892989, "_source": { // original JSON from MongoDB }, ... ] }, "facets" : { "tags" : { "_type" : "terms", "total" : 1577, "terms" : [ { "term" : "ring size=9", "count" : 120 }, { "term" : "ring size=8", "count" : 120 }, { "term" : "metal=sterling silver", "count" : 112 }, ... ] } } }

Page 55: Retail referencearchitecture productcatalog

56

How about MongoDB's indexes and Full-Text-Search?

Search – Using MongoDB Indexing

Page 56: Retail referencearchitecture productcatalog

57

The summary contains:

• department e.g. "Shoes"

• Fields to index

– Category path, e.g. "Shoes/Women/Pumps"

– Price

– List of Item Attributes, e.g. Brand = Guess

– List of Variant Attributes, e.g. Color = red

• Fields not to index

– List of Item Secondary Attributes, e.g. Style = Designer

– List of Variant Secondary Attributes, e.g. heel height = 4.0

Search – Using MongoDB indexing

Page 57: Retail referencearchitecture productcatalog

58

• Get summary from item iddb.variation.find({ _id: "p301671" })

• Get summary's specific variation from SKUdb.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } )

• Get summary by department, sorted by ratingdb.variation.find( { department: "Shoes" } ).sort( { rating: 1 } )

• Get summary with mix of parametersdb.variation.find( { department : "Shoes" ,

"vars.attrs" : { "color" : "Gray"} , "category" : ^/Shoes/Women/ , "price" : { "$gte" : 65.99 , "$lte" :

180.99 } } )

Search - Using MongoDB indexing

Page 58: Retail referencearchitecture productcatalog

59

Search – Using MongoDB indexing

• The following indices are used:– department + attr + category + _id– department + vars.attrs + category + _id– department + category + _id– department + price + _id– department + rating + _id

• _id used for pagination

• Can take advantage of index intersection

• With several attributes specified (e.g. color=red and size=6), which one is looked up?

Page 59: Retail referencearchitecture productcatalog

60

Facet samples:

{ "_id" : "Accessory Type=Hosiery" , "count" : 14}

{ "_id" : "Ladder Material=Steel" , "count" : 2}

{ "_id" : "Gold Karat=14k" , "count" : 10138}

{ "_id" : "Stone Color=Clear" , "count" : 1648}

{ "_id" : "Metal=White gold" , "count" : 10852}

Single operations to insert / update:

db.facet.update( { _id: "Accessory Type=Hosiery" },

{ $inc: 1 }, true, false)

The facet with lowest count is the most restrictive…

It should come first in the $all query!

Search – Using MongoDB indexing

Page 60: Retail referencearchitecture productcatalog

61

• Search Engine advantages:– Index size (~ 10x smaller than MongoDB's)

– Indexing speed

– Read speed, integrated cache

– All languages support

– Built-in facetted search, which includes facet counts

• MongoDB's Indexing advantages:– Built-in the data store, no additional server / software needed

– Single query to get the results

– Can filter down the variant entry and save computing

> Winner here is Elastic Search

Search – Comparing Solutions

Page 61: Retail referencearchitecture productcatalog

62

Search – Benchmarking

Department Category Price Primary attribute

Time Average (ms)

90th (ms) 95th (ms)

1 0 0 0 2 3 3

1 1 0 0 1 2 2

1 0 1 0 1 2 3

1 1 1 0 1 2 2

1 0 0 1 0 1 2

1 1 0 1 0 1 1

1 0 1 1 1 2 2

1 1 1 1 0 1 1

1 0 0 2 1 3 3

1 1 0 2 0 2 2

1 0 1 2 10 20 35

1 1 1 2 0 1 1

Page 62: Retail referencearchitecture productcatalog

Closing Comments

Page 63: Retail referencearchitecture productcatalog

64

Q & A Time

Page 64: Retail referencearchitecture productcatalog

Thank You!

Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal

Page 65: Retail referencearchitecture productcatalog