Upload
planet-cassandra
View
247
Download
1
Tags:
Embed Size (px)
DESCRIPTION
We will describe the architecture of a personalization platform that captures customer profiles and behavioral data. A Cassandra cluster is used as an intermediate storage backend to replicate updates to profile records and timeline events across multiple data-centers. A caching tier serves up the user data and provides a real-time execution environment where predictive models can calculate propensities or update category histograms, etc.. We delve into metrics that are used to track replication performance and data freshness. We also discuss applications and features like user badges that are powered by this new P13N platform.
Citation preview
Bullseye P13n Platform
April 7, 2014
Charles Bracher Bullseye Dev Manager
Ranjan Sinha, PhD Lead Research Scientist
Bullseye
Outline
P13n Platform
Why Cassandra?
Cassandra Setup
Cassandra Usage
Cassandra Issues and Resolutions
Hand over to Ranjan for the Data Science Perspective
Bullseye
Bullseye
Bullseye Functional Architecture
Offline Analysis
Offline Database/ Batch Processing
Recent User Data 1-5 days
(Cassandra)
Real Time Model Evaluation & Caching
(sharded/full user state in memory)
Client Access
Near Real Time Event Collection
Tracking
Long Term User Data
(Local SSD)
Why Cassandra?
Great write performance
Great replication performance
Reasonable read performance
Reasonable cost
Client controlled consistency settings
Bullseye
Cassandra Setup
Cassandra Version 1.2.9
We use Replication
– Cassandra rings deployed to 3 datacenters
Cassandra clients
– We use both the Datastax Java and C++ Beta clients
Using CQL Table specifications and commands
Not on SSDs
Bullseye
Cassandra Usage
Column Family Design:
– Avoid Tombstones
– Avoid Compaction
With Focus on Short Term Storage:
– Turn off automatic compaction / only manual compaction
– Use unique column key names to avoid tombstones
– Clear out old data with truncation
Bullseye
Cache Miss Flow (New Session)
Bullseye
CREATE TABLE DAY_N (USER_ID TEXT, RECORD_NAME TEXT, RECORD_VALUE BLOB, PRIMARY KEY (USER_ID, RECORD_NAME)); Write to active day column family with key user id. Truncate the oldest day column family. When going from one day to the next, do a manual compaction for the old day. On read, pull user id info from all col. families newer than the local SSD data.
Queuing Flow (Ongoing Activity)
Bullseye
CREATE TABLE HOUR_N (ID TEXT, RECORD_NAME TEXT, RECORD_VALUE BLOB, PRIMARY KEY (ID, RECORD_NAME)); Read/Write from active hour with key timestamp rounded to nearest second Store the column family one hour old to offline DB Truncate the column family two hours old Do async probe of record for current second as well as recent seconds till state is captured. Data may be read 1-3 times. More if replication is lagging.
Cassandra Issues and Resolutions
Issues with C++ Datastax Cassandra beta client
– open sourced, so could apply fixes
Performance issues with the cache miss query
– increased heap size
– reduced replication factor
– turned off cross colo read repair
– deployed data center aware policy for C++
Bullseye
Personalization Applications
Ranjan Sinha, PhD Lead Research Scientist
April 7, 2014
Disclaimer: Some of the content in this talk is based on my personal opinion. It does not reflect the views of ebay.
Outline
Why Personalize? P13N Platform
– Introduction
– Conceptual architecture
– Modeling stages
P13N Applications – User badges – Search ranking
– Contextual models – Deals
Personalization Applications
Why Personalize?
Enable more relevant experience
Retention of existing users
New user acquisition
Reactivating churned users
Increasing activity per user
Improving conversion from visits to transactions
Personalization Applications
P13N Platform: Introduction
Maintains activity timeline information
Enables event processing at near real-time
Enables in-session personalization
Provides environment for predictive model evaluation
Backup and restore to and from Hadoop/HBase
Personalization Applications
P13N Platform: Conceptual Architecture
Personalization Applications
Tracking Event Source
m1 m3 m2 ….
Model Executor
Filters and forwards events
Activity Timeline
+ User Badges
In-memory Cache + Model
Evaluation
CEP Processor
Client Access
Hadoop/HBase
Offline Modeling Platform
User Badges
mn
Cassandra
P13N Platform: Modeling stages
Realtime
– In-session user intent
– Contextual Models
Nearline
– Update propensity models (aka User Badges)
Offline
– Bootstrap propensity models by mining long-term behavior history
Personalization Applications
Application (1): User Badges
Personalization Applications
Name Description SaleType Auction vs. Buy-it-now
ItemCondition New vs. Used
Category Preference of categories
Price Price range of purchasing activity
Deals Propensity to purchase deals
Social Share Propensity to share items in social media
Profile based on long-term behavior
Application (2): Search Ranking …
Should all queries be personalized in the same manner? – For some queries (ebay or google), everyone would like the same results
– For other queries, different people may want completely different results
Personalization Applications
Query: “big ben puzzles”
Not_P13N Rank
P13N Rank
Sold IsNew Title
1 1 No No LOT OF 7 BIG BEN PUZZLES 5/1000PC. 2/1500 PUZZLES EUC
2 3 No Yes 1000 Pc MB Big Ben Jigsaw Puzzle Mount Shuksan North Cascades National Park WA
3 2 Yes No COMPLETE Fishing Village,Smalls Island MB Big Ben Puzzle 1000 Piece Puzzle Size!
User: always buys used items
Application (3): Contextual models …
Personalization Applications
Infer categories that user is interested in within the current session
Long and Short term behavior – Historic behavior may provide benefits at the start of the session
– Short-term behavior may contribute gains in an extended search session
– Combination of session and historic behavior may outperform using either alone
e2
t
Nearline, after session expiry
Online, in-session
Offline, historical
e3 e1 …events… e1
Event source
Application (4): Deals
Personalization Applications
Personalize categories
Personalize modules
Personalize tabs
Personalize items
fin Personalization Applications