Austin bdug 2011_01_27_small_and_big_data

Preview:

DESCRIPTION

Short overview of data infrastructure at Bazaarvoice. We use a combination of many different data stores such as MySQL, SOLR, Infobright, MongoDB and Hadoop.

Citation preview

I CAN HAS BIG DATA?Small and Big Data at Bazaarvoice

Alex Pinkin@apinkin

whois apinkin

● Alex PinkinSoftware Engineering Lead, Data Infrastructure team,Bazaarvoice

● Loves both SQL and NoSQL. Can't commit to one! :-)

@apinkin

Big Data?

A few facts about Bazaarvoice

● Bazaarvoice is a SaaS companypowering user generated contentsuch as ratings and reviews on thousands of web sites

● Over 75 Million reviews

● 280 Billion impressions

● 5 Billion Page Views per month

How Do We Do It?

● Client-side integration

● Code and Servers :)

What Do We Run in Prod?

● SQL○ MySQL○ Infobright

● NoSQL○ SOLR○ ElasticSearch○ MongoDB○ CouchDB○ Hadoop

Four Pillars

MySQL and Big Data?!!

● Yes, MySQL is our Master. Mostly used as K/V store.

● Scaling Reads: Replication● Scaling Writes: Sharding● HA: Hot Back-up, Multiple DC

● Pros○ Rock solid○ SQL

● Cons○ Inflexible schema○ Replication lag○ Sharding not built-in○ HA

Search: SOLR/Lucene

● Document Store● Inverted Index

Term Document IDs

rating:5 1,2

rating:4 3

productId: 12345 1,2,3

Analytics

Analytics - Infobright

● Columnar storage○ Compression (10x+)○ Reduced disk I/O

● Partitioning○ Horizontal: Data Packs○ Vertical: Columns

● Knowledge grid ○ MIN(C), MAX(C),

SUM(C), AVG(C),COUNT(DISTINCT(C))

Infobright - Pros and Cons

● Pros○ 30x faster than MySQL on analytics queries○ Open Source

● Cons○ No DML in OSS version○ No MPP (good for up to 5 TB)

Hadoop Use Case

Bazaarvoice EMR - Phase 1

Bazaarvoice EMR - Phase 2

Summary

● We use the best tool for the job

● NoSQL is maturing quickly. Query languages are still in flux though.

● Hadoop is here to stay

● We are (slowly) moving away from MySQL

@apinkin

Recommended