17
I CAN HAS BIG DATA? Small and Big Data at Bazaarvoice Alex Pinkin @apinkin

Austin bdug 2011_01_27_small_and_big_data

Embed Size (px)

DESCRIPTION

Short overview of data infrastructure at Bazaarvoice. We use a combination of many different data stores such as MySQL, SOLR, Infobright, MongoDB and Hadoop.

Citation preview

Page 1: Austin bdug 2011_01_27_small_and_big_data

I CAN HAS BIG DATA?Small and Big Data at Bazaarvoice

Alex Pinkin@apinkin

Page 2: Austin bdug 2011_01_27_small_and_big_data

whois apinkin

● Alex PinkinSoftware Engineering Lead, Data Infrastructure team,Bazaarvoice

● Loves both SQL and NoSQL. Can't commit to one! :-)

@apinkin

Page 3: Austin bdug 2011_01_27_small_and_big_data

Big Data?

Page 4: Austin bdug 2011_01_27_small_and_big_data

A few facts about Bazaarvoice

● Bazaarvoice is a SaaS companypowering user generated contentsuch as ratings and reviews on thousands of web sites

● Over 75 Million reviews

● 280 Billion impressions

● 5 Billion Page Views per month

Page 5: Austin bdug 2011_01_27_small_and_big_data

How Do We Do It?

● Client-side integration

● Code and Servers :)

Page 6: Austin bdug 2011_01_27_small_and_big_data

What Do We Run in Prod?

● SQL○ MySQL○ Infobright

● NoSQL○ SOLR○ ElasticSearch○ MongoDB○ CouchDB○ Hadoop

Page 7: Austin bdug 2011_01_27_small_and_big_data

Four Pillars

Page 8: Austin bdug 2011_01_27_small_and_big_data

MySQL and Big Data?!!

● Yes, MySQL is our Master. Mostly used as K/V store.

● Scaling Reads: Replication● Scaling Writes: Sharding● HA: Hot Back-up, Multiple DC

● Pros○ Rock solid○ SQL

● Cons○ Inflexible schema○ Replication lag○ Sharding not built-in○ HA

Page 9: Austin bdug 2011_01_27_small_and_big_data

Search: SOLR/Lucene

● Document Store● Inverted Index

Term Document IDs

rating:5 1,2

rating:4 3

productId: 12345 1,2,3

Page 10: Austin bdug 2011_01_27_small_and_big_data

Analytics

Page 11: Austin bdug 2011_01_27_small_and_big_data

Analytics - Infobright

● Columnar storage○ Compression (10x+)○ Reduced disk I/O

● Partitioning○ Horizontal: Data Packs○ Vertical: Columns

● Knowledge grid ○ MIN(C), MAX(C),

SUM(C), AVG(C),COUNT(DISTINCT(C))

Page 12: Austin bdug 2011_01_27_small_and_big_data

Infobright - Pros and Cons

● Pros○ 30x faster than MySQL on analytics queries○ Open Source

● Cons○ No DML in OSS version○ No MPP (good for up to 5 TB)

Page 13: Austin bdug 2011_01_27_small_and_big_data

Hadoop Use Case

Page 14: Austin bdug 2011_01_27_small_and_big_data

Bazaarvoice EMR - Phase 1

Page 15: Austin bdug 2011_01_27_small_and_big_data

Bazaarvoice EMR - Phase 2

Page 16: Austin bdug 2011_01_27_small_and_big_data

Summary

● We use the best tool for the job

● NoSQL is maturing quickly. Query languages are still in flux though.

● Hadoop is here to stay

● We are (slowly) moving away from MySQL

Page 17: Austin bdug 2011_01_27_small_and_big_data

@apinkin