20
On Benchmarking Online Social Media Analytical Queries Weining Qian with Haixin Ma, Fan Xia, Jinxian Wei, Chengcheng Yu, and Aoying Zhou http://database.ecnu.edu.cn/

On Benchmarking Online Social Media Analytical Queries

Embed Size (px)

DESCRIPTION

Slides for GRADES 2013 (Workshop affiliated with SIGMOD 2013) (GRADES: Graph Data-management Experiences & Systems) http://event.cwi.nl/grades2013/

Citation preview

Page 1: On Benchmarking Online Social Media Analytical Queries

On Benchmarking Online Social Media Analytical

QueriesWeining Qian

with Haixin Ma, Fan Xia, Jinxian Wei, Chengcheng Yu, and Aoying Zhou

http://database.ecnu.edu.cn/

Page 2: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 2

Outline

• Motivation• BSMA: Benchmark for Social Media

Analytical query processing– Data set– Queries– Measurements

• Preliminary results• Discussion/on-going work

Page 3: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 3

Motivation

• Social media has become a major source to sense the world– Emergent event monitoring, political election/stock

market predicting, product survey, etc.

• Social media = social network + media– Social network: large-scale static/dynamic networks– Media: content with timestamps

• Both collective behavior analysis and personalized data analysis has many applications– Variant kind of queries

Page 4: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 4

Motivation

• Many "big data" management/mining systems exist (and maybe more are coming)– Parallel RDBMS, NOSQL/NewSQL systems

(Hadoop-related ones, Cassandra, etc.)

• Which system/tech. is most suitable to a given problem?– A benchmark is needed

Page 5: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 5

Social media data

Page 6: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 6

Schema

Page 7: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 7

BSMA

Queries (to be extended/revised)

Data set(crawled from Sina Weibo)

Data generator(under development)

BSMA performance testing tool (based on YCSB)

Page 8: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 8

Data acquisition

• Crawled from Sina Weibo ("Chinese Twitter")

Haixin Ma, Weining Qian, Fan Xia, Xiaofeng He, Jun Xu, Aoying Zhou: Towards modeling popularity of microblogs. Frontiers of Computer Science 7(2): 171-

184 (2013)

Page 9: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 9

Data set

• Followship network– Seed users: 11 lawyers and opinion leaders and 21

researchers– 2nd level users from seeds: 120,000+ users– 3rd level users from seeds: 1.7+ million users– 4th level users from seeds: 18+ million users (incomplete)

• More than 1 billion following relationships– Tweets from 1.7+ million users– From Aug. 2009 to Jun. 2012– 480+ million tweets (about 51.11% of them are retweeted

tweets, and others are original tweets)

Page 10: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 10

Queries

• Queries on social networks– E.g. list common followees of uses A and B

• Queries on hotspots– Hotspots may be: users, tweets, topics, etc.– E.g. list the tweets with highest #retweet

• Queries on timelines– E.g. list 10 most recent tweets posted by

A's followees

Page 11: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 11

Query example (Q12)

Rank the tweets appearing in A's followees’ timelines according to the number of retweet.

Page 12: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 12

BSMA performance testing tool based on YCSB

• YCSB: Yahoo Cloud Service Benchmark– http://wiki.github.com/brianfrankcooper/

YCSB/

• BSMA modifications– Query argument and parameter generation

• User IDs, top-k, timespan, etc.

– Query wrappers– https://github.com/xiafan68/BSMA

Page 13: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 13

Measurements

• Throughput– The highest throughput of the system under

different settings of number of threads

• Latency– The (average) latency of the system under

the setting with the 2nd highest throughput

• Scalability– The slope of the throughput/latency plot

Page 14: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 14

WISE 2012 Challenge Performance Track

• A preliminary version of BSMA is used in WISE 2012 Challenge Performance Track

• 4 teams– A special purpose (in-memory) system– A Hbase-based system with secondary index– A SQLLite-based system with many

optimizations– A special purpose system with B+-tree

optimizations for different kind of queries

Page 15: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 15

Results Find the set of people who share the same followee with the specified user.

Page 16: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 16

Difficulties

• Joins of very large tables

• Skewness of the data distribution– Power-law

distribution

• Preserving the orders in results

Page 17: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 17

Future work

• Data generator– More than a social

network generator– Simulate user

activities• Followship network• Tweeting and

retweeting actions• Timeline• Topics

Page 18: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 18

Future work

• Queries related to content of tweets– Queries with keyword search– Real-life data set needed

• More queries

• Performance testing of more systems– RDBMS, graph database, etc.

Page 19: On Benchmarking Online Social Media Analytical Queries

6/23/2013 GRADES 2013 @ NY, USA 19

More on BSMA

• Original WISE 2012 Challenge page– http://www.wise2012.cs.ucy.ac.cy/

challenge.html• WISE 2012 Challenge follow-up

information– https://wnqian.wordpress.com/research/

wise2012challenge/• BSMA performance testing tool

– https://github.com/xiafan68/BSMA• Suggestions or comments are welcome!

– Mailto: [email protected]

Page 20: On Benchmarking Online Social Media Analytical Queries

Thanks!

http://database.ecnu.edu.cn/