40
Cloud Elephants and Witches: A Big Data Tale from Mendeley Kris Jack, PhD Data Mining Team Lead

DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Embed Size (px)

DESCRIPTION

DataScience Talk by Kris Jack, Team Lead of Dataming at Mendeley LTD Date: February 9th 2012 Graz, Austria

Citation preview

Page 1: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Cloud Elephants and Witches: A Big Data Tale from Mendeley

Kris Jack, PhD

Data Mining Team Lead

Page 2: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

➔ What's Mendeley?

➔ The curse that comes with success

➔ A framework for scaling up (Hadoop + MapReduce)

➔ Moving to the cloud (AWS)

➔ Conclusions

Overview

Page 3: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

What's Mendeley?

Page 4: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

...a large data technology startup company

...and it's on a mission to change the way that

research is done!

What is Mendeley?

Page 5: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

works like this:

1) Install “Audioscrobbler”

2) Listen to music

3) Last.fm builds your music profile and recommends you music you also could like... and it’s the world‘s biggest open music database

Last.fmMendeley

Page 6: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

research libraries

researchers

papers

disciplines

music libraries

artists

songs

genres

Last.fmMendeley

Page 7: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

...organise their research

Mendeley provides tools to help users...

...organise their research

Page 8: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

...organise their research

...collaborate with one another

Mendeley provides tools to help users...

...organise their research

Page 9: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

Page 10: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley
Page 11: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

Page 12: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

The curse that comes with success

Page 13: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

In the beginning, there was...

➔ MySQL:➔ Normalised tables for storing and serving:

➔ User data➔ Article data

➔ The system was happy

➔ With this, we launched the article catalogue➔ Lots of number crunching➔ Many joins for basic stats

Page 14: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Here's where the curse of success comes

➔ More articles came➔ More users came

➔ The system became unhappy

➔ Keeping data fresh was a burden➔ Algorithms relied on global counts➔ Iterating over tables was slow➔ Needed to shard tables to grow catalogue

➔ In short, our system didn't scale

Page 15: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

1.6 million+ users; the 20 largest userbases:

University of CambridgeStanford University

MITUniversity of Michigan

Harvard UniversityUniversity of OxfordSao Paulo University

Imperial College LondonUniversity of Edinburgh

Cornell UniversityUniversity of California at Berkeley

RWTH AachenColumbia University

Georgia TechUniversity of Wisconsin

UC San DiegoUniversity of California at LA

University of FloridaUniversity of North Carolina

Page 16: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Real-time data on 28m unique papers:

Thomson Reuters’ Web of Knowledge(dating from 1934)

Mendeley after 16 months:

50m

>150 million individual articles,

(>25TB)

Page 17: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

We had serious needs

➔ Scale up to the millions (billions for some items)➔ Keep data fresh➔ Support newly planned services

➔ Search➔ Recommendations

➔ Business context➔ Agile development (rapid prototyping)➔ Cost effective➔ Going viral

Page 18: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

A framework for scaling up (Hadoop and MapReduce)

Page 19: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

What is Hadoop?

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing

www.hadoop.apache.org

Page 20: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

➔ Designed to operate on a cluster of computers➔ 1...thousands➔ Commodity hardware (low cost units)

➔ Each node offers local computation and storage➔ Provides framework for working with petabytes of data

➔ When learning about Hadoop, you need to learn about:➔ HDFS➔ MapReduce

Hadoop

Page 21: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

➔ Hadoop Distributed File System➔ Based on Google File System➔ Replicates data storage (reliability, x3, across racks)➔ Designed to handle very large files (e.g. 64MB)➔ Provides high-throughput➔ File access through Java and Thrift APIs, CL and Wepapp

➔ Name node is a single point of failure (availability issue)

HDFS

Page 22: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

➔ MapReduce is a programming model➔ Allows distributed processing of large data sets➔ Based on Google's MapReduce ➔ Inspired by functional programming➔ Take the program to the data, not the data to the program

MapReduce

Page 23: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

MapReduce Example:Article Readers by Country

doc_id1, reader_id1, usa, 2010, …doc_id2, reader_id2, austria, 2012, …doc_id1, reader_id3, china, 2010, …

.

.

.

HDFSLarge file (150M entries)Flattened dataStored across nodes

doc_id1, {usa, china, usa, uk, china, china...}doc_id2, {austria, austria, china, china, uk …}...

Map(pivot countries

by doc id)

Reduce(calc. document stats)

doc_id1, usa, 0.27doc_id1, china, 0.09doc_id1, uk, 0.09doc_id2, austria, 0.99

.

.

.

Page 24: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

➔ HDFS for storing data➔ MapReduce for processing data

➔ Together, bring the program to the data

Hadoop

Page 25: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Hadoop's Users

Page 26: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

We make a lot of use of HDFS and MapReduce

➔ Catalogue Stats➔ Recommendations (Mahout)➔ Log Analysis (business analytics)➔ Top Articles➔ … and more

➔ Quick, reliable and scalable

Page 27: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Beware that these benefits have costs

➔ Migrating to a new system (data consistency)➔ Setup costs

➔ Learn black magic to configure➔ Hardware for cluster

➔ Administrative costs➔ High learning curve to administrate Hadoop➔ Still an immature technology➔ You may need to debug the source code

➔ Tips➔ Get involved in the community (e.g. meetups, forums)➔ Use good commodity hardware➔ Consider moving to the cloud...

Page 28: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Moving to the cloud (AWS)

Page 29: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

What is AWS?

Amazon Web Services (AWS) delivers a set of services that together form a reliable, scalable, and inexpensive computing platform “in the cloud”

www.aws.amazon.com

Page 30: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Why move to AWS?

➔ The cost of running your own cluster can be high➔ Monetary (e.g. hardware)➔ Time (e.g. training, setup, administration)

➔ AWS takes on these problems, renting their services to you based on your usage

Page 31: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Article Recommendations

➔ Aim: help researchers to find interest articles➔ Combat information deluge➔ Keep up-to-date with recent movements

➔ 1.6M users➔ 50M articles➔ Batch process for generating regular recommendations (using Mahout)

Page 32: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Article Recommendations in EMR

➔ Use Amazon's Elastic Map Reduce (EMR)➔ Upload input data (user libraries)➔ Upload Mahout jar➔ Spin up cluster➔ Run the job

➔ You decide the number of nodes (cost vs time)➔ You decide the spec of the nodes (cost vs quality)

➔ Retrieve the output

Page 33: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Catalogue Search

➔ 50 million articles➔ 50GB index in Solr➔ Variable load (over 24 hours)

➔ 1AM is quieter (100 q/s), 1PM is busier (150 q/s)

Page 34: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Catalogue Search in Context of Variable Load

➔ Amazon's Elastic Load Balancer➔ Only pay for nodes when you need them

➔ Spin up when load is high➔ Tear down load is low

➔ Cost effective and scalable

?, ?, ?...AWS elastic

load balancer

queries(100/s)

AWS Instance

AWS Instance

At 1AM, 100 queries/second

AWS Instance

At 1PM, 150 queries/second

queries(150/s)

Page 35: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Problems we've faced

➔ Lack of control can be an issue➔ Trade-off administration and control

➔ Orchestration issues➔ We have many services to coordinate➔ Cloud formation & Elastic Beanstalk

➔ Migrating live services is hard work

Page 36: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Conclusions

Page 37: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Conclusions

➔ Mendeley has created the world's largest scientific database➔ Storing and processing this data is a large scale challenge➔ Hadoop, through HDFS and MapReduce, provides a framework for large scale data processing➔ Be aware of administration costs when doing this in house

Page 38: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Conclusions

➔ AWS can make scaling up efficient and cost effective➔ Tap into the rich big data community out there➔ We plan to have make no more substantial hardware purchases, instead use AWS➔ Scaling up isn't a trivial problem, to save pain, plan for it from the outset

Page 39: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

Conclusions

➔ Magic elephants that live in clouds can lift the curses of evil witches

Page 40: DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley

www.mendeley.com