45
Large-Scale Analysis of Web Pages on a Startup Budget? Hannes Mühleisen, Web-Based Systems Group AWS Summit 2012 | Berlin

AWS Summit Berlin 2012 Talk on Web Data Commons

Embed Size (px)

Citation preview

Page 1: AWS Summit Berlin 2012 Talk on Web Data Commons

Large-Scale Analysis of Web Pages - on a Startup Budget?

Hannes Mühleisen, Web-Based Systems Group

AWS Summit 2012 | Berlin

Page 2: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

2

Page 3: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

• Websites now embed structured data in HTML

2

Page 4: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

• Websites now embed structured data in HTML

• Various Vocabularies possible

• schema.org, Open Graph protocol, ...

2

Page 5: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

• Websites now embed structured data in HTML

• Various Vocabularies possible

• schema.org, Open Graph protocol, ...

• Various Encoding Formats possible

• μFormats, RDFa, Microdata

2

Page 6: AWS Summit Berlin 2012 Talk on Web Data Commons

Our Starting Point

• Websites now embed structured data in HTML

• Various Vocabularies possible

• schema.org, Open Graph protocol, ...

• Various Encoding Formats possible

• μFormats, RDFa, Microdata

2

Question: How are Vocabularies and Formats used?

Page 7: AWS Summit Berlin 2012 Talk on Web Data Commons

Web Indices

• To answer our question, we need to access to raw Web data.

3

Page 8: AWS Summit Berlin 2012 Talk on Web Data Commons

Web Indices

• To answer our question, we need to access to raw Web data.

• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

3

Page 9: AWS Summit Berlin 2012 Talk on Web Data Commons

Web Indices

• To answer our question, we need to access to raw Web data.

• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

• Google and Bing have indices, but do not let outsiders in

3

Page 10: AWS Summit Berlin 2012 Talk on Web Data Commons

• Non-Profit Organization

4

Page 11: AWS Summit Berlin 2012 Talk on Web Data Commons

• Non-Profit Organization

• Runs crawler and provides HTML dumps

4

Page 12: AWS Summit Berlin 2012 Talk on Web Data Commons

• Non-Profit Organization

• Runs crawler and provides HTML dumps

• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

4

Page 13: AWS Summit Berlin 2012 Talk on Web Data Commons

• Non-Profit Organization

• Runs crawler and provides HTML dumps

• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

• Available on AWS Public Data Sets

4

Page 14: AWS Summit Berlin 2012 Talk on Web Data Commons

Why AWS?

• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

5

Page 15: AWS Summit Berlin 2012 Talk on Web Data Commons

Why AWS?

• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

• Preliminary analysis: 1 GB / hour / CPU possible

• 8-CPU Desktop: 8 months

• 64-CPU Server: 1 month

• 100 8-CPU EC2-Instances: ~ 3 days

5

Page 16: AWS Summit Berlin 2012 Talk on Web Data Commons

Common Crawl Dataset Size

Page 17: AWS Summit Berlin 2012 Talk on Web Data Commons

1 CPU, 1 h

Common Crawl Dataset Size

Page 18: AWS Summit Berlin 2012 Talk on Web Data Commons

1000 € PC, 1 h

1 CPU, 1 h

Common Crawl Dataset Size

Page 19: AWS Summit Berlin 2012 Talk on Web Data Commons

1000 € PC, 1 h

1 CPU, 1 h

5000 € Server, 1 h

Common Crawl Dataset Size

Page 20: AWS Summit Berlin 2012 Talk on Web Data Commons

1000 € PC, 1 h

1 CPU, 1 h

5000 € Server, 1 h

Common Crawl Dataset Size

17 € EC2 Instances, 1 h

Page 21: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

7

Page 22: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

• Job Coordination: SQS Message Queue

7

Page 23: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

• Job Coordination: SQS Message Queue

• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

7

Page 24: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

• Job Coordination: SQS Message Queue

• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

• Result Output: Write to S3

7

Page 25: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Setup

• Data Input: Read Index Splits from S3

• Job Coordination: SQS Message Queue

• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

• Result Output: Write to S3

• Logging: SDB

7

Page 26: AWS Summit Berlin 2012 Talk on Web Data Commons

S3

SQS

42

EC2

...

42 43 ... CC R42 R43 ...WDC

• Each input file queued in SQS

• EC2 Workers take tasks from SQS

• Workers read and write S3 buckets

Page 27: AWS Summit Berlin 2012 Talk on Web Data Commons

S3

SQS

42

EC2

...

42 43 ... CC R42 R43 ...WDC

• Each input file queued in SQS

• EC2 Workers take tasks from SQS

• Workers read and write S3 buckets

Page 28: AWS Summit Berlin 2012 Talk on Web Data Commons

S3

SQS

42

EC2

...

42 43 ... CC R42 R43 ...WDC

• Each input file queued in SQS

• EC2 Workers take tasks from SQS

• Workers read and write S3 buckets

Page 29: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Types of Data

9

0 50 100 150 200

5e+0

35e

+04

5e+0

55e

+06

Type

Entit

y C

ount

(log

)

Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010

Website Structure 23 %

Products, Reviews 19 %

Movies, Music, ... 15 %

Geodata 8 %

People, Organizations 7 %

2012 Microdata Breakdown

Page 30: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Types of Data

9

0 50 100 150 200

5e+0

35e

+04

5e+0

55e

+06

Type

Entit

y C

ount

(log

)

Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010

Website Structure 23 %

Products, Reviews 19 %

Movies, Music, ... 15 %

Geodata 8 %

People, Organizations 7 %

2012 Microdata Breakdown

• Available data largely determined by major player support

Page 31: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Types of Data

9

0 50 100 150 200

5e+0

35e

+04

5e+0

55e

+06

Type

Entit

y C

ount

(log

)

Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010

Website Structure 23 %

Products, Reviews 19 %

Movies, Music, ... 15 %

Geodata 8 %

People, Organizations 7 %

2012 Microdata Breakdown

• Available data largely determined by major player support

• “If Google consumes it, we will publish it”

Page 32: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Formats

10

• URLs with embedded Data: +6%

RDFa Microdata geo hcalendar hcard hreview XFN

Format

Perc

enta

ge o

f UR

Ls

01

23

4 2009/201002−2012

Page 33: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Formats

10

• URLs with embedded Data: +6%

• Microdata +14% (schema.org?)

RDFa Microdata geo hcalendar hcard hreview XFN

Format

Perc

enta

ge o

f UR

Ls

01

23

4 2009/201002−2012

Page 34: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Formats

10

• URLs with embedded Data: +6%

• Microdata +14% (schema.org?)

• RDFa +26% (Facebook?)

RDFa Microdata geo hcalendar hcard hreview XFN

Format

Perc

enta

ge o

f UR

Ls

01

23

4 2009/201002−2012

Page 35: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Extracted Data

• Extracted data available for download at

• www.webdatacommons.org

11

Page 36: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Extracted Data

• Extracted data available for download at

• www.webdatacommons.org

• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

11

Page 37: AWS Summit Berlin 2012 Talk on Web Data Commons

Results - Extracted Data

• Extracted data available for download at

• www.webdatacommons.org

• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

• Have a look!

11

Page 38: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Costs

• Ca. 5500 Machine-Hours were required

• 1100 € billed by AWS for that

12

Page 39: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Costs

• Ca. 5500 Machine-Hours were required

• 1100 € billed by AWS for that

• Cost for other services negligible *

12

Page 40: AWS Summit Berlin 2012 Talk on Web Data Commons

AWS Costs

• Ca. 5500 Machine-Hours were required

• 1100 € billed by AWS for that

• Cost for other services negligible *

• * At first, we underestimated SDB cost

12

Page 41: AWS Summit Berlin 2012 Talk on Web Data Commons

Takeaways• Web Data Commons now publishes the largest set of

structured data from Web pages available

13

Page 42: AWS Summit Berlin 2012 Talk on Web Data Commons

Takeaways• Web Data Commons now publishes the largest set of

structured data from Web pages available

• Large-Scale Web Analysis now possible with Common Crawl datasets

13

Page 43: AWS Summit Berlin 2012 Talk on Web Data Commons

Takeaways• Web Data Commons now publishes the largest set of

structured data from Web pages available

• Large-Scale Web Analysis now possible with Common Crawl datasets

• AWS great for massive ad-hoc computing power and complexity reduction

13

Page 44: AWS Summit Berlin 2012 Talk on Web Data Commons

Takeaways• Web Data Commons now publishes the largest set of

structured data from Web pages available

• Large-Scale Web Analysis now possible with Common Crawl datasets

• AWS great for massive ad-hoc computing power and complexity reduction

• Choose your architecture wisely, test by experiment, for us EMR was too expensive.

13

Page 45: AWS Summit Berlin 2012 Talk on Web Data Commons

Thank You!

Web Resources: http://webdatacommons.orghttp://hannes.muehleisen.org

Questions?Want to hire me?