11
ANGEL TRIFONOV ANH PHAM XIAO QIN Project 1 : Who is Popular, and Who is Not.

Project 1 : Who is Popular, and Who is Not

Embed Size (px)

DESCRIPTION

Project 1 : Who is Popular, and Who is Not. Angel Trifonov Anh Pham Xiao Qin. Tasks. Task b, c both in Pig and Java Task h in Java. Task b in Java. Write a job(s) that reports for each country, how many of its citizens have a Facebook page. Single map-reduce job - PowerPoint PPT Presentation

Citation preview

Page 1: Project 1 : Who is Popular, and Who is Not

ANGEL TRIFONOV ANH PHAMXIAO QIN

Project 1 : Who is Popular, and Who is Not.

Page 2: Project 1 : Who is Popular, and Who is Not

Tasks

Task b, c both in Pig and Java

Task h in Java

Page 3: Project 1 : Who is Popular, and Who is Not

Task b in Java

Write a job(s) that reports for each country, how many of its citizens have a Facebook page.

Single map-reduce job Input: MyPage datasets Mapper: examine each file line-by-line

• Each line converted to a string• String is split using “,” delimiter

Extract nationality and map to an IntWriteable Reducer: take all pairs and sum values for each key Output: number of users per nationality Single reducer

Page 4: Project 1 : Who is Popular, and Who is Not

Task b in Pig

Group Mypage dataset based on Country code: countrygrp = group mypage by cc;

Report number of people that have Facebook page for each country: taskb = foreach countrygrp generate group,

COUNT(mypage.id); dump taskb;

Running Time Comparison:

Plain MapReduce: 1 min 36 sec (Job time) Pig: 24sec (Job time)

Page 5: Project 1 : Who is Popular, and Who is Not

Task c in Java

Find the top 10 interesting Facebook pages, namely, those that got the most accesses based on your AccessLog dataset compared to all other pages.

Hadoop Settings: multiple mappers and one reducer. (setNumReduceTasks(1))

Input: AccessLog 1st round:

Mapper(s): Parse the input data. Get the WhatPage. Set WhatPage as the key and a constant number 1 as the value.

Reducer: For each key, sum up the total value. Set the WhatPage as the key and the total count as the value

2nd round: Swap the key and value (InverseMapper.class)

Output: [Count] , [WhatPage] (in descending order )

Page 6: Project 1 : Who is Popular, and Who is Not

Task c in Pig

Group the Accesslog dataset based on accessed facebook ID: access_fid_grp = group alog by fid;

Get the access count for each accessed facebook ID: grpcnt = foreach access_fid_grp generate group,COUNT(alog.aid) as

alogcnt; Order the count descending:

grporder = order grpcnt by alogcnt desc; List top 10:

taskc = limit grporder 10; dump taskc;

Running Time Comparison:

Plain MapReduce: 2 min 1 sec(Job time)Pig: 1 min 52 sec (Job time)

Page 7: Project 1 : Who is Popular, and Who is Not

Task h : Define Potential Stalkers

A person who visits another person’s Facebook page too much. But they are not friend.

Page 8: Project 1 : Who is Popular, and Who is Not

Mapper

- Output key: 2nd field (Person ID): IntWritable

1st Field, PersonID, 3rd Field …

- Output value: “<dataset tag>, <ID>”: Text

Friends:

personID f, friendIDAccesslog:

personID a, visitedID

Page 9: Project 1 : Who is Popular, and Who is Not

Reducer

Key:<personID>

Value List:<(f,friendID) (a,visitedID) (f,friendID) (a,visitedID) …>- Sort the list based on the second field of each element.- All visitedID and friendID have the same value will be place

next to each other- If all ID are visitedID, and it appears too many times (based

on a predefined threshold) => Potential stalker.

Output: personID visitedID

Page 10: Project 1 : Who is Popular, and Who is Not

Sample Result

Page 11: Project 1 : Who is Popular, and Who is Not

Thank you!Questions?