Upload
quinn-armstrong
View
34
Download
3
Embed Size (px)
DESCRIPTION
Project 1 : Who is Popular, and Who is Not. Angel Trifonov Anh Pham Xiao Qin. Tasks. Task b, c both in Pig and Java Task h in Java. Task b in Java. Write a job(s) that reports for each country, how many of its citizens have a Facebook page. Single map-reduce job - PowerPoint PPT Presentation
Citation preview
ANGEL TRIFONOV ANH PHAMXIAO QIN
Project 1 : Who is Popular, and Who is Not.
Tasks
Task b, c both in Pig and Java
Task h in Java
Task b in Java
Write a job(s) that reports for each country, how many of its citizens have a Facebook page.
Single map-reduce job Input: MyPage datasets Mapper: examine each file line-by-line
• Each line converted to a string• String is split using “,” delimiter
Extract nationality and map to an IntWriteable Reducer: take all pairs and sum values for each key Output: number of users per nationality Single reducer
Task b in Pig
Group Mypage dataset based on Country code: countrygrp = group mypage by cc;
Report number of people that have Facebook page for each country: taskb = foreach countrygrp generate group,
COUNT(mypage.id); dump taskb;
Running Time Comparison:
Plain MapReduce: 1 min 36 sec (Job time) Pig: 24sec (Job time)
Task c in Java
Find the top 10 interesting Facebook pages, namely, those that got the most accesses based on your AccessLog dataset compared to all other pages.
Hadoop Settings: multiple mappers and one reducer. (setNumReduceTasks(1))
Input: AccessLog 1st round:
Mapper(s): Parse the input data. Get the WhatPage. Set WhatPage as the key and a constant number 1 as the value.
Reducer: For each key, sum up the total value. Set the WhatPage as the key and the total count as the value
2nd round: Swap the key and value (InverseMapper.class)
Output: [Count] , [WhatPage] (in descending order )
Task c in Pig
Group the Accesslog dataset based on accessed facebook ID: access_fid_grp = group alog by fid;
Get the access count for each accessed facebook ID: grpcnt = foreach access_fid_grp generate group,COUNT(alog.aid) as
alogcnt; Order the count descending:
grporder = order grpcnt by alogcnt desc; List top 10:
taskc = limit grporder 10; dump taskc;
Running Time Comparison:
Plain MapReduce: 2 min 1 sec(Job time)Pig: 1 min 52 sec (Job time)
Task h : Define Potential Stalkers
A person who visits another person’s Facebook page too much. But they are not friend.
Mapper
- Output key: 2nd field (Person ID): IntWritable
1st Field, PersonID, 3rd Field …
- Output value: “<dataset tag>, <ID>”: Text
Friends:
personID f, friendIDAccesslog:
personID a, visitedID
Reducer
Key:<personID>
Value List:<(f,friendID) (a,visitedID) (f,friendID) (a,visitedID) …>- Sort the list based on the second field of each element.- All visitedID and friendID have the same value will be place
next to each other- If all ID are visitedID, and it appears too many times (based
on a predefined threshold) => Potential stalker.
Output: personID visitedID
Sample Result
Thank you!Questions?