Upload
lippo-group-digital
View
147
Download
0
Embed Size (px)
Citation preview
Profiler for Smartphone Users Interests Using Modified Hierarchical
Agglomerative Clustering Algorithm Based on Browsing History
Priagung Khusumanegara, Rischan Mafrur, and Deokjai Choi
School of Electronics & Computer Engineering Chonnam National University
{priagung.123, rischanlab}@gmail.com, [email protected]
The 23rd IFIP World Computer Congress (WCC 2015)Tuesday, October 6
Outline• Introduction• Proposed System• Data Collection• Data Extraction• URL Filtering Categories• Modified Hierarchical Agglomerative
Clustering• Experimental Results
IntroductionMotivations:• Smartphone provides many applications to support our activity
which one of the applications is web browser applications.• People spend much time on browsing activity for finding useful
information that they are interested on it. • It is not easy to find the particular pieces of information that they
interested on it.
Proposed System:
• We proposed a Modified Hierarchical Agglomerative Clustering Algorithms on a server-based application to provides smartphone users profiling for interests-focused based on browsing history.
Proposed System
Figure: Illustration of Proposed System
Data Collection• Data Collection Application
– It is built using the Funf framework. Funf framework is an extensible sensing and data processing framework for mobile devices.
• Number of participants– 30 Chonnam National University Students (aged 19-22)
• Period of data collection– 1 month (July 2014)
Figure: Data Collection Application
Data Extraction• Example of Collected Data
• We extract collected data to observe a resource name part of URL structure which is useful information to analyze user interest.
• Example of data extraction
URL Filtering Categories• URL Labeling
The URL data is labeled based on a popular Web directory.Open Directory Project (ODP) (dmoz.org).
• URL Filtering Categories
• The matrix of filtering categories
Category Group Category Type
Business Business/Economy, Job Search/Careers, Real Estate, and Shopping
Communications and search Blog/Web Communication, Social Networks, Email, and Search Engines
General Computer/Internet, Education, News/Media, and Reference
Lifestyle Entertainment, Games, Arts, Humor, Religion, Restaurants/Food, and Travel
𝐶4𝑥𝑛 = ۏ����������ێ?????ێ?????ێ?????𝑘𝑒𝑦1,1ۍ����������������� 𝑘𝑒𝑦1,2 … 𝑘𝑒𝑦1,𝑛𝑘𝑒𝑦2,1 𝑘𝑒𝑦2,2 … 𝑘𝑒𝑦2,𝑛𝑘𝑒𝑦3,1𝑘𝑒𝑦4,1
𝑘𝑒𝑦3,2𝑘𝑒𝑦4,2…… 𝑘𝑒𝑦3,𝑛𝑘𝑒𝑦4,𝑛 ۑ��������������ے������������������
ۑ��������������ۑ��������������ې?????
Modified Hierarchical Agglomerative Clustering
Input: 𝐴 𝑠𝑒𝑡 𝑋 𝑜𝑓 𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑒𝑑 𝑈𝑅𝐿 𝑑𝑎𝑡𝑎 {𝑈𝑅𝐿1,…,𝑈𝑅𝐿𝑛} 𝑈𝑅𝐿 𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 𝐶4𝑥𝑛 =
ۏ����������
𝑘𝑒𝑦1,1 𝑘𝑒𝑦1,2 ⋯ 𝑘𝑒𝑦1,𝑛𝑘𝑒𝑦2,1 𝑘𝑒𝑦2,2 … 𝑘𝑒𝑦2,𝑛𝑘𝑒𝑦3,1 𝑘𝑒𝑦3,2 … 𝑘𝑒𝑦3,𝑛𝑘𝑒𝑦4,1 𝑘𝑒𝑦4,2 … 𝑘𝑒𝑦4,1 ے������������������
𝐴 𝐿𝑎𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑑𝑖𝑠𝑡(𝑐1,𝑐2) 𝒐𝒖𝒕𝒑𝒖𝒕: 𝐷𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡, 𝑢𝑠𝑒𝑟 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑠 1: 𝐶𝑔𝑟𝑜𝑢𝑝1= [] 2: 𝐶𝑔𝑟𝑜𝑢𝑝2= [] 3: 𝐶𝑔𝑟𝑜𝑢𝑝3= [] 4: 𝐶𝑔𝑟𝑜𝑢𝑝4= [] 5: 𝒇𝒐𝒓 𝑖 = 1 𝑡𝑜 𝑛 6: 𝑐𝑖 = {𝑈𝑅𝐿𝑖} 7: 𝑒𝑛𝑑 𝑓𝑜𝑟 8: 𝐶 = {𝑐1,…,𝑘} 9: 𝑙 = 𝑛+ 1 10: 𝑾𝒉𝒊𝒍𝒆 𝐶.𝑠𝑖𝑧𝑒 > 1 𝑑𝑜 11: (𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2) = 𝑚𝑖𝑛.𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑐𝑖,𝑐𝑗) 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑐𝑖,𝑐𝑗 𝑖𝑛 𝐶 12: 𝑅𝑒𝑚𝑜𝑣𝑒 𝑐𝑚𝑖𝑛1 𝑎𝑛𝑑 𝑐𝑚𝑖𝑛2 𝑓𝑟𝑜𝑚 𝐶 13: 𝐴𝑑𝑑 {𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2} 𝑡𝑜 𝐶 14: 𝐼𝑓 {𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2} 𝑎𝑛𝑦 [𝑘𝑒𝑦1,𝑛],𝑤ℎ𝑒𝑟𝑒 𝑛 = 1,2,..,𝑛: 15: 𝐴𝑑𝑑 {𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2} 𝑡𝑜 𝐶𝑔𝑟𝑜𝑢𝑝1 16: 𝑒𝑙𝑖𝑓 {𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2} 𝑎𝑛𝑦 [𝑘𝑒𝑦2,𝑛],𝑤ℎ𝑒𝑟𝑒 𝑛 = 1,2,..,𝑛: 17: 𝐴𝑑𝑑 ሼ𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2ሽ 𝑡𝑜 𝐶𝑔𝑟𝑜𝑢𝑝2 18: 𝑒𝑙𝑖𝑓 {𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2} 𝑎𝑛𝑦 [𝑘𝑒𝑦3,𝑛],𝑤ℎ𝑒𝑟𝑒 𝑛 = 1,2,..,𝑛: 19: 𝐴𝑑𝑑 ሼ𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2ሽ 𝑡𝑜 𝐶𝑔𝑟𝑜𝑢𝑝3 20: 𝑒𝑙𝑖𝑓 {𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2} 𝑎𝑛𝑦 [𝑘𝑒𝑦4,𝑛],𝑤ℎ𝑒𝑟𝑒 𝑛 = 1,2,..,𝑛: 21: 𝐴𝑑𝑑 ሼ𝑐𝑚𝑖𝑛1,𝑐𝑚𝑖𝑛2ሽ 𝑡𝑜 𝐶𝑔𝑟𝑜𝑢𝑝4 22: 𝑙 = 𝑙 + 1 23: 𝒆𝒏𝒅 𝒘𝒉𝒊𝒍𝒆 24: 𝐷𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡 𝐶𝑔𝑟𝑜𝑢𝑝1 =
𝑙𝑒𝑛𝑔𝑡ℎ.𝐶𝑔𝑟𝑜𝑢𝑝 1σ 𝑙𝑒𝑛𝑔𝑡 ℎ.𝐶𝑔𝑟𝑜𝑢𝑝 𝑖4𝑖=1
25: 𝐷𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡 𝐶𝑔𝑟𝑜𝑢𝑝2 = 𝑙𝑒𝑛𝑔𝑡ℎ.𝐶𝑔𝑟𝑜𝑢𝑝 2σ 𝑙𝑒𝑛𝑔𝑡 ℎ.𝐶𝑔𝑟𝑜𝑢𝑝 𝑖4𝑖=1
26: 𝐷𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡 𝐶𝑔𝑟𝑜𝑢𝑝3 = 𝑙𝑒𝑛𝑔𝑡ℎ.𝐶𝑔𝑟𝑜𝑢𝑝 3σ 𝑙𝑒𝑛𝑔𝑡 ℎ.𝐶𝑔𝑟𝑜𝑢𝑝 𝑖4𝑖=1
27: 𝐷𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡 𝐶𝑔𝑟𝑜𝑢𝑝4 = 𝑙𝑒𝑛𝑔𝑡ℎ.𝐶𝑔𝑟𝑜𝑢𝑝 4σ 𝑙𝑒𝑛𝑔𝑡 ℎ.𝐶𝑔𝑟𝑜𝑢𝑝 𝑖4𝑖=1
28: set 𝐶𝑔𝑟𝑜𝑢𝑝 𝑖 which has max ሺ𝐷𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡 ሻ𝑎𝑠 𝑢𝑠𝑒𝑟 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑠
Experimental Results
Figure: Degree of Users Interests
Figure: Users Interests Profile
Experimental Results (Cont’d)• C 4.5 is the best known classification
algorithm used to generate decision trees for continuous and discrete attributes.
• Attributes:– Total session time– Total time a user stays at the site– Total number of accessed pages during the
whole session
Experimental Results (Cont’d)
Figure : Execution time comparison with C4.5 algorithm
Experimental Results (Cont’d)
Memory Size (MB) MHAC (%) C4.5 (%)
25 83.3 76.7
50 86.7 80.0
75 93.3 80.0
100 96.7 83.3
125 96.7 86.7
Table. Accuracy comparison with C4.5 algorithm
Conclusion and Futures Work• We proposed a modified Hierarchical Agglomerative Clustering
that can automatically provides a user interests profile.
• The proposed algorithms can measure degree of users’ interests and inferring particular pieces of information that they interested on it based on browsing history
• Our work can outperforms the C4.5 algorithm in execution time and accuracy on all memory utilization.
• In the future, we need to implement Map-Reduce algorithm on modified Hierarchical Agglomerative Clustering to enhance performance of clustering algorithm.
AcknowledgementsThis research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2012R1A1A2007014).
Thank You감사합니다