18
Curtis Spencer Ezra Burgoyne An Internet Forum Index

Curtis Spencer Ezra Burgoyne

  • Upload
    bina

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

An Internet Forum Index. Curtis Spencer Ezra Burgoyne. The Problem. Forums provide a wealth of information often overlooked by search engines. Semi-structured data provided by forums is not taken advantage of by popular search software (e.g. Google, Yahoo). - PowerPoint PPT Presentation

Citation preview

Page 1: Curtis Spencer Ezra Burgoyne

Curtis Spencer

Ezra Burgoyne

An Internet Forum Index

Page 2: Curtis Spencer Ezra Burgoyne

The Problem• Forums provide a wealth of information often overlooked by search engines.

• Semi-structured data provided by forums is not taken advantage of by popular search software (e.g. Google, Yahoo).

• Despite being crawled, many useful information rich posts never appear in results due to low page rank.

• Discovering what the best forums are for a given topic is difficult even when the help of a search engine is enlisted.

• Forum users are often unaware of related information found on rival forums.

• A forum’s own search software is often slow and returns poor results.

Page 3: Curtis Spencer Ezra Burgoyne

Quick summary of solution

• Forum detection crawlers continually find new forums with the help of a web search engine (e.g. Dogpile)

• These discovered forums are eventually wrapped in their entirety through a distributed crawler.

• Forum content collected in the database is indexed using latest MySQL fulltext natural language index.

• Search ranking algorithm uses data ignored by traditional search engines such as number of replies, number of views, popularity of poster, etc.

Page 4: Curtis Spencer Ezra Burgoyne
Page 5: Curtis Spencer Ezra Burgoyne

Forums supported by Forum Looter

phpBB vBulletin

Page 6: Curtis Spencer Ezra Burgoyne
Page 7: Curtis Spencer Ezra Burgoyne

Discovering forums

• Using WordNet, a program serves dictionary words and their synonyms to a set of distributed crawlers.

• Every link returned by Dogpile subjected to a detection algorithm that consist of URL formations as well as common patterns in the markup.

• Detects the three most popular forum types used on the internet: vBulletin, phpBB, Invision

• In trying to be good netizens, Dogpile website only accessed every two minutes. In addition, robots.txt was respected.

Page 8: Curtis Spencer Ezra Burgoyne

Dogpile example query

Page 9: Curtis Spencer Ezra Burgoyne
Page 10: Curtis Spencer Ezra Burgoyne

Distributed forum wrapping (architecture)

• Synchronized Java RMI server stores a queue of jobs.

• Distributed crawlers retrieve jobs from the central RMI server.

• Distributed crawlers wrap whatever page their fetched job contains and saves results into database.

• Distributed crawlers can schedule new jobs on the RMI server, too.

Page 11: Curtis Spencer Ezra Burgoyne

More on distributed forum wrapping (access)

• In trying to be good netizens, each forum website is only accessed once in any 20 second time period.

• At the mentioned rate, it would take two months to completely wrap some of the largest forums out there.

• Last request times of forum websites were set by individual client crawlers and kept track of in the RMI server.

• Exponential back off algorithm used for slow sites.

Page 12: Curtis Spencer Ezra Burgoyne

More on distributed forum wrapping (performance)

• Java RMI server performed very well, only using 10% of an AMD Athlon CPU.

• Individual crawlers were rather memory intensive.

• Individual crawlers used JTidy for parsing pages and performed DOM manipulation in addition to regular expressions to extract data.

• Memory use attributed to Hibernate and JTidy.

• Database access caused main bottleneck in the distributed system.

Page 13: Curtis Spencer Ezra Burgoyne
Page 14: Curtis Spencer Ezra Burgoyne

Indexing of data

• Shadow database periodically updates forum data from crawl and creates an incremental MySQL fulltext index.

• Has built-in support for stop words and document frequency scaling.

Page 15: Curtis Spencer Ezra Burgoyne

Ranking algorithm for search results

• Forum software has many missed opportunities for metadata analysis.

• We do a hierarchical weighted value calculation for each post once it is matched a NLP query from MySQL.

• An approximation of this calculation is Value = w(apc)*apc + w(numViews) + w(numReplies) + w(isThread())

Page 16: Curtis Spencer Ezra Burgoyne

Future changes

• Natural language parsing of forum post corpus so ranking may be affected by things such as “thank you” or “great post” replies.

• Collaborative filtering of results per query by tracking user clicks.

• Improve resource usage of crawlers.

Page 17: Curtis Spencer Ezra Burgoyne
Page 18: Curtis Spencer Ezra Burgoyne