A Quality Focused Crawler for Health Information Tim Tang

A Quality Focused Crawler for Health Information

Tim Tang

2

Outline

Overview Contributions Experiments and results Issues for discussion Future work Questions & Suggestions?

3

Overview

Many people use the Internet to search for health information

But… health web pages may contain low quality information, and may lead to personal endangerment. (example)

It is important to find means to evaluate the quality of health websites and to provide high quality results in health search.

4

Motivation

Web users can search for health information using general engines or domain-specific engines like health portals

79% of Web users in the U.S search for health information from the Internet (Fox S. Health Info Online, 2005)

No measurement technique is available for measuring the quality of Web health search results.

Also, there is no method for automatically enhancing the quality of health search results

Therefore, people building a high quality health portal have to do it manually and, without work on measurement, we can’t tell how good a job they are doing

Example of such a health portal is BluePages search, developed by the ANU’s centre for mental health research.

5

BluePages Search (BPS)

6

BPS result list

7

Research Objectives

To produce a health portal search that: Is built automatically to save time, effort, and

expert knowledge (cost saving). Contains (only) high quality information in the

index by applying some quality criteria Satisfies users’ demand for getting good advice

(evidence-based medicine) about specific health topics from the Internet

8

Contributions

New and effective quality indicators for health websites using some IR-related techniques

Techniques to automate the manual quality assessment of health websites

Techniques to automate the process of building high quality health search engines

9

Expt1: General vs. domain specific search engines

Aim: To compare the performance of general search engines (Google, GoogleD) and domain specific engines (BPS) for domain relevance and quality.

Details: Running 100 depression queries in these engines. The top 10 results for each query from each engine are evaluated.

Results: next slide.

10

Expt1: Results

Relevance Quality

EngineMean MAP

NDCG Score

GoogleD 0.407 0.609 78

BPS 0.319 0.553 127

Google 0.195 0.349 28

MAP = Modified Average Precision

NDCG = Normalised Discounted Cumulative Gain

11

Expt1: Findings

Findings: GoogleD can retrieve more relevant pages, but less high quality pages compared to BPS. Domain-specific engines (BPS) have poor coverage (causing worse performance in relevance).

What next: How to improve coverage for domain-specific engines? How to automate the process of constructing a domain specific engine?

12

Expt2: Prospect of Focused Crawling in building domain-specific engines

Aim: To investigate into the prospect of using focused crawling (FC) techniques to build health portals. In particular: Seed list: BPS uses a seed list (start list for a

crawl) that was manually selected by experts in the field. Can we automate this process?

Relevance of outgoing links: Is it feasible to follow outgoing links from the currently crawled pages to obtain more relevant links?

Link prediction: Can we successfully predict relevant links from available link information?

13

Expt2: Results & Findings

Out of 227 URLs from DMOZ, 186 were relevant (81%) => DMOZ provides good starting list of URLs for a FC An unrestricted crawler starting from the BPS crawl can

reach 25.3% more known relevant pages in one single step from the currently crawled pages. => Outgoing links from a constraint crawl lead to additional relevant content

Machine learning algorithm C4.5 decision tree can predict link relevance with a precision of 88.15%

=> A decision tree created using features like anchor text, URL words and link anchor context can help a focused crawler obtain new relevant pages

14

Expt3: Automatic evaluation of Websites

Aim: To investigate if Relevance Feedback (RF) technique can help in the automatic evaluation of health websites.

Details: RF is used to learn terms (words and phrases) representing high quality documents and their weights. This weighted query is then compared with the text of web pages to find degree of similarity. We call this “Automatic quality tool” (AQT).

Findings: Significant correlation was found between human-rated (EBM) results and AQT results.

15

Expt3: Results – Correlation between AQT score and EBM score

16

Expt3: Results – Correlation between Google PageRank and EBM score

Correlation: small & non-significant

r=0.23, P=0.22, n=30 Excluding sites with

PageRank of 0, we obtained better correlation, but still significantly lower than the correlation between AQT and EBM.

17

Expt4: Building a health portal using FC

Aim: To build a high-quality health portal automatically, using FC techniques

Details: Relevance scores for links are predicted using the

decision tree found in Expt. 2. Relevance scores are transformed into probability scores using Laplace correction formula

We found that machine learning didn’t work well for predicting quality but RF helps.

Quality of target pages is predicted using the mean of quality scores of all the known (visited) source pages

Combination of relevance and quality score: The product of the relevance score and the quality score is used to determine crawling priority

18

Expt4: Results – Quality scores

3 crawls were built: BF, Relevance and Quality

19

Expt4: Results – Below Average Quality (BAQ) pages in each crawl

20

Expt4: Findings

RF is a good technique to be used in predicting quality of web pages based on the quality of known source pages.

Quality is an important measure in health search because a lot of relevant information is of poor quality (e.g. the relevance crawler)

Further analysis shows that quality of content might be further improved by post-filtering a very big BF crawl but at the cost of substantially increased network traffic.

21

Issues for discussion

Combination of scores Untrusted sites Quality evaluation Relevance threshold choice Coverage Combination of quality indicators RF vs Machine learning

22

Issue: Combination of scores

The decision to multiply the relevance and quality scores was taken arbitrarily, the idea was to keep a balance between relevance and quality, to make sure both quality and coverage are maintained.

Question: Should addition (or other linear combinations) be a better way to calculate this score? Or rather, only the quality score should be considered? In general, how to combine relevance and quality scores?

23

Issue: Untrusted sites

Untrusted sites RF was used for predicting high quality, but … Analysis showed that low quality health sites are

often untrusted sites, such as commercial sites, chat sites, forums, bulletins and message boards. Our results don’t seem to exclude a some of these sites.

Question: Is it feasible to use RF somehow, or any other means to detect these sources? How should that be incorporated into the crawler?

24

Issue: Quality evaluation expt.

Expensive because manual evaluation for quality requires a lot of expert knowledge and effort. To know the quality of a site, we have to judge all the pages of that site.

Question: How to design a cheaper but effective evaluation experiment for quality? Can lay judgment for quality be used somehow?

25

Issue: Relevance threshold choice

A relevance classifier was built to help reducing the relevance judging effort. A cut-off point for relevance score needs to be identified. The classifier runs on 2000 pre-judged documents, half are relevant. I decided the cut-off threshold as a score at which the total number of false positive and false negative is minimised.

Question: Is it a reasonable way to decide a relevance threshold? Any alternative?

26

Issue: Coverage

The FC may not explore all the directions of the Web and resulted in low coverage. It’s important to know how much of the high quality Web documents that the FC can index.

Question: How to design an experiment that evaluates

coverage issue? (How to measure recall?)

27

Issue: Combination of quality indicators

Health experts have identified several quality indicators that may help in the evaluation of quality, such as content currency, authoring information, information about disclosure, etc.

Question: How can/should these indicators be used in my work to predict quality?

28

Issue: RF vs Machine Learning

Compared to RF, ML has the flexibility of adding more features such as ‘inherited quality score’ (from source pages) into the leaning process to predict the quality of the results.

However, we’ve tried ML initially to predict quality but found that RF is much better. Maybe because we didn’t do it right!?

Question: Could ML be used in a similar way that RF is used? Does the former promise better result?

29

Future work

Better combination of quality and relevance scores to improve quality

Involve quality dimension in ranking of health search results (create something similar to BM25, with the incorporation of quality measure?)

Move to another topic in health domain or an entirely new topic?

Combine heuristics, other medical quality indicators with RF?

30

Suggestions

Any more suggestions to improve my work? Any more suggestions for future work? Other suggestions?

The end!

Documents

A Quality Focused Crawler for Health Information Tim Tang