24

Click here to load reader

Automatic indexing

  • Upload
    arva

  • View
    93

  • Download
    3

Embed Size (px)

DESCRIPTION

Automatic indexing. Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes automatic indexing. Approaches and Methods. Initial approach Create an inverted file On-the-fly (natural language processing) Methods - PowerPoint PPT Presentation

Citation preview

Page 1: Automatic indexing

1

Automatic indexing Salton:

When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes automatic indexing

Page 2: Automatic indexing

2

Approaches and Methods Initial approach

– Create an inverted file– On-the-fly (natural language processing)

Methods– All words, remove stop words– Word frequencies (Wilson’s objective

method of determining aboutness)– More sophisticated IR methods

• Semantic/linguistical analysis, • co-occurrence/similarity measures, etc.

Page 3: Automatic indexing

3

Basic arrangement of automatic indexes

Inverted file: contains all the index terms automatically drawn from the document records according to the indexing technique used.– Position of term

- record number- Field number- Number of occurrences- Position in the field (digits 45-57)

Page 4: Automatic indexing

4

Access of Inverted Files

Sequential access

- alphabetical ordering Binary chain access

- binary search tree Hashing

Page 5: Automatic indexing

5

Pros and Cons of Automatic Indexing Pros

– Consistency– Cost reduction– Time reduction

Cons / limitations– Human intellect– Term relationships– Misleading in retrieval– Good algorithms, but generally domain-specific

Page 6: Automatic indexing

6

Natural language vs. Controlled Vocabulary Natural language continuum <-basic key word------------IR------------full NLP->

Page 7: Automatic indexing

7

Natural language vs. Controlled Vocab. Pros

Cons

Production cost Cost to the end-user Facilitate specificity

in terms of access Exhaustivity

indexing Handling of errors

Page 8: Automatic indexing

8

What is Automatic Classification?

Automatic manipulation of a document’s contents to support logical grouping with other similar documents for organization and/or retrieval activities. Can include the assignment of, or manipulation of, classification notation.

Page 9: Automatic indexing

9

Why Automatic Classification?

Classification is time consuming and expensive

Knowledge structuring– To much information

Status of automatic classification– Fairly experimental, although not

completely…• Operational systems for e-mail• Web retrieval harvesting METADATA (semi-

automatic)

Page 10: Automatic indexing

10

Automatic Classification on the Web

Automatic Classification of Web resources using Java and Dewey Decimal Classification http://www.scit.wlv.ac.uk/~ex1253/classifier/

SOSIG: Associated Research and Developmenthttp://www.sosig.ac.uk/about_us/research.html

Page 11: Automatic indexing

11

Why Automatic Classification?

Your articles…. How defined What was the purpose How was the automatic classification

done, or discussed Outcome

Page 12: Automatic indexing

12

RecallNumber of relevant documents retrieved out of all the

possible relevant documents in system.

[quantity—did you get it all?]

PrecisionPercentage of documents retrieved that were relevant

[quality of what you found]

Page 13: Automatic indexing

13

Tradeoff between Recall and Precision

We can easily recall everything that matches a particular text string or pattern; however, we cannot search through all the matching results (too many)

We can do an OK job limiting to most relevant, but as we “tune” result to be more relevant, we leave out more and more matching results.

Page 14: Automatic indexing

Major Issues Information is mostly online Information is increasing available in full-

text (full-content) There is an explosion in the amount of

information being produced. So much so that even in fields like

medical literature where there are major efforts like NLM Medline to index content, we cannot keep up.

14

Page 15: Automatic indexing

What this means

Need ways to index without requiring paid experts– Automatic indexing, classification, keyword

extraction, and even relationship and fact extraction.

– Need to take advantage of experts who are reading the materials to comment on it and provide rankings, summarizations, keywords, “factoids”. (like Amazon)

15

Page 16: Automatic indexing

Future Search

Full text searching of content, and of associated annotations on content, and metadata (including reader rankings, tags, etc). Like Connotea, NeoNote, etc.

Faceted based searching (Endeca, e.g. Home Depot, NCSU library).

Clustered based searching (Clusty)16

Page 17: Automatic indexing

Study on gene name searching

Looks at full text searching Tradeoff between precision and recall (Hemminger 2007).

17

Page 18: Automatic indexing

18

Article Discovery StudySchizophrenia +

Schizophrenia Gene

Schizophrenia Gene Arabidopsis Gene

Genes Found in Metadata Only

172 8.58% 3541 20.63% 2712 8.83%

Genes Found in Full-text Only

1671 83.38%

10125 58.99% 5705 18.57%

Genes Found in Metadata and Full-text

161 8.03% 3498 20.38% 22305 72.60%

Totals for Found Genes

2004 17164 30722

Page 19: Automatic indexing

19

Article Review Study

Two literature cohorts, – Schizophrenia (Pat Sullivan)– Arabidopsis (Todd Vision)

Each cohort had three readers Readers are asked to “review the article and

judge its relevance to them as someone new to the gene in this biological setting, trying to build an understanding of the state of knowledge in that research area.”

Page 20: Automatic indexing

20

Metadata Articles More Valuable

In both cases and for all observers, their mean quality rating values were lower (more useful) for the metadata discovered articles. There were statistically significant differences between the mean quality rating for the metadata discovered articles versus the full-text discovered articles for the both the Arabidopsis and Schizophrenia sets at the p < 0.05 level

Page 21: Automatic indexing

21

Precision and RecallSchizophrenia Arabidopsis

Recall Precision Recall Precision

Metadata discovered 15.7% (16.6%)

94.7% 84.1% (84.1%)

100%

Full-text only discovered 100% 63.7% 100% 69%

Page 22: Automatic indexing

22

Article Features that correlate with Value: Number of Hits

The number of hits or matches of the search term within the returned document is a commonly used feature to rank returned articles. To test the value of this feature, the number of hits was correlated with the mean quality ranking for each article (averaged across all observers). The results clearly show a relationship where articles with many matches of the search term, tend to be much more highly valued.

Page 23: Automatic indexing

23

Improving Relevance for Metadata Searching Repeating the calculations on the schizophrenia

and Arabidopsis article review sets, but limited to only matches with high hit counts (Schizophrenia ≥ 20 hits and Arabidopsis ≥ 15 hits) shows that precision for the full text is now the same (100% in Aradidopsis) or slightly better than that of the metadata retrieved articles (95% versus 94.4% in schizophrenia). However, the number of additional cases discovered by full-text searching is now only slightly better, finding 5% more cases in schizophrenia and 28% more in Arabidopsis.

Page 24: Automatic indexing

24

Conclusions

This suggests that rather than accepting metadata searching as a surrogate for full-text searching, it may be time to make the transition to direct full text searching as the standard. This could be accomplished by using certain features of the full-text article, such as number of hits of the search string or whether the search string is found in the metadata (i.e. our current metadata search) as filters that allow us to increase the precision of our results. (and put the user in control of the filtering).