19
Math Information Retrieval Zhao Jin

Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated

Embed Size (px)

Citation preview

Math Information Retrieval

Zhao Jin

Zhao Jin. Math Information Retrieval

Examples:– Looking for formulas– Collect teaching resources– Keeping updated on research

development

Generic search engines ineffective in such situations– Unaware of user needs and math

expressions

Why Math Information Retrieval?

23/4/20 2

Zhao Jin. Math Information Retrieval

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Linked Expression: a2+b2=c2

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Filter by > resource category, experience, specificity

Tutorials, Slides, Problem Solution Set, Applets, Tools, Data, Algorithms

Proof

Definition…

Tutorial

Proof

Goal: a user-centric and math-aware DL

Zhao Jin. Math Information Retrieval

• Introduction• Literature Review

• Domain-specific Information Seeking Studies• Current Math Resources• Math Information Retrieval

• User Study• Prototype Implementation• Future Work

Outline

23/4/20 4

Zhao Jin. Math Information Retrieval

• Domain Specific Information Seeking Studies• Monograph to Online Resources• Time-saving or relevant• Usability and accessibility problems

• Current Math Resources Online

• Isolated• Different degree of math awareness

Literature Review

23/4/205

Key requirements:Usefulness, Usability, and Accessibility

Math Web Search Wolfram Function Site

1. Hamper Accessibility2. Limited search capability and hard to judge usefulness

Zhao Jin. Math Information Retrieval

Current Math Information Retrieval Research• Expression Matching

• Text-based approaches– Notational Variation Problem: “a2+b2=c2” ≠ “x2+y2=z2”

• Non-text-based approach

• Query language• Expressiveness vs. User-Friendliness

Literature Review

23/4/20 6User-Friendliness

Expressiveness LaTeX,

MathML

Text Keyword

GUI, Natural Language, ASCII

Assume an expression input from the user

Zhao Jin. Math Information Retrieval

Whether the information needs of the users are satisfied by such resources• What do the user really need?• How do they perform information seeking?• What are the difficulties encountered?

Whether the current research focus is appropriate• Do they really need/prefer expression search?

Unanswered Issues

23/4/20 7

Further Study Needed!

Zhao Jin. Math Information Retrieval

• Introduction• Literature Review• User Study

• Findings• Desiderata in Math Information Retrieval

• Prototype Implementation• Conclusion

Outline

23/4/20 8

Zhao Jin. Math Information Retrieval

Three Approaches• Keyword Search / Browsing /Personal ContactsTrade-off between cost and benefit

Expression Search• Attractive but utility unknownKeyword search still popular and preferred

The multi-faceted user needs• Information-oriented / Format-oriented• Specificity and Experience for filtering• Domain and Intent as contextNeed to cater for specifically

Findings from the User Study

23/4/20 9

Zhao Jin. Math Information Retrieval

Multi-collection search• Search through multiple collections on behalf of the user

Enhance the usability and accessibility of collections

Resource Categorization• Automatically classify the materials according to the different facets

of the user needs

Return results that best suit the user needs

Desiderata in Math Retrieval

23/4/20 10

Zhao Jin. Math Information Retrieval

• Introduction• Literature Review• User Study• Prototype Implementation

– Focus on Resource Categorization• Future Work

Outline

23/4/20 11

Zhao Jin. Math Information Retrieval

Multi-collection Search• Meta-search• Offline indexing based on

open source package• Easier requirement to meet

between the two

Resource Categorization• Domain-specific text

categorization on webpages• More interesting as a research

topic

Prototype Implementation

23/4/20 12

Focus of the prototype is on Resource Categorization

Zhao Jin. Math Information Retrieval

Entire page is not a suitable unit for categorization

• Vision-based Segmentation (VIPS) used

Webpage Segmentation

Definition

Variation

Toolbar

23/4/20 13

Zhao Jin. Math Information Retrieval

Supervised Machine Learning Pipeline• Labels

• Math related / non-math-related

• Features• Word, Image, Formatting, Hyperlink, Layout, Context

• Machine Learner• SVM

• Training/Testing Data• Small corpus of webpages for 5 math topics• Manually annotated• Kappa-agreement: 0.87

Resource Categorization

23/4/20 14

Zhao Jin. Math Information Retrieval

Average accuracy: 0.36 on F1

• Strength: separating math contents from the rest• Weakness: identifying their exact type

Feature Utility• Text competitive baseline• Image filter non-math information• Formatting identify section headings etc.• Hyperlink separate related concepts and resource from the rest• Layout improve precision at the cost of recall• Context not effective overall

Evaluation

23/4/20 15

Zhao Jin. Math Information Retrieval

• Training Data• Insufficient examples• Skewed distributions

• Segmentation• Over- or under-segmented

Potential Sources of Error

23/4/20 16

Zhao Jin. Math Information Retrieval

• Introduction• Literature Review• User Study• Prototype Implementation• Future Work

Outline

23/4/20 17

Zhao Jin. Math Information Retrieval

Iterative Development Process• Resource Categorization• Prototype fielding

Text-to-Expression Linking• Resolve text keywords to expressions

• Reduce the need for expression input• Help to solve the notational variation problem• Fit well with the rest of the desiderata

Extension to Medical Domain• NUH evidence Project

Future Work

23/4/20 18

“Pythagorean Theorem” : “a2+b2=c2” & “x2+y2=z2”

Zhao Jin. Math Information Retrieval

To create a user-centric and math-aware digital library on math materials

Two Desiderata:• Multi-Collection Search, Resource Categorization

Prototype classification accuracy of 0.36 F1

Future Text-to-Expression Linking

Thank you for listening Questions?

Conclusion

23/4/20 19