Upload
aubrey-strickland
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Zhao Jin. Math Information Retrieval
Examples:– Looking for formulas– Collect teaching resources– Keeping updated on research
development
Generic search engines ineffective in such situations– Unaware of user needs and math
expressions
Why Math Information Retrieval?
23/4/20 2
Zhao Jin. Math Information Retrieval
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Linked Expression: a2+b2=c2
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Filter by > resource category, experience, specificity
Tutorials, Slides, Problem Solution Set, Applets, Tools, Data, Algorithms
Proof
Definition…
Tutorial
Proof
Goal: a user-centric and math-aware DL
Zhao Jin. Math Information Retrieval
• Introduction• Literature Review
• Domain-specific Information Seeking Studies• Current Math Resources• Math Information Retrieval
• User Study• Prototype Implementation• Future Work
Outline
23/4/20 4
Zhao Jin. Math Information Retrieval
• Domain Specific Information Seeking Studies• Monograph to Online Resources• Time-saving or relevant• Usability and accessibility problems
• Current Math Resources Online
• Isolated• Different degree of math awareness
Literature Review
23/4/205
Key requirements:Usefulness, Usability, and Accessibility
Math Web Search Wolfram Function Site
1. Hamper Accessibility2. Limited search capability and hard to judge usefulness
Zhao Jin. Math Information Retrieval
Current Math Information Retrieval Research• Expression Matching
• Text-based approaches– Notational Variation Problem: “a2+b2=c2” ≠ “x2+y2=z2”
• Non-text-based approach
• Query language• Expressiveness vs. User-Friendliness
Literature Review
23/4/20 6User-Friendliness
Expressiveness LaTeX,
MathML
Text Keyword
GUI, Natural Language, ASCII
Assume an expression input from the user
Zhao Jin. Math Information Retrieval
Whether the information needs of the users are satisfied by such resources• What do the user really need?• How do they perform information seeking?• What are the difficulties encountered?
Whether the current research focus is appropriate• Do they really need/prefer expression search?
Unanswered Issues
23/4/20 7
Further Study Needed!
Zhao Jin. Math Information Retrieval
• Introduction• Literature Review• User Study
• Findings• Desiderata in Math Information Retrieval
• Prototype Implementation• Conclusion
Outline
23/4/20 8
Zhao Jin. Math Information Retrieval
Three Approaches• Keyword Search / Browsing /Personal ContactsTrade-off between cost and benefit
Expression Search• Attractive but utility unknownKeyword search still popular and preferred
The multi-faceted user needs• Information-oriented / Format-oriented• Specificity and Experience for filtering• Domain and Intent as contextNeed to cater for specifically
Findings from the User Study
23/4/20 9
Zhao Jin. Math Information Retrieval
Multi-collection search• Search through multiple collections on behalf of the user
Enhance the usability and accessibility of collections
Resource Categorization• Automatically classify the materials according to the different facets
of the user needs
Return results that best suit the user needs
Desiderata in Math Retrieval
23/4/20 10
Zhao Jin. Math Information Retrieval
• Introduction• Literature Review• User Study• Prototype Implementation
– Focus on Resource Categorization• Future Work
Outline
23/4/20 11
Zhao Jin. Math Information Retrieval
Multi-collection Search• Meta-search• Offline indexing based on
open source package• Easier requirement to meet
between the two
Resource Categorization• Domain-specific text
categorization on webpages• More interesting as a research
topic
Prototype Implementation
23/4/20 12
Focus of the prototype is on Resource Categorization
Zhao Jin. Math Information Retrieval
Entire page is not a suitable unit for categorization
• Vision-based Segmentation (VIPS) used
Webpage Segmentation
Definition
Variation
Toolbar
23/4/20 13
Zhao Jin. Math Information Retrieval
Supervised Machine Learning Pipeline• Labels
• Math related / non-math-related
• Features• Word, Image, Formatting, Hyperlink, Layout, Context
• Machine Learner• SVM
• Training/Testing Data• Small corpus of webpages for 5 math topics• Manually annotated• Kappa-agreement: 0.87
Resource Categorization
23/4/20 14
Zhao Jin. Math Information Retrieval
Average accuracy: 0.36 on F1
• Strength: separating math contents from the rest• Weakness: identifying their exact type
Feature Utility• Text competitive baseline• Image filter non-math information• Formatting identify section headings etc.• Hyperlink separate related concepts and resource from the rest• Layout improve precision at the cost of recall• Context not effective overall
Evaluation
23/4/20 15
Zhao Jin. Math Information Retrieval
• Training Data• Insufficient examples• Skewed distributions
• Segmentation• Over- or under-segmented
Potential Sources of Error
23/4/20 16
Zhao Jin. Math Information Retrieval
• Introduction• Literature Review• User Study• Prototype Implementation• Future Work
Outline
23/4/20 17
Zhao Jin. Math Information Retrieval
Iterative Development Process• Resource Categorization• Prototype fielding
Text-to-Expression Linking• Resolve text keywords to expressions
• Reduce the need for expression input• Help to solve the notational variation problem• Fit well with the rest of the desiderata
Extension to Medical Domain• NUH evidence Project
Future Work
23/4/20 18
“Pythagorean Theorem” : “a2+b2=c2” & “x2+y2=z2”
Zhao Jin. Math Information Retrieval
To create a user-centric and math-aware digital library on math materials
Two Desiderata:• Multi-Collection Search, Resource Categorization
Prototype classification accuracy of 0.36 F1
Future Text-to-Expression Linking
Thank you for listening Questions?
Conclusion
23/4/20 19