44
Logical Structure Analysis of Scientific Publications in Mathematics Valery Solovyev, Nikita Zhiltsov Kazan (Volga Region) Federal University, Russia 1 / 44

Logical Structure Analysis of Scientific Publications in Mathematics

Embed Size (px)

DESCRIPTION

A talk at WIMS'11.

Citation preview

Page 1: Logical Structure Analysis of Scientific Publications in Mathematics

Logical Structure Analysis ofScientific Publications in Mathematics

Valery Solovyev, Nikita Zhiltsov

Kazan (Volga Region) Federal University, Russia

1 / 44

Page 2: Logical Structure Analysis of Scientific Publications in Mathematics

Overview

É LOD Cloud has been growing at 200-300%per year since 2007∗

É Prevalent domains: government (43%),geographic (22%) and life sciences (9%)

É However, it lacks data sets related toacademic mathematics

∗C.Bizer et al. State of the Web of Data.LDOW WWW’11

2 / 44

Page 3: Logical Structure Analysis of Scientific Publications in Mathematics

1 Background

2 Proposed Semantic Model

3 Analysis Methods

4 Experiments and Evaluation

5 Prototype

3 / 44

Page 4: Logical Structure Analysis of Scientific Publications in Mathematics

Mathematical Scholarly PapersEssential features

É Well-structured documentsÉ The presence of mathematical formulaeÉ Peculiar vocabulary (“mathematical

vernacular”)

4 / 44

Page 5: Logical Structure Analysis of Scientific Publications in Mathematics

Research Objectives

Current studyÉ Specification of the document logical structureÉ Methods for extracting structural elements

Long-term goalsÉ A large corpus of semantically annotated papersÉ Semantic search of mathematical papers

5 / 44

Page 6: Logical Structure Analysis of Scientific Publications in Mathematics

Modelling the Structure of ScientificPublicationsABCDE format

É LaTeX-based format to represent the narrativestructure of proceedings and workshop contributions

É Sections:� Annotations (Dublin Core metadata)� Background (e.g. description of research positioning)� Contribution (description of the presented work)� Discussion (e.g. comparison with other work)� Entities (citations)

6 / 44

Page 7: Logical Structure Analysis of Scientific Publications in Mathematics

Modelling the Structure of ScientificPublicationsSALT

É LaTeX-based authoring tool for generatingsemantically annotated PDF documents

É Three ontologies:� SALT Document Ontology� SALT Annotation Ontology� SALT Rhetorical Ontology

7 / 44

Page 8: Logical Structure Analysis of Scientific Publications in Mathematics

SALT Layers

8 / 44

Page 9: Logical Structure Analysis of Scientific Publications in Mathematics

Mathematical KnowledgeRepresentation

É Languages for formalized mathematics� Mizar� Coq� Isabelle

É Semiformal math languages� HELM ontology� MathLang� OMDoc format (+ OMDoc ontology, sTeX)

É Presentation/authoring formats� PDF� LATEX

9 / 44

Page 10: Logical Structure Analysis of Scientific Publications in Mathematics

Mathematical KnowledgeRepresentation

10 / 44

Page 11: Logical Structure Analysis of Scientific Publications in Mathematics

Trade-off Candidates

É arXMLiv format� XHTML+MathML� Marked up theorem-like elements, sections,

equations� Automatic conversion for LaTeX documents with

styles of available bindings (LaTeXML)� 60% of arXiv.org were converted into the format

É Present work� Follow the slides ⇒

11 / 44

Page 12: Logical Structure Analysis of Scientific Publications in Mathematics

1 Background

2 Proposed Semantic Model

3 Analysis Methods

4 Experiments and Evaluation

5 Prototype

12 / 44

Page 13: Logical Structure Analysis of Scientific Publications in Mathematics

Mathematical KnowledgeRepresentation

13 / 44

Page 14: Logical Structure Analysis of Scientific Publications in Mathematics

Proposed Semantic Model

É It is an ontology that captures the structural layoutof mathematical scholarly papers (as in the LaTeXmarkup)

É The segment represents the finest level ofgranularity and has the properties:

� starting and ending positions� the text or math contents� functional role

É Select most frequent segments from samplecollections of genuine papers

É Consider synonyms as one concept (e.g. conjectureand hypothesis)

14 / 44

Page 15: Logical Structure Analysis of Scientific Publications in Mathematics

Proposed Semantic Model (cont.)

É Select basic semantic relations between segmentsfrom the prior-art models

É Integration with SALT Document Ontology classes:� Publication� Section� Figure� Table

15 / 44

Page 16: Logical Structure Analysis of Scientific Publications in Mathematics

Ontology Elementshttp://cll.niimm.ksu.ru/ontologies/mocassin#

16 / 44

Page 17: Logical Structure Analysis of Scientific Publications in Mathematics

1 Background

2 Proposed Semantic Model

3 Analysis Methods

4 Experiments and Evaluation

5 Prototype

17 / 44

Page 18: Logical Structure Analysis of Scientific Publications in Mathematics

Logical Structure Analysis

É The ontology specifies a controlled vocabulary tosemantic analysis

É Two analysis tasks:� recognizing the types of document segments� recognizing the semantic relations between them

18 / 44

Page 19: Logical Structure Analysis of Scientific Publications in Mathematics

Example

19 / 44

Page 20: Logical Structure Analysis of Scientific Publications in Mathematics

Example (cont.)

20 / 44

Page 21: Logical Structure Analysis of Scientific Publications in Mathematics

Example (cont.)

21 / 44

Page 22: Logical Structure Analysis of Scientific Publications in Mathematics

Recognizing the Types of DocumentSegments

We exploit the LaTeX markup extensively

1 Elicit a LaTeX environment2 Associate it with a string that may beeither the environment name

or the environment title (if available)

3 Filter out standard formatting environments (e.g.center, align, itemize)

4 Compute string similarity between a string andcanonical names of ontology concepts

5 Check if the found most similar concept isappropriate using a predefined threshold

22 / 44

Page 23: Logical Structure Analysis of Scientific Publications in Mathematics

Recognizing Navigational Relations

The dependsOn and refersTo relations are navigational

AssumptionNavigational relations are induced by referentialsentences

ExamplesÉ “By applying Lemma 1, we obtain ...” (dependsOn)É “Theorem 2 provides an explicit algorithm ...”

(refersTo)

23 / 44

Page 24: Logical Structure Analysis of Scientific Publications in Mathematics

Recognizing Navigational RelationsSupervised method

1 Given a segment S; split its text into sentences,tokenize and do POS tagging

2 Referential sentences are ones that contain the \refcommand entries

3 For each sentence:� find mentioned segments; each of them makes a pair

with S (type feature)� for each pair, compute relative positions of segments

normalized by the document size (distance feature)� build a boolean vector for its verbs (verb feature)

24 / 44

Page 25: Logical Structure Analysis of Scientific Publications in Mathematics

Recognizing Navigational Relations(cont.)Supervised method

Example training instance

t1 t2 d1 d2 add ... apply ... relation

proof lemma 0.09 0.27 0 ... 1 ... dependsOn

É Train a learning model using these features and alabeled example set

É Apply the model to classify new induced relations

25 / 44

Page 26: Logical Structure Analysis of Scientific Publications in Mathematics

Recognizing Restricted Relations

The hasConsequence, exemplifies and proves relationsare restricted

AssumptionRestricted relations occur between consecutivesegments

26 / 44

Page 27: Logical Structure Analysis of Scientific Publications in Mathematics

Recognizing Restricted Relations (cont.)Baseline method

According to the ontology, restricted relations involveinstances of three types, separately: Corollary, Exampleand Proof

1 Seek a segment of one of these types

2 Find its segments-predecessors

3 Filter out segments of inappropriate types

4 Return the closest predecessor

27 / 44

Page 28: Logical Structure Analysis of Scientific Publications in Mathematics

1 Background

2 Proposed Semantic Model

3 Analysis Methods

4 Experiments and Evaluation

5 Prototype

28 / 44

Page 29: Logical Structure Analysis of Scientific Publications in Mathematics

Experimental SetupCollectionsÉ 1355 papers of the “Izvestiya Vysshikh Uchebnykh

Zavedenii. Matematika” journalÉ A sample of 1031 papers from arXiv.org

ImplementationAn open source Java library built upon:É LaTeX-to-XML convertersÉ GATE frameworkÉ WekaÉ Jena

See http://code.google.com/p/mocassin

29 / 44

Page 30: Logical Structure Analysis of Scientific Publications in Mathematics

Segment Recognition EvaluationÉ Evaluation on the arXiv sample onlyÉ Q-gram string matching algorithm was usedÉ The threshold value was optimized w.r.t. F1-score

Type # of F1-scoretrue instances

Axiom 5 1.000Claim 114 0.987Conjecture 152 0.987Corollary 1715 0.995Definition 1838 1.000Example 771 0.999Lemma 4061 0.998Proof 4943 0.997Proposition 3052 0.999Remark 2114 1.000Theorem 4670 0.991other 671 0.892

30 / 44

Page 31: Logical Structure Analysis of Scientific Publications in Mathematics

Ontology Coverage Evaluation

É Evaluation on the both entire collections (“Izvestiya”and arXiv)

É Equations are most ubiquitous segments (52% and69%, respectively)

É The ontology covers types of 91.9% and 91.6% ofsegments (with SALT Section class – 99.5% and99.6%)

31 / 44

Page 32: Logical Structure Analysis of Scientific Publications in Mathematics

Distribution of Segment Types

0%

5%

10%

15%

20%

25%

30%T

heo

rem

Pro

of

Lem

ma

Rem

ark

Coro

llar

y

Def

init

ion

Pro

posi

tion

Ex

ample

oth

ers

Cla

im

Co

nje

cture

Per

centa

ge

of

seg

men

t o

ccu

rren

ces

Izvestiya

arXiv

32 / 44

Page 33: Logical Structure Analysis of Scientific Publications in Mathematics

Evaluation of Navigational RelationRecognition

É A paper contains 51.4 (Izvestiya) and 53.9 (arXiv)referential sentences on the average

É 243 referential sentences were randomly selectedand manually annotated

É 95% were true navigational relationsÉ A decision tree learner (C4.5) was trainedÉ The results were from 10-fold cross validation

Features Accuracy F1-score F1-scorerefersTo dependsOn

type 0.663 0.566 0.752type+distance 0.658 0.663 0.704type+verb 0.704 0.653 0.770type + distance + verb 0.741 0.744 0.772

33 / 44

Page 34: Logical Structure Analysis of Scientific Publications in Mathematics

A Cloud of Frequent Verbs

34 / 44

Page 35: Logical Structure Analysis of Scientific Publications in Mathematics

Evaluation of Restricted RelationRecognition

É Evaluation on the arXiv sample onlyÉ 10% of the documents which contain certain

segments were randomly selectedÉ For each such a segment, corresponding relations

were annotated manuallyÉ Known issues: imported corollaries and examples

for arbitrary text fragments

Relation # of instances F1-scorehasConsequence 178 0.687exemplifies 62 0.613proves 216 0.954

35 / 44

Page 36: Logical Structure Analysis of Scientific Publications in Mathematics

Conclusion on Evaluation

É The ontology covers the largest part of the logicalstructure and appears to be feasible for automaticextraction methods

É The task of segment type recognition has beenaccomplished

É The method for recognizing navigational relationsestablishes ground truth, however, a large-scaleevaluation and learning model selection are required

É The baseline method for recognizing restrictedrelations must be improved by leveraging additionalinformation (discussed in the paper!)

36 / 44

Page 37: Logical Structure Analysis of Scientific Publications in Mathematics

1 Background

2 Proposed Semantic Model

3 Analysis Methods

4 Experiments and Evaluation

5 Prototype

37 / 44

Page 38: Logical Structure Analysis of Scientific Publications in Mathematics

Prototype

A prototype:É demonstrates our ongoing research on

semantic search of mathematical papersÉ incorporates the logical structure analysis

methodsÉ is integrated with arXiv APIÉ enables enhanced search for arXiv papers

and visualization of their logical structureÉ publishes the semantic index as Linked

Data via SPARQL endpoint

38 / 44

Page 39: Logical Structure Analysis of Scientific Publications in Mathematics

Search Interfacehttp://cll.niimm.ksu.ru/mocassin

39 / 44

Page 40: Logical Structure Analysis of Scientific Publications in Mathematics

Formulating a Queryhttp://cll.niimm.ksu.ru/mocassin

40 / 44

Page 41: Logical Structure Analysis of Scientific Publications in Mathematics

Search Resultshttp://cll.niimm.ksu.ru/mocassin

41 / 44

Page 42: Logical Structure Analysis of Scientific Publications in Mathematics

Preview a Search Resulthttp://cll.niimm.ksu.ru/mocassin

42 / 44

Page 43: Logical Structure Analysis of Scientific Publications in Mathematics

Summary

É The proposed approach aims to analyze thestructure of mathematical scholarly papersin an automatic way

É Our ontology provides a controlledvocabulary for analysis

É The methods elicit document segments interms of the ontology

É The extracted semantic graph can be usedfor:

� discovering important document parts� semantic search of theoretical results

43 / 44

Page 44: Logical Structure Analysis of Scientific Publications in Mathematics

Thanks for your attention!Questions?

44 / 44