20170916 CREST Symposium...using corpus generation Truong-son Nguyen (JAIST) Purpose: propose two methods for generating training examples to enrich the size of the COLIEE corpus that

科学技術論文の解析と知識獲得

松本裕治

奈良先端科学技術大学院大学

• CRESTビッグデータ応用

「構造的文書解析による知識獲得」

2017年12月25日Big Data Application（2015.10-2021.3)

Objective of the Project

Help researchers by analyzing scholarly documents (scientific papers) For finding similar/dissimilar research papers

For grasping contents of papers

For construction/completion of domain Knowledge Bases

For visualizing research trends

Development of tools and environment for scientific paper analysis and presentation

Search Aspect-based similar/dissimilar paper search

Visualization Citation relation / research trend

Extraction Concept / Relation / KnowledgeBase completion

Analysis Document Analysis:

PDF, Tables, Graphs, Math formulas

Text Analysis: natural language analysis Annotation tools

Research Items for Scientific Document Analysis

プロジェクトの概要Text/Document

Analysis(T1, T5)

Knowledge Base Fertilization(T2, T3, T4)

Structural Similarity Analysis(T6, T7)

User Interface / Survey

Generation(T8, T9)

Inner-Document Analysis

• Text/Document Analysis

• Concept/RelationAnalysis

• Predicate-argument str analysis• Event Chain Analysis• Document Structure Analysis• Document Summarization

Inter-Document Analysis

• Citation Analysis• Document Relation• Document

Similarity• Multi-Document

Summarization• Survey Generation

User Interface / Document Visualization

研究グループの体制

G0: NAIST: 松本，新保，進藤，能地，中村，… テキスト解析基盤技術および文書構造解析析専門分野の研究者との協働：領域データベースの補完

G1: NII: 佐藤健，JAIST: Minh Le Nguyen 類似判例検索

G2: 東北大: 乾，井上仮説推論に基づく論述構造の解析

G3: NII: 相澤，阿辺川，宮尾，広島市大: 難波言語・分野横断的な知識獲得を可能にする論文構造解析

G4: 東大: 鶴岡論文の深い意味理解のための基盤技術の開発

G5: 東大: 森大規模引用ネットワークおよび文献テキストの構造的関係性に基づく潜在関連知識の抽出

G6: 静岡大: 狩野脳科学論文のテキストマイニングと応用

Aspect-based similarity learning for paper retrieval

6

Visualization of similar papers

User can control density of graphs by setting similarity threshold

Relation Extraction (Knowledge Acquisition)

Knowledge Base Completion KNApSAcK Database (NAIST)

Databese of Metabolite-Plant Species Relationship

KEGG Pathway (© Kyoto University) collection of manually drawn pathway maps

representing knowledge on the molecular interaction, reaction and relation networks (substrate-enzyme-product)

PolyInfo (National Institute for Materials Science) Information of Polymers extracted from scientific

papers

KNApSAcK Database

DataBase of Metabolite-Plant Species Relationship

Metabolite Species

Knowledge Base Completionby distant supervision

Existing Knowledge Base(KNApSAcK DB)

Automatic Annotation of Terms and Relations

Metabolite Species Reference

Classifier / Sequence Labeler

Training data

Semi-automatic constructionOf traning data

Annotation of KEGG relations

Target relations

gene – enzyme, enzyme – substrate,

enzyme – product

KEGG Pathway Map Generation

KEGG: Kyoto Encyclopedia of Genes and Genomes

: Gene products (KEGG Ortholog)

: Chemical compounds

• Relation (protein network)

• Reaction (chemical network)

Document Analysis PDF analysis

XML conversion, Table / Graph / Math formula analysis

Citation analysis / Document similarity

Text (Language) Analysis Base NLP analysis tools: POS tagging, parsing, NE

recognition, Relation analysis, Predicate-argument structure analysis

Complex sentence structure analysis

Interface Document retrieval / Visualization Annotation tools

Document and Text Analysis

Analysis of Layout, Logical, and Semantic Structures of Scientific Documents計算言語学分野の論文構造の解析と情報アクセス支援

Different layers of structure information are indistinguishably embedded in real

documents

Recommendation and summarization

Layout structure analysisPDF, (X)HTML

Logical structure analysisXML

Semantic structure analysisAnnotated text

Linking to external knowledgebaseKnowledge Representation

Difficulty of PDF layout analysis

Gap between tag types and their semantic roles

Reasoning implicit relationships

AI-augmented research navigator

Reasoning Knowledge-resources Agents

Equation (1)

Figure 1.

(fig.1)

Binomial theorem(Wikipedia)

Bag, Amulya Kumar (1966). "Binomial theorem in ancient India". Indian J. History Sci 1 (1): 68–74.

Tensor product identities in

two variables are quite com

mon in mathematics: Expon

ential, logarithmic, trigono

metric, …

encyclopedicknowledge

references

in-documentcitation

formula description

Contentrestructuring

Knowledge structure

・・・・・・・・・・・・・・

・・・・・・・・・・・・・・

・・・・・・・・・・・・・・

・・・・・・・・・・・・・・

・・・・・・・・・・・・・・

Document

Section

Paragraph

Sentence

係Dependency

・・・・・・・・・・・・・・

Word

・・・・・・・・・・・・・・

・・・・・・・・・・・・・・

・・・・・・・・・・・・・・

・・・・・・・・・・・・・・

・・・・・・・・・・・・・・

Real documents in printable-format

メンバー：相澤彰子、宮尾祐介、阿辺川武（NII）、難波英嗣（広島市立大）、吉岡真治（北大）

レイアウト・論理構造の解析

意味構造の解析

文書の再構成

情報科学論文に典型的に表れる、文脈依存・非依存な意味クラス・関係を定義してコーパスを構築

span領域アノテーション

PDF XML 平文

location-aware XHTML

言語に依存しない図表領域認識

数式の抽出

ブロック間の論理関係抽出

参照情報の抽出と参照文脈解析

論文閲覧システム

知識の再編成

意味解析に基づく検索・要約

図

グラフ

数式

手法

Improving entailment recognition in legal texts

using corpus generation Truong-son Nguyen (JAIST)

Purpose: propose two methods for generating training examples to enrich the size of the COLIEE corpus that covers some linguistic phenomena for entailment task including negation and sub-sentences

Data generation result: generate 4876 new training examples from Japanese Civil Code(2160 pairs from method 1+ 2716 pairs from method 2) 16

e1r1 r2 e2

(1) utilizes the syntactic parse tree

t = <a sentence in legal text . >

(2) utilizes the requisite-effectuation (RE) structure

Input

Output {(t , h1, y1), (t, h2, y2, …), …}

where hi is a sub-sentence or

negation of a subsentence in t

t = <a sentence in legal text . >

{(t , h1, y1), (t, h2, y2, …), … }

where hi is a sentence constructed

from an RE pair

Using Stanford parser

Using Japan Civil Code RE corpus

Improving entailment recognition in legal texts

using corpus generation

Evaluate the generated dataset:

– Compare the performance of the classifiers trained on the orginal dataset and the new dataset (= orginal + generated data)

– Train classifiers using two types of deep learning models:

Result: The performance of models on the new dataset has a significant improvement

(~8%) compared to the original dataset.

The results outperform other systems of COLIEE 2016 and 2017 on the English dataset (from 7-10%)

(a1, …, an) (b1, …, bm)words

Word

representation

<vector><vector>

<vector>

MLP

Logistic regression

An encoding method (CBOW / BILSTM)

Artices Question

Prediction

Transformation

Aggregation

Sentence

representatio

n

(1) Sentence encoding-based models

(CBOW+BiLSTM)

Question

Attentio

n

Comparision

and

Aggregation

<vector>

MLPTransformation

Prediction Logistic regression

(2) Attention-based models (Decomposable + ESIM)

Article

ネットワーク表現学習を用いた論文引用ネットワークのトレンド検知

引用ネットワーク引用ネットワークにおける学術文献の分散表現

分散表現

引用ネットワークの表現学習

分散表現を用いた・学術分野の可視化・トレンド抽出・引用予測

ネットワーク構造からノード間のランダムウォークや近接性に基づき分散表現を学習

ネットワークの分散表現を用いて、学術分野の可視化、トレンド抽出、引用予測などに応用

提案手法・ネットワーク表現学習で得られた512次元の潜在空間上でネットワークの成長の方向性を特定・各論文がカッティング・エッジ領域（その方向性の先端）かある度合いを指標化

成長の方向性

カッティング・エッジ領域

東京大学工学系研究科浅谷公威

19

引用予測の精度向上例えばナノカーボン分野のCutting Edge領域にある論文は将来の引用が伸びる確率が高いことを確認

将来の頻出キーワード予測Cutting Edge領域の論文によく出てくるキーワードは、将来の論文でたくさん使われる傾向

手法を論文の可視化、萌芽予測、キーワード予測の各タスクにおいて評価

学術領域の成長の把握灰色〜緑〜赤と特定の方向に学術領域が進化していることを確認

トレンドを検知して数値化することが様々なタスクに有効であることを実証

今後はキーワード予測を自然言語処理技術と組み合わせて高度化する予定