3 Typical Work on Automatic Relation Extraction

3 Typical Work on Automatic Relation

Extraction

自动关系抽取的三种重要方法武文娟

2009.06.04

Outline

DIPRE,1998 KnowItAll, 2005 Open IE, 2007

1 DIPRE: Dual Iterative Pattern Expansion

Sergey Brin,

Extracting Patterns and Relations from the World Wide Web,

In : Proceedings of the International Workshop on the Web and Databases, 1998.

1 DIPRE: Dual Iterative Pattern Expansion

首次利用迭代方法发现数据实体间的模式和关系，并成功的发现了作者 / 作品数据对。 Input: 5 本书的样本集 (author, title) Output: 自动扩展到了 15,000 本书

有些书是最大的网上书店亚马逊也没有的。

1.1 Idea

pattern tuple

discover

extract

模板和关系之间存在对偶性

1.2 Algorithm

R (Tuple set)

OccurrenceFindOccurrence (R, D)

Patterns

Generate & filterSearch

七元组(author, title, order, url, prefix, middle,

suffix)

Pattern generation

Group by

Order, middle

Occurrence七元组

(author, title, order, url, prefix, middle, su

ffix) For each Oi

GenOnePattern(Oi)

Pattern p五元组

(order, urlprefix, prefix, middle, suffix)

O1, O2, …, Ok

p is specific?YES

输出 p

NO

URL: 匹配 urlprefix*内容： *prefix, author, middle, title,

suffix*

1.3 Experiments

Corpus A repository of 24 million web pages 147G

1.3 Experiments: Initial sample

1.3 Experiments: 3 Patterns in First Iteration

1.3 Experiments: 4047 new pairs in First Iteration

1.3 Experiments: review

corpus occurrences

patterns (author, title) pairs

1st iteration

24 million 199 3 4047

2nd iteration

5 million 3972 105 9369

3rd iteration

156,000 9938 346 15257

1.4 Conclusion

DIPRE ：半监督关系学习方面的最初的工作利用了关系和模板之间的对偶性，在 Web 这样的

大规模语料库上，通过少量的 sample 作为种子，以迭代的方法，不断地抽取新的模板和实例。

Outline


KNOWITALL

Oren Etzioni etc.

University of Washington Unsupervised Named-Entity Extractio

n from the Web: An Experimental Study

AAAI 2005

Introduction

以前的工作： HMM, CRF 小规模的语料库需要提供种子数据

KNOWITALL: an unsupervised, domain-independent system that extracts information from the Web

关键挑战 : 保证准确率： a novel generate-and-test architecture 提高召回率：

Pattern Learning (PL) Subclass Extraction (SE) List Extraction (LE)

1 Flowchart of the main components in KnowItAll

For every predicate: creates extraction rules

and discriminators Train discriminators

“cities such as ” NPList

Information Focus

唯一领域相关的输入是一组 predicate ，用来指定所关注的领域。

通用的抽取模板

Extraction Rules 通用的抽取模板，结合 predicate 的标签，生成相应领域的抽取规

则 Class1 = ‘city’ ，规则即为

“cities such as ” NPList “towns such as ” NPList

Keywords: “cities such as ” , “towns such as ” （提交给搜索引擎）

Discriminator

用来确认某个抽取到的信息是否 validate 利用 PMI

Training discriminator: Bootstrapping

The result of training

A set of discriminator, eg. Discriminator: <I> is a city

Learned threshold T: 0.000016

Conditional probabilitiesP(PMI > T | class) = 0.83

P(PMI > T | ¬class) = 0.08

An Example

Predicate: city Bootstrapping:

Generate extraction rules and discriminators Train all discriminators, and selected the 5 best

discriminators

An Example:Trained discriminator

An ExampleMain cycle: extract

Suppose that the query is “and other cities” from a rule with extraction pattern: NP “and other cities”.

2 instances: Fes, East Coast

An ExampleMain cycle: Assess

To compute the probability of City (Fes) sends six queries

“Fes” has 446,000 hits; “Fes is a city” has 14 hits “cities Fes” (201 hits) “cities such as Fes” (10 hits); “cities including Fes” (4 hits) 0 hits for “Fes and other towns”.

City (East Coast) below threshold for all discriminators

Sum up all the probability,

The final probability is 0.99815

The final probability is

0.00027.

1.2 Experiment noise tolerance

1.2 Experiment find negative training seeds for assessor

1.2 Experiment: search cutoff metric Signal to Noise ratio (STN): 正例与负例的比值 Query Yield Ratio (QYR) ： n 个网页抽取到的新信息量

2 如何提高召回率 Pattern Learning (PL):

抽取规则评价实例准确性的确认模板

Subclass Extraction (SE): 自动识别子概念，便于抽取例如，为了抽取科学家的实例，可以先找到科学家的子概念（物

理学家、地理学家等），再抽取这些子概念的实例。 List Extraction (LE):

learns a “wrapper” for each list, and uses the wrapper to extract list elements.

使用通用抽取模板抽取到的信息作为这三种方法的初始种子，因此它们都不需要人事先给出训练数据。

2.1 Pattern Learning (PL):

通用模板对特定领域来说通常并不是最有效的模板 “the film <film> starring” “headquartered in <city>”

Pattern Learning algorithm

Estimating recall & precision efficiently take the positive examples of one class to be negative

examples for all other classes.

I: A set of seed

instancesContext of i

SearchBest patterns

Filter:Recall &precision

3 of the most productive rules

如何提高召回率 Pattern Learning (PL) Subclass Extraction (SE) List Extraction (LE)

2.2 Subclass ExtractionBasic subclass extraction (SEbase)

Extracting candidate subclasses 通用抽取规则在抽取实例的同时也抽取子类 . 如何区分 ?

实例 : 专有名词 , 大写 Scientists such as Einstein, Newton,… 子类 : 普通名词 Scientists such as physical scientist, biolo

gist, … Assessing Candidate Subclasses, a combination meth

od 子类名是否包含了父类名

“microbiologist” is a subclass of “biologist” 在 WordNet 中是否有父子关系 SEbase Assessor:

bootstrap training method

Rules for subclass extraction

Improving Subclass Extraction Recall 对抽取到的候选子类，用 table2 中后两条规则

来抽取它们兄弟 , 得到更多的候选子类。两种子类

Context-independent subclassPerson - Priest

Context-dependent subclassPerson - Pharmacist

两种 assessing method SEself: 用自训练的方式训练一个分类器 SEiter ：迭代地为每个抽取规则计算置信度

Experimental result: Context-independent subclass

Experimental result: Context-dependent subclass

如何提高召回率 Pattern Learning (PL) Subclass Extraction (SE) List Extraction (LE)

不同于前两种方法处理无结构文本 LE 利用网页中的结构来抽取信息

2.3 List Extractor

网页中很多列表都是从数据库中生成的，因此通常具有明显的结构特征

基本方法定位网页中的 list 学习一个 wrapper ，自动抽取所有 list 中的 item

Learning a Wrapper

An Example

W3 is the BEST(1) 对应的 HTML 块尽量

小(2) 匹配尽量多 keywords

Experiments of LE

Discussion

使用 LE 可以用较少的查询，抽取到大量的信息

虽然准确率不够高，但是帮助缩小了候选信息的数量 , 使得 Assessor 工作

量大大减少 . 可以发现在标准 IE 方法没有抽取到的信息

在 HTML 文档中，长选择列表中的一些罕见城市

2.4 PL， SE和 LE的比较： recall

city film

scientist

对于通用概念的实例抽取， SE 更有效

PL， SE和 LE的比较 : extraction rate

extraction rate

= num (unique extraction) / num (query)

the Trade-off between Recall and Precision

3 Conclusion KnowItAll: Unsupervised information extractio

n from the Web Input a set of predicate names no hand-labeled training examples of any kind

准确率 utilizes a novel generate-and-test architecture Extractor, Assessor

召回率 Pattern learning, Subclass Extraction, List Extract

ion

Outline


Open IE

Michele BankoUniversity of Washington

Open Information Extraction from the Web,

IJCAI2007

1 Introduction

传统 IE 小规模、同类的语料

因此可以很大程度上依赖自然语言处理技术，例如：命名实体识别等

固定的关系类型

1.1 新的挑战 Automation

最初，把手工标注的实例、文档片断和自动学习到的特定领域的抽取模板作为系统的输入

后来，只需要为每种目标关系提供少量的种子实例或是手工编写的抽取模板（ DIPRE ， SNOWBALL ， Web-based question answering system ）

但是，生成这些数据依然需要专业知识。并且对于每种目标关系，都要提供训练数据。同时需要预先规定好要抽取的关系。

新的挑战 Corpus Heterogeneity

以前的工作抽取关系时 , 都是在小规模的特定领域的语料上抽取有限几种特定的关系

Kernel-based methods[Bunescu and Mooney, 2005] maximum-entropy models [Kambhatla, 2004] graphical models [Rosario and Hearst, 2004; Culotta et al.,

2006] co-occurrence statistics [Lin and Pantel, 2001; Ciaramita e

t al., 2005] 绝大多数工作都使用了 NER 、词法分析、依赖关系分析等技术。这些语言处理技术在处理异构的网络文本中会出现更多的错误。同时，已有的 NER 系统也不适合网络上命名实体的数量和复杂程度

新的挑战 Efficiency

KNOWITALL自动化：通过利用少量领域无关的抽取模式，自动地

标识训练集网络异构：使用 Part-of-speech tagger 代替 parser,

并不需要 NER但是需要大量的搜索引擎查询和下载网页，实验往往

需要数周时间。需要关系的名字作为输入。因此每次改变目标关系，就需要重新运行一次。

1.2 本文的贡献 Open Information Extraction

自动发现可能的关系 , 不需要事先确定要找的关系 ,因此只需要扫描一遍语料库

TEXTRUNNER OIE 的完整实现

对抽取结果进行了统计报告

2 Open IE

3 个模块 Self-Supervised Learner

输入：一个小规模语料输出：一个分类器，用于判断候选的关系是否 trustwo

rthy Single-Pass Extractor

一遍扫描整个语料，抽取候选关系元组，用分类器判断，保留正例。

Redundancy-Based Assessor

2.1 Self-Supervised Learner

自动为训练集标正反例 (ei, ri,j, ej) 分析数千个句子，获得依赖图在每个句子中寻找名词短语作为 ei

对每对 (ei, ej) ，按它们在依赖图中的连接，寻找代表他们之间关系的词序列作为 ri,j 。对每个元组，如果满足预先给定的启发式规则，就标为正例。

ei, ej 之间的连接路径长度不得大于某阈值路径在一个句子的范围之内ei, ej二者都不能只包含代词

2.1 Self-Supervised Learner

用标好的训练集训练分类器把每个元组映射为特征向量

the number of tokens in ri,j , the number of stopwords in ri,j , whether or not an object e is found to be a proper noun, the part-of-speech tag to the left of ei, the part-of-speech tag to the right of ej .

将特征向量集作为训练集，得到 Naïve Bayes 分类器不是特定关系的，也不包含词汇特征

2.2 Single-Pass Extractor

识别名词短语，从连接两个名词短语的文本中寻找关系，得到候选元组利用较轻量级的 NLP技术，这使得方法更健壮，能适应异构的网络文本。

启发式地去除过度修饰实体的介词短语和其它不必要的修饰词等。

使用分类器，保留被标为“ trustworthy” 的元组

2.3 Redundancy-Based Assessor

合并相同元组去掉不必要的修饰词

对每个元组，统计它们出现的不同的句子的总数

用这个总数来评估该元组正确的概率研究证明，这个概率方法的准确率比其它基于 nois

y-or ，点互信息的方法高得多。

2.4 Query Processing 查询速度快： at interactive speeds

对元组和它所在的文本建立倒排索引每种关系分配在一台电脑上。

由于关系的名字是从网络文本中抽取到的，这些名字也更容易被用作查询关键词。

使用 relation-centric 索引，不同于目前搜索引擎使用的标准倒排索引，它可以支持复杂的关系查询 relationship queries, unnamed-item queries, and

multiple-attribute queries

2.5 Analysis

时间复杂度 OpenIE ： O(D) for extraction, O(TlogT) for sort,

count and assess the tuples Traditional IE: O(R * D)

速度由于没有采用 dependency parse 等 NLP技术，

OIE 处理一个句子需要 0.036 CPU seconds, 传统IE技术需要 3 CPU seconds

3 Experimental Results: 3.1 Comparison with Traditional IE

9 million Web page corpus, 10 relations

3 Experimental Results: Comparison with Traditional IE

抽取到相同的正确元组， TEXTRUNNER错误率更低

时间上， TextRunner慢一些 85 vs. 63 CPU Hours 同时抽取更多关系

3 Experimental Results:3.2 Global Statistics on Facts Learned

11.3 million tuples containing 278,085 distinct relation strings.

Filtered rules: probability of at least 0.8 The tuple’s relation is supported by at least 10 distinct sent

ences in the corpus The tuple’s relation is not found to be in the top 0.1% of rel

ations by number of supporting sentences. (These relations were so general as to be nearly vacuous, such as (NP1, has, NP2)).

Estimating the Correctness of Facts

手工标注其中 400 个元组作为样本判断

Well-formed? Relation: (FCI, specializes in, software development) Entities: (29, dropped, instruments)

Concrete or abstract? Concrete: 用于 IE ， question answering

(Tesla, invented, coil transformer) Abstract: ontology learning

(Einstein, derived, theory)

True or false? 与它所在的句子原意一致

Estimating the Number of Distinct Facts

Distinct relation Merging

首尾的标点符号，助动词，开头的停用词，如“ are consistent with”, “, which is consistent with”.

主动和被动语态关系的多义性

Eg. developed使得如果不借助 domain-specific type checking ，同义的 relation将对应着有重叠但是差别很大的 tuple 集

Estimating the Number of Distinct Facts

Distinct relation Build “synonymy clusters” for 11.3 million tuples:

(e1,r,e2), (e1,q,e2), where r≠q

1/3 belong to the “synonymy clusters” Distinct facts in the “synonymy clusters”: ¾ hat 2/3 + (1/3 × 3/4 ) or roughly 92% of the tuples

found by TEXTRUNNER express distinct assertions. overestimated

4 Conclusion

Open IE an unsupervised extraction paradigm

Web All relations

one-time relation discovery TEXTRUNNER

a fully implemented Open IE system demonstrates its ability to extract massive amount

s of high-quality information from a 9 million Web page corpus. Comparing with KnowItAll

SUMMARY

SUMMARY DIPRE,1998

首次利用迭代方法发现数据实体间的模式和关系半监督关系抽取方面最初的工作

KnowItAll, 2005 unsupervised, domain-independent extracts information from the Web

Open IE, 2007 unsupervised, domain-independent Web All relations one-time relation discovery Higher precision than KnowItAll

感谢大家！Questions?

Documents

3 Typical Work on Automatic Relation Extraction