Upload
garrison-tillman
View
139
Download
14
Embed Size (px)
DESCRIPTION
3 Typical Work on Automatic Relation Extraction. 自动关系抽取的三种重要方法 武文娟 2009.06.04. Outline. DIPRE,1998 KnowItAll, 2005 Open IE, 2007. 1 DIPRE: Dual Iterative Pattern Expansion. Sergey Brin, Extracting Patterns and Relations from the World Wide Web, - PowerPoint PPT Presentation
Citation preview
3 Typical Work on Automatic Relation
Extraction
自动关系抽取的三种重要方法武文娟
2009.06.04
Outline
DIPRE,1998 KnowItAll, 2005 Open IE, 2007
1 DIPRE: Dual Iterative Pattern Expansion
Sergey Brin,
Extracting Patterns and Relations from the World Wide Web,
In : Proceedings of the International Workshop on the Web and Databases, 1998.
1 DIPRE: Dual Iterative Pattern Expansion
首次利用迭代方法发现数据实体间的模式和关系,并成功的发现了作者 / 作品数据对。 Input: 5 本书的样本集 (author, title) Output: 自动扩展到了 15,000 本书
有些书是最大的网上书店亚马逊也没有的。
1.1 Idea
pattern tuple
discover
extract
模板和关系之间存在对偶性
1.2 Algorithm
R (Tuple set)
OccurrenceFindOccurrence (R, D)
Patterns
Generate & filterSearch
七元组(author, title, order, url, prefix, middle,
suffix)
Pattern generation
Group by
Order, middle
Occurrence七元组
(author, title, order, url, prefix, middle, su
ffix) For each Oi
GenOnePattern(Oi)
Pattern p五元组
(order, urlprefix, prefix, middle, suffix)
O1, O2, …, Ok
p is specific?YES
输出 p
NO
URL: 匹配 urlprefix*内容: *prefix, author, middle, title,
suffix*
1.3 Experiments
Corpus A repository of 24 million web pages 147G
1.3 Experiments: Initial sample
1.3 Experiments: 3 Patterns in First Iteration
1.3 Experiments: 4047 new pairs in First Iteration
1.3 Experiments: review
corpus occurrences
patterns (author, title) pairs
1st iteration
24 million 199 3 4047
2nd iteration
5 million 3972 105 9369
3rd iteration
156,000 9938 346 15257
1.4 Conclusion
DIPRE : 半监督关系学习方面的最初的工作 利用了关系和模板之间的对偶性,在 Web 这样的
大规模语料库上,通过少量的 sample 作为种子,以迭代的方法,不断地抽取新的模板和实例。
Outline
DIPRE,1998 KnowItAll, 2005 Open IE, 2007
KNOWITALL
Oren Etzioni etc.
University of Washington Unsupervised Named-Entity Extractio
n from the Web: An Experimental Study
AAAI 2005
Introduction
以前的工作: HMM, CRF 小规模的语料库 需要提供种子数据
KNOWITALL: an unsupervised, domain-independent system that extracts information from the Web
关键挑战 : 保证准确率: a novel generate-and-test architecture 提高召回率:
Pattern Learning (PL) Subclass Extraction (SE) List Extraction (LE)
1 Flowchart of the main components in KnowItAll
For every predicate: creates extraction rules
and discriminators Train discriminators
“cities such as ” NPList
Information Focus
唯一领域相关的输入是一组 predicate ,用来指定所关注的领域。
通用的抽取模板
Extraction Rules 通用的抽取模板,结合 predicate 的标签,生成相应领域的抽取规
则 Class1 = ‘city’ ,规则即为
“cities such as ” NPList “towns such as ” NPList
Keywords: “cities such as ” , “towns such as ” (提交给搜索引擎)
Discriminator
用来确认某个抽取到的信息是否 validate 利用 PMI
Training discriminator: Bootstrapping
The result of training
A set of discriminator, eg. Discriminator: <I> is a city
Learned threshold T: 0.000016
Conditional probabilitiesP(PMI > T | class) = 0.83
P(PMI > T | ¬class) = 0.08
An Example
Predicate: city Bootstrapping:
Generate extraction rules and discriminators Train all discriminators, and selected the 5 best
discriminators
An Example:Trained discriminator
An ExampleMain cycle: extract
Suppose that the query is “and other cities” from a rule with extraction pattern: NP “and other cities”.
2 instances: Fes, East Coast
An ExampleMain cycle: Assess
To compute the probability of City (Fes) sends six queries
“Fes” has 446,000 hits; “Fes is a city” has 14 hits “cities Fes” (201 hits) “cities such as Fes” (10 hits); “cities including Fes” (4 hits) 0 hits for “Fes and other towns”.
City (East Coast) below threshold for all discriminators
Sum up all the probability,
The final probability is 0.99815
The final probability is
0.00027.
1.2 Experiment noise tolerance
1.2 Experiment find negative training seeds for assessor
1.2 Experiment: search cutoff metric Signal to Noise ratio (STN): 正例与负例的比值 Query Yield Ratio (QYR) : n 个网页抽取到的新信息量
2 如何提高召回率 Pattern Learning (PL):
抽取规则 评价实例准确性的确认模板
Subclass Extraction (SE): 自动识别子概念,便于抽取 例如,为了抽取科学家的实例,可以先找到科学家的子概念(物
理学家、地理学家等),再抽取这些子概念的实例。 List Extraction (LE):
learns a “wrapper” for each list, and uses the wrapper to extract list elements.
使用通用抽取模板抽取到的信息作为这三种方法的初始种子,因此它们都不需要人事先给出训练数据。
2.1 Pattern Learning (PL):
通用模板对特定领域来说通常并不是最有效的模板 “the film <film> starring” “headquartered in <city>”
Pattern Learning algorithm
Estimating recall & precision efficiently take the positive examples of one class to be negative
examples for all other classes.
I: A set of seed
instancesContext of i
SearchBest patterns
Filter:Recall &precision
3 of the most productive rules
如何提高召回率 Pattern Learning (PL) Subclass Extraction (SE) List Extraction (LE)
2.2 Subclass ExtractionBasic subclass extraction (SEbase)
Extracting candidate subclasses 通用抽取规则在抽取实例的同时也抽取子类 . 如何区分 ?
实例 : 专有名词 , 大写 Scientists such as Einstein, Newton,… 子类 : 普通名词 Scientists such as physical scientist, biolo
gist, … Assessing Candidate Subclasses, a combination meth
od 子类名是否包含了父类名
“microbiologist” is a subclass of “biologist” 在 WordNet 中是否有父子关系 SEbase Assessor:
bootstrap training method
Rules for subclass extraction
Improving Subclass Extraction Recall 对抽取到的候选子类,用 table2 中后两条规则
来抽取它们兄弟 , 得到更多的候选子类。 两种子类
Context-independent subclassPerson - Priest
Context-dependent subclassPerson - Pharmacist
两种 assessing method SEself: 用自训练的方式训练一个分类器 SEiter :迭代地为每个抽取规则计算置信度
Experimental result: Context-independent subclass
Experimental result: Context-dependent subclass
如何提高召回率 Pattern Learning (PL) Subclass Extraction (SE) List Extraction (LE)
不同于前两种方法处理无结构文本 LE 利用网页中的结构来抽取信息
2.3 List Extractor
网页中很多列表都是从数据库中生成的,因此通常具有明显的结构特征
基本方法 定位网页中的 list 学习一个 wrapper ,自动抽取所有 list 中的 item
Learning a Wrapper
An Example
W3 is the BEST(1) 对应的 HTML 块尽量
小(2) 匹配尽量多 keywords
Experiments of LE
Discussion
使用 LE 可以用较少的查询,抽取到大量的信息
虽然准确率不够高,但是 帮助缩小了候选信息的数量 , 使得 Assessor 工作
量大大减少 . 可以发现在标准 IE 方法没有抽取到的信息
在 HTML 文档中,长选择列表中的一些罕见城市
2.4 PL, SE和 LE的比较: recall
city film
scientist
对于通用概念的实例抽取, SE 更有效
PL, SE和 LE的比较 : extraction rate
extraction rate
= num (unique extraction) / num (query)
the Trade-off between Recall and Precision
3 Conclusion KnowItAll: Unsupervised information extractio
n from the Web Input a set of predicate names no hand-labeled training examples of any kind
准确率 utilizes a novel generate-and-test architecture Extractor, Assessor
召回率 Pattern learning, Subclass Extraction, List Extract
ion
Outline
DIPRE,1998 KnowItAll, 2005 Open IE, 2007
Open IE
Michele BankoUniversity of Washington
Open Information Extraction from the Web,
IJCAI2007
1 Introduction
传统 IE 小规模、同类的语料
因此可以很大程度上依赖自然语言处理技术,例如:命名实体识别等
固定的关系类型
1.1 新的挑战 Automation
最初,把手工标注的实例、文档片断和自动学习到的特定领域的抽取模板作为系统的输入
后来,只需要为每种目标关系提供少量的种子实例或是手工编写的抽取模板( DIPRE , SNOWBALL , Web-based question answering system )
但是,生成这些数据依然需要专业知识。并且对于每种目标关系,都要提供训练数据。同时需要预先规定好要抽取的关系。
新的挑战 Corpus Heterogeneity
以前的工作抽取关系时 , 都是在小规模的特定领域的语料上抽取有限几种特定的关系
Kernel-based methods[Bunescu and Mooney, 2005] maximum-entropy models [Kambhatla, 2004] graphical models [Rosario and Hearst, 2004; Culotta et al.,
2006] co-occurrence statistics [Lin and Pantel, 2001; Ciaramita e
t al., 2005] 绝大多数工作都使用了 NER 、词法分析、依赖关系分析等技术。这些语言处理技术在处理异构的网络文本中会出现更多的错误。同时,已有的 NER 系统也不适合网络上命名实体的数量和复杂程度
新的挑战 Efficiency
KNOWITALL自动化:通过利用少量领域无关的抽取模式,自动地
标识训练集网络异构:使用 Part-of-speech tagger 代替 parser,
并不需要 NER但是需要大量的搜索引擎查询和下载网页,实验往往
需要数周时间。需要关系的名字作为输入。因此每次改变目标关系,就需要重新运行一次。
1.2 本文的贡献 Open Information Extraction
自动发现可能的关系 , 不需要事先确定要找的关系 ,因此只需要扫描一遍语料库
TEXTRUNNER OIE 的完整实现
对抽取结果进行了统计报告
2 Open IE
3 个模块 Self-Supervised Learner
输入:一个小规模语料输出:一个分类器,用于判断候选的关系是否 trustwo
rthy Single-Pass Extractor
一遍扫描整个语料,抽取候选关系元组,用分类器判断,保留正例。
Redundancy-Based Assessor
2.1 Self-Supervised Learner
自动为训练集标正反例 (ei, ri,j, ej) 分析数千个句子,获得依赖图 在每个句子中寻找名词短语作为 ei
对每对 (ei, ej) ,按它们在依赖图中的连接,寻找代表他们之间关系的词序列作为 ri,j 。对每个元组,如果满足预先给定的启发式规则,就标为正例。
ei, ej 之间的连接路径长度不得大于某阈值路径在一个句子的范围之内ei, ej二者都不能只包含代词
2.1 Self-Supervised Learner
用标好的训练集训练分类器 把每个元组映射为特征向量
the number of tokens in ri,j , the number of stopwords in ri,j , whether or not an object e is found to be a proper noun, the part-of-speech tag to the left of ei, the part-of-speech tag to the right of ej .
将特征向量集作为训练集,得到 Naïve Bayes 分类器 不是特定关系的,也不包含词汇特征
2.2 Single-Pass Extractor
识别名词短语,从连接两个名词短语的文本中寻找关系,得到候选元组 利用较轻量级的 NLP技术,这使得方法更健壮,能适应异构的网络文本。
启发式地去除过度修饰实体的介词短语和其它不必要的修饰词等。
使用分类器,保留被标为“ trustworthy” 的元组
2.3 Redundancy-Based Assessor
合并相同元组去掉不必要的修饰词
对每个元组,统计它们出现的不同的句子的总数
用这个总数来评估该元组正确的概率研究证明,这个概率方法的准确率比其它基于 nois
y-or ,点互信息的方法高得多。
2.4 Query Processing 查询速度快: at interactive speeds
对元组和它所在的文本建立倒排索引 每种关系分配在一台电脑上。
由于关系的名字是从网络文本中抽取到的,这些名字也更容易被用作查询关键词。
使用 relation-centric 索引,不同于目前搜索引擎使用的标准倒排索引,它可以支持复杂的关系查询 relationship queries, unnamed-item queries, and
multiple-attribute queries
2.5 Analysis
时间复杂度 OpenIE : O(D) for extraction, O(TlogT) for sort,
count and assess the tuples Traditional IE: O(R * D)
速度由于没有采用 dependency parse 等 NLP技术,
OIE 处理一个句子需要 0.036 CPU seconds, 传统IE技术需要 3 CPU seconds
3 Experimental Results: 3.1 Comparison with Traditional IE
9 million Web page corpus, 10 relations
3 Experimental Results: Comparison with Traditional IE
抽取到相同的正确元组, TEXTRUNNER错误率更低
时间上, TextRunner慢一些 85 vs. 63 CPU Hours 同时抽取更多关系
3 Experimental Results:3.2 Global Statistics on Facts Learned
11.3 million tuples containing 278,085 distinct relation strings.
Filtered rules: probability of at least 0.8 The tuple’s relation is supported by at least 10 distinct sent
ences in the corpus The tuple’s relation is not found to be in the top 0.1% of rel
ations by number of supporting sentences. (These relations were so general as to be nearly vacuous, such as (NP1, has, NP2)).
Estimating the Correctness of Facts
手工标注其中 400 个元组作为样本 判断
Well-formed? Relation: (FCI, specializes in, software development) Entities: (29, dropped, instruments)
Concrete or abstract? Concrete: 用于 IE , question answering
(Tesla, invented, coil transformer) Abstract: ontology learning
(Einstein, derived, theory)
True or false? 与它所在的句子原意一致
Estimating the Number of Distinct Facts
Distinct relation Merging
首尾的标点符号,助动词,开头的停用词,如“ are consistent with”, “, which is consistent with”.
主动和被动语态 关系的多义性
Eg. developed使得如果不借助 domain-specific type checking ,同义的 relation将对应着有重叠但是差别很大的 tuple 集
Estimating the Number of Distinct Facts
Distinct relation Build “synonymy clusters” for 11.3 million tuples:
(e1,r,e2), (e1,q,e2), where r≠q
1/3 belong to the “synonymy clusters” Distinct facts in the “synonymy clusters”: ¾ hat 2/3 + (1/3 × 3/4 ) or roughly 92% of the tuples
found by TEXTRUNNER express distinct assertions. overestimated
4 Conclusion
Open IE an unsupervised extraction paradigm
Web All relations
one-time relation discovery TEXTRUNNER
a fully implemented Open IE system demonstrates its ability to extract massive amount
s of high-quality information from a 9 million Web page corpus. Comparing with KnowItAll
SUMMARY
SUMMARY DIPRE,1998
首次利用迭代方法发现数据实体间的模式和关系 半监督关系抽取方面最初的工作
KnowItAll, 2005 unsupervised, domain-independent extracts information from the Web
Open IE, 2007 unsupervised, domain-independent Web All relations one-time relation discovery Higher precision than KnowItAll
感谢大家!Questions?