Upload
osborn-lane
View
212
Download
0
Embed Size (px)
Citation preview
Table Extraction Using Conditional Random Fields
D. Pinto, A. McCallum, X. Wei and W. Bruce Croft
- on SIGIR03 -
Presented by Vitor R. CarvalhoMarch 15th 2004
Warm up
• Why table extraction?
– Applications: Question-Answering, data mining and IR– Tables: “textual tokens laid out in tabular form” – Tables: “databases designed for human eyes”
• Related Work:– Pyreddy and Croft,1997: purely layout-based approach; a Character
Alignment Graph (CAG) is used to identify the whole table
– Ng et. al. ,1999: machine learning to identify rows and columns positions; no extraction of content.
– Hurst, 2000: combination of layout and language perspective; text are broken into blocks by spatial and linguistic evidence
– Pinto et. al., 2002: based on CAG, heuristic method to extract table cells for QA system.
Objectives
• On this paper:– Only text tables are studied, not HTML tables
– Table extraction can be broken down into 6 subproblems:» Locate the table (*)
» Identify the row positions and types (*)
» Identify columns positions and types
» Segment tables into cells
» Tag cells as data or headers
» Associate data cells with their corresponding headers
– Only (*) tasks are addressed in the paper
– CRFs are compared to MaxEntropy and to HMM
Example
• From www.FedStats.com , July 2001
12 Line Labels
• Non-extraction labels– { NONTABLE, BLANKLINE, SEPARATOR }
• Header Labels– { TITLE, SUPERHEADER, TABLEHEADER, SUBHEADER,
SECTIONHEADER }
• Data Row Labels– { DATAROW, SECTIONDATAROW }
• Caption Labels– { TABLEFOOTNOTE, TABLECAPTION }
Feature Set• White Space Features
– Presence of: 4 consecutive white spaces, 4 space indents, 2 consecutive white space between non-space characters, a complete white space line, single space indent, etc
– Percentage of: white space from the first non-white space on
• Text Features– Presence of: 3 cells on a line, etc
– Percentage of: digits (0-9) on a line, alphabet characters(a-z) on a line, header features (year strings, month abreviations, etc) on a line
• Separator Features– Presence of: 4 consecutive periods
– Percentage of: separator characters(-,+,! ,=,:,*) on a line
• Conjunction of Features– Conjunctions: current&previous line, current&next line, next&nextnext
Task 1: Table Line Location
• A table line is any label but NONTABLE, BLANKLINE and SEPARATOR
• F-Measure = (2*Precision * Recall)/(Recall+Precision)
• Both CRFs used a Gaussian Prior and were trained using L-BFGS
• Training set (52 documents), develop. set (6 documents), test set (62 docs)
Task 2: Line Identification
• How many of these lines were actually table lines?
Task 2: Line Identification
Additional Results
• Pinto et. al. heuristic method
• 4 labels: CAPTIONS, HEADERS, DATA, NON-TABLE
Conclusions
• The Table extraction problem has complex linguistic and formatting characteristics. In order to attack this problem, a combination of textual and spatial features was used.
• CRFs can handle very well arbitrary and overlapping features, and offer the combined benefits of conditional-probability training models and Markov finite-state context models.