Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15

Table Extraction Using Conditional Random Fields

D. Pinto, A. McCallum, X. Wei and W. Bruce Croft

- on SIGIR03 -

Presented by Vitor R. CarvalhoMarch 15th 2004

Warm up

• Why table extraction?

– Applications: Question-Answering, data mining and IR– Tables: “textual tokens laid out in tabular form” – Tables: “databases designed for human eyes”

• Related Work:– Pyreddy and Croft,1997: purely layout-based approach; a Character

Alignment Graph (CAG) is used to identify the whole table

– Ng et. al. ,1999: machine learning to identify rows and columns positions; no extraction of content.

– Hurst, 2000: combination of layout and language perspective; text are broken into blocks by spatial and linguistic evidence

– Pinto et. al., 2002: based on CAG, heuristic method to extract table cells for QA system.

Objectives

• On this paper:– Only text tables are studied, not HTML tables

– Table extraction can be broken down into 6 subproblems:» Locate the table (*)

» Identify the row positions and types (*)

» Identify columns positions and types

» Segment tables into cells

» Tag cells as data or headers

» Associate data cells with their corresponding headers

– Only (*) tasks are addressed in the paper

– CRFs are compared to MaxEntropy and to HMM

Example

• From www.FedStats.com , July 2001

12 Line Labels

• Non-extraction labels– { NONTABLE, BLANKLINE, SEPARATOR }

• Header Labels– { TITLE, SUPERHEADER, TABLEHEADER, SUBHEADER,

SECTIONHEADER }

• Data Row Labels– { DATAROW, SECTIONDATAROW }

• Caption Labels– { TABLEFOOTNOTE, TABLECAPTION }

Feature Set• White Space Features

– Presence of: 4 consecutive white spaces, 4 space indents, 2 consecutive white space between non-space characters, a complete white space line, single space indent, etc

– Percentage of: white space from the first non-white space on

• Text Features– Presence of: 3 cells on a line, etc

– Percentage of: digits (0-9) on a line, alphabet characters(a-z) on a line, header features (year strings, month abreviations, etc) on a line

• Separator Features– Presence of: 4 consecutive periods

– Percentage of: separator characters(-,+,! ,=,:,*) on a line

• Conjunction of Features– Conjunctions: current&previous line, current&next line, next&nextnext

Task 1: Table Line Location

• A table line is any label but NONTABLE, BLANKLINE and SEPARATOR

• F-Measure = (2*Precision * Recall)/(Recall+Precision)

• Both CRFs used a Gaussian Prior and were trained using L-BFGS

• Training set (52 documents), develop. set (6 documents), test set (62 docs)

Task 2: Line Identification

• How many of these lines were actually table lines?

Task 2: Line Identification

Additional Results

• Pinto et. al. heuristic method

• 4 labels: CAPTIONS, HEADERS, DATA, NON-TABLE

Conclusions

• The Table extraction problem has complex linguistic and formatting characteristics. In order to attack this problem, a combination of textual and spatial features was used.

• CRFs can handle very well arbitrary and overlapping features, and offer the combined benefits of conditional-probability training models and Markov finite-state context models.

Documents

Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15