Upload
brandon-watkins
View
214
Download
0
Embed Size (px)
Citation preview
ITTL.ppt-1
Information Technology & Telecommunications Laboratory
Document Type Recognition and Content Summarization
William Underwood
Persistent Archives Testbed Working Meeting
SDSC, La Jolla, CA
Feb 17-18, 2005
ITTL.ppt-2
Information Technology & Telecommunications Laboratory
Overview
• Information Extraction
• Machine learning and recognition of document types
• Content Extraction
• Summarization (Folder titles and Content Notes)
• FOIA Review
ITTL.ppt-3
Information Technology & Telecommunications Laboratory
Access Restriction Checker
Domain Knowledge
Office &Staff Names
Family&FriendNames
LexicalKnowledge
Interface Agent
DocumentArchivist’s Annotations
Document ContextDocumentASCII version of DocumentMarked up DocumentDocument ProfileDocument TypeArchivist’s AnnotationsRestrictions, Locations, Rationale
Questions to ArchivistsArchivists’ Answers
Conclusions
Blackboard
Control
Info Extractor
Reader
Access Restriction Architecture
ARCHIVIST
Agenda
Scenario Templates
Document Typer
FOIA/PRA Restriction Checker
Record Typer
Profiler
Learner
InteractionHistorian
Summarizer
Community of CollaboratingIntelligent Agents
Advisors
OntologiesPolitical, Military, Etc.
ITTL.ppt-4
Information Technology & Telecommunications Laboratory
Information Extraction
• Information extraction (IE) is a procedure that selects, extracts and combines data from text in order to produce structured information.
• The Named entity (NE) Task is to identify all named persons, organizations, locations, dates, times, numeric monetary amounts and percentages in text.
ITTL.ppt-5
Information Technology & Telecommunications Laboratory
Letter From George Bush to Ronald Reagan
ITTL.ppt-6
Information Technology & Telecommunications Laboratory
Named Entity Recognition
ITTL.ppt-7
Information Technology & Telecommunications Laboratory
Content Extraction Tasks
• The Template Element (TE) Task is to fill in templates about persons and organizations from an automatic analysis of text.
• The Scenario Template (ST) task is to fill in templates about events and their participants (persons, organizations, etc.) from an automatic analysis of text?
ITTL.ppt-8
Information Technology & Telecommunications Laboratory
Content Extraction Applied to Recognizing Request for Confidential Advice
ITTL.ppt-9
Information Technology & Telecommunications Laboratory
Content Extraction and Access Restriction Rules
Action: Request
Agent: Person
Job_Title: President
Object: Analysis of the War Powers Resolution
Patient: C Boyden Gray
Job_Title: Counsel to the President
Presidential_Advisor: C Boyden Gray
If Document(X), and
Action(X) = Request, and
Agent(X) = Y, and
(Job_Title(Y) = President, or Presidential_Advisor(Y)) and
Patient(X) = Z and
Presidential_Advisor(Z) and
Object(X) = Information
Then Access_Restriction(X) = a(5).
ITTL.ppt-10
Information Technology & Telecommunications Laboratory
Some Document Types in Bush Presidential Electronic Records
• Agenda• Biographical Information • Briefing Memo• Decision Memo• Executive Order• Information Memo• White House Letter• List of Candidates for Appointment to Federal Office• Mailing List• Minutes of Meeting• Nomination for Appointment to Federal Office• Press Release• Resume• Schedule• Telephone Call Recommendation
ITTL.ppt-11
Information Technology & Telecommunications Laboratory
Document Type Recognition
• Convert document format to ASCII or HTML
• Use Information Extraction Technology to Markup Different Document Types.
• Machine Learning of Document Type through Grammatical Inference
• Evaluate Performance
• Use for Recognizing Document Types of other Records
ITTL.ppt-12
Information Technology & Telecommunications Laboratory
Annotated White House Correspondence
<date>March 27, 1990</date>
<greeting>Dear</greeting><person>Mr. Allen</person>
<p>Thank you very much for your letter of <date>March 15, 1990</date> which
stated your concerns and suggestions regarding the Americans with Disabilities Act.</p>
<p>In order to fulfill <person>President Bush's</name> campaign promise of bringing
Americans with handicaps into the mainstream of American life, the
Bush Administration supports the objectives of the A.D.A.</p>
<p>As you may know, the bill is still in <organization>House Committee</organization>
for consideration and change. You can be sure that your thoughts have been
fully noted and are appreciated.</p>
<formula of respect>Sincerely,</formula of respect>
<person>Doug Wead</person>
<job title>Special Assistant to the President for Public Liaison</job title>
<address><person>Ray Allen</person>, <job title>President</job title>
<organization>American Cultural Traditions</organization>
<postal address>P.O. Box 1895</postal address>
<location>Washington, D.C.</location> <zipcode>20013</zipcode></address>
ITTL.ppt-13
Information Technology & Telecommunications Laboratory
Regular Grammar for the Layout of White House Correspondence
Letter <date></date>A
A <greeting></greeting>B
B <p></p>B
B <p></p>C
C <formula of respect></formula of respect>D
D <person></person>E
E <job title></job title>F
F <address></address>
ITTL.ppt-14
Information Technology & Telecommunications Laboratory
Scope and Content Note for John Sununu’s Files
These files contain correspondence from senior level staff in the Executive Office of the President, and from every member of the Cabinet. The material covers issues that faced the Bush Administration from 1989 to 1990, including abortion / fetal research, the Exxon Valdez oil spill, the savings and loan industry, the Clean Air Act, the White House Conference on Global Climate Change, relations with China following the student demonstrations in Tiananmen Square, the National Drug Control Strategy, the 1990 Bipartisan Budget Agreement, the spotted owl issue, the Americans with Disabilities Act, and the nomination of Supreme Court Justice David Souter. It includes correspondence, routine reports, press releases, press clippings, papers produced by organizations outside the Administration, and speech drafts.
ITTL.ppt-15
Information Technology & Telecommunications Laboratory
Relationship to Persistent Archives Testbed
• Information extraction, document type learning and recognition and series summarization will be provided as Archival Services within the NARA Persistent Archives Prototype, and could be provided within the PAT.
ITTL.ppt-16
Information Technology & Telecommunications Laboratory
Additional Information
• http://perpos.gtri.gatech.edu• Archival Processing Tools: User Manual• An Analysis of the Knowledge Required to
Perform FOIA and PRA Review, PERPOS Technical Report ITTL/CSITD 04-1,Mar 2004.
• PERPOS: Results of Laboratory Experiments and Use by Archivists, Nov 2003
• Recognizing Named Entities in Presidential Electronic Records, PERPOS Technical Report ITTL/CISTD 04-4, June, 2004