Upload
saburo
View
55
Download
5
Embed Size (px)
DESCRIPTION
Semi-Automatic Content Extraction from Specifications. Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Aaron Berkovich and Dan Sokol Cohesia Corporation. Extraction : Summarize in a prescribed vocabulary. Spec: Text. Spec: SDR. - PowerPoint PPT Presentation
Citation preview
1
Semi-Automatic Content Extraction from Specifications
Krishnaprasad ThirunarayanDepartment of Computer Science & Engineering
Wright State University Aaron Berkovich and Dan Sokol
Cohesia Corporation
2
Extraction : Summarize in a prescribed vocabulary
Spec: Text Spec: SDR
Domain Library
3
Sponsor: National Science Foundation SBIR: Phase I and Phase II
Industry: Cohesia Corporation Developer of (B2B) content and lower-level
infrastructure University: Wright State University
User-level tools: conceptualization and designOthers: Geometric Software Solutions, …
Tool/Product development and integration
Participants
4
Outline
Background and Goal (What?)Motivation (Why?)Details (How?)Conclusions
5
Background and Goal
6
Manual Content Extraction
Input: Paper-based specifications of a
manufacturing task describing composition, processing, and testing of materials
Additional constraints imposed by customers and vendors
Appropriate Ontology and Domain Library defining standard vocabulary
7
Output: An “equivalent” formalized description of
specs in Specification Definition Representation (SDR)
Observation: Specs originating from a common source
(ASTM, SAE, GE) share vocabulary and structure.
Linguistic patterns found in specs are exploited by an experienced extractor to interpret it.
8
Assistance for Extraction Document
PaperDocument
TextMark-Up Editor
(Wizard)
Document SDR
Document Proofer
original
9
Semi-automatic Content Extraction
Starting from an electronic version of a spec, develop a strategy for semantic markup, to assist in creating an “equivalent” SDR.
Semantic Markup: The task of overlaying an abstract syntax (“the essence”) on the “free-form” text.
• Spec: Human-sensible• Mark-up: Computer-sensible
Automate routine mechanical tasks.
10
AEROSPACE SPECIFICATION
TOLERANCES
Corrosion and Heat Resistant Steel, Iron Alloy, Titanium, and Titanium Alloy Bars and Wire
1. SCOPE: This specification covers established inch/pound manufacturing tolerances
applicable to corrosion and heat resistant steel, iron alloy, titanium, and titanium alloy bars and wire ordered to inch/pound dimensions. These tolerances apply to all conditions unless otherwise noted. The term excl. is used to apply only to the higher figure of the specified range.
2. DIAMETER AND THICKNESS: 2.1 Cold Finished Bars: 2.1.1 Rounds, Squares, Rexagons, and Octanons {See 2.1.3 and 2.1.4)
TABLE I Tolerance, Inch
Squares, Hexagons, Specified Diameter Rounds and Octagons or Thickness plus and minus minus only Inches (See 2.1.1.1) (See 2.1.1.2) Over 0.500 to 1.000, excl 0.002 0.004 1.000 0.0025 0.004 Over 1.000 to 1.500, excl 0.0025 0.006 1.500 to 2.000, incl 0.003 0.006 Over 2.000 to 3.000, incl 0.003 0.008 Over 3.000 to 4.000, incl 0.003 0.010 2.1.1.1 Size tolerances for round bars are plus and minus as shown in Table I, unless otherwise
specified. If required, however, they may be specified all plus and nothing minus, or all minus and nothing plus, or any combination of plus and minus, if the total spread in size tolerance for a specified size is not less than the total spread shown in the table.
2.1.1.2 For titanium and titanium alloys, the difference among the three measurements of the
distance between opposite faces of hexagons shall be not greater than one-half the size tolerance and the difference between the measurements of the distance between opposite faces of octagons shall be not greater than the size tolerance.
AS 2241J Issued 5-1-75 Revised 1-1-83
Value
Characteristic
Spec NameSpec Title
Revision
Revision Date
Qualifier
Values
Procedure
Semantic Mark-up
11
Ontology
(Gruber) An ontology is an explicit
specification of a conceptualization, which is an abstract, simplified view of the world that we wish to represent for some purpose.
12
Procedure
1 or many
1 or many
0, 1 or many
0, 1 or many
Characteristic
Document
Ref: 0, 1 or many
Ref: 0, 1 or many
Ref: 0, 1 or many
Value
Layer
RevisionReference
0, 1 or many
DomainLibrary
SDL Ontology
13
Spec: Text Spec: SDR
Extraction: Spec to SDR
14
Fundamental ObstaclesThe relation between the spec and its SDR rendition is “not linear”.
Same spec information duplicated in SDR in different contexts.
Contiguous block of information in SDR spread out in spec.
Equivalence of phrases hard to formalize.Tables and footnotes abbreviate information in irregular and complicated ways.
15
Linearizing through Abstraction: Introducing Specification Definition Language
Original Spec SDL
SDR
Manual (Ph-I) Compiled (Ph-I)
Original AMS-4976 spec is 8 pages. Its SDL equivalent is 15 pages.
Original AMS-5662J spec is 11 pages. Its SDR equivalent is 30 pages.
Manual (original)
Literal, Integrated,Semi-automatic (Ph-II)
16
17
18
Introducing Extraction Wizard
19
Motivation (Why?)
20
Business Background (Supply Chain)
Engine
Metal
Forger
Drawing
Spec
Drawing
Spec
21
Diverse and Large number of specs and spec users
QualityAssurance
Inspecting/Testing
Sales
Engineering
Certificateof
Test
Certificateof
Test
SalesOrder
LabRouting
ProductionRouting
Specs: AMS, DIN, JIS, PWA, GE, ASTMGM, etc.
22
Quality Issues Transcription Errors
From spec to hand-written sheet to computerCompleteness
Info in spec but missing in SDRSoundness
Info in SDR but not in specUniformity of Form Uniformity in Interpretation
Different understanding of the meaning while mapping to SDR (Ambiguity/Inconsistency)
23
Efficiency Issues Minimize time/effort required. Automate routine mechanizable tasks
Eliminate “cut-paste-modify” cycleMinimize duplication of information. Concise representation
Size of translation = O(Size of spec). Update consistency
Flexible rendition into various external forms.
24
Details (How?)
25
Essence of our Approach : Literal Translation
Conceptually, every piece of info in SDR owes its existence to phrases in spec.
Enable maintenance of correspondence between spec and its translation, and attempt to embed the translation into spec.
Requires compilation into SDL/SDR. Cf. XML/XSL Technology
26
Semi-automatic approach is feasible only if the partially generated translations (annotations) are intelligible to an extractor in the context of the original spec, and is systematically extensible.
Note that current manual extractions into SDL are not literal even though SDL enables it to an extent.
27
SDL Studio and its ExtensionSDL studio enables creation and editing of SDL documents. It has facilities to search domain library and compile SDL into an equivalent SDR. This can be further enriched using
Improved Domain Library Search Extraction and composition of SDL fragments Providing templates for commonly occurring
“procedures” Table processor etc …
28
Domain Library Search Engine
29
Domain Library
Currently, it contains technical phrases pertinent to materials and processing requirementsCohesia creates and maintains DLs for in-house use and for use by its clients such as GE, Alcoa, Allvac, etc.Typical size: 10,000 phrases
30
31
Improving Domain Library Search
Goal: Mapping “equivalent” phrases to same Domain Library TermUses: Techniques for prefix removal,
stemming, and dealing with other variations for root recognition
Stop words elimination Abbreviation expander and alias
normalization
32
Algorithm SketchList[Phrase] dl;Phrase ip; Int mt;List[Word] dlwm, inwm; % with back referencesList[Phrase] dlts;begin dl := readAndBuildDomainLibrary(); dlwm := buildWordMapAndBackLinks(dl); % delete stop words, link words to DLTs (in,mt) := readInputPhraseAndMatchThreshold(); inwm := buildWordMap(in); dlts :=
buildDLTsListContainingMatchedWords(dlwm,inwm); dlts := evaluateAndFilterDLTs(dlts,mt);end;
33
Matching wordsInt wordMatch(w1,w2)begin % normalized = vowels deleted, i.e., only consonants
present if caseUniformAndCleanedMatch(w1,w2)
return 100; if normalizedMatch(w1,w2)
return 90; if orderedNormalizedMatch(w1,w2)
return 70; % analyze for differences due to prefix and suffix
if normalizedDifferenceInPrefixSuffixTables(w1,w2) return 90;
end;
34
Design RationaleInput phrase may contain multiple DLTs.DLT words may not appear contiguous in input.Consonants are significant, and "correct" spellings may differ in vowels. Robustness with respect to spelling errors such as transposition of letters or missing vowels.Stemmers do not work for words appearing in DLTs satisfactorily. Instead, create tables customized to deal with prefixes and suffixes that arise in practice, and normalize dynamically.Err on the side of recall rather than precision.Number of words < Number of DLTs
35
Extraction Tool
36
Overall Approach
Preprocessing: Obtain spec in plain text form (from MSWord format).
This is a practical alternative to scanning and OCR-ing a paper-based spec.
Saving it in HTML format has the benefit of isolating tables. On the con side, it retains formatting tags.
Semi-Automatic Extraction: Recognize phrases in spec text that are associated with a requirement and generate SDL fragments to assist an extractor.
37
Two possible Avenues(From Document to SDL)Iteratively annotate the document text with XML tags reflecting the SDL structure and ontology. Generate various views of the document
and SDL from this single XML Master. Iteratively generate a sequence of progressively detailed SDL document from spec text.
38
First Avenue : Via XMLSemi-automatic extraction is accomplished in two phases: Initial automatic markup phase: Systematically
recognize domain library terms in spec text and add suitable XML annotations. Then generate a first-cut SDL fragment.
Subsequent manual conversion phase: Extractor organizes the information and completes the translation into an equivalent SDL.
Further steps: As the tool matures, automation can be attempted to produce more detailed extractions.
39
Advantages: Focus is on a single persistent XML
Master that tries to maintain a link between the spec and the extractions.
All the processing is orchestrated on this XML file.
Implements various views of the XML source using XSLFO and various transformations on the XML source using XSLT.
(cont’d)
40
Disadvantages: There is a need to manage a separate
SDL version to incorporate user inputs and corrections. This is because, even though it may be possible to represent SDL constructs using XML tags, it may not be possible to integrate user edits literally into the XML source.
(cont’d)
41
Insert Structure
Tags
Insert Ontology
Tags
Infer MissingChar.
GroupChar.
& Values
GroupC-Vs into
Procedures
Semantic-Markup Algorithm
42
DLT Tagger
Group Tagger
SDL Converter
Text file
XML file
XML file
XML file
SDL file
DomainLibrary
Structure Tagger
Functional Components
43
Tagging and Transformingflex structTagger.flexgcc lex.yy.c -lfla < GE.txt > GE.xmljava org.apache.xalan.xslt.Process -in GE.xml -xsl CSDLStylesheet.xsl -out GE.sdl …java org.apache.xalan.xslt.Process -in GE.xml -xsl CExpSDLStylesheet.xsl -out GE.exp.sdljava org.apache.xalan.xslt.Process -in GE.xml -xsl OriginalStylesheet.xsl -out GE.org.txt
44
45
46
Second Avenue: SDL all alongAs there is no obvious way of incorporating SDL edits into the XML source in general, try to generate legal SDL at different levels of detail all along. Advantage: Yields SDL documents that can be immediately used in Spec Studio and extended by an extractor.Disadvantage: This form does not retain correspondence with the original document explicitly.
47
Extraction Tool – Prototype Operation
Prototype Operation
48
49
Views: In the context of Spec
Plain text view Text view with
“requirement” phrases color coded and highlighted
View of domain library terms found in the spec
Views: In the context of SDL
Spec identity view + Large Note : Method D Extraction
Method C Extraction
Procedure view Characteristic-
value pair view
50
Extraction Method
Qualifiers Requirements Procedures
References
D Spec Class Only All information in notes
Not used
In notes
C Spec Class, Product, Alloy
All information in notes
Not used
In notes
B Many Qualifiers Characteristic-Value
pairs and notes
Used Retrieved
A Many Qualifiers CV pairs, pre-conditions,
permissibility, formulas, etc
Used Retrieved
51
Additional Standalone ToolsDomain Library Browser Given a word or a phrase, display all the
domain library information related to it.SDL Fragment Generator Given a sentence, generate an SDL
fragment that captures its essence.These tools can assist an extractor in composing SDL document incrementally.
52
Future Work
53
Longer-term VisionMarketplace continues to confirm the need for tools to capture the semantic interpretation of document contentCohesia plans to productize the results of the research into a viable commercial product
54
Example Engineering TasksHow to express and represent templates for well-known “procedures”? Alternative to cut-paste-modify cycle
Tensile Test Heat Treatment Melt Method Chemistry Packaging
55
How to express and represent heterogeneous tables and non-trivial footnotes in a spec in a convenient and uniform way?How to create, manipulate, and store specs in SDR and SDL among other forms and maintain interoperability?
56
Example Research QuestionsWhat are the forms of extraction rules? Phrase pattern matching Theory of equivalence/subsumption
Example: Aliases / Equivalent Phrases Creep = Plastic Strain Delivery Condition = Surface Finish Cause for Rejection = Rejection Criteria Imperfections detrimental to usage of product
= Free of injurious defects
57
Rules for interpreting “logic words”o Connectives: and, or, …o Quantifiers: all, every, each, …o Modifiers: over, under, more, less, …o Negation: not, no, unless, except, “free of” ...
Mismatch?• A, B, and C => {A,B,C}
union/OR-logic Distributive Laws?
• Lot and order number => lot number and order number
58
Another Example Scenerio
Melt Atmosphere = Inert GasSulphur < 2.0%Niobium < 0.5%
Melt Atmosphere = ArgonSulphur < 1.7%
Columbium < 0.2%
Buyers’ Purchase Order
Sellers’ Inventory
Match?
59
What are the strategies for searching and matching? Top-down: Template-driven
expectations Bottom-up: Identifying requirements
present Closure: Manual addition /
modification / disambiguation
60
Relevant Information Extraction Research and Technologies References
Message Understanding Conferences. Work on NLP an IE at UMass, NYU, SRI,
etc. Search and Filtering tools.
61
Conclusions
62
Spec Text asElectronic
Image
OpticalCharacter
Recognition
SpecText on Paper
PaperScanning
SDL (XML) SDR
SDLCompiler
SDLEditor
Spec Text inHTML/XML
ExtractionWizard
Read,Interpret,& Type
NSF SBIR Phase I
NSF SBIR Phase II
Before
63
Appendix
64
AMS 4928N (Ti Alloy)
65
Tensile Test
66