66
1 Semi-Automatic Content Extraction from Specifications Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Aaron Berkovich and Dan Sokol Cohesia Corporation

Semi-Automatic Content Extraction from Specifications

  • Upload
    saburo

  • View
    55

  • Download
    5

Embed Size (px)

DESCRIPTION

Semi-Automatic Content Extraction from Specifications. Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University Aaron Berkovich and Dan Sokol Cohesia Corporation. Extraction : Summarize in a prescribed vocabulary. Spec: Text. Spec: SDR. - PowerPoint PPT Presentation

Citation preview

Page 1: Semi-Automatic Content Extraction from Specifications

1

Semi-Automatic Content Extraction from Specifications

Krishnaprasad ThirunarayanDepartment of Computer Science & Engineering

Wright State University Aaron Berkovich and Dan Sokol

Cohesia Corporation

Page 2: Semi-Automatic Content Extraction from Specifications

2

Extraction : Summarize in a prescribed vocabulary

Spec: Text Spec: SDR

Domain Library

Page 3: Semi-Automatic Content Extraction from Specifications

3

Sponsor: National Science Foundation SBIR: Phase I and Phase II

Industry: Cohesia Corporation Developer of (B2B) content and lower-level

infrastructure University: Wright State University

User-level tools: conceptualization and designOthers: Geometric Software Solutions, …

Tool/Product development and integration

Participants

Page 4: Semi-Automatic Content Extraction from Specifications

4

Outline

Background and Goal (What?)Motivation (Why?)Details (How?)Conclusions

Page 5: Semi-Automatic Content Extraction from Specifications

5

Background and Goal

Page 6: Semi-Automatic Content Extraction from Specifications

6

Manual Content Extraction

Input: Paper-based specifications of a

manufacturing task describing composition, processing, and testing of materials

Additional constraints imposed by customers and vendors

Appropriate Ontology and Domain Library defining standard vocabulary

Page 7: Semi-Automatic Content Extraction from Specifications

7

Output: An “equivalent” formalized description of

specs in Specification Definition Representation (SDR)

Observation: Specs originating from a common source

(ASTM, SAE, GE) share vocabulary and structure.

Linguistic patterns found in specs are exploited by an experienced extractor to interpret it.

Page 8: Semi-Automatic Content Extraction from Specifications

8

Assistance for Extraction Document

PaperDocument

TextMark-Up Editor

(Wizard)

Document SDR

Document Proofer

original

Page 9: Semi-Automatic Content Extraction from Specifications

9

Semi-automatic Content Extraction

Starting from an electronic version of a spec, develop a strategy for semantic markup, to assist in creating an “equivalent” SDR.

Semantic Markup: The task of overlaying an abstract syntax (“the essence”) on the “free-form” text.

• Spec: Human-sensible• Mark-up: Computer-sensible

Automate routine mechanical tasks.

Page 10: Semi-Automatic Content Extraction from Specifications

10

AEROSPACE SPECIFICATION

TOLERANCES

Corrosion and Heat Resistant Steel, Iron Alloy, Titanium, and Titanium Alloy Bars and Wire

1. SCOPE: This specification covers established inch/pound manufacturing tolerances

applicable to corrosion and heat resistant steel, iron alloy, titanium, and titanium alloy bars and wire ordered to inch/pound dimensions. These tolerances apply to all conditions unless otherwise noted. The term excl. is used to apply only to the higher figure of the specified range.

2. DIAMETER AND THICKNESS: 2.1 Cold Finished Bars: 2.1.1 Rounds, Squares, Rexagons, and Octanons {See 2.1.3 and 2.1.4)

TABLE I Tolerance, Inch

Squares, Hexagons, Specified Diameter Rounds and Octagons or Thickness plus and minus minus only Inches (See 2.1.1.1) (See 2.1.1.2) Over 0.500 to 1.000, excl 0.002 0.004 1.000 0.0025 0.004 Over 1.000 to 1.500, excl 0.0025 0.006 1.500 to 2.000, incl 0.003 0.006 Over 2.000 to 3.000, incl 0.003 0.008 Over 3.000 to 4.000, incl 0.003 0.010 2.1.1.1 Size tolerances for round bars are plus and minus as shown in Table I, unless otherwise

specified. If required, however, they may be specified all plus and nothing minus, or all minus and nothing plus, or any combination of plus and minus, if the total spread in size tolerance for a specified size is not less than the total spread shown in the table.

2.1.1.2 For titanium and titanium alloys, the difference among the three measurements of the

distance between opposite faces of hexagons shall be not greater than one-half the size tolerance and the difference between the measurements of the distance between opposite faces of octagons shall be not greater than the size tolerance.

AS 2241J Issued 5-1-75 Revised 1-1-83

Value

Characteristic

Spec NameSpec Title

Revision

Revision Date

Qualifier

Values

Procedure

Semantic Mark-up

Page 11: Semi-Automatic Content Extraction from Specifications

11

Ontology

(Gruber) An ontology is an explicit

specification of a conceptualization, which is an abstract, simplified view of the world that we wish to represent for some purpose.

Page 12: Semi-Automatic Content Extraction from Specifications

12

Procedure

1 or many

1 or many

0, 1 or many

0, 1 or many

Characteristic

Document

Ref: 0, 1 or many

Ref: 0, 1 or many

Ref: 0, 1 or many

Value

Layer

RevisionReference

0, 1 or many

DomainLibrary

SDL Ontology

Page 13: Semi-Automatic Content Extraction from Specifications

13

Spec: Text Spec: SDR

Extraction: Spec to SDR

Page 14: Semi-Automatic Content Extraction from Specifications

14

Fundamental ObstaclesThe relation between the spec and its SDR rendition is “not linear”.

Same spec information duplicated in SDR in different contexts.

Contiguous block of information in SDR spread out in spec.

Equivalence of phrases hard to formalize.Tables and footnotes abbreviate information in irregular and complicated ways.

Page 15: Semi-Automatic Content Extraction from Specifications

15

Linearizing through Abstraction: Introducing Specification Definition Language

Original Spec SDL

SDR

Manual (Ph-I) Compiled (Ph-I)

Original AMS-4976 spec is 8 pages. Its SDL equivalent is 15 pages.

Original AMS-5662J spec is 11 pages. Its SDR equivalent is 30 pages.

Manual (original)

Literal, Integrated,Semi-automatic (Ph-II)

Page 16: Semi-Automatic Content Extraction from Specifications

16

Page 17: Semi-Automatic Content Extraction from Specifications

17

Page 18: Semi-Automatic Content Extraction from Specifications

18

Introducing Extraction Wizard

Page 19: Semi-Automatic Content Extraction from Specifications

19

Motivation (Why?)

Page 20: Semi-Automatic Content Extraction from Specifications

20

Business Background (Supply Chain)

Engine

Metal

Forger

Drawing

Spec

Drawing

Spec

Page 21: Semi-Automatic Content Extraction from Specifications

21

Diverse and Large number of specs and spec users

QualityAssurance

Inspecting/Testing

Sales

Engineering

Certificateof

Test

Certificateof

Test

SalesOrder

LabRouting

ProductionRouting

Specs: AMS, DIN, JIS, PWA, GE, ASTMGM, etc.

Page 22: Semi-Automatic Content Extraction from Specifications

22

Quality Issues Transcription Errors

From spec to hand-written sheet to computerCompleteness

Info in spec but missing in SDRSoundness

Info in SDR but not in specUniformity of Form Uniformity in Interpretation

Different understanding of the meaning while mapping to SDR (Ambiguity/Inconsistency)

Page 23: Semi-Automatic Content Extraction from Specifications

23

Efficiency Issues Minimize time/effort required. Automate routine mechanizable tasks

Eliminate “cut-paste-modify” cycleMinimize duplication of information. Concise representation

Size of translation = O(Size of spec). Update consistency

Flexible rendition into various external forms.

Page 24: Semi-Automatic Content Extraction from Specifications

24

Details (How?)

Page 25: Semi-Automatic Content Extraction from Specifications

25

Essence of our Approach : Literal Translation

Conceptually, every piece of info in SDR owes its existence to phrases in spec.

Enable maintenance of correspondence between spec and its translation, and attempt to embed the translation into spec.

Requires compilation into SDL/SDR. Cf. XML/XSL Technology

Page 26: Semi-Automatic Content Extraction from Specifications

26

Semi-automatic approach is feasible only if the partially generated translations (annotations) are intelligible to an extractor in the context of the original spec, and is systematically extensible.

Note that current manual extractions into SDL are not literal even though SDL enables it to an extent.

Page 27: Semi-Automatic Content Extraction from Specifications

27

SDL Studio and its ExtensionSDL studio enables creation and editing of SDL documents. It has facilities to search domain library and compile SDL into an equivalent SDR. This can be further enriched using

Improved Domain Library Search Extraction and composition of SDL fragments Providing templates for commonly occurring

“procedures” Table processor etc …

Page 28: Semi-Automatic Content Extraction from Specifications

28

Domain Library Search Engine

Page 29: Semi-Automatic Content Extraction from Specifications

29

Domain Library

Currently, it contains technical phrases pertinent to materials and processing requirementsCohesia creates and maintains DLs for in-house use and for use by its clients such as GE, Alcoa, Allvac, etc.Typical size: 10,000 phrases

Page 30: Semi-Automatic Content Extraction from Specifications

30

Page 31: Semi-Automatic Content Extraction from Specifications

31

Improving Domain Library Search

Goal: Mapping “equivalent” phrases to same Domain Library TermUses: Techniques for prefix removal,

stemming, and dealing with other variations for root recognition

Stop words elimination Abbreviation expander and alias

normalization

Page 32: Semi-Automatic Content Extraction from Specifications

32

Algorithm SketchList[Phrase] dl;Phrase ip; Int mt;List[Word] dlwm, inwm; % with back referencesList[Phrase] dlts;begin dl := readAndBuildDomainLibrary(); dlwm := buildWordMapAndBackLinks(dl); % delete stop words, link words to DLTs (in,mt) := readInputPhraseAndMatchThreshold(); inwm := buildWordMap(in); dlts :=

buildDLTsListContainingMatchedWords(dlwm,inwm); dlts := evaluateAndFilterDLTs(dlts,mt);end;

Page 33: Semi-Automatic Content Extraction from Specifications

33

Matching wordsInt wordMatch(w1,w2)begin % normalized = vowels deleted, i.e., only consonants

present if caseUniformAndCleanedMatch(w1,w2)

return 100; if normalizedMatch(w1,w2)

return 90; if orderedNormalizedMatch(w1,w2)

return 70; % analyze for differences due to prefix and suffix

if normalizedDifferenceInPrefixSuffixTables(w1,w2) return 90;

end;

Page 34: Semi-Automatic Content Extraction from Specifications

34

Design RationaleInput phrase may contain multiple DLTs.DLT words may not appear contiguous in input.Consonants are significant, and "correct" spellings may differ in vowels. Robustness with respect to spelling errors such as transposition of letters or missing vowels.Stemmers do not work for words appearing in DLTs satisfactorily. Instead, create tables customized to deal with prefixes and suffixes that arise in practice, and normalize dynamically.Err on the side of recall rather than precision.Number of words < Number of DLTs

Page 35: Semi-Automatic Content Extraction from Specifications

35

Extraction Tool

Page 36: Semi-Automatic Content Extraction from Specifications

36

Overall Approach

Preprocessing: Obtain spec in plain text form (from MSWord format).

This is a practical alternative to scanning and OCR-ing a paper-based spec.

Saving it in HTML format has the benefit of isolating tables. On the con side, it retains formatting tags.

Semi-Automatic Extraction: Recognize phrases in spec text that are associated with a requirement and generate SDL fragments to assist an extractor.

Page 37: Semi-Automatic Content Extraction from Specifications

37

Two possible Avenues(From Document to SDL)Iteratively annotate the document text with XML tags reflecting the SDL structure and ontology. Generate various views of the document

and SDL from this single XML Master. Iteratively generate a sequence of progressively detailed SDL document from spec text.

Page 38: Semi-Automatic Content Extraction from Specifications

38

First Avenue : Via XMLSemi-automatic extraction is accomplished in two phases: Initial automatic markup phase: Systematically

recognize domain library terms in spec text and add suitable XML annotations. Then generate a first-cut SDL fragment.

Subsequent manual conversion phase: Extractor organizes the information and completes the translation into an equivalent SDL.

Further steps: As the tool matures, automation can be attempted to produce more detailed extractions.

Page 39: Semi-Automatic Content Extraction from Specifications

39

Advantages: Focus is on a single persistent XML

Master that tries to maintain a link between the spec and the extractions.

All the processing is orchestrated on this XML file.

Implements various views of the XML source using XSLFO and various transformations on the XML source using XSLT.

(cont’d)

Page 40: Semi-Automatic Content Extraction from Specifications

40

Disadvantages: There is a need to manage a separate

SDL version to incorporate user inputs and corrections. This is because, even though it may be possible to represent SDL constructs using XML tags, it may not be possible to integrate user edits literally into the XML source.

(cont’d)

Page 41: Semi-Automatic Content Extraction from Specifications

41

Insert Structure

Tags

Insert Ontology

Tags

Infer MissingChar.

GroupChar.

& Values

GroupC-Vs into

Procedures

Semantic-Markup Algorithm

Page 42: Semi-Automatic Content Extraction from Specifications

42

DLT Tagger

Group Tagger

SDL Converter

Text file

XML file

XML file

XML file

SDL file

DomainLibrary

Structure Tagger

Functional Components

Page 43: Semi-Automatic Content Extraction from Specifications

43

Tagging and Transformingflex structTagger.flexgcc lex.yy.c -lfla < GE.txt > GE.xmljava org.apache.xalan.xslt.Process -in GE.xml -xsl CSDLStylesheet.xsl -out GE.sdl …java org.apache.xalan.xslt.Process -in GE.xml -xsl CExpSDLStylesheet.xsl -out GE.exp.sdljava org.apache.xalan.xslt.Process -in GE.xml -xsl OriginalStylesheet.xsl -out GE.org.txt

Page 44: Semi-Automatic Content Extraction from Specifications

44

Page 45: Semi-Automatic Content Extraction from Specifications

45

Page 46: Semi-Automatic Content Extraction from Specifications

46

Second Avenue: SDL all alongAs there is no obvious way of incorporating SDL edits into the XML source in general, try to generate legal SDL at different levels of detail all along. Advantage: Yields SDL documents that can be immediately used in Spec Studio and extended by an extractor.Disadvantage: This form does not retain correspondence with the original document explicitly.

Page 47: Semi-Automatic Content Extraction from Specifications

47

Extraction Tool – Prototype Operation

Prototype Operation

Page 48: Semi-Automatic Content Extraction from Specifications

48

Page 49: Semi-Automatic Content Extraction from Specifications

49

Views: In the context of Spec

Plain text view Text view with

“requirement” phrases color coded and highlighted

View of domain library terms found in the spec

Views: In the context of SDL

Spec identity view + Large Note : Method D Extraction

Method C Extraction

Procedure view Characteristic-

value pair view

Page 50: Semi-Automatic Content Extraction from Specifications

50

Extraction Method

Qualifiers Requirements Procedures

References

D Spec Class Only All information in notes

Not used

In notes

C Spec Class, Product, Alloy

All information in notes

Not used

In notes

B Many Qualifiers Characteristic-Value

pairs and notes

Used Retrieved

A Many Qualifiers CV pairs, pre-conditions,

permissibility, formulas, etc

Used Retrieved

Page 51: Semi-Automatic Content Extraction from Specifications

51

Additional Standalone ToolsDomain Library Browser Given a word or a phrase, display all the

domain library information related to it.SDL Fragment Generator Given a sentence, generate an SDL

fragment that captures its essence.These tools can assist an extractor in composing SDL document incrementally.

Page 52: Semi-Automatic Content Extraction from Specifications

52

Future Work

Page 53: Semi-Automatic Content Extraction from Specifications

53

Longer-term VisionMarketplace continues to confirm the need for tools to capture the semantic interpretation of document contentCohesia plans to productize the results of the research into a viable commercial product

Page 54: Semi-Automatic Content Extraction from Specifications

54

Example Engineering TasksHow to express and represent templates for well-known “procedures”? Alternative to cut-paste-modify cycle

Tensile Test Heat Treatment Melt Method Chemistry Packaging

Page 55: Semi-Automatic Content Extraction from Specifications

55

How to express and represent heterogeneous tables and non-trivial footnotes in a spec in a convenient and uniform way?How to create, manipulate, and store specs in SDR and SDL among other forms and maintain interoperability?

Page 56: Semi-Automatic Content Extraction from Specifications

56

Example Research QuestionsWhat are the forms of extraction rules? Phrase pattern matching Theory of equivalence/subsumption

Example: Aliases / Equivalent Phrases Creep = Plastic Strain Delivery Condition = Surface Finish Cause for Rejection = Rejection Criteria Imperfections detrimental to usage of product

= Free of injurious defects

Page 57: Semi-Automatic Content Extraction from Specifications

57

Rules for interpreting “logic words”o   Connectives: and, or, …o   Quantifiers: all, every, each, …o   Modifiers: over, under, more, less, …o   Negation: not, no, unless, except, “free of” ...

Mismatch?• A, B, and C => {A,B,C}

union/OR-logic Distributive Laws?

• Lot and order number => lot number and order number

Page 58: Semi-Automatic Content Extraction from Specifications

58

Another Example Scenerio

Melt Atmosphere = Inert GasSulphur < 2.0%Niobium < 0.5%

Melt Atmosphere = ArgonSulphur < 1.7%

Columbium < 0.2%

Buyers’ Purchase Order

Sellers’ Inventory

Match?

Page 59: Semi-Automatic Content Extraction from Specifications

59

What are the strategies for searching and matching? Top-down: Template-driven

expectations Bottom-up: Identifying requirements

present Closure: Manual addition /

modification / disambiguation

Page 60: Semi-Automatic Content Extraction from Specifications

60

Relevant Information Extraction Research and Technologies References

Message Understanding Conferences. Work on NLP an IE at UMass, NYU, SRI,

etc. Search and Filtering tools.

Page 61: Semi-Automatic Content Extraction from Specifications

61

Conclusions

Page 62: Semi-Automatic Content Extraction from Specifications

62

Spec Text asElectronic

Image

OpticalCharacter

Recognition

SpecText on Paper

PaperScanning

SDL (XML) SDR

SDLCompiler

SDLEditor

Spec Text inHTML/XML

ExtractionWizard

Read,Interpret,& Type

NSF SBIR Phase I

NSF SBIR Phase II

Before

Page 63: Semi-Automatic Content Extraction from Specifications

63

Appendix

Page 64: Semi-Automatic Content Extraction from Specifications

64

AMS 4928N (Ti Alloy)

Page 65: Semi-Automatic Content Extraction from Specifications

65

Tensile Test

Page 66: Semi-Automatic Content Extraction from Specifications

66