40
Lab 1 - LASI Product Description 1 Lab 1 - LASI Product Description Red Aluan Haddad CS411W Gene H. Price, Janet B. Brunelle 03/18/2013 Version 2

CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 1

Lab 1 - LASI Product DescriptionRed

Aluan HaddadCS411W

Gene H. Price, Janet B. Brunelle03/18/2013Version 2

Page 2: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 2

ContentsLIST OF FIGURES.........................................................................................................................2

1. INTRODUCTION.......................................................................................................................4

2. PRODUCT DESCRIPTION........................................................................................................6

2.1 Key Product Features and Capabilities..............................................................................72.1.1 Deployment and Modularity...........................................................................................82.1.3 Native Source Document Compatibility.........................................................................92.1.4 Output............................................................................................................................12

2.2 MAJOR FUNCTIONAL COMPONENTS (HARDWARE AND SOFTWARE)....................................152.1.1 HOST PLATFORM SOFTWARE SYSTEM REQUIREMENTS....................................................152.2.2 CURRENT OFF THE SHELF SOFTWARE COMPONENTS........................................................16

2.2.4 Key Type System and Grammatical Representation Foundations................................172.2.3 KEY ALGORITHMS.............................................................................................................24

3. IDENTIFICATION OF CASE STUDY....................................................................................26

PROTOTYPE DESCRIPTION.....................................................................................................27

4.1 MAJOR FUNCTIONAL COMPONENTS (HARDWARE AND SOFTWARE)....................................274.2 FEATURES AND CAPABILITIES..............................................................................................28

GLOSSARY OF TERMS..............................................................................................................28

Page 3: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 3

List of Figures

Figure 1 the user flow through the LASI GUI...............................................................................10

Figure 2 Tornado Chart of words in a paragraph written by Drs. Hester and Myers....................11

Figure 3 demonstrates the use of syntax highlighting in the GUI.................................................12

Figure 4 demonstrates a statistical view of a parsed document.....................................................13

Figure 6 provides an abstract conceptual description of the syntactic anaylsis process................19

Figure 7 loosely describes LASI's hardware and softawre separation..........................................22

Figure 8 contrasts the functionality of the prototype with conceptual package............................23

Page 4: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 4

1. Introduction

LASI (Linguistic Analysis for Subject Identification) is a natural language processing engine

that will combine raw lexical analysis heuristics with sophisticated syntax aware heuristics and

thereby form a basis to extrapolate, determine ways to interrelate, and abstract statistically

derived semantic content over an input domain containing multiple English written works. The

process of linguistic analysis, defined herein as the procedural study of how the words and phrases within

a written work compose to form emergent meanings, is the central concept behind the LASI project. The

use of language is a constantly evolving, self-describing process which complexifies on the composition

of syntactic rules to express complex, emergent ideas, which in turn compose together to form themes:

the distillation and relationship of the relationships literally described by the document.

Essentially, linguistic analysis in this context aims to, at least conceptually, to sufficiently quantify

qualitative information, thereby reducing trivial disagreements to be dispelled allowing for faster and

more effective decision making. Such analysis tools can provide key services in role as decision support

tools. LASI is such a tool.

The notion of theme refers to emergent, overlapping, intra and inter-textually derived, mental

constructs which represent one of the key bases for human communication. In a sense, themes

provide an abstraction interface which allows for the expression of linguistic ideas.

However, as much as communicating via thematic abstraction is something without which

humans would be unable to express complex ideas to one another; interpreting and expressing

themes is often fraught with misunderstandings, conflation, and subject-arbitrary emotional

associations. Any of these pitfalls can impede and stifle linguistic communication. For example,

consider the case of an author who, while he genuinely expresses a certain theme with eloquence

and brevity, is criticized for stating something he did not in fact assert, but the reference frame of

the reader unpredictably clashed with that of the author in a way neither of them were capable of

Page 5: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 5

predicting, and perhaps resulted in a mutual perception of disagreement over a subject on which,

had the authors words been parameterized differently, they might have wholeheartedly agreed.

Thus, in spite of or perhaps because of the critical role which themes play in communication and

expression, a multitude of potentially baseless or conflated concepts are communicated between

authors and readers as well as between individual readers. In the case of readers, small

differences in their respective interpretations of some works can compose into serious

disagreement over what a given author is trying to express. While this has many powerful and

sometimes even positive implications, and while it forms one of the key underpinnings for meta-

explorative disciplines such epistemology and philosophy, discord over needless

misunderstanding can have very harmful effects in areas where justifiable, imperative decisions

must made based solely on textual perusal. For example, consider a situation is when time

critical decisions must be made by government agencies or large corporations who must

carefully determine how to allocate of scarce resources, or make time-critical financial or

military decisions. As these situations involve multiple individuals doing independent research

and then pooling their knowledge, and since some such degree of semantic disagreement is

inevitable in a relatively democratic environment, serious problems such as needless delays,

resource misappropriations, or outright inaction may result and cause severe damage.

2. Product Description

LASI is a software package which aims to provide such decision support and validation via

combinations of high performance, context aware statistical heuristics and a graphical front

which will allow researchers to glean more meaningful results more quickly from their sources

of written information. The pairing of these sophisticated algorithms and a graphical user

Page 6: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 6

interface (GUI) will allow the tool to reach and assist a broad range of individuals with widely

varying technical backgrounds and areas of research. Additionally, because of the broad societal

significance of the problem it approaches, LASI has many potential applications which exceed the

domain of pure research. For example, its pattern recognition and synonym generalization features could

help professors to identify plagiarism, which has been wrapped in a thin veneer by basic paraphrasing. On

the other hand, its contextual awareness capabilities could help students to quickly find relevant sources

to cite for written assignments. In more in depth contexts, advanced LASI users such as Researchers, like

the progenitor of the LASI project Dr. Patrick T. Hester, can reap the benefits of the algorithms inferential

capabilities, to provide their clients with quantitatively verifiable assessments of the complex systems

they assess, and thereby providing recommendations that are more specialized. Broadly speaking, any

individual needing to quickly become familiar with a single specific area of broad field of study or

knowledge could employ LASI’s unique functionality to quickly hone in on increasingly relevant written

resources.

Due to the both the selection of C# and Dr. Hester’s use of Windows enterprise software,

LASI will initially target Microsoft’s .Net framework. However, due to the availability of

reliable C# framework implementations for non-Windows platforms, the slow but steady

transition Microsoft is making towards and supporting open source programs, and the

conservative selection of core language features used in its implementation, LASI will ultimately

be accessible to users of a wide array of software platforms including Windows 7 and 8, various

iterations of Mac OSX, and a multitude of Linux based platforms including RedHat and BSD.

2.1 Key Product Features and Capabilities

LASI will feature a number of different assessment techniques over which the user can

exert control as they desire to extrapolate and construct, from a linear set of word-tokens, a

Page 7: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 7

reflexive web of syntactic and sematic associations which all of which will be revised and

refined recursively as it continues to infer potential relationships.

Before attempting any higher level analysis, a set of syntactic parsing libraries will

examine the text and identify, statistically and locally, the likely part of speech of text’s lexical

constructs. The result of this phase is a collection of words and phrases which have been usage-

wise categorized and thus mapped to program constructs which encapsulate their syntactic roles.

After this initial step, which results in a dynamic word and phrase behavior driven data model, a

large number of independent statistical functions and element association techniques will be

applied, their results compared, procedures potentially reordered and reevaluated, and finally

interrelated over multiple sources and representations in an attempt to find the common thematic

ideas and shared concepts of the input domain. A key technique that allows this to be

accomplished is the assignment of a variety of numerical weights, both to individual word and

phrase elements and to sets of potentially associated constructs which are iteratively modified

and scaled by each subsequent metric applied.

2.1.1 Deployment and Modularity

LASI is an open source, portable, and standalone application. After its release version is

finalized, its source code will be made freely available under a yet undetermined open source

license. Although much of the functionality it aims to provide is designed to be hardware

platform agnostic, in order to gain timely results from comparative analysis of multiple

documents, a mid-range laptop or desktop computer, currently valued at roughly than 600 U.S.D

will be required.

The core of the LASI Framework consists of a set of modular algorithms and data structures,

which together function as an engine which weaves together English textual information into an

Page 8: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 8

queryiable run time data structure, thus allowing for new statistical functions to be added and

new orderings of metrics to be defined with minimal overhead. This facilitates efficient

debugging and ever increasing understanding as the LASI team develops the project and, once

the source is publicly released, inherently defines a convenient flexible API under which new

programmers can writing extensions or add to the base.

Because LASI is designed as a standalone desktop application that can be run on affordable,

commonplace computers, and because it has been designed with open source extensibility and

accepted, familiar GUI conventions in mind, it has the potential to reach beyond the archetypal

science academics that generally make use of heuristic data processing engines.

The primary goal of LASI is to assess, compose, and interrelate large quantities of

Sufficiently Contiguous English text. Initially, the domain of input documents will be limited to

peer-reviewed research papers and academic journals articles in order to reduce early issues that

may arise from the increasingly generally accepted use of some colloquializations in popular

writing. LASI will then process these documents in order to extrapolate thematic content and

ultimately, via a concept roughly analogous to set intersection, in order determine a valid,

intertextual commonality between them if it can be found. Because it can be assumed there will

be many causes where common patterns are found between documents that have little to do with

one another, determining a range of valid thresholds for statistical significance, especially when

analyzing the results of some of the more naïve metrics, is a key aspect of the implementation.

(This space intentionally left blank.)

Page 9: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 9

2.1.2 Native Source Document Compatibility

In terms of common capabilities and user accessibility features, the LASI software

package will accept English textual works in multiple popular file formats. Currently all

Microsoft Word document types as well as raw ASCII text files are fully supported LASI will

also provide native support for adding Adobe Acrobat documents directly to user projects at

some point. Implementing this functionality has been given a relatively low priority by the team

as it requires that an optical character recognition system be implemented or integrated

successfully such that all potentially erroneous characters parsed from an Acrobat document

containing scanned text must be differentiated and completely dealt with before the text is passed

to the tagging module.

In addition to parsing the data provided by the user, its functionally allows users to provide

custom dictionary-like inputs containing weight adjustments, static associations, explicit

synonym collections, and syntactic-role overrides for lexical entities in order to facilitate more

focused, user-intent-driven results. While his has the advantage of increasing user control over

the process and allowing for more customizable selection of results and their arrangements, it is

inseparably tied to the a loss of a demonstrable validity, detracting from any assertions

developers can make about accuracies and bias likelihoods when shipping an iteration of LASI

which provides such a feature. The most agreeable middle ground probably is an approach

allowing users to make some adjustments, through a properly abstracted interface, and providing

clear, unmistakable warnings regarding the decreasing verifiability of results. The user interface

provides standard, responsive navigation functions that explicitly provides for all of the possible

branches as illustrated by Figure 1.

Page 10: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 10

Figure 1 the user flow through the LASI GUI

Page 11: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 11

2.1.3 Output

The default UI results format will consist of a colorized graphical display of the key

results. Tabbed views and context menus will allow the user to filter and organize the results, by

specifying specific source documents, word and phrase relationships, and correlation views. A

simple, but expressive, example of such a view is illustrated by (Figure 1).

An additional results format, also exhibited by the prototype GUI, conveys syntactic

information to the user via part-of-speech-colorized syntax highlighting. While static at the time

of the rasterization below, the dynamic nature of the run-time representation of textual entities

Figure 2 Tornado Chart of words in a paragraph written by Drs. Hester and Myers

Page 12: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 12

allows for detailed association information to be displayed for each instance of each word via

syntactic inference. The sample-colorized output is shown in (Figure 3).

Figure 3 demonstrates the use of syntax highlighting in the GUI

In addition to these contextual, high level views of LASI’s analysis results, more

austerely quantitative representations will be provided. Such views, a prototype example of

which can be seen in Figure 3 which displays the overall frequency sorted by part-of-speech and

then by text-content of every word in the document, have the two-fold advantage of providing

the user with comprehensive results and serving as a manual validation tool.

(This space intentionally left blank.)

Page 13: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 13

Figure 4 demonstrates a statistical view of a parsed document

The goal is to provide, for each category of information which the LASI engine can infer

from a document, a human readable, non-number-overloaded view which highlights the relevant

information and provides contextual navigation to other perspectives. With this in mind, the

prototype screen rasterizations shown in Figure 1, Figure 2, and Figure 3 are only some of the

views that are intended.In addition to in program dynamic result renderings, the LASI UI will

facilitate exporting static views of all results to common presentational, tabular, and or

serialization-oriented file formats such as Adobe Acrobat and Microsoft Excel formats in

addition to simple non-proprietary formats such as the CSV (Separated Value) , XML

(Extensible Markup Language) , and JSON(JavaScript Object Notation) file formats. This

Page 14: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 14

allows for results to be flexibly retained, viewed, and shared indecently of the LASI environment

itself.

2.2 Major Functional Components (Hardware and Software)

The requirements for the host operating system are fairly standard, consisting of an up-to-

date build of Microsoft Windows 7 Home Premium or above and an up-to-date version of the

Microsoft .Net Framework v4.5 or above. Support for non-Microsoft based platforms, such as a

RedHat build pared with the Mono Framework, is a planned feature. Support for the DotGNU

UNIX platform is also a future possibility.

The physical hardware requirements being targeted, irrespective of the operating system

hosting LASI, are those of a fast but affordable desktop or notebook computer. While some

requirements are more flexible than others, the absolute minimum system specifications required

are that the Processor must have at least four logical cores (via an dual core Intel core series

processor with hyper-threading support enabled, or a quad core AMD processor), be clocked at a

frequency at or above 2.0GHz, and a minimum of eight gigabytes of DDR3 (Double Data Rate

memory type 3) of total system memory clocked at or above 1,066MHz.

For an optimal experience, or for an open source developer experimenting with the code

post release, the recommended hardware requirements consist of a processor having at least eight

logical cores (via an quad core Intel core series processor with hyper-threading support enabled

or a eight core AMD processor), of eight gigabytes of low latency DDR3 clocked at or above

1,333MHz with timings memory access latencies not greater than 9-9-9, a solid state based data

storage medium for document retrieval having at least 128 megabytes of onboard DDR3 cache

and a rated random read speed of at least 40 megabytes per second for arbitrary 512 kilobyte data

blocks.

Page 15: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 15

2.2.1 Current off the Shelf Software Components

The LASI project library contains source and executable code files from two preexisting open

source C# projects. First, LASI incorporates executable code files from b2xtranslator, an open

source binary to XML file format converter. Specifically, LASI contains two of its child

programs, the precompiled executable doc2x which converts Legacy (1997) Microsoft

Word .doc file to .docx open XML files and the precompiled executable ppt2x which converts

Legacy (1997) Microsoft PowerPoint .ppt files to .pptx open XML files, which are included and

used under the FreeBSD open source license.

Secondly, and far more significantly, LASI contains the part-of-speech-tagging library

SharpNLP, an open source C# fork of the OpenNLP and its dynamic link libraries, which are

included and used under the limited GNU open source license. The methods provided by therein

provide critical support to the LASI project as they are utilized to convert from ASCII text files

containing whitespace delimited word-tokens into tagged files wherein these tokens are

represented as serialized Tagged Word Objects (TWO) which contain the original lexical string

annotated with embedded syntactic role information.

The reasons for returning a constructor to an object instead of the object itself in this case

are twofold. First, the pattern of returning a constructor provides beneficial abstraction between

the Word and Phrase types used by the algorithm, only requiring that instantiated objects derive

from the abstract class Word, and secondly, it allows for deferred execution of object

instantiation which can be used with other patterns, such as monadic function composition, to

provide unique and useful behavior not efficiently achieved otherwise.

Page 16: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 16

An additional third party, but not strictly software, asset used by LASI is Princeton

University’s free, manually compiled set of synonym database oriented text files. These files are

mapped at runtime to thesaurus constructs which provide various types of synonym lookup.

These thesauri make it possible for LASI to generalize many patterns that would otherwise be

random or unsafe, thereby providing potentially higher levels of results and allowing for

significant performance increases.

(This space intentionally left blank.)

Page 17: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 17

2.2.4 Key Type System and Grammatical Representation Foundations

The core analysis functionalities LASI implements are built around compositions and

permutations of Enumerable collections of redundantly linked data structures which directly

represent words and phrases as instances of corresponding class types. Figure 4 provides a

detailed view of the static composition of non-syntactic-linked-based representation of a

document object at runtime. In particular note the multidirectional many-to-one and one-to-many

aggregation relationships as well as the deliberate multi-parent and multi-child redundancy

which allow for independent iteration of the document representation from any construct within

it. For example, this allows for useful data abstractions such as functions which can return free

words without the need to store, maintain and return the unused context of the word in the

function which returns it.

Figure 4 illustrates the reflexive of lexical elements

Page 18: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 18

To help overcome the challenge of mapping the fluidity of relationships between words

in the English language, it becomes necessary to associate objects which have no direct

inheritance suitable relationships. For example, the object of a transitive verb is a role compatible

with both nouns and noun phrases, but it is conceptually, and here programmatically, incorrect to

have noun and noun phrase share compositional inheritance relationships because phrases are

compositions of words and not words themselves, so to cause noun phrase to derive from noun

would introduce a literal and intellectual circular dependency and additionally lead to

inexpressive, awkwardly written functions. To provide the desired syntax role generalization

between words and phrases which have parallel behaviors but not compositions or derivations, a

number of interface types are defined, which allow for elegant coding patterns and a level of

abstraction which more closely matches one’s mental concept of their parallel relationships

(This space intentionally left blank.)

Page 19: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 19

2.2.3 Key Algorithms

The main analysis process can be conceptually broken down into 3 distinct phases. Each

phase is designed to build upon the links established and statistical information gathered during

the previous one. It is significant that, while this linear breakdown provides a useful conceptual

view of how the system operates, the component algorithms may be reordered, and or executed

multiple times, depending on the nature of input set.

Figure 5 provides an abstract conceptual description of the syntactic anaylsis process

Page 20: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 20

2.2.2.3. A Primary Phase

During the initial phase of analysis, the primary focus will be to build a textually agnostic

set of associations based on the determined part of speech for each token in a document. At this

relatively low level of abstraction, the document will be represented by a text-wise-linear

collection of role specialized word objects where each corresponds to a text token and an

inferred part of speech. Through this, a frequency assessment of the words will be used to

determine an initial, naïve significance metric for each usage of each word.

2.2.2.3. B Secondary Phase

After the initial assessment, this naïve metric will be refined through two syntactic

association procedures. First, binding instances of pronouns with the nouns they refer to will

significantly increase the accuracy of the noun weights determined during the primary phase.

Further, by binding adjectives to the nouns they describe, we begin to associate adjective counts

to specific noun instances. Both of these procedures are significant in that they begin the process

of associating together the linear text into constructs, thus beginning to raise the level of

abstraction from individual words to inter-word relationships.

2.2.2.3. C Tertiary Phase

Following the secondary phase here, we will begin to bind subject and object entities to

each other by way of the verbs which associate, through this, we can identify and correlate the

ways entities are related to one another and the significance of these associations. These linkages

form the basis for statistical synonym association and, most significantly, form the basis for the

themes the LASI analysis engine will identify for the user.

Page 21: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 21

2.2.2.3. D Progression and Refinement

As more and more constructs are linked together, the relationships between them begin to

complexify into abstract semantic associations allowing for continuing refinement of the results

via multi-pass, iterative refinements and branching assessment of parallel possible implications.

These are designed to form a basis for future higher order algorithm phases to be developed as

abstractions on top of the analysis provided by the three outlined here.

3. Identification of Case Study

The case study which sparked the creation and realistic purpose of the developing LASI

derives from the work of Dr. Patrick T. Hester and Dr. Tom Meyers through their organization

NCSOSE (the National Centers for System of Systems Engineering).Through NCSOSE, they

provide critical research and analysis services to corporation and government agencies. In this

context, they provide information and perspective to help guide high level decisions.

In order to provide such a service, they must combine domain specific insight with strong

independent . They generally spend hours and even days researching the technical aspects

inherent in the complex organization systems of their clients. A high-performance linguistic

analysis tool that could elucidate and validate the key areas of interest with respect to their client

would help them efficiently research a domain. Additionally, it would be able to synthesize a

concise, human-readable synopses derived orthogonally to its input data, encapsulating the

commonalities between documents and what they focus on in a timely objective manner, would

be a powerful asset which would increase the accuracy, efficiency and client satisfaction.

Page 22: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 22

4. Prototype Description

The LASI prototype represents a scaled back version of the real world solution. Although it retains the majority of the language processing functionality proposed, it has been reduced in scope due to the time constraints imposed by the nature of undergraduate university education. It has however, been constrained in such a way as to allow for truncated features can be developed and easily integrated in the future.

4.1 Major Functional Components (Hardware and Software)

The major functional components of the LASI project are derived hardware and software categories. As

an application designed for desktop, software components comprise the bulk of the discrete elements of

the package. Broadly these consist of a set of composable algorithms which perform weighting, syntactic

element binding and semantic referencing. On the hardware front, all that will be required is a single

personal computer.

Figure 6 loosely describes LASI's hardware and softawre separation

MAJOR FUNCTIONAL COMPONENTS

Page 23: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 23

4.2 Features and Capabilities

Although it remains fairly robust, retaining most of its planned core features, , the scaled back

prototype version lacks several noteworthy features. These include a human-readable

explanation its reasoning process, and support for scanned text formats. Additionally, LASI

prototype will support an input set of at most five documents while the real-world solution would

be able to handle an arbitrary number of documents.

Figure 7 contrasts the functionality of the prototype with conceptual package

(This space intentionally left blank.)

Page 24: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 24

Glossary of Terms

Theme: A subject-object-verb relationships that LASI is attempting to generate from the input set.LASI: Linguistic Analysis for Subject Identification.

Document Converter: Takes in DOC and DOCS files and converts them to TXT files.

WordNet: A library provided of synonym information provided by Princeton University.

Word (noted by a capital W): an instance of Word class.

Phrase (noted by a capital P): An instance of the Phrase class or one of its descendants.

Phrase: A group of words standing together as a conceptual unit, typically forming a new component.

Analysis: Detailed examination of the elements or structure of something, typically as a basis for interpretation.

Linguistic Analysis: The technical analysis of language.

Tag: A label or the act of attaching a label, that specifies the role (such part of speech or location)

of a selected element in a document.

Document: A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research.

Word Weight: A numeric value, associated with each syntactically and lexically unique word in a written work, which indicates the relative significance of that word.

Tornado Chart: A horizontal bar graph like visualization, representing the relative frequency or significance of elements, sorted in descending order by magnitude.

Head word: A Head Word is the locally distinct word within a phrase which, by its syntactic associations, thereby determining the syntactic role of the phrase itself.

C#: a programming language which provides safe, and flexible, and performant

Page 25: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 25

abstractions. across multiple programming paradigms, LASI will initially target Microsoft’s .net framework. However, due to the availability of reliable framework implementations for non-windows platforms, and the conservative selection of core language features used in its implementation, LASI will ultimately target any number of operating platforms including Windows 7 and 8, Mac OSX, and a multitude of Linux based platforms including RedHat and BSD.

Sharp NLP: A natural language processing tool used to parse and tag parts-of-speech, written in C#.

Part of Speech Tagging: The process of binding part-of-speech to a word.

Tagged Word Object: A word that has an associated part-of-speech.

Tagged Set: A group of words whose parts of speech have been identified by a parser.

Lexer: a piece of our parsing tool that isolates each word and its part of speech, and location in a sentence into machine readable tokens.

Syntactic Analysis: A form of linguistic analysis that focuses on grammar in sentences and identifies themes based on sentence structure and formatting.

Semantic Analysis: A form of linguistic analysis that identifies key words based on their location in the sentence rather than their overall meaning throughout the document.

Subject Identification: Identifies the main actor in a sentence. However, in a broader sense, the word subject is synonymous with the theme of a document. Subject identification is the process of determining subjects, or themes of a document or documents.

Part of Speech Tagger: Software that parses text and assigns labels identifying their use.

Semantic Analysis: Relating syntactic structures to the independent meaning of words.

A.I.D. Process: Assessment Improvement Design Process developed and utilized by Dr. Patrick T. Hester and Dr. Tom Meyers to determine problems and solutions.

Strategic Document: A document that defines goals, visions, and mission statements.

Sufficiently Contiguous: The requirement that there must be at least a single paragraph from a single written work for results beyond to be extracted. A corollary or caveat of the above is that, if a source file is merely contiguously stored in, but actually contains sentences which, while grammatically correct, are strung together arbitrarily, the same condition applies In both cases, LASI creators Red Team, have do not provide an guarantee of rational output given nonsensical input, conversely, Red Team unfortunately cannot guarantee that LASI will recognize and reject all nonsense. That is invoking LASI on nonsense is not necessarily closed over a semantic nonsense subspace because, when

Page 26: CS411 Lab 1 - cs.odu.edu · Web view2.2.4 Key Type System and Grammatical Representation Foundations17. 2.2.3 Key Algorithms24. 3. Identification of Case Study26. Prototype Description27

Lab 1 - LASI Product Description 26

comparing multiple or very long nonsensical files patterns may emerge and be considered.