Download pdf - (and computational linguistics) · Computational linguistics: computer science theory cognition algorithms Natural language processing: software development application practical

1

(and computational linguistics)

Computational linguistics: computer science theory cognition algorithms

Natural language processing: software development application practical techniques

Computer methods and their usefulness (or uselessness) for human language processing (textual, spoken, gestural, etc.)

Implementation of techniques, procedures, algorithms for language computation

Enabling human-machine communication Enhancing human-human communication

3

4

computer science

psychology/cognitive science

linguistics

math/statistics

philosophy

communication

NLP

Tokenization Part-of-speech tagging Computational morphology Syntactic parsing Lexical relations Dialogue move engines

Dialectizer Speech recognition (speech to text) Speech synthesis (text to speech) Diacritization, Romanization Corpus annotation (Syriac) Thought identification

Question answering Summarization Natural language generation Machine translation Spoken language identification Spoken language translation

Humanities, natural and behavioral sciences, and engineering

Linguistics, computer science, psychology, and mathematics

Theory and practice, science and art Models, foundations vs. corpora, data

(top-down vs. bottom-up)

8

Math: statistics, calculus, algebra, modelling Computational paradigms: connectionist, rule-

based, cognitively plausible Linguistics: LFG, HPSG, GB, OT, CG, etc. Architectures: stacks, automata, networks,

compilers

9

Several approaches implemented, taught here Homegrown: analogical modeling (AM) State-of-the-art performance in various

applications for various languages: Written language identification Part-of-speech tagging Morpheme boundary detection Named entity recognition Word sense disambiguation Shallow parsing Semantic role labeling Spoken language identification

10

11

Year Price

Make Mileage

Model

Feature

PhoneNr

Extension

Car

has has

has

has is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..1 0..1

0..1

0..1

0..1

0..*

1..*

Work on information extraction (data-rich text, web)

Recognition and extraction of low-level data elements

Ontology-based Related applications: ontology

generation, text similarity and classification, information integration, etc.

NSF-funded

12

Results and issues

• Corpus of 1500 obituaries, 500 hand-annotated

• Preliminary evaluation on a few features: name, age, title, birth date, death date, death place, funeral time/location

• Results: around 80% precision, little less on recall

• Lexicon coverage (especially place names)

• Occasional typos • Deceased sometimes

not named • Factored lists: Pierre et

Marie, son fils et belle-fille

• Anaphora resolution: Né à Paris et y décédé…

… …

… …

… …

… …

… …

grandchildren of Mary Ely

… …

… …


… …


… …

… …

… …

… …

Number of facts extracted: 22,251 8,740 Person-BirthDate facts 3,803 Person-DeathDate facts 9,708 children facts, including

▪ 5,020 Child-has-parent-Person facts ▪ 2,394 Son-of-Person facts ▪ 2,294 Daughter-of-Person facts

Number of implied grandchild facts inferred: 5,277

Processing time: ~18 seconds per page CPU time: ~4 hours

Precision: .52 (spot-checking 100 of the 22,251 facts) Recall: .33 & Precision: .40 (spot-checking 2 fact-filled family

pages)

“Find a BBQ restaurant near the Umeda station, with typical prices under $40”

Language-Agnostic Ontology

Oral proficiency testing for language learners

Sentences presented aurally, repeated back Carefully engineered for vocabulary level,

grammatical complexity, length in syllables Score responses with forced alignment Correlate to standard testing methods English, French, Spanish, Japanese In use at language training facilities,

universities, industry

Too short: just WM task w/ parroting Too long: impossible to repeat Too complex: even NS can’t repeat Too simple: can’t discriminate NNS levels EI item design is a linguistic engineering

task! Sentence length Sentence complexity Vocabulary levels Breadth of sampling of grammatical

structures, constructions

681,925 annotated sentences of length 5-20 words

NLP in a cognitive modeling framework Goal-directed, incremental Machine learning Trying to model/mimic human performance

in language tasks Several modalities Parsing Generation Translation Dialogue

30

Cognitive modeling Model human behavior: agent-based, goal-

directed, representation of world, decomposable actions, learned skills, behaviors, expertise, memory

Fatigue, emotion, attention, overload, confusion Plausible: processes, time course, constraints Robots: explore control, agency, interaction Language: cognition, acquisition, modeling,

agency, incrementality, discourse/dialogue, process (parsing, lexical access, generation, translation, …)

Develop NLP capability in Soar Parsing, generation, discourse/dialogue,

translation, speech Fit models of human performance data Incremental, learning, agent-based WordNet, other resources for lexical info English, French, Japanese Use in HCI, modeling (reading, acquisition),

task interactions, emotion, attention, ambiguity resolution, parser breakdown, etc.

33

Dialogue

Comprehension

Generation

Dialogue

Generation

Comprehension

Operationalize language processing of all kinds (mostly for DoD) Machine translation, sentiment analysis,

dialect recognition, prevarication detection, etc. Beyond the current paradigms, language

resources (cf. trained on newswire) MT and CLIR (A), HCI English+Arabic (B), ST

English+Arabic (C), Arabic dialects (D) Activity E: language, agents, and robotics

Grounded language acquisition by robots Deep semantics, visual+tactile input,

experiential learning of objects, actions, and consequences

Acquires language via grounding, hypothesizing, automated reasoning

Human guides acquisition via situated, inter-active instruction

Robot demonstrates understanding via performance

Social band (105 to 107:days to months) Rational band (102 to 104:minutes to

hours) Cognitive band (10-1 to 101:100 ms to 10

secs) Biological band (10-4 to 10-2:100 μs to 10

ms)

Put <object> in <location> Includes moving to <object>, picking it up, moving to <location>,

opening <location> if necessary, depositing <object>, closing <location> if necessary

Fails if already another object in location (or can extend to put second object in work area?)

Cook <object> Clears the location where the object will be cooked. Turns on location to correct temperature (background knowledge in

semantic memory!) If need to preheat (oven), wait for it to preheat. Puts object in location. Waits. Tests temperature or other appropriate sensor (toothpick for

cake?). Removes object from oven/stove and places on workspace Turns off oven/stove

40

41

42