Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell...

Preview:

Citation preview

Ideas for100K Word Data Set for Human and Machine Learning

Lori LevinAlon LavieJaime CarbonellLanguage Technologies InstituteCarnegie Mellon University

The data set should support

Machine learningMachine learning from small data can work if

the data is structured. Analysis by humans

Humans can learn a lot from a small data set if the form-function mappings are clear.

Concrete Suggestions1. Hand align a portion of the corpus. 2. Include parse trees and feature structures for a

portion of the corpus.3. Include a representative sample of diversity of

phrase structures.4. Include a representative sample of diversity in

function/meaning.5. Include some simple, single sentences.6. Include some full texts.7. Look for well-known divergences. 8. Conduct an evaluation to be sure that the

corpus elicits what you want it to elicit.

Hand align a portion of the corpus

Automatic alignments algorithms can be bootstrapped from the hand alignments.

A lexicon can be created from the alignments.

Humans can study word usage.

Provide parse trees for a portion of the corpus

Parse trees plus alignments can be input to Avenue-style rule learning Automatic treebanking of the minor language

Humans can study the translation of specific structures.

There should be semantic and functional information in addition to structural information. See below.

Include a representative example of structural diversity Part of the corpus can be structured to

include simple, common sub-trees from the English Penn TreeBank.

Learn a collection of structural mappings that is compositionalA lot of mileage from small data

Preliminary work with Katharina ProbstRaw WSJ data requires editingNeed redundant examples of each structure

Include a representative example of function or meaning Finding out how English structures translate

into minor language structures is not enoughFor example, finding out how to translate

English auxiliary verbs is not useful because they have many functions: tense, aspect, epistemics, evidentials, etc.

Finding out how to express tense, aspect, epistemics, evidentials, etc. is useful.

Include some multi-sentence texts

In order to observeTemporal sequencing of eventsCausationRhetorical relations

Contrast, elaboration, etc.

Given and new informationCo-reference

Look for well-known divergences

E.g., run across the street vs cross the street running

But see below for our view of divergences.

Include some simple sentences

So that the form-function mapping is clear to a human without confounding factors

As a seed for machine learning

Evaluation

Test the corpus on a few languages that in order to be sure that the intended structures and functions are elicited. Need to watch out for idiosyncrasies, lexical

gaps, special constructions, etc. For example, if you want to elicit a noun

modified by a preposition, the person in the room will work better than a bottle of wine.

Hard problems

Body of common phenomena with a tail of phenomena that are individually rare, but collectively massive.

Extra slides

Our view of translation divergences Elaboration on the different roles of

structure and function

Our view of divergences which is divergent from some other views of divergences

Divergences arise when the same function is expressed by a different structure.

Many functions are expressed by specialized constructions that do not translate literally into other languages.

Divergences cannot be neatly grouped into a few classes.

Typological differences between languages are relevant: Embedding vs serialization Synthetic vs analytic causative constructions

Coverage: Structure and Function

Structural DiversityAppositives, adjuncts, embedded clauses,

coordinate structures, ellipsis, etc. Functional/Meaning Diversity

Temporal relations, rhetorical relations, modality, negation, tense, aspect, etc.

Structure and Function

The way you understand a text is by knowing which structure has which function.

The same function is expressed by different structures in different languages.

What a human needs to know(function) Who did what to who when? What happened before/after what? What caused what? Is it first hand knowledge, hearsay, or

inference? Is it certain, probable, or improbable?

Did it happen or not? What do these words mean?

How a human knows these things(structure/grammar)

Who did what to who when? Grammatical relations, coreference, time expressions, pronouns/pro-drop,

nominalizations, subordinate clauses, case marking, word order, agreement, tense, aspect

What happened before/after what? Time expressions, temporal connectives, tense and aspect morphemes

What caused what Markers of rhetorical relationsbetween sentences

Is it first hand knowledge, hearsay, or inference? Is it certain, probable, or improbable? Markers of modality and epistemics

Did it happen or not? Markers of negation and counterfactuals

What do these words mean? Vocabulary

Other Questions, existentials, possessives, coordinate structures

How to make sure the corpus captures what a human needs to know

Organize the corpus by function and then a human can observe the corresponding structure.

Coverage of data for human analysis: basics Closed Class and Special Constructions

Dates, names, numbers, prices, etc. Pronouns, prepositions, etc.

Encoding of grammatical relations and/or semantic roles. How do you know who did what to who? Word order, case marking, agreement

Encoding of old and new information Word order, special constructions (e.g., clefts), etc.

Questions Negation Modification Possession Coordination Indirect speech

Coverage of data for human analysis: multi-sentence and multi-clause

Rhetorical relationsCause, elaboration, contrast, etc.

Temporal relationsBefore, after, during, etc.

Same subject and obviation phenomena Subordination

As subject or objectAs complementAs adjunct

Other grammatically encoded meanings Modality and Epistemics

Certainty, source of information (first hand, second hand, inference), etc.

Conditionals Comparatives Existentials Tense and aspect Definiteness

Recommended