38
A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressi Introduction to Computational Linguistics Summary Kurt Eberle [email protected] January 31, 2018 1 / 38

Introduction to Computational Linguistics - Summarysfs.uni-tuebingen.de/~keberle/ICL/slides/slidesSum.pdf ·  · 2018-01-31I John R. Searle The Chinese Room ... This should begin

Embed Size (px)

Citation preview

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Introduction to Computational LinguisticsSummary

Kurt Eberle

[email protected]

January 31, 2018

1 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

OutlineA bit of Philosophy of ScienceSome technical stuff from different chapters. . .Encoding LanguageWriters’ Aids

N-gram analysisRule-based methodsSimilarity key techniquesUsing ProbabilitiesMinimal Edit Distance

Regular ExpressionsREs and Finite State AutomataREs and Finite State Transducers

Classifying Documents2 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

OutlineA bit of Philosophy of ScienceSome technical stuff from different chapters. . .Encoding LanguageWriters’ Aids

N-gram analysisRule-based methodsSimilarity key techniquesUsing ProbabilitiesMinimal Edit Distance

Regular ExpressionsREs and Finite State AutomataREs and Finite State Transducers

Classifying Documents3 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

A bit of Philosophy of Science

Is Artificial Intelligence Possible?I Can Machines Think?I Can Machines Understand?I Can Machines Learn?I Can Machines Have Emotions?

4 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

A bit of Philosophy of Science

Artificial IntelligenceI Strong AII Weak AI

5 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

A bit of Philosophy of Science

Interesting positionsI Alan M. Turing

COMPUTING MACHINERY AND INTELLIGENCEI John R. Searle

MINDS, BRAINS, AND PROGRAMS

6 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

A bit of Philosophy of Science

Interesting positionsI Alan M. Turing

The Turing-TestI John R. Searle

The Chinese Room Argument

7 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Turing

I Imitation game. How is it defined? What is its purpose?→ I propose to consider the question, ”Can machines think?”

This should begin with definitions of the meaning of the terms”machine” and ”think.” . . .

→ man A, woman B, interrogator C, two rooms, teleprinter→ Who is the man, who is the woman?→ Replace A or B by Computer. Difference?

8 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Searle

I Chinese room scenario. How is it defined? What is its purpose?→ Three batchs of Chinese text→ Manipulation Guideline→ Story, Script (Logical derivation rules), Questions→ Answers by symbol manipulation→ Not sufficient for ’understanding’

9 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

OutlineA bit of Philosophy of ScienceSome technical stuff from different chapters. . .Encoding LanguageWriters’ Aids

N-gram analysisRule-based methodsSimilarity key techniquesUsing ProbabilitiesMinimal Edit Distance

Regular ExpressionsREs and Finite State AutomataREs and Finite State Transducers

Classifying Documents10 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Some technical stuff from different sections. . .

11 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

OutlineA bit of Philosophy of ScienceSome technical stuff from different chapters. . .Encoding LanguageWriters’ Aids

N-gram analysisRule-based methodsSimilarity key techniquesUsing ProbabilitiesMinimal Edit Distance

Regular ExpressionsREs and Finite State AutomataREs and Finite State Transducers

Classifying Documents12 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Converting decimal numbers to binary

Two methods:I Tabular methodI Division method

13 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Converting decimal numbers to binaryTabular Method

Using the first 4 bits, we want to know how to write 10 in bit (orbinary) notation.

8 4 2 1? ? ? ?

8 < 10 ? ? ?

1 8 + 4 = 12 > 10 ? ?

1 0 8 + 2 = 10 ?

1 0 1 0

(Task: Apply the method for more bits and higher numbers!)14 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Converting decimal numbers to binaryDivision Method

Decimal Remainder? Binary10/2 = 5 no 0

5/2 = 2 yes 10

2/2 = 1 no 010

1/2 = 0 yes 1010

(Task: Apply the method for higher numbers!)

15 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

UTF-8 detailsI First byte unambiguously tells you how many bytes to expect

after itI e.g., first byte of 11110xxx has a four total bytes

I all non-starting bytes start with 10 = not the initial byte

Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 60xxxxxxx110xxxxx 10xxxxxx1110xxxx 10xxxxxx 10xxxxxx11110xxx 10xxxxxx 10xxxxxx 10xxxxxx111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Task: Greek β has a code value of 946. What is it in UTF-8?16 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

N-grams: MotivationLet’s say we’re having trouble telling what word a person said in anASR system

I We could look it up in a phonetic dictionaryI But if we hear something like ni, how can we tell if it’s knee,

neat, need, or some other word?I All of these are plausible wordsI So, we can assign a probability, or weight, to each change:

I e.g., deleting a [t] at the end of a word is slightly morecommon than deleting a [d]

I We can look at how far off a word is from the pronunciation;we’ll return to the issue of minimum edit distance with spellchecking

I But if the previous word was I, the right choice becomesclearer ...

Material based upon chapter 5 of Jurafsky and Martin 200017 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

N-gram definition

An n-gram is a stretch of text n words longI Approximation of language: information in n-grams tells us

something about language, but doesn’t capture the structureI Efficient: finding and using every, e.g., two-word collocation in

a text is quick and easy to do

N-grams help a variety of NLP applications, including wordprediction

I N-grams can be used to aid in predicting the next word of anutterance, based on the previous n − 1 words

18 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Bigram example

What is the probability of seeing the sentence The quick brown foxjumped over the lazy dog?

I P(The quick brown fox jumped over the lazy dog) =P(The|START)P(quick|The)P(brown|quick)...P(dog |lazy)

Or, for our ASR example, we can compare:I P(need |I) > P(neat|I)

(Task: List the 3- , 4- and 5-grams of the sentence above)(Task: What is the probability that ’quick’ follows ’the’ given thebigrams of the sentence?)

19 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

OutlineA bit of Philosophy of ScienceSome technical stuff from different chapters. . .Encoding LanguageWriters’ Aids

N-gram analysisRule-based methodsSimilarity key techniquesUsing ProbabilitiesMinimal Edit Distance

Regular ExpressionsREs and Finite State AutomataREs and Finite State Transducers

Classifying Documents20 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

N-gram analysis

N-gram analysisI An n-gram here is a string of n letters.

a 1-gram (unigram)at 2-gram (bigram)ate 3-gram (trigram)late 4-gram...

...I We can use this n-gram information to define what the

possible strings in a language are.I e.g., po is a possible English string, whereas kvt is not.

This is more useful to correct optical character recognition (OCR)output, but we’ll still take a look.

21 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

N-gram analysis

Bigram arrayI We can define a bigram array = information stored in a tabular

fashion.I An example, for the letters k, l, m, with examples in parentheses

. . . k l m . . ....k 0 1 (tackle) 1 (Hackman)l 1 (elk) 1 (hello) 1 (alms)m 0 0 1 (hammer)...

I The first letter of the bigram is given by the vertical letters (i.e.,down the side), the second by the horizontal ones (i.e., across thetop).

I This is a non-positional bigram array = the array 1’s and 0’sapply for a string found anywhere within a word (beginning, 4thcharacter, ending, etc.). 22 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Rule-based methods

Rule-based methodsOne can generate correct spellings by writing rules:

I Common misspelling rewritten as correct word:I e.g., hte → the

I RulesI based on inflections:

I e.g., VCing → VCCing, whereV = letter representing vowel,

basically the regular expression [aeiou]C = letter representing consonant,

basically [bcdfghjklmnpqrstvwxyz]I based on other common spelling errors (such as keyboard

effects or common transpositions):I e.g., CsC → CaCI e.g., Cie → Cei

23 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Similarity key techniques

Similarity key techniques (SOUNDEX)I Problem: How can we find a list of possible corrections?I Solution: Store words in different boxes in a way that puts the

similar words together.I Example:

1. Start by storing words by their first letter (first letter effect),I e.g., punc starts with the code P.

2. Then assign numbers to each letterI e.g., 0 for vowels, 1 for b, p, f, v (all bilabials), and so forth,

e.g., punc → P052

3. Then throw out all zeros and repeated letters,I e.g., P052 → P52.

4. Look for real words within the same box,I e.g., punk is also in the P52 box.

24 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Using Probabilities

Confusion probabilities

I It is impossible to fully investigate all possible error causes andhow they interact, but we can learn from watching how oftenpeople make errors and where.

I One way is to build a confusion matrix = a table indicatinghow often one letter is mistyped for another

25 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Minimal Edit Distance

Computing edit distancesUsing a graph to map out the options

I To calculate minimum edit distance, we set up a directed,acyclic graph, a set of nodes (circles) and arcs (arrows).

I Horizontal arcs correspond to deletions, vertical arcscorrespond to insertions, and diagonal arcs correspond tosubstitutions (a letter can be “substituted” for itself).

Discussion here based on Roger Mitton’s book English Spelling and the Computer.26 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Minimal Edit Distance

Computing edit distancesAn example graph

I Say, the user types in fyre.I We want to calculate how far away fry is (one of the possible

corrections). In other words, we want to calculate theminimum edit distance (or minimum edit cost) from fyre tofry.

I As the first step, we draw the following directed graph:

27 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Minimal Edit Distance

Computing edit distancesAdding numbers to the example graph

I The graph is acyclic = for any given node, it is impossible toreturn to that node by following the arcs.

I We can add identifiers to the states, which allows us to definea topological order:

28 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Minimal Edit Distance

Computing edit distancesAdding costs to the arcs of the example graph

I We need to add the costs involved to the arcs.I In the simplest case, the cost of deletion, insertion, and

substitution is 1 each (and substitution with the samecharacter is free).

I Instead of assuming the same cost for all operations, in realityone will use different costs, e.g., for the first character orbased on the confusion probability.

29 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

OutlineA bit of Philosophy of ScienceSome technical stuff from different chapters. . .Encoding LanguageWriters’ Aids

N-gram analysisRule-based methodsSimilarity key techniquesUsing ProbabilitiesMinimal Edit Distance

Regular ExpressionsREs and Finite State AutomataREs and Finite State Transducers

Classifying Documents30 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Regular Expressions

For each of the following pairs of regular expressions, state whetherthey are equivalent or not, i.e. whether they denote exactly thesame set of strings.

1. [c a | a | a c] = [a c | c a | a]2. [c 0 b 0 a 0] = [0 c b a 0]3. (a) a∗ = a+

4. [a | b] − a = a5. [a∗] − [a a a+] = [0 | a | a a | a a a]

31 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

REs and Finite State Automata

REs and Finite State AutomataPlease draw a finite state automaton which accepts the languagethat the regular expression describes. Make sure to specify theinitial state and the final states of your automata.

1. [c | c b]

2. [a∗ b+]

3. [0 0]

4. [b a c | b a c d | b a b d]

5. [[[a | b]+ c] | b+]

32 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

REs and Finite State Transducers

REs and Finite State Transducers1. Consider the finite state transducer specified by

[a:0 a | a b:0]What output string does the transducer give for the input string ab?→result: ab →a, aa →a

2. Draw the finite state transducer specified in (1).

3. Consider the finite-state transducer specified by[[a c]:d].o.[d:[c d]]

What output string does the transducer give for the input string ac?→cd

33 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

REs and Finite State Transducers

REs and Finite State Transducers4. Consider the following expression containing the replacement

operator, (→), which denotes unconditional optional replacement:[aba] (→) x

For the input string abadddaba, what output(s) does the replaceoperation yield?→abadddaba, xdddaba, abadddx, xdddx

5. The following expression contains an operator denoting left-to-right,longest match replacement:[b∗(a)c] @→ x

For the input string abbaccbcadad, what output(s) does thereplace operation yield?→axxxadad

34 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

OutlineA bit of Philosophy of ScienceSome technical stuff from different chapters. . .Encoding LanguageWriters’ Aids

N-gram analysisRule-based methodsSimilarity key techniquesUsing ProbabilitiesMinimal Edit Distance

Regular ExpressionsREs and Finite State AutomataREs and Finite State Transducers

Classifying Documents35 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Bag of words

I Simple strategyI Text as an unstructured collection of wordsI Ignore sentence structure and order of wordsI Put words in a bagI bag of words assumptionI if preclassified texts exist compute similarity using patterns of

word distribution

36 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Bag of words

I Bag of words with knowledge about positive and negativewords

〈ridiculousN , movieU , worst/badN , . . . 〉

〈favoriteP , movieU , lengthyN , greatP 〉

37 / 38

A bit of Philosophy of Science Some technical stuff from different chapters. . . Encoding Language Writers’ Aids Regular Expressions Classifying Documents

Bag of wordsI Bag of words with knowledge about positive and negative

words

〈ridiculousN , movieU , worst/badN , . . . 〉

〈favoriteP , movieU , lengthyN , greatP 〉I Combine with relevance of the word in the text as suchI Example of a measure: term frequency/inverse document

frequency (tf/idf)〈ridiculousN,high, movieU , worst/badN,low , . . . 〉〈favoriteP,high, movieU , lengthyN,neutr , greatP,low 〉

→Compute values for P,N (using Naive Bayes applied tofrequencies of P-/N-words or more sophisticated measures)

38 / 38