35
Hypothesis Testing Hypothesis Generation Corpus Linguistics A Simple Introduction Niko Schenk [email protected] Applied Computational Linguistics Lab Computer Science Department / Department of English- and American Studies Goethe University Frankfurt, Germany April 17, 2019 Niko Schenk Corpus Linguistics – Introduction 1 / 35

Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk [email protected] Applied Computational Linguistics Lab Computer Science

  • Upload
    others

  • View
    28

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Corpus LinguisticsA Simple Introduction

Niko [email protected]

Applied Computational Linguistics LabComputer Science Department /

Department of English- and American StudiesGoethe University Frankfurt, Germany

April 17, 2019

Niko Schenk Corpus Linguistics – Introduction 1 / 35

Page 2: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

1 Hypothesis Testing

2 Hypothesis Generation

Niko Schenk Corpus Linguistics – Introduction 2 / 35

Page 3: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

1 Hypothesis Testing

2 Hypothesis Generation

Niko Schenk Corpus Linguistics – Introduction 3 / 35

Page 4: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

“Correct” vs. “Incorrect” Use of Language

Figure: Source: http://www.elektrojournal.at/bilder/d166/Saturn_Claim.jpg

Niko Schenk Corpus Linguistics – Introduction 4 / 35

Page 5: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Soo! muss Technik

Regarding the slogan four things are striking:

Missing main verb “sein” (verbal ellipsis?/“ungrammatical”)Orthographic variation “soo” vs. “so” (spelling “error”)Misleading punctuation “soo! ...”, all letters capitalized...

Question: How would you objectively/formally test that, “soo”—compared to“so”—is not “regular” language?

Dictionary...Better: → www.google.com!

Niko Schenk Corpus Linguistics – Introduction 5 / 35

Page 6: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Comparing Frequencies of Contrastive Elements

(a) Google results for “so” (b) Google results for “soo”

Niko Schenk Corpus Linguistics – Introduction 6 / 35

Page 7: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Refining Comparisons of Contrastive Elements

(a) Google results for “so schon” (b) Google results for “soo schon”

Niko Schenk Corpus Linguistics – Introduction 7 / 35

Page 8: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Congratulations!

We’ve successfully performed our first corpus linguistic search.

How is that different from using a dictionary?Answer:

1 consult tons of real data instead single of contemporary rule.

2 use tendencies instead of absolute true/false answer (Thedictionary claims that “soo” is false or—even worse–that thephrase does not exist).

Niko Schenk Corpus Linguistics – Introduction 8 / 35

Page 9: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Language Change-An Example

Figure: “Thrived” vs. ...

Niko Schenk Corpus Linguistics – Introduction 9 / 35

Page 10: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Language Change-An Example cont’d

Figure: ... “throve”.

Niko Schenk Corpus Linguistics – Introduction 10 / 35

Page 11: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Language Change-An Example cont’d

Question: How would you objectively/formally show that “throve” is obsolete/nolonger in use?

You could ask a native speaker...(Much) better: → Google books Ngram Viewer

Niko Schenk Corpus Linguistics – Introduction 11 / 35

Page 12: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Figure: Distributions of “thrived” vs. “throve” in the Google books corpus.

Possible explanation: Low-frequency words change to fit the main paradigm.Niko Schenk Corpus Linguistics – Introduction 12 / 35

Page 13: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Again, how is that different from asking a native speaker?Answer: consult real data instead of intuition of one individualspeaker.

Niko Schenk Corpus Linguistics – Introduction 13 / 35

Page 14: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

What is Corpus Linguistics? (I)

Starting with a linguistic phenomenon (see previous examples) and a hypothesis, youuse

large textual resources (a corpus!) and software to objectively test(falsify/verify) the hypothesis.

hypothesis testing is usually based on frequencies obtained through search.

Niko Schenk Corpus Linguistics – Introduction 14 / 35

Page 15: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

1 Hypothesis Testing

2 Hypothesis Generation

Niko Schenk Corpus Linguistics – Introduction 15 / 35

Page 16: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Extracting Useful Information

Contrary to the previous examples, you don’t have to start with a concrete hypothesis.

You could just “do something” with the corpus itself:

e.g., compute various statistics and inspect the output.

Usually you count words, phrases, etc. This is done automatically with the helpof computer programs.

→ As a result, you can come up with a hypothesis.

Niko Schenk Corpus Linguistics – Introduction 16 / 35

Page 17: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Motivation

Co-occurrence probabilities between words...

Niko Schenk Corpus Linguistics – Introduction 17 / 35

Page 18: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Motivation

Niko Schenk Corpus Linguistics – Introduction 18 / 35

Page 19: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Motivation

Niko Schenk Corpus Linguistics – Introduction 19 / 35

Page 20: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Motivation

Niko Schenk Corpus Linguistics – Introduction 20 / 35

Page 21: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Motivation

Niko Schenk Corpus Linguistics – Introduction 21 / 35

Page 22: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Motivation

Niko Schenk Corpus Linguistics – Introduction 22 / 35

Page 23: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Motivation

Niko Schenk Corpus Linguistics – Introduction 23 / 35

Page 24: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Motivation

Niko Schenk Corpus Linguistics – Introduction 24 / 35

Page 25: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Motivation

Niko Schenk Corpus Linguistics – Introduction 25 / 35

Page 26: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

How to Automatically Find Collocations

Example: Automatically collect words which co-occur more frequently than whatwould be expected:

Generated hypothesis: → These words are idiomatic expressions, proper names...Niko Schenk Corpus Linguistics – Introduction 26 / 35

Page 27: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

How to Automatically Detect Similar Facebook Users

Based on the data, you could just count the words which are used by different peopleand compare the numbers.

Niko Schenk Corpus Linguistics – Introduction 27 / 35

Page 28: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

How to Automatically Detect Similar Facebook Users

Generated hypothesis: → Groups of people with the same opinion share a similarvocabulary.

Niko Schenk Corpus Linguistics – Introduction 28 / 35

Page 29: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

How to Automatically Identify the Author of a Text

The same technique can be applied to find the author of a document.

For each document (written by an author), count the number of distinct wordswhich he/she uses.

Niko Schenk Corpus Linguistics – Introduction 29 / 35

Page 30: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

How to Automatically Identify the Author of a Text

● ●

●●

● ●

●●

0.30 0.35 0.40 0.45 0.50

0.25

0.30

0.35

0.40

0.45

0.50

standard type / token ratio

type

(le

mm

a) /

toke

n ra

tio

christopher.txt

christopher2.txt

instructions.txt

lisa.txt lisa2.txt

lisa3.txt

lisa4.txt

lisa5.txtlisa6.txt

marie−luise.txt

marie−luise2.txt

marie−luise3.txt

marie−luise4.txt

miriam.txt

miriam2.txt

miriam3.txtmiriam4.txt

miriam5.txt

miriam6.txt

philipp.txt

simon.txt

simon2.txt

simon3.txt

thu.txt

thu2.txt

thu3.txt

thu4.txt

thu5.txt

thu6.txtthu7.txt

thu8.txt

viktor.txt

viktor2.txt

Niko Schenk Corpus Linguistics – Introduction 30 / 35

Page 31: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Hypothesis Generation

Generated hypothesis:→ Students appearing closer together in the visualization are similar in language use.

Niko Schenk Corpus Linguistics – Introduction 31 / 35

Page 32: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

What is Corpus Linguistics? (II)

Starting with the corpus and no specific hypothesis, you use

large textual resources and statistics to detect contrastive (interesting)patterns, i.e. you generate a hypothesis.

Niko Schenk Corpus Linguistics – Introduction 32 / 35

Page 33: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Summary

Corpus Linguistics can be divided into two parts

1 hypothesis testing

2 hypothesis generation

You usually use

large textual (linguistic) resources which are electronically available.

software to analyze (search) the data.

Frequencies are essential.

Niko Schenk Corpus Linguistics – Introduction 33 / 35

Page 34: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Homework Assignment

Niko Schenk Corpus Linguistics – Introduction 34 / 35

Page 35: Corpus Linguistics - A Simple Introduction · Corpus Linguistics A Simple Introduction Niko Schenk n.schenk@em.uni-frankfurt.de Applied Computational Linguistics Lab Computer Science

Hypothesis Testing Hypothesis Generation

Corpus Linguistics—An Example

Google offers an exploratory search functionality as a corpus linguistic application.

Cf. Google Books Corpus1 / Google Ngram Viewer2

1http://googlebooks.byu.edu/x.asp – AE: 155 billion words2Cf. http://books.google.com/ngrams/

Niko Schenk Corpus Linguistics – Introduction 35 / 35