70
Corpora in Linguistic Research 南南南南 南南南 南南025-8443-6787 Email [email protected]

Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : [email protected] [email protected]

Embed Size (px)

Citation preview

Page 1: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Corpora in Linguistic Research

南京大学

李长生

电话: 025-8443-6787

Email : [email protected]

Page 2: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Order of Presentation

I. Corpus Research versus Linguistic Research II. Influential Corpora III. Corpus Analysis IV. More on Statistical Analysis V. Q and maybe A (anytime during presentation)

Page 3: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

I. Corpus Research versus Linguistic Research

Corpus Research=Linguistic Research

Language (features) Learner language (features)

Page 4: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

I. Corpus Research versus Linguistic Research

Corpus Research≠Linguistic Research

(Large,) representative authentic data

Page 5: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

II. Influential Corpora

Native-speaker corpora Learner corpora

Page 6: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Native-speaker Corpora

Page 7: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Collins Corpus/Bank of English

A 2.5-billion word analytical database of English. Contains written material from websites,

newspapers, magazines and books published around the world, and spoken material from radio, TV and everyday conversations. 

New data is fed into the corpus every month, to help the Collins dictionary editors identify new words and meanings from the moment they are first used.

Bank of English: part of the Collins Corpus. Contains 650 million words from a carefully chosen

selection of sources, to give a balanced and accurate reflection of English as it is used every day.

Page 8: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

British National Corpus

Contains approximately 100 million words of written texts (90%) and transcripts of speech (10%) in modern British English.

Can be accessed online remotely using the BNC Online service.

Page 9: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

American National Corpus

Contains 11.5 million words of written and spoken American English data (8.3 million words for writing and 3.2 million words for speech)

Page 10: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Longman/Lancaster Corpus

Contains about 30 million words of published English.

British data takes up 50% and American data 40% while the other 10% represents other varieties such as Australian, African and Irish English.

Page 11: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Learner Corpora

Page 12: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

International Corpus of Learner English

Contains argumentative essays written by advanced learners of English, i.e. university students of English as a foreign language (EFL) in their 3rd or 4th year of study.

Contains over 2.5 million words in the form of 3,640 texts ranging between 500-1,000 words in length written by EFL learners from 11 mother tongue backgrounds, namely, Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish.

Page 13: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

CLEC

Contains one million words from writing produced by Chinese learners of English from five proficiency levels: middle school students, junior and senior non-English majors, and junior and senior English majors.

Annotated with learner errors using an annotation scheme which consists of 61 error types clustered in 11 categories.

Page 14: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

SWECCL

包含我国英语专业大学生的口语和笔语总共约 200万词

Page 15: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

LSECCL

Year 1 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - The Most Unforgettable

Birthday Task 3 - Dialogue - Holiday plan

Recording 2 Task 1 - Retelling Task 2 - Monologue - Whether it is appropriate

for college students to rent apartments outside the campus and live there

Task 3 - Dialogue - Whether exams should be abolished

Page 16: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

LSECCL

Year 2 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - Describe one of your

persons you admire most Task 3 - Dialogue - What gift to buy for a friend -

Lily Recording 2

Task 1 - Retelling Task 2 - Monologue - Make critical comments on

the use of electronic dictionaries among college students

Task 3 - Dialogue - Whether it is a good practice or not to keep one’s own computer in dorm

Page 17: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

LSECCL

Year 3 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - Describe one of your

experiences when you had a great ambition to do something

Task 3 - Dialogue - Talk about ways of relaxation after a month-long preparation for an exam

Recording 2 Task 1 - Retelling Task 2 - Monologue - Do you think it is appropriate

for college students to get married Task 3 - Dialogue - Talk about the necessity of

having certificates

Page 18: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

LSECCL

Year 4 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - The Most Unforgettable

Birthday Task 3 - Dialogue - Holiday plan

Recording 2 Task 1 - Retelling Task 2 - Monologue - Whether it is appropriate

for college students to rent apartments outside the campus and live there

Task 3 - Dialogue - Whether exams should be abolished

Page 19: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

III. Corpus Analysis

(Tagging corpus data) Calculating frequencies and frequency

differences Frequencies of occurrence Frequencies of co-occurrence Frequency differences across registers/corpora/

periods of time (Transferring frequencies) Statistical analysis

Page 20: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Lexis

《大学英语课程教学要求》 (2007) 参考词汇表

Page 21: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Lexis

headwords

Page 22: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Lexis

meanings: deal (Biber et al., 1998)

Page 23: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Lexis

synonyms: utterly, perfectly

Page 24: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Lexis

synonyms: big, large, great (Biber et al., 1998)

Page 25: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Lexis

collocations: system

Page 26: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Lexis

chunks (Qi, 2006)

第一步 : 运行 WordList第二步 : 选定语料库第三步 : 制作索引第四步 : 点击计算 (Compute)Clusters

Page 27: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Grammar

that-clause, to-clause (Biber et al., 1998)

<V* that <CST>to <TO> * <V?I>/to <TO> * <R* * <V?I>/to <TO> * <R* R <* * <V?I>

Page 28: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Grammar

syntactic co-occurrences of try (McEnery and Wilson, 2001)

Page 29: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Learner Language

Frequency differences across corpora Frequency differences across periods of

time

Page 30: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Across Corpora

SWECCL

ICLE

BNC

L1 (NNS-NNS)

L1 (NNS-NS)

Page 31: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Corpus Analysis

Page 32: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Tagging Corpus Data

CLAWS book book_NN1

超级批量文本替换 book_NN1 book <NN1>

Page 33: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Calculating Frequencies and Frequency Differences

passive voice (be done) (Li, 2007a)

* <VB* * <V?N>

Page 34: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Statistical Analysis

差异 两库或三库 1. chi-square

Under Analyze, choose Descriptive Statistics, then Crosstabs. Move one variable into the Row(s) box and the other into the Column(s) box. Click Statistics, and check off Chi-square. Click Cells, and check off Expected.

2. one-way chi-square Under Analyze, choose Nonparametric Tests, then Chi-Square.

Move the variable into the Test Variable List box. Click OK.

Page 35: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Another Example

AWL (Li, 2007a)

+matchlist

Page 36: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Across Periods of Time

LSECCL

Grades (Year 1-Year 2-Year 3-Year 4)

Page 37: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Title

1)    Key terms 3)    Noun phrase 4)    Word limit (<20) 5)    Capitalization

Li (2007b)

Page 38: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Abstract

Summary

Page 39: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Acknowledgments

Specific

Page 40: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Introduction

Motivation for the study, theoretical and practical significance of the study, overall structure

Page 41: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Literature Review

Key terms Theoretical issues Empirical studies Unresolved issues

Page 42: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Literature Review

Bibliographies/Indices/Databases (ERIC, NJU, Google Scholar, corpus4u)

Papers (Chen, 2004) Journals (Applied Linguistics, Language

Learning) Books (FLTRP)

Page 43: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Research Questions

LSECCL

Grades (Year 1-Year 2-Year 3-Year 4)

Page 44: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Corpus Analysis

Page 45: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Tagging Corpus Data

Microsoft Word I think I think <sv> <ip> <cm> <0>

Page 46: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Calculating Frequencies and Frequency Differences

<sv>/<ap>/<dn> <cm>

Page 47: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Transferring Frequencies

Microsoft Excel

=COUNTIF(N1:N5000,"D:\YEAR1\1-2-B02B.TXT")

Page 48: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Statistical Analysis

Changes in frequency differences三次或三次以上数据 Wilcoxon Under Analyze, choose Nonparametric Tests, then 2

Related Samples. Move the variables into the Test Pair(s) List box.

Page 49: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Results and Discussion

Answers to the research questions, and reasons for the answers

Page 50: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Conclusion

Summary of the findings, theoretical and practical implications of the findings, and limitations of the study

Page 51: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

References

Works cited

Page 52: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Appendices

Sample tagged text, etc

Page 53: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

IV. More on Statistical Analysis

Page 54: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Research Questions in Linguistic Research

1. Differences 2. Changes 3. Correlation 4. Effects

Page 55: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Differences (2 groups of subjects, 1 test)

1) independent t-test Entering the data Analyzing the data

Under Analyze, choose Compare Means, then Independent-Samples T Test. Move the dependent variable into the Test Variable box, and the independent variable into the Grouping Variable box. Click Define Groups and type in the values of the two groups.

Tabulating the results Describing the results

Page 56: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

2) Mann-Whitney U Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then 2 Independent Samples. Move the dependent variable into the Test Variable List box, and the independent variable into the Grouping Variable box. Click Define Groups. Check off Mann-Whitney U.

Tabulating the results Describing the results

Page 57: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Differences (3 groups of subjects, 1 test)

1) one-way ANOVA Entering the data Analyzing the data

Under Analyze, choose Compare Means, then One-Way ANOVA. Move the dependent variable into the Dependent List box, and the independent variable into the Factor box. Click Post Hoc, and choose Tukey (equal number of cases in each group) or Bonferroni (unequal number of cases).

Tabulating the results Describing the results

Page 58: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

2) Kruskal-Wallis H Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then K Independent Samples. Move the dependent variable into the Test Variable List box, and the independent variable into the Grouping Variable box. Click Define Range. Check off Kruskal-Wallis H.

Tabulating the results Describing the results

Page 59: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Differences (3 groups of subjects, 2 tests)

MANOVA Entering the data Analyzing the data

Under Analyze, choose General Linear Model, then Multivariate. Move the dependent variables into the Dependent Variables box, and the independent variable into the Fixed Factor(s) box.

Tabulating the results Describing the results

Page 60: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Differences (2 or 3 groups of subjects)

1) chi-square Entering the data Analyzing the data

Under Analyze, choose Descriptive Statistics, then Crosstabs. Move one variable into the Row(s) box and the other into the Column(s) box. Click Statistics, and check off Chi-square. Click Cells, and check off Expected.

Tabulating the results Describing the results

Page 61: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

2) one-way chi-square Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then Chi-Square. Move the variable into the Test Variable List box. Click OK.

Tabulating the results Describing the results

Page 62: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Changes (1 group of subjects, 2 tests)

1) paired t-test Entering the data Analyzing the data

Under Analyze, choose Compare Means, then Paired-Samples T Test. Click on a pair of variables, and move them into the Paired Variables box.

Tabulating the results Describing the results

Page 63: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

2) Wilcoxon Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then 2 Related Samples. Move the variables into the Test Pair(s) List box.

Tabulating the results Describing the results

Page 64: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Changes (1 group of subjects, 3 tests)

1) repeated-measures ANOVA Entering the data Analyzing the data

Under Analyze, choose General Linear Model, then Repeated Measures.

Tabulating the results Describing the results

Page 65: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

2) Wilcoxon Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then 2 Related Samples. Move the variables into the Test Pair(s) List box.

Tabulating the results Describing the results

Page 66: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Correlation (2 or 3 variables)

1) Pearson Entering the data Analyzing the data

Under Analyze, choose Correlate, then Bivariate. Move the variables into the Variables box. Check off Pearson.

Tabulating the results Describing the results

Page 67: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

2) Spearman Entering the data Analyzing the data

Under Analyze, choose Correlate, then Bivariate. Move the variables into the Variables box. Check off Spearman.

Tabulating the results Describing the results

Page 68: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

Effects (2 or 3 variables)

1) linear regression Entering the data Analyzing the data

Under Analyze, choose Regression, then Linear. Enter the dependent and independent variables. Choose an appropriate method (Stepwise or Enter), and click OK.

Tabulating the results Describing the results

Page 69: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

2) categorical regression Entering the data Analyzing the data

Under Analyze, choose Regression, then Optimal Scaling. Enter the dependent and independent variables. Choose an appropriate method (Stepwise or Enter), and click OK.

Tabulating the results Describing the results

Page 70: Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email : csli@jlonline.com csli@jlonline.com

V. Q and A