Corpora in Linguistic Research 南京大学 李长生 电话: 025-8443-6787 Email :...

Preview:

Citation preview

Corpora in Linguistic Research

南京大学

李长生

电话: 025-8443-6787

Email : csli@jlonline.com

Order of Presentation

I. Corpus Research versus Linguistic Research II. Influential Corpora III. Corpus Analysis IV. More on Statistical Analysis V. Q and maybe A (anytime during presentation)

I. Corpus Research versus Linguistic Research

Corpus Research=Linguistic Research

Language (features) Learner language (features)

I. Corpus Research versus Linguistic Research

Corpus Research≠Linguistic Research

(Large,) representative authentic data

II. Influential Corpora

Native-speaker corpora Learner corpora

Native-speaker Corpora

Collins Corpus/Bank of English

A 2.5-billion word analytical database of English. Contains written material from websites,

newspapers, magazines and books published around the world, and spoken material from radio, TV and everyday conversations. 

New data is fed into the corpus every month, to help the Collins dictionary editors identify new words and meanings from the moment they are first used.

Bank of English: part of the Collins Corpus. Contains 650 million words from a carefully chosen

selection of sources, to give a balanced and accurate reflection of English as it is used every day.

British National Corpus

Contains approximately 100 million words of written texts (90%) and transcripts of speech (10%) in modern British English.

Can be accessed online remotely using the BNC Online service.

American National Corpus

Contains 11.5 million words of written and spoken American English data (8.3 million words for writing and 3.2 million words for speech)

Longman/Lancaster Corpus

Contains about 30 million words of published English.

British data takes up 50% and American data 40% while the other 10% represents other varieties such as Australian, African and Irish English.

Learner Corpora

International Corpus of Learner English

Contains argumentative essays written by advanced learners of English, i.e. university students of English as a foreign language (EFL) in their 3rd or 4th year of study.

Contains over 2.5 million words in the form of 3,640 texts ranging between 500-1,000 words in length written by EFL learners from 11 mother tongue backgrounds, namely, Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish.

CLEC

Contains one million words from writing produced by Chinese learners of English from five proficiency levels: middle school students, junior and senior non-English majors, and junior and senior English majors.

Annotated with learner errors using an annotation scheme which consists of 61 error types clustered in 11 categories.

SWECCL

包含我国英语专业大学生的口语和笔语总共约 200万词

LSECCL

Year 1 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - The Most Unforgettable

Birthday Task 3 - Dialogue - Holiday plan

Recording 2 Task 1 - Retelling Task 2 - Monologue - Whether it is appropriate

for college students to rent apartments outside the campus and live there

Task 3 - Dialogue - Whether exams should be abolished

LSECCL

Year 2 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - Describe one of your

persons you admire most Task 3 - Dialogue - What gift to buy for a friend -

Lily Recording 2

Task 1 - Retelling Task 2 - Monologue - Make critical comments on

the use of electronic dictionaries among college students

Task 3 - Dialogue - Whether it is a good practice or not to keep one’s own computer in dorm

LSECCL

Year 3 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - Describe one of your

experiences when you had a great ambition to do something

Task 3 - Dialogue - Talk about ways of relaxation after a month-long preparation for an exam

Recording 2 Task 1 - Retelling Task 2 - Monologue - Do you think it is appropriate

for college students to get married Task 3 - Dialogue - Talk about the necessity of

having certificates

LSECCL

Year 4 Recording 1

Task 1 - Reading aloud Task 2 - Monologue - The Most Unforgettable

Birthday Task 3 - Dialogue - Holiday plan

Recording 2 Task 1 - Retelling Task 2 - Monologue - Whether it is appropriate

for college students to rent apartments outside the campus and live there

Task 3 - Dialogue - Whether exams should be abolished

III. Corpus Analysis

(Tagging corpus data) Calculating frequencies and frequency

differences Frequencies of occurrence Frequencies of co-occurrence Frequency differences across registers/corpora/

periods of time (Transferring frequencies) Statistical analysis

Lexis

《大学英语课程教学要求》 (2007) 参考词汇表

Lexis

headwords

Lexis

meanings: deal (Biber et al., 1998)

Lexis

synonyms: utterly, perfectly

Lexis

synonyms: big, large, great (Biber et al., 1998)

Lexis

collocations: system

Lexis

chunks (Qi, 2006)

第一步 : 运行 WordList第二步 : 选定语料库第三步 : 制作索引第四步 : 点击计算 (Compute)Clusters

Grammar

that-clause, to-clause (Biber et al., 1998)

<V* that <CST>to <TO> * <V?I>/to <TO> * <R* * <V?I>/to <TO> * <R* R <* * <V?I>

Grammar

syntactic co-occurrences of try (McEnery and Wilson, 2001)

Learner Language

Frequency differences across corpora Frequency differences across periods of

time

Across Corpora

SWECCL

ICLE

BNC

L1 (NNS-NNS)

L1 (NNS-NS)

Corpus Analysis

Tagging Corpus Data

CLAWS book book_NN1

超级批量文本替换 book_NN1 book <NN1>

Calculating Frequencies and Frequency Differences

passive voice (be done) (Li, 2007a)

* <VB* * <V?N>

Statistical Analysis

差异 两库或三库 1. chi-square

Under Analyze, choose Descriptive Statistics, then Crosstabs. Move one variable into the Row(s) box and the other into the Column(s) box. Click Statistics, and check off Chi-square. Click Cells, and check off Expected.

2. one-way chi-square Under Analyze, choose Nonparametric Tests, then Chi-Square.

Move the variable into the Test Variable List box. Click OK.

Another Example

AWL (Li, 2007a)

+matchlist

Across Periods of Time

LSECCL

Grades (Year 1-Year 2-Year 3-Year 4)

Title

1)    Key terms 3)    Noun phrase 4)    Word limit (<20) 5)    Capitalization

Li (2007b)

Abstract

Summary

Acknowledgments

Specific

Introduction

Motivation for the study, theoretical and practical significance of the study, overall structure

Literature Review

Key terms Theoretical issues Empirical studies Unresolved issues

Literature Review

Bibliographies/Indices/Databases (ERIC, NJU, Google Scholar, corpus4u)

Papers (Chen, 2004) Journals (Applied Linguistics, Language

Learning) Books (FLTRP)

Research Questions

LSECCL

Grades (Year 1-Year 2-Year 3-Year 4)

Corpus Analysis

Tagging Corpus Data

Microsoft Word I think I think <sv> <ip> <cm> <0>

Calculating Frequencies and Frequency Differences

<sv>/<ap>/<dn> <cm>

Transferring Frequencies

Microsoft Excel

=COUNTIF(N1:N5000,"D:\YEAR1\1-2-B02B.TXT")

Statistical Analysis

Changes in frequency differences三次或三次以上数据 Wilcoxon Under Analyze, choose Nonparametric Tests, then 2

Related Samples. Move the variables into the Test Pair(s) List box.

Results and Discussion

Answers to the research questions, and reasons for the answers

Conclusion

Summary of the findings, theoretical and practical implications of the findings, and limitations of the study

References

Works cited

Appendices

Sample tagged text, etc

IV. More on Statistical Analysis

Research Questions in Linguistic Research

1. Differences 2. Changes 3. Correlation 4. Effects

Differences (2 groups of subjects, 1 test)

1) independent t-test Entering the data Analyzing the data

Under Analyze, choose Compare Means, then Independent-Samples T Test. Move the dependent variable into the Test Variable box, and the independent variable into the Grouping Variable box. Click Define Groups and type in the values of the two groups.

Tabulating the results Describing the results

2) Mann-Whitney U Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then 2 Independent Samples. Move the dependent variable into the Test Variable List box, and the independent variable into the Grouping Variable box. Click Define Groups. Check off Mann-Whitney U.

Tabulating the results Describing the results

Differences (3 groups of subjects, 1 test)

1) one-way ANOVA Entering the data Analyzing the data

Under Analyze, choose Compare Means, then One-Way ANOVA. Move the dependent variable into the Dependent List box, and the independent variable into the Factor box. Click Post Hoc, and choose Tukey (equal number of cases in each group) or Bonferroni (unequal number of cases).

Tabulating the results Describing the results

2) Kruskal-Wallis H Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then K Independent Samples. Move the dependent variable into the Test Variable List box, and the independent variable into the Grouping Variable box. Click Define Range. Check off Kruskal-Wallis H.

Tabulating the results Describing the results

Differences (3 groups of subjects, 2 tests)

MANOVA Entering the data Analyzing the data

Under Analyze, choose General Linear Model, then Multivariate. Move the dependent variables into the Dependent Variables box, and the independent variable into the Fixed Factor(s) box.

Tabulating the results Describing the results

Differences (2 or 3 groups of subjects)

1) chi-square Entering the data Analyzing the data

Under Analyze, choose Descriptive Statistics, then Crosstabs. Move one variable into the Row(s) box and the other into the Column(s) box. Click Statistics, and check off Chi-square. Click Cells, and check off Expected.

Tabulating the results Describing the results

2) one-way chi-square Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then Chi-Square. Move the variable into the Test Variable List box. Click OK.

Tabulating the results Describing the results

Changes (1 group of subjects, 2 tests)

1) paired t-test Entering the data Analyzing the data

Under Analyze, choose Compare Means, then Paired-Samples T Test. Click on a pair of variables, and move them into the Paired Variables box.

Tabulating the results Describing the results

2) Wilcoxon Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then 2 Related Samples. Move the variables into the Test Pair(s) List box.

Tabulating the results Describing the results

Changes (1 group of subjects, 3 tests)

1) repeated-measures ANOVA Entering the data Analyzing the data

Under Analyze, choose General Linear Model, then Repeated Measures.

Tabulating the results Describing the results

2) Wilcoxon Entering the data Analyzing the data

Under Analyze, choose Nonparametric Tests, then 2 Related Samples. Move the variables into the Test Pair(s) List box.

Tabulating the results Describing the results

Correlation (2 or 3 variables)

1) Pearson Entering the data Analyzing the data

Under Analyze, choose Correlate, then Bivariate. Move the variables into the Variables box. Check off Pearson.

Tabulating the results Describing the results

2) Spearman Entering the data Analyzing the data

Under Analyze, choose Correlate, then Bivariate. Move the variables into the Variables box. Check off Spearman.

Tabulating the results Describing the results

Effects (2 or 3 variables)

1) linear regression Entering the data Analyzing the data

Under Analyze, choose Regression, then Linear. Enter the dependent and independent variables. Choose an appropriate method (Stepwise or Enter), and click OK.

Tabulating the results Describing the results

2) categorical regression Entering the data Analyzing the data

Under Analyze, choose Regression, then Optimal Scaling. Enter the dependent and independent variables. Choose an appropriate method (Stepwise or Enter), and click OK.

Tabulating the results Describing the results

V. Q and A