23
1 Sketch Engine Sketch Engine Sketch Engine Sketch Engine ᪥ᮏㄒ∧ ᪥ᮏㄒ∧ ᪥ᮏㄒ∧ ᪥ᮏㄒ∧ ࡢࡑ ࡢࡑ ࡢࡑࡢࡑ⏝᪉ἲ ⏝᪉ἲ ⏝᪉ἲ ⏝᪉ἲ SRDANOVIĆ SRDANOVIĆ SRDANOVIĆ SRDANOVIĆ ERJAVEC Irena, ERJAVEC Irena, ERJAVEC Irena, ERJAVEC Irena, ⛉႐Ꮚ ோ⛉႐Ꮚ ோ⛉႐Ꮚ ோ⛉႐Ꮚ 㸦ᮾிᕤᴗᏛ㸧 Sketch EngineゝㄒᏛ㸪㎡᭩Ꮫ㸪➨ゝㄒᩍ⫱㸪ඹ㉳ ᴫせ ᴫせ ᴫせ ᴫせ ㏆ᖺᵓ⠏⏝㛵ࡉࡢࡀ✲◊࡞ᒎ㛤㸪ᮏ✏ࢥࡣ Sketch Engine ᪥ᮏㄒ∧సᡂ⏝᪉ἲ࠸ࡘሗ࿌ᶆ‽ⓗࢥ࡞ ᶵ⬟௨እㄒ㝶Web 1 “Word Sketch”ᶵ⬟ᣢࢯࢩሗពⓗ㢮ఝඹ㏻Ⅼᕪ␗ “Thesaurus”“Sketch Difference”ᶵ⬟⌧ࠋSketch Engine ᪥ᮏㄒ ࡣ∧JpWaC ࠺࠸4 ൨ㄒつᶍ Web ࢥࡢᦚ㍕ ᮏ✏Sketch Engine ᪥ᮏㄒᏛ⩦㎡᭩↔Ⅼ ᙜ㸪᪥ᮏㄒᏛ◊✲㸪᪥ᮏㄒᩍ⫱ᛂ⏝⬟ᛶ࠸ࡘ1. ࡌࡣ ࡀࢫゝㄒ◊✲ศ㔝ᬑཬ1980 ᖺ㡭 ࡇ࠺࠸㆟ㄽࠋࡓ࠸ 10 ᖺᚋࡢࢫつᶍ௦⾲ᛶၥ㢟ࡢࡑ㆟ㄽࠋࡓࡗ⛣つᶍࢹ࡞ࡢࢱࡀ࠸௨๓㞴ࡗ࡞ࡃ࡞ࡃࡋ⌧ᅾ㸪つᶍ 㐺ษࢹ࡞ࡅࡔᢳฟ㆟ㄽ⾜ࡀ80 ᖺ௦ᢳฟࢥࡢ 㛤Ⓨࢶࡢࡑ⤫ⓗࢶ࡞ Kilgarriff & Rundell 2002つᶍࢥ࡞⣴ᑐ㇟ ࡢࡘ 500 1,000 20,000 ௨⏝㔞⾲♧ ࡢࡑ⡆༢ࡇ࠺ࡇࡑࠋ࠸ࡋ 2000 ᖺ㡭㸪௨እᶵ⬟ࢥࡔ㛤ⓎࡓࡗHeid et al. 2000, Kilgarriff & Tugwell 2001ᮏሗ࿌ࡢࡑ Sketch EngineKilgarriff et al. 2004Srdanović et al.2008 ண㸧సᡂࡓࡋ᪥ᮏㄒ∧ࡢࡑ⬟ᛶ⤂Sketch Engine ࢥࡣᶵ⬟௨እㄒ㝶グ㏙ Web㸯㡫ᶵ⬟ᣢWord Sketch㸧㸪ࢯࢩሗ㸦Thesaurus㸧㢮⩏ㄒ ඹ㏻Ⅼᕪ␗ᥦ♧㸦Sketch Difference㸪ゝㄒⓗ ࠸ࡋ᪉ἲᥦ౪ⱥㄒࡌࡣゝㄒゝㄒᏛ㎡᭩Ꮫ➨ゝㄒᩍ⫱ศ㔝࠸࠾ࡇࡇᖺ⏝᪥ᮏㄒ

SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

1

Sketch EngineSketch EngineSketch EngineSketch Engine

SRDANOVIĆSRDANOVIĆSRDANOVIĆSRDANOVIĆ ERJAVEC Irena, ERJAVEC Irena, ERJAVEC Irena, ERJAVEC Irena,

Sketch Engine

Sketch Engine

Web

1 “Word Sketch”

“Thesaurus” “Sketch Difference” Sketch Engine

JpWaC 4 Web

Sketch Engine

1.

1980

10

80

Kilgarriff & Rundell 2002

500 1,000 20,000

2000

Heid et al. 2000, Kilgarriff &

Tugwell 2001 Sketch Engine Kilgarriff et al. 2004

Srdanović et al. 2008

Sketch Engine

Web Word Sketch Thesaurus

Sketch Difference

Page 2: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

2

Sketch Engine

2. Sketch Engine

Sketch Engine Kilgarriff et al. 2004

Erjavec et al. 2007

4 4 Web

Web

Sketch Engine

Sketch Engine

2.1. Sketch Engine

Web Sketch Engine (http://www.sketchengine.com)

4 JpWaC Web 1 Sharoff (2006)

Ueyama & Baroni (2005) Web 5 WAC Baroni &

Bernardini, eds. 2006 BootCat Baroni et al. 2006

HTML

boilerplate removal Web

ChaSen token

lemma tag Erjavec et

al. 2006

.jp .com Erjavec

et al. 2007 Srdanović et al. 2008

Sketch Engine

2 3

URL

Web

JpWaC

2007

Page 3: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

3

1111 Sketch Engine 2222 Sketch Engine

3333 Sketch Engine

2.2. Word Sketches

22

Word Sketch, Thesaurus Sketch Difference

Chasen Gahl 1998

corpus query syntax ( ) 4

Word Sketch

Page 4: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

4

4

4 2 1

2 salience 1

modifies_N

( )

4 2 dual

*DUAL

=modifier_Ana/modifies_N

2:2:2:2:"N.Ana" "Aux" "Pref.*"? 1:1:1:1:[tag="N.*" & tag!="N.Suff.*" & tag!="N.bnd.*"]

modifier_Anamodifier_Anamodifier_Anamodifier_Ana modifies_Nmodifies_Nmodifies_Nmodifies_N

2:2:2:2:"N.Ana" "Aux" "Pref.*"? N.Ana Aux

Pref.* 1111:::: [tag="N.*" & tag! ="N.Suff.*" & tag!

="N.bnd.*"] N.*

N.Suff.* N.bnd.* - - -

- -

Page 5: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

5

* 0

N.* N.g N.Prop

0 1

Sketch Engine

Concordance CQL Corpus Query

Language

• [word=” ”| word=” ”]

ChaSen

[word=” ”] [word=” ”] [lemma=” ”] 3.2

• [tag=”N.*”]&[ word =“ ”]

Word Sketch

Sketch Engine ChaSen IPADIC)

IPADIC Sketch Engine

Web

http://tell.fll.purdue.edu/chakoshipub/index2.html ChaSen

5 ChaSen

ChaSen Sketch

Engine

tokentokentokentoken kanakanakanakana lemmalemmalemmalemma POS tagPOS tagPOS tagPOS tag (((( )))) POS tagPOS tagPOS tagPOS tag----engengengeng (((( ))))

- Adv.P

- N.Ana

Aux

- N.g

Aux

Aux

- Sym.p

ChaSen

ChaSen IPADIC ChaSen

ChaSen

Page 6: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

6

Word Sketch ChaSen

Word Sketch

Word Sketch Concordance

100 Word Sketch

ChaSen

Web

2.3. Thesaurus Sketch Difference

Thesaurus Sketch Difference shared triples 3

triple

Srdanović et al. 2008

Thesaurus

6

Sketch Difference 7

8

16,309 6,486 2.5

Web

Page 7: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

7

Thesaurus

7777 Sketch Difference only pattern

8888 Sketch Difference only pattern

2.4. Web

Web

Web

Page 8: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

8

Web

Web

Keller & Lapata 2003 Web

Web JpWaC

Web

Web Sharoff 2006 Ueyama & Baroni 2005

Web Web

Web

Sharoff 2006 Ueyama &

Baroni 2005

Web

narrative style Web

interactive style

Web

Web

Web

Ghani et al. 2001

Web

Web

Web

Web

Web

Crystal 2006

Web

• Web

• Web

Page 9: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

9

Web

3. Sketch Engine

Sketch Engine

3.1. Sketch Engine

80 Cobuild

90 Church & Hanks 1989 (MI)

2000 Word Sketch

Sketch Engine BNC British National Corpus

Rundell, ed. 2002 Kilgarriff &

Rundell (2002)

Word Sketch Word Sketch

Word Sketch

Sketch Engine

Word

Sketch

Sketch Engine

Page 10: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

10

Kilgarriff & Rundell 2002

‘challenge’

2004

Sketch Engine

3.1.1

9 Word Sketch

9999 Word Sketch

9 modifier_Ana modifier_Ai

verb verb verb verb

9

‘initiation’ ‘trial’

-

Page 11: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

11

Word Sketch ‘challenge to something/somebody‘

Concordance 10 Concordance

CQL [word=" "] []{0,3} [word=" "]

{0,3} 0 3 token

11 ( 3

199

10101010 11111111

Word Sketch

jaSlo Erjavec et al.

2006

Page 12: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

12

3.1.2

2004

2004

Word Sketch

10 Word Sketch

1) 2) 3)

4)

1)

1,180 364

Sketch Engine

22 2

Sketch Engine

Sketch

Engine Sketch Engine

Page 13: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

13

2)

Word Sketch

Word Sketch Sketch

Engine Web

Sketch Engine

3)

Word Sketch

Word Sketch 12

Page 14: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

14

11112222 Word Sketch

4)

Word Sketch Sketch

Engine Thesaurus Sketch Difference

A B A

B A

Sketch Difference

Page 15: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

15

Web Web

Word Sketch

Sketch Engine

3.2. Sketch Engine

Sketch Engine

Word Sketch Thesaurus Sketch Difference

Concordance

• suffix ( ) prefix

• suffix_base prefix_base

• bound_V

• V_bound

suffix bound_V

V_bound

Sketch Difference

/ /

Page 16: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

16

Word

Sketch Word Sketch

lemma

2)

Concordance

Concordance 2.2 3.3.1 Concordance

CQL

Concordance CQL

[word=" "][word=" "][lemma=" "]

[word=" "][word=" "][lemma=" "]

lemma

432 2,975

Collocation candidates

Page 17: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

17

Concordance CQL [tag="V.*"][word=" | "][word=" "][lemma=" "]

Web 1,170

CQL [word=" | "][word=" "][lemma=" "]

Collocation candidates 10

Concordance [word=" "] [word=" "] [lemma="

"] 10,845 Collocation candidates

4,000 13

(lexical sets)

11113333

Page 18: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

18

[word=" "][word=" | "][word=" "][word=" "] [word=" "] [lemma=" "]

Srdanović 2007

Word Sketch

Word Sketch

3.3. Sketch Engine

Sketch Engine

Sketch Engine

1)

Sketch Engine

a b

Sketch Engine

Sketch Engine

Nishina &

Yoshihashi 2007

Smrž 2004 Sketch Engine

Page 19: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

19

2)

Sketch Engine

3)

a ( )

b

c

d

3.1 3.2 Sketch Engine

Smrž 2004 Sketch Difference

Thesaurus

Sketch Engine

Smrž 2004

Sketch Engine

Sketch Engine

4)

a

b

c

Sketch Engine

Sketch Engine Smith et

al. 2007

Page 20: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

20

3.4.

Sketch Engine

2.3

Web Web

Word Sketch

Thesaurus Joice 2005 Sketch Engine

ChaSen

ChaSen

Corpus Builder Sketch Engine

WebBootCat Web

Baroni et al. 2006

4.

Sketch Engine

1) ChaSen 4 Web

2) ChaSen

Sketch Engine

Word Sketch Thesaurus Sketch Difference Concordance

1) Web

2)

3) ChaSen

ChaSen

Page 21: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

21

Srdanović Erjavec, Irena 2007

19 , 83-89,

2007 Sketch Engine

18 , 109-112,

2004

Baroni, Marko, Adam Kilgarriff, Jan Pomikalek & Pavel Rychly (2006) WebBootCaT: a web

tool for instant corpora, Proceedings of the EuraLex Conference 2006, 123-132.

Baroni, Marko & Silvia Bernardini, eds. (2006) Wacky! Working papers on the Web as Corpus,

Bologna: GEDIT.

Church, Kenneth Ward & Patrick Hanks (1989) Word association norms, mutual information,

and lexicography, Proceedings of the 27th annual meeting on Association for

Computational Linguistics, 76-83.

Crystal, David (2006) Language and the Internet, Cambridge: Cambridge University Press.

Erjavec, Tomaž, Kristina Hmeljak Sangawa & Irena Srdanović Erjavec (2006) jaSlo, A

Japanese-Slovene Learners' Dictionary: Methods for Dictionary Enhancement,

Proceedings of the 12th EURALEX International Congress

Erjavec, Tomaž, Adam Kilgarriff & Irena Srdanović Erjavec (2007) A large public-access

Japanese corpus and its query tool, CoJaS 2007, The Inaugural Workshop on

Computational Japanese Studies.

Gahl, Susanne (1998) Automatic Extraction of subcategorization frames for corpus-based

dictionary-building, Proc EURALEX 1998, 445-452.

Ghani, Rayid, Rosie Jones & Dunja Mladenic (2001) Using the Web to Create Minority

Language Corpora, Proceedings of the 2001 ACM CIKM: Tenth International

Conference on Information and Knowledge Management, 279-286.

Heid, Ulrich, Stefan Evert, Vincent Docherty, Wolfgang Worsch & Wermke, Matthias (2000)

Computational tools for semi-automatic corpus-based updating of dictionaries,

EURALEX 2000 Proceedings, 183-196.

Joyce, Terry (2005) Constructing a large-scale database of Japanese word associations, In

Katsuo Tamaoka (ed.) Corpus Studies on Japanese Kanji (Glottometrics 10), 82-98,

Tokyo: Hituzi Syobo & Germany: RAM-Verlag:Ludenschied.

Keller, Frank & Maria Lapata (2003) Using the Web to Obtain Frequencies for Unseen

Bigrams, Computational Linguistics 29 (3), 459-484.

Page 22: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

22

Kilgarriff, Adam & Michael Rundell (2002) Lexical Profiling Software and its Lexicographic

Applications - a Case Study, EURALEX 2002 Proceedings, 807-818.

Kilgarriff, Adam, Pavel Rychly, Pavel Smrž & David Tugwell (2004) The Sketch Engine, Proc.

Euralex, 105-116.

Kilgarriff Adam & David Tugwell (2001) WORD SKETCH: Extraction and Display of

Significant Collocations for Lexicography, Proc. workshop "COLLOCATION:

Computational Extraction, Analysis and Exploitation. 39th ACL & 10th EACL, 32-38.

Nishina, Kikuko & Kenji Yoshihashi (2007) Japanese Composition Support System

Displaying Occurrences and Example Sentences, Symposium on Large-scale

Knowledge Resources (LKR2007), 119-122.

Rundell, Michael, ed. (2002) Macmillan English Dictionary for Advanced Learners, London:

Macmillan.

Sharoff, Serge (2006) Open-source corpora: using the net to fish for linguistic data,

International Journal of Corpus Linguistics 11(4), 435-462.

Smith, Simon, Alice Chen & Adam Kilgarriff (2007) A corpus query tool for SLA: learning

Mandarin with the help of Sketch Engine, Practical Applications in Language and

Computers - PALC 2007

Smrž, Pavel (2004) Integrating Natural Language Processing into E-learning — A Case of

Czech, Proceedings of the Workshop on eLearning for Computational Linguistics and

Computational Linguistics for eLearning, COLING 2004. 106-111.

Srdanović Erjavec, Irena, Tomaž Erjavec & Adam Kilgarriff (2008 ) A web corpus and

word-sketches for Japanese, ,

Ueyama Motoko & Marko Baroni (2005) Automated construction and evaluation of a

Japanese web-based reference corpus, Proceedings of Corpus Linguistics 2005.

Page 23: SKE modified last3 - Sketch Engine · 2016. 6. 25. · 3 W W W W111 Sketch Engine b ¨0 1 "' W W W W2222 Sketch Engine b ¥ å ¥ î ² å « W W W W333 Sketch Engine b ¥ å ¥ î

23

SkeSkeSkeSketch Enginetch Enginetch Enginetch Engine corpus query toolcorpus query toolcorpus query toolcorpus query tool for Japanese for Japanese for Japanese for Japanese and its and its and its and its possible applications possible applications possible applications possible applications

SRDANOVIĆ ERJAVEC Irena, NISHINA Kikuko

Tokyo Institute of Technology

KeywordsKeywordsKeywordsKeywords

Sketch Engine, corpus linguistics, lexicography, second language learning, collocations

AbstractAbstractAbstractAbstract

Although corpus-based language research has been developing rapidly in recent years,

there is still a lack of resources in regards to their size, textual variety, and time of creation,

and of efficient and user-friendly corpus query tools. This is also the case for the Japanese

corpus linguistics, which is one of the primary reasons for the recent rise in projects

constructing Japanese corpora resources.

In this paper, we present a method for extracting linguistic information from corpora using

the Sketch Engine corpus query tool, which has recently been extended for the Japanese

language. The Japanese version is based on a 400 million word Japanese Web corpus, which

is linguistically annotated by the morphological analyzer ChaSen, and a Japanese

grammatical relations file. The tool offers efficient and user-friendly ways of extracting

concise linguistic data about words—their grammatical and collocational behavior, as well as

thesaurus-like information and differences in usage for similar words. We explain, through

examples, how the tool could be utilized in corpus lexicography, linguistic research and

computer assisted language learning of the Japanese language. The investigation part of the

article concentrates mainly on the ways that the tool could be applied within the dictionary

creation process, and the results illustrate how each of the tool functions can greatly

contribute to that process.