View
3
Download
0
Category
Preview:
Citation preview
Hilofumi YamamotoTokyo Institute of Technology
Bor HodoscekOsaka University
A Study on the Distribution of Cooccurrence Weight Patterns ofClassical Japanese Poetry
Introduction
← more generic more specific →
• P: Hairball effect or spoke effect (Yamamoto 2005)• P: Difficult to observe all word adjacency features.• O: Checking the distribution of cooccurrence
weight (Yamamoto 2006)• G: Drawing clear figures and extracting function words.• S: Cooccurrence weight is like the mean
value of two person’s weights.
light heavyC.F.Gauss(1777–1855)
x
σ0−σ
y
1
Tall HeadTall HeadTall Head
~Long Tail DinosaurLong Tail DinosaurLong Tail Dinosaur
Content TailFunction Tail
MethodsMaterial: the Hachidaishu (ca. 905–1205)
Calculation of Cooccurrence Weight: cw
We need the process of untangling hairball to obtain a clear graph model.(Nocajet al. 2014)
One of the methods used in Hodoscek and Yamamoto (2013) exploited theoccurrence not of individual words but of pairwise or cooccurrence patternssuch as‘ fragranceflower ’relationships and revealed that the distributionof cooccurrence weights in modern Japanese texts approximately fitted aGaussian curve. In this study, we will attempt to expand this analysis toclassical texts by utilizing the characteristics of the Gaussian distribution toautomatically group words into three clusters of cooccurrence patterns. Wewill show the differences between each of the three by visualizing with graphmodels.
2 Methods
We use the Hachidaishu as the material of the present study, which comprisesthe eight anthologies compiled under order of the Emperors (ca. 9051205)and contains about 9,500 poems. We developed the corpus and a method ofcooccurrence weighting similar to the tf-idf method, cw (Yamamoto 2006),which calculates the weight of patterns of any two words occurring in a poemsentence (Sparck Jones 1972, Robertson 2004, Manning and Schutze 1999,Rajaraman and Ullman 2012).
w(t, d) = (1+log tf(t, d)) · idf(t)cw(t1, t2, d) = (1+log ctf(t1, t2, d)) · cidf(t1, t2)cidf(t1, t2) =
√idf(t1) · idf(t2)
idf(t) = logN
df(t)
Where w is a weight, t a token, and N the number of tokens. Thefunction idf is called the “inverse document frequency.”(Sparck Jones 1972,Robertson 2004, Manning and Schutze 1999) The function cw is called the“co-occurrence weight,” which allows us to examine the patterns of poeticword constructions through mathematical modeling.
As in Fig. 1, there is a concept (Losee 2001: 1019) of terms located ineach layer being effective query terms. Luhn (1968) cuts the top and bot-tom words of the frequency and uses mid-range vocabulary for the devel-opment of an automatic outline generation system.(Fig. 1) Nagao (1983: 28)also mentioned mid-range vocabulary to be effective in generating automaticabstracts. Nagao (1983)’s viewpoint is slightly different with Luhn (1968) in
2
Bell curve~Distribution of cw becomes Bell curve.ring!ring!
• Over σ =⇒ Content Tail.• Under −σ ⇒ Function Tail.
Result
−2 −1 0 1 2 30
200
400
600
800
1000
z−value
frequency Sakura
(cherry)
−2 −1 0 1 2 3 40
200
400
600
800
1000
1200
1400
1600
z−value
frequency Ume
(plum)
Figure 1: Bell curves
Table 1: Upper cutoff patterns of ame (sakura): cw = co-occurrence weight; z = z-value(normalized value of frequency). word annotations: ari(be), ba(cond.), ha(topic.),hana(flower), hito(human), keri(past.), ki(past.), koso(emphatic.), miru(see), mo(also), nasi(no exist), nu(neg.), o(obj.), omou(think), ramu(aux.will), su(do), te(p.),to(and), ware(we), zo(emphatic.), zu(neg.)
cw z pattern cw z pattern cw z pattern
1 0.62 -0.91 mo–keri 11 0.59 -0.96 nasi–ha 21 0.52 -1.05 nu–o2 0.62 -0.92 hana–o 12 0.57 -0.98 o–ramu 22 0.52 -1.05 o–zo3 0.62 -0.92 o–koso 13 0.57 -0.98 mo–ramu 23 0.52 -1.05 miru–o4 0.60 -0.94 zu–keri 14 0.57 -0.98 ha–ki 24 0.48 -1.09 ba–mo5 0.60 -0.94 su–ha 15 0.56 -1.00 zu–mo 25 0.48 -1.09 o–keri6 0.60 -0.94 to–ba 16 0.56 -1.00 o–te 26 0.43 -1.16 zu–ha7 0.59 -0.96 ari–ha 17 0.55 -1.01 hito–mo 27 0.43 -1.16 to–o8 0.59 -0.96 ari–mo 18 0.54 -1.02 zu–te 28 0.43 -1.16 te–ha9 0.59 -0.96 ware–mo 19 0.52 -1.05 zo–ha 29 0.34 -1.27 o–ha
10 0.59 -0.96 nasi–o 20 0.52 -1.05 omou–o 30 0.34 -1.27 o–mo
as well as in classical texts (Fig. 2). Therefore we will attempt to divide thisshape into three layers by inflection points.
The co-occurrence patterns of sakura (cherry) under -0.9 (near -1) cwvalue are adjacent patterns comprising function words, and over 1 cw valueare those of the patterns with content words as we expected (Table 1 and2). As for the upper cutoff, we used an under -0.9 (near -1) σ value ofcw, which could extract patterns of functional tokens: almost all patternsincluded functional words, while as lower cutoff, we used over 1 σ values,which could extract patterns of content tokens: almost all patterns includedcontent words. Both under -1 and over 1 σ are regarded as inflection pointswhich have mathematically interesting property.
4 Discussion
Inflection points are defined as the points on the curve where the curvaturechanges its sign while a tangent exists.(Bronshtein et al. 2004: 231) We con-sider the threshold values that separate upper cutoff, mid-range, and lowercutoff not as coincidental but as evidential points. It is, however, necessary toconduct further experiments and continue to discuss the mathematical traitsbehind the distributions of cooccurrence weights. In terms of removing thelow-range (upper cutoff) and extracting the high-range (lower cutoff) frompoetic texts, we found that we do not need to use any filters to eliminateterms, since cw values returned semantically cooccurring patterns. Apartfrom low-range and high-range, the characteristics of the mid-range lexicallayer are still unknown.
4
Table 2: Lower cutoff patterns of ame (sakura) in Kokinshu: 30 out of 164 patternsextracted; cw = co-occurrence weight; z = z-value (normalized value of fre-quency) word annotations: ba(cond.), bakari(only), besi(should be), chiru(fall),fukakusa(deepgreen), hana(flower), isa(already), kakusu(hide), katu(win), koku(pull),komoru(go deep inside), magiru(mix), makasu(entrust), maku(wind up), manimani(asit is), masi(as), mazu(mix), me(eye), minami(south), miyako(city), mono(thing),nagara(even if), sakura(cherry), si(emphasic.), sumi(black ink), tatu(start,stand),tazumu(being around), tu(past.), uturou(change), watasu(give), yamakaze(mountainwind), yamu(stop), yanagi(willow), yononaka(world)
cw z pattern cw z pattern
1 3.86 3.18 yamu–manimani 106 2.38 1.31 si–fukakusa2 3.75 3.04 minami–magiru 107 2.38 1.31 sakura–hana3 3.67 2.93 minami–maku 108 2.38 1.31 sakura–isa4 3.61 2.86 maku–magiru 109 2.38 1.31 sakura–ba5 3.42 2.62 yanagi–koku 110 2.38 1.30 sakura–me6 3.38 2.57 yamu–makasu —7 3.38 2.56 mazu–koku 155 2.17 1.04 chiru–katu8 3.27 2.43 yanagi–mazu 156 2.17 1.04 bakari–sumi9 3.26 2.42 sakura–yamu 157 2.16 1.03 maku–besi
10 3.25 2.40 minami–yamakaze 158 2.16 1.03 tatu–maku– 159 2.16 1.03 tatu–tazumu
101 2.40 1.33 uturou–komoru 160 2.16 1.03 tazumu–tu102 2.40 1.33 sakura–watasu 161 2.16 1.03 miyako–sakura103 2.40 1.33 katu–nagara 162 2.16 1.02 kakusu–si104 2.39 1.32 sakura–masi 163 2.14 1.00 yononaka–sakura105 2.39 1.31 sakura–makasu 164 2.14 1.00 mono–sakura
5 Conclusion
Using the distribution characteristics of cooccurrence weights, we were ableto classify cooccurrence patterns into three layers of cooccurrence patterns:high-, mid-, and low-range patterns.
We found that 1) the distribution of classical texts fits a Gaussian curveas well as in modern texts; 2) the cw value can separate patterns into threelayers (low-, mid-, and high-range) using inflection points (-1σ and 1σ); 3)of the three layers, the high-range could be extracted without a list of stopwords; 4) the mid-range lexical layer might include mathematical traits notyet revealed in the present study.
References
Bronshtein, I.N., K. A. Semendyayev, G. Musiol, and H. Muehlig (2004)Handbook of Mathematics : Springer-Verlag, 4th edition.
Hodoscek, Bor and Hilofumi Yamamoto (2013) “Analysis and Application ofMidrange Terms of Modern Japanese”, in Computer and Humanities 2013Symposium Proceedings, No. 4, pp. 21–26.
5
omou (think.1) B1
VERB a3 LK a1 a2 DN gamma b1 b2 END
omou(think)
suru(do.2)
1
haseru(run)
1
(ra)reru(PASS/POT)5
te/de(LK)
24
ni(LK)
1
ta(PERF)
5
no/koto(DN)38
node(because.2)
1
hodo(as much as..)
2
yo(FP)
7
wa(FP)
13
.
37
u(CJR)
1
torawareru(be captured)
1
1
1
2
11
iru(be-AS)23 kuru
(toward-AS)1 koi(love.2)
1
fuan(anxiety)
1
1
15
ka(Q)
1
2
5
5
1
5
da(ASSER)
34
4 sa(FP)
125
kara(because.1) 1
taisetsu(precious)
1
1 1
6
1
14
20
Figure 2: Construction of the predicate of omou (think)
with Function Tail cherry blossom (28/131,4.28): CT ctf.>0.00;non-dist=off; idf=off; pruned under U:1:
here way home1village.11
lodging1
scatter.1
1
forget1
confuse
1
do.1
1
(adv.neg)
1
(must)
1
while
2
�
2
�
1
11
1
1
1
1
1
1
2 21
1
1
1
1
1
1
1
22 1
1
1
1
1
1
1
2 21
1
21
104
2
8
15
8
perplexed
1
(adv.will)4
forcibly
1
�9
�6
even
3
see.310�11
��
2
get used
1
know
1say
2
(p.wish)
6
�7
bloom
1
ç�¡ã�˙5
(adv.would)
1
ã�˛
4break off
1
�
1
�
2
ã�`ã��
1
increase
1
love.1
1
will not
1
kurabu.pl
1
�2
cloud.vi
1
such as.1
1
��2
wait2
�8�3
borrow1
regrettable1
here we go1
like.2
1
blow
2
�佯�
1
regret
1
bear1
think
1
fast/early1
ã�˙ 1
(ku)
3
lovely
1
sorrowful.1
11
1
1
1
22 1
1
1
1
2
3
1
1
1
1
1
1
1
1 1
1
6
3 2
2
1
1
2
(aux.past)
1(aux.perf)1
2
1
8
4
118
41
2 13
1
1
2
113
3
6 1
1
33
(suffix.forbid)
11
painful.1
1high.1
2
(emph)
1
wish
1
1
(interj.)
2
Yoshino.PN1seemed
1
3
2
1
151
quiet
2
although
1
21
1
2
22 1
2 1
1
15
71
7
7
4
��
1
seem
1
1
5
2
1
1
11
5
14
2
ã�ˆã�˝
2
2
11
6
2
1
11
1
7
7
2
2
3
2
2
2
1 2
2
1
1
2
233
1
1
42
2
1
2
1
41
1
either way
1 till
1
person
1
road1
stop.vi.21
go
1
1
1
1
1
2
2
1
1
11
1
1
2
1
1
2 111
1
1
11 lodge.n
1
be.3
1
take1
1
1
1
1
4
2
727 8
2
1
1
1
1
6
1
3
34
73
mountain village
1
1
3
1
2
mind2
wind.n1
mountain2
desolate
1
come1
2
be lonely
1
prosper
1
1
1
1
1
stone
1
waterfall
1
hand
11
run
1
1
stand.vi
11
cross
1
go over
1
send.v2 word
1
hear
1 1
1
present
1
tell
1
2
add 1
get tired.1
12
1
name
1
1vain
1
rare.11
1
1
1
fall
1
1spring rain
1
tear
1
1
1
1
1
1
2
1
1
1
1
1
2
1
2
2
1
1
exchange
1
11
1
1
1
1
1
1
1
2
1112
1
1
2
1
2
2
11
1
see.1
1
1
1
1
1
12
65
4
1
3
2
4
1
1
1
3
1
2
1
1
1
1
1
2
4 8
1
45
2
3
2
1
2
1
1
1
2
1
12
1
1
13
2
3
1
12
1
what1
1
2
thing.2
1
between
1
1spring haze1
hide.vt1
1
1
1
1
1
2
1
11
1
11
1
21
1
24
11
2
11
1
12
1
2
1
myself1
2
1
number
1
only
1
1
1 1
1
1
11
1
1
1
2
1
2 1
1
11
2
1
12
3
11
11
2
1
2
1 1
1
1
1
1
11 1
1
11
111 1
1
snow
1
11
1
such as.2
1
11
1
1
1
11
1
12
1
1
1
2
1
23
1
2
1
1
1
1
trail2
1
change.1
1
transfer.1
11
1
2
1
1
1
1
71
1
7
56
62
2
1
10
1
3
5
23
11
3
1
trust.2
1
1
1
1
2
1
1 11
3
1
1
1
4
2
1
2
51
1
11
1
1 11
11
1
1
1
2
1
1
2
12
1 1
1
thing.1
11
1
spring
1
this year1
��
1
begin1
11
11
1
1
1
21
11
1
1
1
1
1
111
1
1
21
1
year
1
1
11
1
1
1
1
1
111
11
1
11
1
2
1
111
1
1
1
1
1
1
1
white cloud
1
foot
1
pull
1
1
1
1
1
11
1
1
11
1
1
11
1
1
1
1
1
1
11 1
2
1
1
11
11
1
2
2
1
22
1
1
2
1
4
1
2
1
1
1
1
1
3
21
1
2
1
out
1
12
1
111
after
11
11
11112
11
1
1
1
11
1
1
11
2
111
11
1
1
2 225
1
1
11
2
12
1
1
1 1
1
1
1
1
11
1
1
5 25
9
3
2
1
1
34
14
1
1
1
1
1
1
1
1
3
1
1
1
1
1
211
1
31
2
1
1
1
1
mi
221
1
1 1
121
1
1
1
day
111
2
1
cloud.1
1
3
1
1
1
1
1
1
1
1
1
3
1
11
1
1
11
11
1
1
1
1
1
1
1
1
1
13122
1 11
211
1 12
1
11
1
1 12
1
1
11
1
1
1
112
1 11
1
1
11
1
1
1
11
1
1
1
1
413
3
1
1
1 132
1
2
2
1
2
1
1
1
1
21
1 1
1
1
1
11
11
1
1
1
3
2
233
3
11
2
1
2
12
1
1
1
1
1
1 1
1
wave
1
sky
1
11
1
11
1
2
1
1
1
1
11
11
1
1
1
1
1
1
1
1 415
9
1
2
2
2
3
43
121
1
1
1
1
2
1
1
1
1
1
32
1 11
1
1
11
1
1
1
2
1
1
1 1
1
1
1
1
1
2
1
1
1
11
1
1
1
1
1
1
1
1
6
1
32
33
22
1
5
3
12
2
1
1
1
1
1
11
2
1
1
11
1
1
1
2
1
1
24
2
4
421
2
1
1
1
11
1
1
1
1
1
11
1
111
1
1
1
1
1
11
13
1 1
1
1
11
1
1
1
11
11
1
111
1
1
1
1
11
1
1
1
1
11
1
1
21
1
11
1
1
1
1 11
1
22
1
12
1 1
1
1
1
1
1
1
1
for
11
1
1 11
1
1
1
1
1
11 1
1
1
11
11
1
1
1
1
11 1
1
1
1 1
111
1
1
1
1 1
1
1
131 1
1
2
11 11
1
1
11
1
1
1
1
1
1
11
11
1
1
1
1
1
1 1
11
remnant
112
1
1
1
1
1
water1
1
1
11
1 121
1
11
1
1
1
11
1121
1
1
1
1
11 11
2 1
1
11
1
1
11
41
1
11
1
1
2
1
11
1
1
1
1
112
1
11
1
11
1
1
1
1
1
1
1
1
5
4
1
22
3
11
1
1
11
1
1
1
1
1
111
1
1
1
1
11
1
1
1
2
mistake
1
21
11
1
112
1
11
11
1
112
1
1
11
111 2
1
11
1
1 2
1
11
1
111
1
12
1
11
1
111
112
1
1
1 1
1
gorge12
1
1
1
1
1
1
1
1
1
12
1
1
1
1
1
1
1
12
1
1
1
1
1
121
1
1
1
every 1
3
1
12
11
1
1
1 1
1
1
3
1
21
1
1
1
1
31
2
1
1
1
11
old age
1
1
1
1
11
1
11
11
1
1
1
1
11
11
1
11
11
1
1
1
1
131
21
1
211
11
1
1
12
1
1
11
1
1
1
1
1 1
11
1
11
1
1
2
2
2
31
1
1
1
1
1
2
1
2
1
2
1
1
11
1
111
1
2
1
2
2
1
1
1
1
11
11
1
1
21 1
1
1
1
1
2
1
1
1
1
11
1
1
today2
1
1
21
1
11
1
1
1
a glance
1
you.3
1 12
1
1
2
1
1
1
1
1
1
1
11
21
1 1
1
1
1
33
1
1
11
11
11
1
1
1
1
1
34
1
1
2
2
2
1
1
mountain area
11
1
1
1
1
1
1
1
1
11
1
112
3
1
111
1
11
1
1
2
111
1
1 1
1
11
1
11
11
1
1
1 1
1
1
1
1
1
111
1
11
11
1
11
11
1
332
3
2
12
13
3
1
2
1
2
1
1
1
1
2
1
32
1
1
11
1
1
1
1
1
1
1
1
remain
1
be over
111
1
1
1
11
1
1
1
11
1
1
1
1
1
1111
11
1
1
1
1
111
1
11
1
1
1
11 1
11
1
1
1
1
1
1
1
2
1
1
11
1
1
1
1
1
1
1
1
2 12
2
2
1
1
1 121
1
1
1
2 12
31
2
11
1
1
1
1
2
1
1
1
112
1
1
1
12
1
11
11
1
2
111
1
1
1
1 11
1
11
1
1 11
1
1
1 1
1
1
1
1
111
1
1 11
1
21
1
1
2
1
1
1
1
21
1
1
2
11
1
1
1
2
1
person from old village32
1
11
1
1
1
1
1
1
1
world.2
1
11
1
1
1
1
11
11
111
1
1
1
1
11
11
1
1
1
1
11
111
1
1
111 1
colour1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
Figure 3: Hairball effect
cherry blossom (28/131,4.28): CT ctf.>0.00;
non-dist=off; idf=off; pruned under U:1:
here
way home
1
lodging1
confuse
1
1
1
village.1
1
scatter.1
1
forget
1
(must)
1
1cloud.vi
1
such as.1
1
either way
road
1
stop.vi.2
1
perplexed1
forcibly1
1
1
1
1
2
exchange
11
1
1
1
what
hide.vt
1
out after1
mountain village
1 1
mi
cloud.1
1
mountain
3
desolate1
prosper
1
painful.1
1
high.13
seem
1
1
break off
1
1
3
trail
2
1
1
1
(suffix.forbid)
1
be lonely
1
(emph)
1
1
11
1
1
1
1
1
for
stone
1waterfall 1
hand
1
run 1
wish
1
1
1
1
1
1
1
1
1
1
1
1
present
1
tell
1
1 1
come
1
1
1
remnant
wave1
1
1
2
1kurabu.pl
1
1(p.wish)
6
see.310even3
(ku)3
lovely
1
fast/early
1
between
1
2
spring haze
1
send.v2 word
1
1
1
2
transfer.11
2
get tired.1
1Yoshino.PN
1
gorge
white cloud
1
foot 1
pull
1
bloom
1
1
seemed1
1
1
11
1
1
1
1
such as.2
1
quiet
1
1
1
1
1
every1
11
1
1
(adv.will)
2
old age
1
11
1
say
11
1
1
1
1 1
1
11
spring
add
1
1
(aux.past)
1
(aux.perf)
1
year
1
rare.1
1
day
1
1
this year
begin 1
get used
1
11
1
1
1
1
today
a glance
1
2
1
you.3
1
2
wait
1
mountain area
mistake
1
1
1
1
1
number
1
till
1
borrow
1
like.2
1
1
regrettable
1
here we go
1
1
1
1
1
only
1
1
myself
1
1
1
1
1
1 1
1
1
person from old village3
1
11
lodge.n
1
world.2
1
sorrowful.1
1
name
1
vain
1
1
wind.n
3
1
1
1
spring rain
regret1
snow1
6
remain
1
be over
1
take
1 1
1
11
1
1
1
1
1
1
1
1
change.11
cross
1
go over1
1
increase
1
bear
1
think
1
1
love.11
blow
1
trust.2
1
1
will not
1
Figure 4: Only with Content TailConclusion1) the distribution of classical texts fits a Gaussian (Bell) curve as well as in modern texts (Hodoscek and Yamamoto 2013);2) the cw value can separate patterns into three layers (low-, mid-, and high-range) using inflection points (-1σ and 1σ);3) of the three layers, the high-range could be extracted without a list of stop words;4) the mid-range lexical layer might include mathematical traits not yet revealed in the present study.
Reference• Yamamoto, H. (2005), Visualisation of the construction of poetic vocabulary using the database of the Kokinshu., Jinbun kagaku todetabesu (Humanities and Database) the 11th symposium, 81–8, The council of humanities and database.
• Yamamoto, H. (2006), Extraction and Visualisation of the Connotation of Classical Japanese Poetic Vocabulary, Symposium forComputer and Humanities, vol. 2006, 21–28, The information processing society of Japan.
• Hodoscek, B. and H. Yamamoto (2013) “Analysis and Application of Midrange Terms of Modern Japanese”, in Computer and Hu-manities 2013 Symposium Proceedings, No. 4, pp. 21–26.
1
Recommended