Download pdf - CHAPTER THREErepository.tufs.ac.jp/bitstream/10108/51459/10/dt-ko-0106008.pdf · プielt nervous when l was taking the te5・t（nervousness）；1／leel th is旋ヲst was dbjficult（test

CHAPTER　THREE

METHO1）

　　　　　In　order　to　investigate　the　research　questions　listed　in　Chapter　Two，　both

quantitative　and　qualitative　analyses　were　performed．　To　investigate　the　equivalence　o　f

test　scores　across　modes，　quantitatively，　the　magnitude　of　raw　scores　was　cornpared　by

means　oft－test；a　factor　analysis　approach　was　used　to　compare　the　constructs　measured

by　the　monologic　tasks　delivered　in　the　two　modes．

　　　　　For　compa血g　examinees’speech　samples，　the　transcripts　were　frrst　coded　on　a

range　of　qualitative　analytic　measures．　The　measures　were　then　calculated　and

compared　quantitatively　across　test　modes．　Questionnaires　were　also　utilized　to　gather

inf（）rmation　on　examinees’attitudes　toward　testing　speaking　on　computer．　The

responses　to　Likert－scale　type　items　were　compared　across　modes　quantitatively．　In

addition，　qualitative　data　from　responses　to　open－ended　questions　was　analyzed　through

content　analysis　to　shed　light　on　Likert－type　responses．

3．1Participants

　　　　　Atotal　of　96　Japanese　learners　of　English　as　a　fbreign　language（EFL）in　Japan

participated　in　the　study．　They　were　78　undergraduate　students（81％）from　3

universities　and　18high　school　students（19％）from　2　high　schools．　There　were　23　male

students（24％）and　73　female　students（76％）．　Students丘om　University　A，　a　fbreign

language　university，　were　specializing　in　fbreign　languages　other　than　English．　Students

丘om　University　B　were　majoring　in　English　language　and　literature　and　those　from

45

東京外国語大学博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

University　C，　a　women’s　university，　in　nursing　and　domestic　science．　High　school　D　is

aboys’high　school，　while　High　school　E　is　a　co－educational　high　school．　Table　3．1

shows　the　characteristics　ofthe　participants．　These　students　could　be　considered　to　well

represent　a　wide　range　ofuniversity　students　and　high　school　students　in　Japan．　All　the

students　participated　on　a　voluntary　basis　and　received　an　honorarium　after　their

participation㎞this　project．

Table　3．1Characteristics　oforiginal　96　participants

School Academic　year Gender Total

1 2 3 4 Male Female

Un　iversめノ

　　　　　　A

　　　　　　　B

　　　　　　　C

High　sc乃ool

　　　　　　　D

　　　　　　　E

　　　　　Total

　　9『1『1　　2

く∨0

6　2」く」

1

く∨00

800

00

つ⊃00

AANN

2　0ハU1

0つづー12

へ∠04．〔∠－且3

　　30弓1　　7

4．04．

10

W96

Note．　NA＝Not　applicable．

　　　　　However，　a　total　of　l7　data　was　excluded　ffom　analyses　of　comparing

perf（）rmance　across　the　two　delivery　modes　due　to（a）poor　quality　ofrecording（n＝7），

（b）improper　testing　procedures（n＝＝7），　and（c）missing　data（i．　e．，　no　utterance　on　either

task）（n＝3）．　As　a　result，　data　from　79　participants　was　used　fbr　investigating　Research

Questions　1－4．　They　were　61　undergraduate　students（77％）and　18high　school　students

（23％）．There　were　l　g　males（24％）and　60　females（76％）．　Table　3．21ists　the　detailed

description　ofthe　79　students．　Despite　the　invalid　data　oforal　responses　to　the　tasks，　all

the　participants　filled　out　questio皿aires　administered　to　them　after　the　tests．　Thus，　fbr

46


Research　Question　5，　the　responses　on　the　questio皿aires丘om　the　original　96

participants　were　analyzed．

Table　3．2　Characteristics　o　f　the　79　participants

School Academic　year Gender Total

1 2 3 4 Male Female

Un　iversめ・

　　　　　　A

　　　　　　B

　　　　　　　C

High　school

　　　　　　　D

　　　　　　　E

　　　　　Total

6719

50

12

R3

くJOO

800　

00

〔∠00

AANN

06ハUO

01911　

1■▲

10〔∠

0760

0／0〔∠〔∠1〔∠

10

W79

N（）te．　NA＝Not　applicable．

3．2　1nstruments

　　　　　The　data　fbr　this　study　were　collected　from　multiple　sources，　including（1）

monologic　tasks　delivered　by　computer，（2）monologic　tasks　administered　in　the

face－to－face　mode，（3）aquestionnaire　on　examinees’attitudes　toward　each　mode，　and

（4）a　questionnaire　comparing　examinee　attitudes　toward　the　two　modes．

3．2．1　Computer－delivered　monologic　tasks

　　　　　Tasks　that　are　utilized　to　elicit　ratable　speech　samples　in　testing　speaking　can　be

divided　into　two　types，　namely，　the　monologic　task　and　the　interactive　task．　Monologic

tasks，　the　f（）cus　ofthe　present　study，　usually　refer　to　tasks　that　can　elicit　long　individual

discourses　without　the　examinees’interacting　with　an　interlocutor．　From　this　definition，

they　consist　ofsuch　tasks　as　reading　aloud，　sentence　repetition，　information　transfer　task，

and　oral　presentation　task（0’Sullivan，2008）．　Appendix　A　presents　a　summary　of　test

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　47


tasks　used　in　various　large－scale　speaking　tests．　As　can　be　seen　in　this　Appendix，

reading　aloud　and　sentence　repetition　are　not　common　tasks　in　the　face－to－face　test．

Thus，　it　was　decided　to　compare　only　the　information　transfer　task　and　the　oral

presentation　task　in　this　study．　These　are　defined　as　tasks　in　which　examinees　take　some

time　to　make　several　points　and　to　develop　an　adequate　reply　to　task　prompts．

　　　　　Specifically，　a　narrative　task　and　an　opinion　task　from　the　speaking　section　o　f　the

GTEC　fbr　STUDENTS5　were　used，　representing　the　information　transfer　task　and　the

oral　presentation　task　respectively（see　Appendix　B　fbr　task　prompts）．　The　narrative

task　contained　4　pictures　that　told　a　simple　story　which　participants　had　one　minute　to

relate．　The　opinion　task　provided　a　graph　and　required　participants　to　give　opinions　on　a

topic　based　on　the　information　in　the　graph　within　two　minutes．　In　the　narrative　task，　a

video　prompt　in　which　an　American　female　acted　the　role　of　asking　questions　and

giving　simple　preset　feedback　was　presented．　Examinees　were　given　some　time　to

prepare　a　response　f～）r　each　task．　They　started　to　record　their　responses　by　clicking　the

“start”icon　on　the　screen　when　they　were　ready．　When　the　preparation　time　was　over，

their　responses　were　recorded　automatically．

3．2．2　Face－to－face　monologic　tasks

　　　　　The　face－to－face　tasks　were　constructed　using　the　same　content　and　format　as

those　used　in　the　computer　mode　and　conducted　on　a　one－to－one　basis．　The　author，　a

Chinese　female，　served　as　the　interviewer　in　all　face－to－face　tests．　Instructions　were

written　on　a　prompt　card　fbr　each　task（see　Appendix　C　fbr　task　prompts）．　In　the

5The　GTEC　fbr　STUDENTS，　developed　by　Benesse　Corporation，　is　a　four－skill　computer－based　English

　test．　It　targets　mainly　Japanese　high　school　students　and　university　students．

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　48


narrative　task，　the　interviewer　played　the　role　of　asking　questions　and　giving　feedback

as　the　video　character　did　in　the　conlputer－delivered　task．　The　interviewer　timed

preparation　and　response　time　and　recorded　examinees’oral　responses　using　an　IC

recorder．　The　interviewer　tried　not　to　give　any　verbal　reaction　except　simple

backchannels（e．g．，㎜一㎞and　uh－huh）with　nodding　and　eye　contact．　Appendix　D

summarizes　the　features　oftasks　used　in　the　computer　and　face－to－face　modes．

3．2．3　　Questionnaires

　　　　　Two　questionnaires　in　Japanese　were　used　for　investigating　examinee　attitudes

toward　testing　speaking　in　the　computer　mode　and　the　face－to－face　mode．　The

questionnaire　items　were　adapted　from　Hill（1998）and　O’Loughlin（2001）．　The　author

translated　the　questionnaires，　which　had　originally　been　developed　in　English，　into

Japanese．　A　Japanese　Ph．D．　student　specializing　in　applied　linguistics　at　TUFS6

checked　the　translation．　A　back－translation　of　the　Japanese　transcript　into　English　was

performed　by　another　Japanese　Ph．D．　student．　Refer　to　Appendix　E　fbr　the　English

version　and　Appendix　F　for　the　Japanese　version　ofthe　questionnaires．

3．2．3．1　Feedback　sheet　on　each　test

　　　　　After　taking　each　test，　examinees　completed　Questionnaire　l　about　the　test．　The

questionnaires　fbr　each　test，　which　were　parallel　in　f（）rmat　and　content，　comprised　of

five　statements　regarding　each　of　the　fb　llowing　five　aspects：examinees’nervousness，

examinees’perceptions　oftest　difficulty，　test　fairness，　accuracy　ofthe　test　as　a　measure

ofthe　examinees’English　speaking　level，　and　affective　appeal　ofthe　test．　For　instance，1

6TUFS　is　the　abbreviation　of　Tokyo　University　of　Foreign　Studies．

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　49


プielt　nervous　when　l　was　taking　the　te5・t（nervousness）；1／leel　th　is旋ヲst　was　dbjficult（test

difficulty）．　Examinees　were　asked　to　indicate　the　degree　to　which　they　agreed　with　each

statement　using　a　five－point　Likert　scale：5＝Strong！y　agree，4＝．4gree，3＝Neutral，2

＝Disagree，1＝ぷtrong！y　disagree．

3．2．3．2　Feedback　sheet　on　test　comparisons

　　　　　　A　feedback　sheet　on　test　comparisons（Questio皿aire　2）was　also　provided．　This

consisted　of　six　questions．　Five　of　these　questions　paralleled　the　content　of　the

statements　in　Questio皿aire　l　and　required　examinees　to　compare　the　two　modes

directly　in　regard　to　how　nervous　they　felt，　how　difficult　and　how　fair　they　perceived　the

tests　to　be，　and　how　accurate　they　felt　the　tests　were　as　a　measurement　o　f　their　speaking

English　level．　Moreover，　they　were　asked　to　rate　the　affective　appeal　o　f　the　tests．　For

examp　le，　wh　ich　test　did　J／ou／ilel　more　neハノous　taking？（nervousne　ss）；wh　ich　test　did　J／o　u

feel　more両貨cμ1τ？（test　difficulty）．　In　addition，　one　question　asking　which　of　the　two

modes　examinees　preferred　was　included．　For　each　question，　examinees　chose　one　of

three　options：ccCo〃zputer－delivered　sjりeaking　test”，‘‘Both　theぷa〃ze”，　and‘‘．Face－to－face

test”．　They　were　also　asked　to　write　the丘comments　on　each　question　after　their

「esponses・

3．3　　1）ata　collection　procedure

3．3．1Design

　　　　　Data　were　collected　during　the　period　of　November　to　December　in　2006．　All

participants　were　randomly　assigned　to　two　groups　in　a　counterbalanced　design．　Group

A（n＝48）took　the　computer　mode　first　and　then　the　face－to－face　mode，　while　Group　B

50


（n＝48）took　the　face－to－face　mo　de　frrst　and　then　the　computer　mode．　Due　to　the　invalid

data，41　data　from　Group　A　and　38　data　from　Group　B　were　used　in　the　subsequent

analyses　ofexaminees’perfbrmance．　Examinees　took　the　two　tests　with　an　interval　of7

to　10　days．　It　was　considered　that　during　this　period，　the　participants’English　ability

would　not　grow　so迦ch　and　the　practice　effect　might　be　minimized．

　　　　　The　computer－delivered　speaking　test　was　administered　to　each　participant　on

campus，　either　individually　in　a　quiet　room（University　A　and　B），　or　in　a　coml）uter　lab

with　about　20　classmates（University　C）．　All　the　high　school　students　took　the　test　in　a

computer　lab　with　20r　3　classmates．　They　were　seated　far　enough　apart　to　ensure　that

they　would　not　influence　each　other．　For　each　face－to－face　test，　participants　met

individually　with　the　author　in　a　quite　room　on　their　campus．　None　of　the　students

except　one　female　participant廿om　University　A㎞ew　the　author　beforehand．　The

participants　were　not　infbrmed　befbrehand　that　the　second　test　would　be　administered

having　the　same　task　content　as　the　first　one．　After　each　test，　they　were　asked　to　fi11　out

a　questionnaire　about　their　attitudes　toward　that　specific　test（Questio皿aire　1）．　A丑er

completing　both　tests　and　the　questionnaire　fbr　each　test，　examinees　completed　a　final

questiomaire　comparing　the　tests　delivered　in　the　two　modes（Questionnaire　2）．　Table

3．3presents　the　design　ofthe　study．

51


Table　3．3　Design　ofthe　study

Group　A（N＝48） Group　B（N＝48）

Week　l

Week　2

Computer－delivered　test　　　　　　　　　↓

　　　Questionnaire　1　　　　　　　　　↓

　　　Face－to－face　test

　　　　　　　　　↓


　　　Questionnaire　2

　　　Face－to－face　test

　　　　　　　　　↓


Computer－delivered　test　　　　　　　　　↓


　　　Questionnaire　2

3．3．2　　Scoring

　　　　　Two　independent　accredited　raters7　fbr　the　GTEC　for　STUDENTS　scored　the

examinees’responses　to　the　two　monologic　tasks　delivered　by　cornputer　on　each　ofthe

fo　llo　wing　four　elements：grammar，　vo　cabulary，　fluency，　and　pronunc　iation．　All　the

elements　were　rated　on　a　O－4　analytic　scale（see　Appendix　G　for　the　scoring　rubric）．　The

scores　fbr　each　element　of　each　task　were　determined　by　taking　the　mean　ofthe　ratings

from　two　raters．　The　final　scores　f（）r　each　element　were　calculated　by　averaging　the

element　scores　of　each　task，　and　these　were　added　up　to　obtain　a　total　score　for　each

partlc　lpant．

　　　　　Tasks　hl　the　face－to　一一　face　mode　were　scored　in　the　same　way　as　the

computer－delivered　tasks　by　the　same　pool　of　raters　with　the　same　scoring　rubric．

Reliability　and　agreement　ofratings　assigned　by　two　raters　fbr　both　computer－delivered

and　face－to－face　mono　logic　tasks　are　reported　in　Chapter　Four．

7For　this　study，　no　fUrther　infbrmation　on　raters　was　available．　Examinees　may　have　received　ratings

　from　the　di　fferent　pair　of　raters．　As　pointed　out　in　Lee（2006），　this　is　quite　a　common　practice　fbr　a

　large－scale　perfbrmance　test，　since　it　is　usually　impractical　to　ask　the　same　pair　of　raters　to　do　all　the

　ratingS．

52


3．3．3　Coding　of　analytic　measures

　　　　　Speech　samples　were　transcribed　by　five　senior　students　at　TUFS　and

double－checked　by　the　author8．　Appendix　H　provides　some　examples　ofthe　transcripts9．

The　utterances　were　then　coded　in　order　to　calculate　measure　s　o　f　fluency，　accuracy，　and

complexity．　The　coded　speech　samples　were　entered　into　a　database　fbr　use　with　the

CLAN（Computerized　Language　Analysis）program　developed　as　part　ofthe　CHILDES

project（MacWhinney，2000）．　The　CLAN　pro　gram　allows　a　large　number　o　f　automatic

analyses　to　be　performed　on　data，　including　frequency　counts，　word　searches，　and

calculation　of　type／token　ratios．　This　section　describes　how　the　linguistic　perfbrmance

in　terms　of　fluency，　accuracy　and　complexity　was　operationalized　and　how　the　speech

samples　were　coded　in　this　study．　Table　3．4　at　the　end　of　this　section　presents　a

summary　o　f　the　measures．

3．3．3．1　　Fluency

　　　　　Measures　of　fluency　could　be　categorized　into　two　types：hesitation　phenomena

relating　to　dysfluency　and　temporal　variables　concerned　with　the　speed　of　speaking

（Ellis＆Barkhuizen，2005）．　The　hesitation　dimension　was　operationalized　as　the

number　ofdysfluent　words　divided　by　the　total　amount　of　speech　measured　in　seconds

（e．g．，　Kormo　s＆Denes，2004）．　Dysfluent　words　include　words　o　f　fUnctionless　repetition，

self－correction，　and　false　starts．

　　　　　Following　Foster　et　al．（2000），　fUnctionless　repetition　means　repetition　of　exact

words，　syllables，　phrases，　or　partial　repetition　of　some　part　of　a　word　qr　utterance．

8All　errors　in　pronunciation　were　transcribed　in　their　correct　fbrms．

9These　examples　were　presented　in　the　high，　middle，　and　low－proficiency　groups　based　on　their　scores　in

　computer－delivered　speaking　tasks．　Refer　to　the　details　of　grouping　in　Section　4．3．2．

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　53


Self－correction　occurs　when　the　speaker　identi丘es　an　e廿or　either　du血g　or　immediately

f（）llowing　production　and　stops　and　refbrmulates　the　speech．　A　false　start　is　defined　as

an輌ncomplete　utterance，　which　is　begun　and　then　either　abandoned　altogether　or

refbrmulated　in　some　way．　In　addition　to　dysfluent　words，　separate　measures　fbr　the

above　three　aspects　were　also　reported．　The　fbllowing　are　some　examples　fbr　words　of

repetition，　self－correction，　and飽lse　starts．

●

●

●

●

IIgoing　to　spcaking　test．（1　repeated　word）

However　however　however　some　some　man　talked　to　me．（3　repeated　words）

Ican　I　couldn’t　win　the　speech　contest．（2　words　of　self二correction）

This　pie　chart　said　talked　shows　that　nearly　sixty　percent　of　people㎞ow　that

what’s　happening　in　society　on　TV　through　TV．（4　words　of　self－correction）

So　we　can　so　the　needs　ofthe　newspaper　is　not　so　important．（3　words　of　false

start）

旦旦L－yanyway　I　thhik　we　should　getinformation．（6　words　o　f　false　start）

　　　　　For　temporal　variables，　three　measures　were　employed：（1）speech　rate　as

represented　by　the　number　ofwords　divided　by　the　total　amount　of　speech　measured　in

seconds（e．g．，　Lennon，1990），（2）the　number　of　unfilled　pauses　divided　by　the　total

amount　of　speech　measured　in　seconds（e．g．，　Kormo　s＆Denes，2004），　and（3）the

number　of　filled　pauses　divided　by　the　total　amount　of　speech　measured　in　seconds（e．g．，

Brown　et　al．，2005）．

　　　　　To　calculate　speech　rate，　the　transcribed　speech　was　first　pruned　by　excluding

dysfluent　words．　When　the　number　of　words　was　counted，　filled　pauses　both　in

Japanese　and　English，　such　as“uh”，“um”，　and“eito”were　excluded．　Contractions，　such

as“wasn’t”and“he’s”，　were　counted　as　two　words．　Such　a　Katakana　word　as“telebi”

and“pasokom”was　not　counted　as　one　word．　Then　the　resulting　total　number　ofwords

54


was　divided　by　the　total　speech　time　an　examinee　took　to　complete　the　task，　excluding

pauses　of　three　or　more　secondslo．　The　number　of　unfilled　pauses　was　calculated　by

counting　the　number　ofpauses　of　two　seconds　or　more　that　occurred　in　the　examinees’

speechl　l．Here　are　some　examples　for　filled　pauses　and　words．

・旦hIspeak旦坦spoke　very　well．（2　words　offilled　pauses）

・1　do　n’t　agree　with　your　idea　becau　se　um　1　can　gΩLt　Qyb　got　info　rmation　by　telebi．

　　（13wordsl2）．

3．3．3．2　Accuracy

　　　　　Accuracy　was　assessed　by　two　general　measures：the　percentage　of　error一丘ee

clauses　in　the　total　number　ofclauses（e．g．，　Foster＆Skehan，1996）and　the　percentage

of　error－free　AS－units　in　the　total　number　of　AS－units（e．g．，　Robinson，2001）．　General

measures　fbr　accuracy　are　suggested　to　have　the　advantage　ofbeing　potentially　the　most

comprehensive，　hl　that　all　errors　are　considered（e．g．，　Bygate，2001）．　Another　reason　fbr

adopting　general　measures　is　that　previous　studies（Luoma，1997；Ko　ike，1998）using

specific　measures　failed　to　fmd　any　difference　acro　ss　test　modes．

　　　　　The　Analysis　of　Speech　Unit（AS－unit）was　chosen　as　the　basic　syntactic　unit　of

analysis　in　this　study，　as　it　has　been　considered　to　be　a　better　measure　than　others（e．g．，

T－units）fbr　spoken　data　produced　by　L2　speakers（Fo　ster，　Tonkyn，＆Wigglesworth，

2000）．This　is　defined　as　a‘‘single　speaker’s　utterance　consisting　of　an　independent

clause，　or　sub－clausal　unit，　together　with　any　subordinate　clause（s）associated　with

lo　Following　Brown　et　al．（2005），　pauses　of　three　or　more　seconds　were　considered　to　be　a　substantial

　length　of　elapsed　time　and　were　thus　excluded　from　total　speech　time．11

she　author　found　it　diffricult　to　measure　pause　time　under　tWo　seconds　with　a　stopwatch　reliably．　Thus，

　only　pauses　over　two　seconds　were　counted．12

she　underlined　words　were　excluded　from　counting，　since　they　represented　self－correction

55


either”（Foster　et　al．，　p．365）．　Examples　ofan　AS－unit　are　as　fbllows：

・　Students　have　to　read　newspaper　every　day．（1　AS－unit）

・Ispeech（1　AS－unit）

・And　happy　to　attend　that　speech　contest（1　AS－unit）

・And　then　very　loud　a　clap　hands（1　AS－unit）

　　　　　To　calculate　the　number　ofclauses，　they　were　classified　into　independent　clauses

and　subordinate　clauses．　An　independent　clause　was“minimally　a　clause　including　a

丘nite　verb”（Foster　et　al．，2000，　p．365）．　A　subordinate　clause　included　an　adverbial

clause，　an　adjective　clause，　and　a　nominal　clause　and“consisted　minimally　ofafinite　or

non－finite　verb　element　plus　at　least　one　other　clause　element（su句ect，　o句ect，

complement　or　adverbial”（Foster　et　al．，　p．366）．　In　coding　the　clauses，　all　the　utterances，

including　those　with　errors　as　defined　above，　were　used．　The　following　gives　some

examples　ofindependent　and　subordinate　clause．

●

●

●

●

Ihad　a　speech　contest．（1　independent　clause）

And　hapPy　to　attend　that　speech　contest．（O　independent　c　lause）

Idon’t　say　that－．（1　independent　and　l　subordinateclause　as　underlined　in　the　sentence）

一【is　important．（l　independent　and　l　subordinate　clauseas　underlined　in　the　sentence）

　　　　　Error－free　AS－units　are　AS－units　that　do　not　contain　any　errors　in　syntax，

morphology，　word　order，　or　lexical　choice．　Error－free　clauses　refer　to　clauses　free丘om

any　errors　in　syntax，　mo叩hology，　or　lexical　choice．　Lexical　errbrs　were　defined　as

errors　in　lexical　fbrm　or　collocation．　In　counting　errors，　only　the　pnlned　utterances　were

used．　Also，　errors　in　pronunciation　were　not　considered　since，　as　mentioned　earlier，　they

were　not　reflected　in　the　transcription．　The　native－like　use　of　the　language，　in　terms　of

56


the　grammar　and　lexis，　was　generally　considered　as　a　criterion　in　judging　error－free

clauses　and　error－free　AS－units．

3．3．3．3　Comple】dty

　　　　　Complexity　was　measured　in　terms　of　syntactic　complexity　and　lexical

complexity．　Three　types　of　syntactic　complexity　measures　were　adopted：（1）the

percentage　of　clauses　in　the　total　number　of　AS－units（e．g．，0’Sullivan，2002），（2）the

percentage　of　subordinate　clauses　in　the　total　number　of　AS－units（e．g．，　Wigglesworth，

1997），and（3）the　ratio　ofthe　number　ofwords　divided　by　the　total　number　ofAS－units

（e．g．，　Bygate，2001）．

　　　　　Lexical　complexity　was　established　by　three　measures，　namely，　Guiraud’s　Index，

lexical　density　index，　and　weighted　lexical　density　index．　For　all　these　measures，　only

pnlned　utterances　were　used．　Guiraud’sIndex　was　calculated　by　dividing　the　number　of

types　by　the　square　root　of　the　number　of　tokens．　This　measure　is　thought　to　be　more

appropriate　than　the　type－token　ratio（TTR），　as　it　takes　sample　length　into　account

（Vermeer，2000）．　Following　Daller，　van　Hout，　and　Treffers－Daller（2003），　words　that

differed　fヒom　each　other　only　in　inflectional　morphology　were　counted　as　one　single

type（e．g．，　go－went；』－is－was－are），　whereas　words　canying　different　derivational

morphemes（e．g．，　danger－dangerouぷ）were　counted　as　different　types．

　　　　　Lexical　density　was　calculated　as　the　ratio　of　lexical　words　in　the　total　number　of

words．　Following　O’Loughlin（2001）and　Koizumi（2005），　words　were　classified　into

grammatical　and　lexical　words．　Given　that　the　lexical　density　index　has　been　criticized

fbr　ignoring　the　relative　significance　of　words　of　different　frequency，　this　study　also

adopted　weighted　lexical　density　that　takes　the　frequency　of　words　into　account．　To

57


calculate　weighted　lexical　density，　lexical　words　were　fUrther　divided　hlto

high－frequency　and　low一丘equency　words．　High一丘equency　words　were　defined　as　the

most丘equent　2000　wordsl3　in　the　JACET　list　of　8000　Basic　Words（JACET　Basic

Revision　Committee，2003）14．　High一廿equency　lexical　words　were　given　half　the　weight

of　low－fヒequency　words，　and　the　number　of　weighted　lexical　words　as　a　percentage　of

the　total　number　ofwords　was　then　calculated．

　　　　　Two　coders　coded　the　data．　The　author　coded　all　the　data；the　second　coder，　a

Ph．D．　student　specializing　in　language　testing　at　TUFS，　coded　a　randomly　selected

sample　of　approximately　10％ofthe　data．　Inter－coder　reliability　was　calculated　by　the

percentage　ofagreement　between　the　coders，　which　is　reported　in　Chapter　Four．

13The　criterion　of　beyond　2000　words　was　used　because　high　school　students　in　Japan　are　supposed　to

　learn　about　2000　words（Ministry　of　Education，　Science＆Culture，1989）．14

The　JACET　list　of　8000　Basic　Words　uses　a　lemma　count．（］tperationally，　a　lexical　frequency　profile

　　analysis　was　conducted　using　the　computer　program“JACET　8000　Analysis　Program”（Shimizu，　　2004），which　resulted　in　a　distinction　between　the　words　belonging　to　the　2000　most　frequent　words

　　and　those　with　a　lower　frequency．

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　58


Table　3．4　Measures　of　fluency，　accuracy，　and　complexity　used　in　the　study

Measure Definition

Fluency

Accuracy

Complexity

No．　of　words

No．　ofunfilled　pauses

No．　of　filled　pauses

No．　of　dysfluent　words

No．　ofrepetition　words

No．　of　self－correction

words

No．　of　false　start　words

Percentage　oferror－free

clauses

Percentage　oferror－free

AS－units

Syntactic　complexiりy

　Percentage　ofclauses

Percentage　of　subordinate

clauses

No．　of　words

・乙exical　co碑フlexiリノ

　Guiraud’s　Index

Lexical　density

Weighted　lexical　density

The　number　ofwords／

the　total　amount　of　speech（in　seconds）

The　number　ofunfilled　pauses／


The　number　of　filled　pauses／


The　number　of　dysfluent　words／


The　number　of　repetition　words／

the　total　amount　of　words（in　seconds）

The　number　of　repetition　words／


The　number　of　false　start　words／


The　number　of　error－free　clauses／

the　total　number　of　error－free　clauses

The　number　of　error－free　AS－units／

the　total　number　of　error－free　AS－units

The　number　ofclauses／

the　total　number　of　AS－units

The　number　of　subordinate　clauses／


The　number　of　words／


The　number　of　types／

V－the　number　oftokens

The　number　of　lexical　words／

the　total　number　of　words

（The　number　ofhigh－frequency　words／2十

the　number　of　low－frequency　words）／

the　total　number　of　words

59


3．4　　Data　analysis

3．4．1　Comparing　the　magnitude　of　raw　scores

　　　　　In　order　to　evaluate　the　agreement　of　the　ratings　assigned　by　the　two

randomly－selected　raters，　two　types　of　indexes　were　employed：（1）Pearson　product

moment　correlation　coef丘cients　fbr　inter－rater　reliability　and（2）perfect　agreement　and

adjacent　agreement　between　the　ratings　o　f　the　two　independent　raters．

　　　　　Befbre　proceeding　to　analysis　of　raw　scores，　the　order　effect　on　test　scores　was

checked．　As　described　earlier，　participants　were　assigned　at　random　to　two　testing

conditions：（a）computer　mode　first　and　face－to－face　mode　second　or（b）face－to－face

mode丘rst　and　computer　mode　second．　With　a　counterbalanced　design　like　this，　a

concem　is　that　taking　the　test　in　one　mode　fbllowed　by　the　other　could　affect　the　score

fbr　the　second　mode．　Consequently，　descriptive　statistics　of　the　scores　fbr　each　rating

element　were　frrst　examined　fbr　the　fbur　mode－by－order　conditions　separately　to　detect

any　possible　order　effect．

　　　　　To　assess　the　effect　of　delivery　mode　and　the　mode－by－order　interaction

statistically，　a　series　of　two－way　repeated　measures　analyses　of　variance（ANOVAs），

with　one　within－su句ect　factor（mode）and　one　between－su句ect　factor（order），　was

per丘）rmed　on　element　scores　and　total　scores．　If　there　was　no　mode－by－order　interaction

effect，　the　results　of　the　main　effect　of　delivery　mode　would　be　interpreted　directly．　If

such　interactions　were　fbund　to　be　statistically　significant，　it　would　imply　that　mode

effect　was　not　consistent　across　the　two　groups　assigned　to　the　different　testing

conditions．　It　would　then　be　invalid　to　use　the　combined　data　from　the　two　groups．　In

that　case，　fbllowing　Choi，　Kim，　and　Boo（2003），　independent　t－tests　using　scores　fヒom

the　frrst　test　administered　would　be　conducted　to　compare　the　magnitude　oftest　scores．

60


For　the　purpose　of　reference，　the　results　of　independent　t－tests　based　on　scores　from　the

second　test　would　also　be　presented．　SPSS　11．0　for　Windows　was　employed　for　all　the

statistical　analyses　in　this　study．

3．4．2　Comparing　psychometric　constructs

　　　　　The　data　analyses　in　this　section　consisted　of　data　transformation　as　a　measure

against　order　effect，　a　preliminary　analysis　of　descriptive　statistics，　and　a　series　of

exploratory　factor　analyses．

　　　　　If　an　order　effect　was　not　fbund　through　the　analysis　in　the　previous　section，

scores　fbr　each　of　the　f（）ur　elements　fbr　each　of　the　two　tasks，　which　resulted　in　eight

variables　for　each　mode，　were　used　in　the　subsequent　analysis．　However，　even　when　an

order　effect　was　observed，　it　was　decided　not　to　take　the　measure　ofusing　only　the　data

from　the　first　test．　The　reason　was　that　this　would　lead　to　a　decrease　of　data　and

violation　of　the　assurnption　of　exploratory　factor　analysis　f（）r　sample　size15．　Instead，

after　consulting　with　a　statistical　expert，　it　was　decided　to　transform　the　original　data　by

subtracting　them　from　the　mean　scores　ofeach　variable　f（）r　the　group　that　took　the　tests

in　different　orders．

　　　　　Descriptive　statistics　were　calculated　fbr　the　total　l　6　variables　to　examine　the

assumptions　o　f　exploratory　factor　analysis．　The　univariate　normality　o　f　the　data　was

assessed　through　skewness　and　kurtosis　ofthe　variables，　which　should　be　within　a　range

of－2　to　2．　The　Pearson　product－moment　correlation　matrix　was　computed　to　assess　the

multicollinearity　ofthe　data．　Following　Field（2005），　a　correlation　ofabove．90　between

15lany　criteria　have　been　proposed　to　determine　the　sample　size　fbr　exploratory　factor　analysis．

　Generally，　it　is　considered　to　be　proper　to　have　the　number　of　subjects　at　lease　5　times　the　number　of

　variables（Field，2005）．

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　61


two　variables　was　considered　to　be　a　threat　to　the　singularity　of　the　data．　In　that　case，

one　o　f　the　variables　would　be　eliminated　fro　m　fUrther　analysis．　Univariate　outliers　were

examined，　which　were　defined　as　cases　with　values　more　than　three　standard　deviations

丘om　the　mean　ofagiven　variable．

　　　　　In　order　to　investigate　whether　there　were　components　that　were　shared　in

common　as　measured　by　mono　logic　tasks　delivered　in　the　co卿uter　and　the血ce－to一血ce

modes，　principal　factor　analysisl6　using　varimax　rotations　as　a　factor　anal》戊ic　method

was　performed　on　the　data　for　the　two　modes　respectively．　The　number　of飴ctors，

which　was　dete㎜㎞ed　based　on　the　sizes　ofeigenvalues　and　the　scree　test（Tabac㎞ick

＆Fidell，2001），　was　compared　across　modes．　The　factor　loading　fbr　each　variable　was

also　checked　and　compared　acro　ss　modes．　Another　principal　factor　analysis　was　run　on

the　combined　data　from　the　two　modes　to　conf五m　the　comparability　of　the　factor

structure　from　the　above　analyses．

3．4．3　Comparing　speech　samples

　　　　　To　address　Research　Question　3，　the　coded　responses　of　examinees　in　the

computer　mode　and　the　face－to－face　mode　were　compared　in　terms　of　measures　o　f

fluency，　accuracy，　and　complexity　to　examine　the　effect　of　computer　delivery　mode　on

speech　samples．　The　purpose　of　Research　Question　4　was　to　examine　a　possible

interaction　e脆ct　of　examinees’proficiency　on　their　speech　samples　across　delivery

modes．　To　this　end，　participants　were　first　categorized　into　three　groups　based　on　their

total　scores　in　the　computer－delivered　tasks．　A　one－way　ANOVA　was　then　perf（）rmed　to

16orincipal　factor　analysis（PFA）is　also　called　common　factor　analysis　or　principal　axis　factoring．　This

　extraction　method　was　selected　over　principal　component　analysis　since　it　reflects　only　the　common

　variance　of　variables，　not　including　the　unique　variance．

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　62


ensure　that　the　scores　were　statistically　different　acro　ss　the　three　groups．

　　　　　Aseries　oftwo－way　repeated　measures　ANOVAs，　with　delivery　mode（computer

vs．　face－to－face）as　a　within－su句ect　factor　and　examinees’proficiency　level（low　vs．

mid　vs．　high　proficiency）as　a　between－su句ect　factor，　was　conducted．　The　ANOVA

results　were　expected　to　determine　whether　there　were　signi丘cant　dif民rences　in

measures　of　fluency，　accuracy，　and　complexity　across　test　modes　and　whether　there　was

any　interaction　effbct　between　delivery　mode　and　proficiency　level．

3．4．4　Comparing　examinee　attitudes

　　　　　Means　and　the　standard　deviations　for　Likert－scale　statements　in　Questionnaire　l

were　calculated　for　both　tests．　A　mean　score　greater　than　3　was　considered　to　show　a

positive　trend　and　one　less　than　3　a　negative　trend．　A　repeated　measures　t－test　was

conducted　to　compare　examinees’responses　in　each　test　statistically．　For　Questionnaire

2，the　chi－square　test　was　used　to　compare　the　observed　values　across　groups　who　chose

each　option　to　the　statistically　expected　values．　In　addition，　the　examinees’responses　to

the　open－ended　questions　were　categoriZed　through　content　analysis　and　were　tallied．

63