CHAPTER THREE
METHO1)
In order to investigate the research questions listed in Chapter Two, both
quantitative and qualitative analyses were performed. To investigate the equivalence o f
test scores across modes, quantitatively, the magnitude of raw scores was cornpared by
means oft-test;a factor analysis approach was used to compare the constructs measured
by the monologic tasks delivered in the two modes.
For compa血g examinees’speech samples, the transcripts were frrst coded on a
range of qualitative analytic measures. The measures were then calculated and
compared quantitatively across test modes. Questionnaires were also utilized to gather
inf()rmation on examinees’attitudes toward testing speaking on computer. The
responses to Likert-scale type items were compared across modes quantitatively. In
addition, qualitative data from responses to open-ended questions was analyzed through
content analysis to shed light on Likert-type responses.
3.1Participants
Atotal of 96 Japanese learners of English as a fbreign language(EFL)in Japan
participated in the study. They were 78 undergraduate students(81%)from 3
universities and 18high school students(19%)from 2 high schools. There were 23 male
students(24%)and 73 female students(76%). Students丘om University A, a fbreign
language university, were specializing in fbreign languages other than English. Students
丘om University B were majoring in English language and literature and those from
45
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
University C, a women’s university, in nursing and domestic science. High school D is
aboys’high school, while High school E is a co-educational high school. Table 3.1
shows the characteristics ofthe participants. These students could be considered to well
represent a wide range ofuniversity students and high school students in Japan. All the
students participated on a voluntary basis and received an honorarium after their
participation㎞this project.
Table 3.1Characteristics oforiginal 96 participants
School Academic year Gender Total
1 2 3 4 Male Female
Un iversめノ
A
B
C
High sc乃ool
D
E
Total
9『1『1 2
く∨0
6 2」く」
1
く∨00
800
00
つ⊃00
AANN
2 0ハU1
0つづー12
へ∠04.〔∠-且3
30弓1 7
4.04.
10
W96
Note. NA=Not applicable.
However, a total of l7 data was excluded ffom analyses of comparing
perf()rmance across the two delivery modes due to(a)poor quality ofrecording(n=7),
(b)improper testing procedures(n==7), and(c)missing data(i. e., no utterance on either
task)(n=3). As a result, data from 79 participants was used fbr investigating Research
Questions 1-4. They were 61 undergraduate students(77%)and 18high school students
(23%).There were l g males(24%)and 60 females(76%). Table 3.21ists the detailed
description ofthe 79 students. Despite the invalid data oforal responses to the tasks, all
the participants filled out questio皿aires administered to them after the tests. Thus, fbr
46
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Research Question 5, the responses on the questio皿aires丘om the original 96
participants were analyzed.
Table 3.2 Characteristics o f the 79 participants
School Academic year Gender Total
1 2 3 4 Male Female
Un iversめ・
A
B
C
High school
D
E
Total
6719
50
12
R3
くJOO
800
00
〔∠00
AANN
06ハUO
01911
1■▲
10〔∠
0760
0/0〔∠〔∠1〔∠
10
W79
N()te. NA=Not applicable.
3.2 1nstruments
The data fbr this study were collected from multiple sources, including(1)
monologic tasks delivered by computer,(2)monologic tasks administered in the
face-to-face mode,(3)aquestionnaire on examinees’attitudes toward each mode, and
(4)a questionnaire comparing examinee attitudes toward the two modes.
3.2.1 Computer-delivered monologic tasks
Tasks that are utilized to elicit ratable speech samples in testing speaking can be
divided into two types, namely, the monologic task and the interactive task. Monologic
tasks, the f()cus ofthe present study, usually refer to tasks that can elicit long individual
discourses without the examinees’interacting with an interlocutor. From this definition,
they consist ofsuch tasks as reading aloud, sentence repetition, information transfer task,
and oral presentation task(0’Sullivan,2008). Appendix A presents a summary of test
47
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
tasks used in various large-scale speaking tests. As can be seen in this Appendix,
reading aloud and sentence repetition are not common tasks in the face-to-face test.
Thus, it was decided to compare only the information transfer task and the oral
presentation task in this study. These are defined as tasks in which examinees take some
time to make several points and to develop an adequate reply to task prompts.
Specifically, a narrative task and an opinion task from the speaking section o f the
GTEC fbr STUDENTS5 were used, representing the information transfer task and the
oral presentation task respectively(see Appendix B fbr task prompts). The narrative
task contained 4 pictures that told a simple story which participants had one minute to
relate. The opinion task provided a graph and required participants to give opinions on a
topic based on the information in the graph within two minutes. In the narrative task, a
video prompt in which an American female acted the role of asking questions and
giving simple preset feedback was presented. Examinees were given some time to
prepare a response f~)r each task. They started to record their responses by clicking the
“start”icon on the screen when they were ready. When the preparation time was over,
their responses were recorded automatically.
3.2.2 Face-to-face monologic tasks
The face-to-face tasks were constructed using the same content and format as
those used in the computer mode and conducted on a one-to-one basis. The author, a
Chinese female, served as the interviewer in all face-to-face tests. Instructions were
written on a prompt card fbr each task(see Appendix C fbr task prompts). In the
5The GTEC fbr STUDENTS, developed by Benesse Corporation, is a four-skill computer-based English
test. It targets mainly Japanese high school students and university students.
48
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
narrative task, the interviewer played the role of asking questions and giving feedback
as the video character did in the conlputer-delivered task. The interviewer timed
preparation and response time and recorded examinees’oral responses using an IC
recorder. The interviewer tried not to give any verbal reaction except simple
backchannels(e.g.,㎜一㎞and uh-huh)with nodding and eye contact. Appendix D
summarizes the features oftasks used in the computer and face-to-face modes.
3.2.3 Questionnaires
Two questionnaires in Japanese were used for investigating examinee attitudes
toward testing speaking in the computer mode and the face-to-face mode. The
questionnaire items were adapted from Hill(1998)and O’Loughlin(2001). The author
translated the questionnaires, which had originally been developed in English, into
Japanese. A Japanese Ph.D. student specializing in applied linguistics at TUFS6
checked the translation. A back-translation of the Japanese transcript into English was
performed by another Japanese Ph.D. student. Refer to Appendix E fbr the English
version and Appendix F for the Japanese version ofthe questionnaires.
3.2.3.1 Feedback sheet on each test
After taking each test, examinees completed Questionnaire l about the test. The
questionnaires fbr each test, which were parallel in f()rmat and content, comprised of
five statements regarding each of the fb llowing five aspects:examinees’nervousness,
examinees’perceptions oftest difficulty, test fairness, accuracy ofthe test as a measure
ofthe examinees’English speaking level, and affective appeal ofthe test. For instance,1
6TUFS is the abbreviation of Tokyo University of Foreign Studies.
49
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
プielt nervous when l was taking the te5・t(nervousness);1/leel th is旋ヲst was dbjficult(test
difficulty). Examinees were asked to indicate the degree to which they agreed with each
statement using a five-point Likert scale:5=Strong!y agree,4=.4gree,3=Neutral,2
=Disagree,1=ぷtrong!y disagree.
3.2.3.2 Feedback sheet on test comparisons
A feedback sheet on test comparisons(Questio皿aire 2)was also provided. This
consisted of six questions. Five of these questions paralleled the content of the
statements in Questio皿aire l and required examinees to compare the two modes
directly in regard to how nervous they felt, how difficult and how fair they perceived the
tests to be, and how accurate they felt the tests were as a measurement o f their speaking
English level. Moreover, they were asked to rate the affective appeal o f the tests. For
examp le, wh ich test did J/ou/ilel more neハノous taking?(nervousne ss);wh ich test did J/o u
feel more両貨cμ1τ?(test difficulty). In addition, one question asking which of the two
modes examinees preferred was included. For each question, examinees chose one of
three options:ccCo〃zputer-delivered sjりeaking test”,‘‘Both theぷa〃ze”, and‘‘.Face-to-face
test”. They were also asked to write the丘comments on each question after their
「esponses・
3.3 1)ata collection procedure
3.3.1Design
Data were collected during the period of November to December in 2006. All
participants were randomly assigned to two groups in a counterbalanced design. Group
A(n=48)took the computer mode first and then the face-to-face mode, while Group B
50
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
(n=48)took the face-to-face mo de frrst and then the computer mode. Due to the invalid
data,41 data from Group A and 38 data from Group B were used in the subsequent
analyses ofexaminees’perfbrmance. Examinees took the two tests with an interval of7
to 10 days. It was considered that during this period, the participants’English ability
would not grow so迦ch and the practice effect might be minimized.
The computer-delivered speaking test was administered to each participant on
campus, either individually in a quiet room(University A and B), or in a coml)uter lab
with about 20 classmates(University C). All the high school students took the test in a
computer lab with 20r 3 classmates. They were seated far enough apart to ensure that
they would not influence each other. For each face-to-face test, participants met
individually with the author in a quite room on their campus. None of the students
except one female participant廿om University A㎞ew the author beforehand. The
participants were not infbrmed befbrehand that the second test would be administered
having the same task content as the first one. After each test, they were asked to fi11 out
a questionnaire about their attitudes toward that specific test(Questio皿aire 1). A丑er
completing both tests and the questionnaire fbr each test, examinees completed a final
questiomaire comparing the tests delivered in the two modes(Questionnaire 2). Table
3.3presents the design ofthe study.
51
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Table 3.3 Design ofthe study
Group A(N=48) Group B(N=48)
Week l
Week 2
Computer-delivered test ↓
Questionnaire 1 ↓
Face-to-face test
↓
Questionnaire 1 ↓
Questionnaire 2
Face-to-face test
↓
Questionnaire 1 ↓
Computer-delivered test ↓
Questionnaire 1 ↓
Questionnaire 2
3.3.2 Scoring
Two independent accredited raters7 fbr the GTEC for STUDENTS scored the
examinees’responses to the two monologic tasks delivered by cornputer on each ofthe
fo llo wing four elements:grammar, vo cabulary, fluency, and pronunc iation. All the
elements were rated on a O-4 analytic scale(see Appendix G for the scoring rubric). The
scores fbr each element of each task were determined by taking the mean ofthe ratings
from two raters. The final scores f()r each element were calculated by averaging the
element scores of each task, and these were added up to obtain a total score for each
partlc lpant.
Tasks hl the face-to 一一 face mode were scored in the same way as the
computer-delivered tasks by the same pool of raters with the same scoring rubric.
Reliability and agreement ofratings assigned by two raters fbr both computer-delivered
and face-to-face mono logic tasks are reported in Chapter Four.
7For this study, no fUrther infbrmation on raters was available. Examinees may have received ratings
from the di fferent pair of raters. As pointed out in Lee(2006), this is quite a common practice fbr a
large-scale perfbrmance test, since it is usually impractical to ask the same pair of raters to do all the
ratingS.
52
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
3.3.3 Coding of analytic measures
Speech samples were transcribed by five senior students at TUFS and
double-checked by the author8. Appendix H provides some examples ofthe transcripts9.
The utterances were then coded in order to calculate measure s o f fluency, accuracy, and
complexity. The coded speech samples were entered into a database fbr use with the
CLAN(Computerized Language Analysis)program developed as part ofthe CHILDES
project(MacWhinney,2000). The CLAN pro gram allows a large number o f automatic
analyses to be performed on data, including frequency counts, word searches, and
calculation of type/token ratios. This section describes how the linguistic perfbrmance
in terms of fluency, accuracy and complexity was operationalized and how the speech
samples were coded in this study. Table 3.4 at the end of this section presents a
summary o f the measures.
3.3.3.1 Fluency
Measures of fluency could be categorized into two types:hesitation phenomena
relating to dysfluency and temporal variables concerned with the speed of speaking
(Ellis&Barkhuizen,2005). The hesitation dimension was operationalized as the
number ofdysfluent words divided by the total amount of speech measured in seconds
(e.g., Kormo s&Denes,2004). Dysfluent words include words o f fUnctionless repetition,
self-correction, and false starts.
Following Foster et al.(2000), fUnctionless repetition means repetition of exact
words, syllables, phrases, or partial repetition of some part of a word qr utterance.
8All errors in pronunciation were transcribed in their correct fbrms.
9These examples were presented in the high, middle, and low-proficiency groups based on their scores in
computer-delivered speaking tasks. Refer to the details of grouping in Section 4.3.2.
53
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Self-correction occurs when the speaker identi丘es an e廿or either du血g or immediately
f()llowing production and stops and refbrmulates the speech. A false start is defined as
an輌ncomplete utterance, which is begun and then either abandoned altogether or
refbrmulated in some way. In addition to dysfluent words, separate measures fbr the
above three aspects were also reported. The fbllowing are some examples fbr words of
repetition, self-correction, and飽lse starts.
●
●
●
●
IIgoing to spcaking test.(1 repeated word)
However however however some some man talked to me.(3 repeated words)
Ican I couldn’t win the speech contest.(2 words of self二correction)
This pie chart said talked shows that nearly sixty percent of people㎞ow that
what’s happening in society on TV through TV.(4 words of self-correction)
So we can so the needs ofthe newspaper is not so important.(3 words of false
start)
旦旦L-yanyway I thhik we should getinformation.(6 words o f false start)
For temporal variables, three measures were employed:(1)speech rate as
represented by the number ofwords divided by the total amount of speech measured in
seconds(e.g., Lennon,1990),(2)the number of unfilled pauses divided by the total
amount of speech measured in seconds(e.g., Kormo s&Denes,2004), and(3)the
number of filled pauses divided by the total amount of speech measured in seconds(e.g.,
Brown et al.,2005).
To calculate speech rate, the transcribed speech was first pruned by excluding
dysfluent words. When the number of words was counted, filled pauses both in
Japanese and English, such as“uh”,“um”, and“eito”were excluded. Contractions, such
as“wasn’t”and“he’s”, were counted as two words. Such a Katakana word as“telebi”
and“pasokom”was not counted as one word. Then the resulting total number ofwords
54
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
was divided by the total speech time an examinee took to complete the task, excluding
pauses of three or more secondslo. The number of unfilled pauses was calculated by
counting the number ofpauses of two seconds or more that occurred in the examinees’
speechl l.Here are some examples for filled pauses and words.
・旦hIspeak旦坦spoke very well.(2 words offilled pauses)
・1 do n’t agree with your idea becau se um 1 can gΩLt Qyb got info rmation by telebi.
(13wordsl2).
3.3.3.2 Accuracy
Accuracy was assessed by two general measures:the percentage of error一丘ee
clauses in the total number ofclauses(e.g., Foster&Skehan,1996)and the percentage
of error-free AS-units in the total number of AS-units(e.g., Robinson,2001). General
measures fbr accuracy are suggested to have the advantage ofbeing potentially the most
comprehensive, hl that all errors are considered(e.g., Bygate,2001). Another reason fbr
adopting general measures is that previous studies(Luoma,1997;Ko ike,1998)using
specific measures failed to fmd any difference acro ss test modes.
The Analysis of Speech Unit(AS-unit)was chosen as the basic syntactic unit of
analysis in this study, as it has been considered to be a better measure than others(e.g.,
T-units)fbr spoken data produced by L2 speakers(Fo ster, Tonkyn,&Wigglesworth,
2000).This is defined as a‘‘single speaker’s utterance consisting of an independent
clause, or sub-clausal unit, together with any subordinate clause(s)associated with
lo Following Brown et al.(2005), pauses of three or more seconds were considered to be a substantial
length of elapsed time and were thus excluded from total speech time.11
she author found it diffricult to measure pause time under tWo seconds with a stopwatch reliably. Thus,
only pauses over two seconds were counted.12
she underlined words were excluded from counting, since they represented self-correction
55
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
either”(Foster et al., p.365). Examples ofan AS-unit are as fbllows:
・ Students have to read newspaper every day.(1 AS-unit)
・Ispeech(1 AS-unit)
・And happy to attend that speech contest(1 AS-unit)
・And then very loud a clap hands(1 AS-unit)
To calculate the number ofclauses, they were classified into independent clauses
and subordinate clauses. An independent clause was“minimally a clause including a
丘nite verb”(Foster et al.,2000, p.365). A subordinate clause included an adverbial
clause, an adjective clause, and a nominal clause and“consisted minimally ofafinite or
non-finite verb element plus at least one other clause element(su句ect, o句ect,
complement or adverbial”(Foster et al., p.366). In coding the clauses, all the utterances,
including those with errors as defined above, were used. The following gives some
examples ofindependent and subordinate clause.
●
●
●
●
Ihad a speech contest.(1 independent clause)
And hapPy to attend that speech contest.(O independent c lause)
Idon’t say that-.(1 independent and l subordinateclause as underlined in the sentence)
一【is important.(l independent and l subordinate clauseas underlined in the sentence)
Error-free AS-units are AS-units that do not contain any errors in syntax,
morphology, word order, or lexical choice. Error-free clauses refer to clauses free丘om
any errors in syntax, mo叩hology, or lexical choice. Lexical errbrs were defined as
errors in lexical fbrm or collocation. In counting errors, only the pnlned utterances were
used. Also, errors in pronunciation were not considered since, as mentioned earlier, they
were not reflected in the transcription. The native-like use of the language, in terms of
56
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
the grammar and lexis, was generally considered as a criterion in judging error-free
clauses and error-free AS-units.
3.3.3.3 Comple】dty
Complexity was measured in terms of syntactic complexity and lexical
complexity. Three types of syntactic complexity measures were adopted:(1)the
percentage of clauses in the total number of AS-units(e.g.,0’Sullivan,2002),(2)the
percentage of subordinate clauses in the total number of AS-units(e.g., Wigglesworth,
1997),and(3)the ratio ofthe number ofwords divided by the total number ofAS-units
(e.g., Bygate,2001).
Lexical complexity was established by three measures, namely, Guiraud’s Index,
lexical density index, and weighted lexical density index. For all these measures, only
pnlned utterances were used. Guiraud’sIndex was calculated by dividing the number of
types by the square root of the number of tokens. This measure is thought to be more
appropriate than the type-token ratio(TTR), as it takes sample length into account
(Vermeer,2000). Following Daller, van Hout, and Treffers-Daller(2003), words that
differed fヒom each other only in inflectional morphology were counted as one single
type(e.g., go-went;』-is-was-are), whereas words canying different derivational
morphemes(e.g., danger-dangerouぷ)were counted as different types.
Lexical density was calculated as the ratio of lexical words in the total number of
words. Following O’Loughlin(2001)and Koizumi(2005), words were classified into
grammatical and lexical words. Given that the lexical density index has been criticized
fbr ignoring the relative significance of words of different frequency, this study also
adopted weighted lexical density that takes the frequency of words into account. To
57
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
calculate weighted lexical density, lexical words were fUrther divided hlto
high-frequency and low一丘equency words. High一丘equency words were defined as the
most丘equent 2000 wordsl3 in the JACET list of 8000 Basic Words(JACET Basic
Revision Committee,2003)14. High一廿equency lexical words were given half the weight
of low-fヒequency words, and the number of weighted lexical words as a percentage of
the total number ofwords was then calculated.
Two coders coded the data. The author coded all the data;the second coder, a
Ph.D. student specializing in language testing at TUFS, coded a randomly selected
sample of approximately 10%ofthe data. Inter-coder reliability was calculated by the
percentage ofagreement between the coders, which is reported in Chapter Four.
13The criterion of beyond 2000 words was used because high school students in Japan are supposed to
learn about 2000 words(Ministry of Education, Science&Culture,1989).14
The JACET list of 8000 Basic Words uses a lemma count.(]tperationally, a lexical frequency profile
analysis was conducted using the computer program“JACET 8000 Analysis Program”(Shimizu, 2004),which resulted in a distinction between the words belonging to the 2000 most frequent words
and those with a lower frequency.
58
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Table 3.4 Measures of fluency, accuracy, and complexity used in the study
Measure Definition
Fluency
Accuracy
Complexity
No. of words
No. ofunfilled pauses
No. of filled pauses
No. of dysfluent words
No. ofrepetition words
No. of self-correction
words
No. of false start words
Percentage oferror-free
clauses
Percentage oferror-free
AS-units
Syntactic complexiりy
Percentage ofclauses
Percentage of subordinate
clauses
No. of words
・乙exical co碑フlexiリノ
Guiraud’s Index
Lexical density
Weighted lexical density
The number ofwords/
the total amount of speech(in seconds)
The number ofunfilled pauses/
the total amount of speech(in seconds)
The number of filled pauses/
the total amount of speech(in seconds)
The number of dysfluent words/
the total amount of speech(in seconds)
The number of repetition words/
the total amount of words(in seconds)
The number of repetition words/
the total amount of words(in seconds)
The number of false start words/
the total amount of words(in seconds)
The number of error-free clauses/
the total number of error-free clauses
The number of error-free AS-units/
the total number of error-free AS-units
The number ofclauses/
the total number of AS-units
The number of subordinate clauses/
the total number of AS-units
The number of words/
the total number of AS-units
The number of types/
V-the number oftokens
The number of lexical words/
the total number of words
(The number ofhigh-frequency words/2十
the number of low-frequency words)/
the total number of words
59
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
3.4 Data analysis
3.4.1 Comparing the magnitude of raw scores
In order to evaluate the agreement of the ratings assigned by the two
randomly-selected raters, two types of indexes were employed:(1)Pearson product
moment correlation coef丘cients fbr inter-rater reliability and(2)perfect agreement and
adjacent agreement between the ratings o f the two independent raters.
Befbre proceeding to analysis of raw scores, the order effect on test scores was
checked. As described earlier, participants were assigned at random to two testing
conditions:(a)computer mode first and face-to-face mode second or(b)face-to-face
mode丘rst and computer mode second. With a counterbalanced design like this, a
concem is that taking the test in one mode fbllowed by the other could affect the score
fbr the second mode. Consequently, descriptive statistics of the scores fbr each rating
element were frrst examined fbr the fbur mode-by-order conditions separately to detect
any possible order effect.
To assess the effect of delivery mode and the mode-by-order interaction
statistically, a series of two-way repeated measures analyses of variance(ANOVAs),
with one within-su句ect factor(mode)and one between-su句ect factor(order), was
per丘)rmed on element scores and total scores. If there was no mode-by-order interaction
effect, the results of the main effect of delivery mode would be interpreted directly. If
such interactions were fbund to be statistically significant, it would imply that mode
effect was not consistent across the two groups assigned to the different testing
conditions. It would then be invalid to use the combined data from the two groups. In
that case, fbllowing Choi, Kim, and Boo(2003), independent t-tests using scores fヒom
the frrst test administered would be conducted to compare the magnitude oftest scores.
60
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
For the purpose of reference, the results of independent t-tests based on scores from the
second test would also be presented. SPSS 11.0 for Windows was employed for all the
statistical analyses in this study.
3.4.2 Comparing psychometric constructs
The data analyses in this section consisted of data transformation as a measure
against order effect, a preliminary analysis of descriptive statistics, and a series of
exploratory factor analyses.
If an order effect was not fbund through the analysis in the previous section,
scores fbr each of the f()ur elements fbr each of the two tasks, which resulted in eight
variables for each mode, were used in the subsequent analysis. However, even when an
order effect was observed, it was decided not to take the measure ofusing only the data
from the first test. The reason was that this would lead to a decrease of data and
violation of the assurnption of exploratory factor analysis f()r sample size15. Instead,
after consulting with a statistical expert, it was decided to transform the original data by
subtracting them from the mean scores ofeach variable f()r the group that took the tests
in different orders.
Descriptive statistics were calculated fbr the total l 6 variables to examine the
assumptions o f exploratory factor analysis. The univariate normality o f the data was
assessed through skewness and kurtosis ofthe variables, which should be within a range
of-2 to 2. The Pearson product-moment correlation matrix was computed to assess the
multicollinearity ofthe data. Following Field(2005), a correlation ofabove.90 between
15lany criteria have been proposed to determine the sample size fbr exploratory factor analysis.
Generally, it is considered to be proper to have the number of subjects at lease 5 times the number of
variables(Field,2005).
61
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
two variables was considered to be a threat to the singularity of the data. In that case,
one o f the variables would be eliminated fro m fUrther analysis. Univariate outliers were
examined, which were defined as cases with values more than three standard deviations
丘om the mean ofagiven variable.
In order to investigate whether there were components that were shared in
common as measured by mono logic tasks delivered in the co卿uter and the血ce-to一血ce
modes, principal factor analysisl6 using varimax rotations as a factor anal》戊ic method
was performed on the data for the two modes respectively. The number of飴ctors,
which was dete㎜㎞ed based on the sizes ofeigenvalues and the scree test(Tabac㎞ick
&Fidell,2001), was compared across modes. The factor loading fbr each variable was
also checked and compared acro ss modes. Another principal factor analysis was run on
the combined data from the two modes to conf五m the comparability of the factor
structure from the above analyses.
3.4.3 Comparing speech samples
To address Research Question 3, the coded responses of examinees in the
computer mode and the face-to-face mode were compared in terms of measures o f
fluency, accuracy, and complexity to examine the effect of computer delivery mode on
speech samples. The purpose of Research Question 4 was to examine a possible
interaction e脆ct of examinees’proficiency on their speech samples across delivery
modes. To this end, participants were first categorized into three groups based on their
total scores in the computer-delivered tasks. A one-way ANOVA was then perf()rmed to
16orincipal factor analysis(PFA)is also called common factor analysis or principal axis factoring. This
extraction method was selected over principal component analysis since it reflects only the common
variance of variables, not including the unique variance.
62
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
ensure that the scores were statistically different acro ss the three groups.
Aseries oftwo-way repeated measures ANOVAs, with delivery mode(computer
vs. face-to-face)as a within-su句ect factor and examinees’proficiency level(low vs.
mid vs. high proficiency)as a between-su句ect factor, was conducted. The ANOVA
results were expected to determine whether there were signi丘cant dif民rences in
measures of fluency, accuracy, and complexity across test modes and whether there was
any interaction effbct between delivery mode and proficiency level.
3.4.4 Comparing examinee attitudes
Means and the standard deviations for Likert-scale statements in Questionnaire l
were calculated for both tests. A mean score greater than 3 was considered to show a
positive trend and one less than 3 a negative trend. A repeated measures t-test was
conducted to compare examinees’responses in each test statistically. For Questionnaire
2,the chi-square test was used to compare the observed values across groups who chose
each option to the statistically expected values. In addition, the examinees’responses to
the open-ended questions were categoriZed through content analysis and were tallied.
63
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)