19
CHAPTER THREE METHO1) In order to investigate the research questions liste quantitative and qualitative analyses were performed. To in test scores across modes, quantitatively, the magnitude of means oft-test;a factor analysis approach was used to compa by the monologic tasks delivered in the two modes. For compa血g examinees’speech samples, the transcripts range of qualitative analytic measures. The measures compared quantitatively across test modes. Questionnaires inf()rmation on examinees’attitudes toward testing s responses to Likert-scale type items were compared acro addition, qualitative data from responses to open-ended que content analysis to shed light on Likert-type responses. 3.1Participants Atotal of 96 Japanese learners of English as a fbreign participated in the study. They were 78 undergraduat universities and 18high school students(19%)from 2 high sch students(24%)and 73 female students(76%). Students丘om Un language university, were specializing in fbreign languages 丘om University B were majoring in English language and l 45 東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

CHAPTER THREErepository.tufs.ac.jp/bitstream/10108/51459/10/dt-ko-0106008.pdf · プielt nervous when l was taking the te5・t(nervousness);1/leel th is旋ヲst was dbjficult(test

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

CHAPTER THREE

METHO1)

     In order to investigate the research questions listed in Chapter Two, both

quantitative and qualitative analyses were performed. To investigate the equivalence o f

test scores across modes, quantitatively, the magnitude of raw scores was cornpared by

means oft-test;a factor analysis approach was used to compare the constructs measured

by the monologic tasks delivered in the two modes.

     For compa血g examinees’speech samples, the transcripts were frrst coded on a

range of qualitative analytic measures. The measures were then calculated and

compared quantitatively across test modes. Questionnaires were also utilized to gather

inf()rmation on examinees’attitudes toward testing speaking on computer. The

responses to Likert-scale type items were compared across modes quantitatively. In

addition, qualitative data from responses to open-ended questions was analyzed through

content analysis to shed light on Likert-type responses.

3.1Participants

     Atotal of 96 Japanese learners of English as a fbreign language(EFL)in Japan

participated in the study. They were 78 undergraduate students(81%)from 3

universities and 18high school students(19%)from 2 high schools. There were 23 male

students(24%)and 73 female students(76%). Students丘om University A, a fbreign

language university, were specializing in fbreign languages other than English. Students

丘om University B were majoring in English language and literature and those from

45

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

University C, a women’s university, in nursing and domestic science. High school D is

aboys’high school, while High school E is a co-educational high school. Table 3.1

shows the characteristics ofthe participants. These students could be considered to well

represent a wide range ofuniversity students and high school students in Japan. All the

students participated on a voluntary basis and received an honorarium after their

participation㎞this project.

Table 3.1Characteristics oforiginal 96 participants

School Academic year Gender Total

1 2 3 4 Male Female

Un iversめノ

      A

       B

       C

High sc乃ool

       D

       E

     Total

  9『1『1  2

く∨0

6 2」く」

1

く∨00

800

00

つ⊃00

AANN

2 0ハU1

0つづー12

へ∠04.〔∠-且3

  30弓1  7

4.04.

10

W96

Note. NA=Not applicable.

     However, a total of l7 data was excluded ffom analyses of comparing

perf()rmance across the two delivery modes due to(a)poor quality ofrecording(n=7),

(b)improper testing procedures(n==7), and(c)missing data(i. e., no utterance on either

task)(n=3). As a result, data from 79 participants was used fbr investigating Research

Questions 1-4. They were 61 undergraduate students(77%)and 18high school students

(23%).There were l g males(24%)and 60 females(76%). Table 3.21ists the detailed

description ofthe 79 students. Despite the invalid data oforal responses to the tasks, all

the participants filled out questio皿aires administered to them after the tests. Thus, fbr

46

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Research Question 5, the responses on the questio皿aires丘om the original 96

participants were analyzed.

Table 3.2 Characteristics o f the 79 participants

School Academic year Gender Total

1 2 3 4 Male Female

Un iversめ・

      A

      B

       C

High school

       D

       E

     Total

6719

50

12

R3

くJOO

800 

00

〔∠00

AANN

06ハUO

01911 

1■▲

10〔∠

0760

0/0〔∠〔∠1〔∠

10

W79

N()te. NA=Not applicable.

3.2 1nstruments

     The data fbr this study were collected from multiple sources, including(1)

monologic tasks delivered by computer,(2)monologic tasks administered in the

face-to-face mode,(3)aquestionnaire on examinees’attitudes toward each mode, and

(4)a questionnaire comparing examinee attitudes toward the two modes.

3.2.1 Computer-delivered monologic tasks

     Tasks that are utilized to elicit ratable speech samples in testing speaking can be

divided into two types, namely, the monologic task and the interactive task. Monologic

tasks, the f()cus ofthe present study, usually refer to tasks that can elicit long individual

discourses without the examinees’interacting with an interlocutor. From this definition,

they consist ofsuch tasks as reading aloud, sentence repetition, information transfer task,

and oral presentation task(0’Sullivan,2008). Appendix A presents a summary of test

                                   47

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

tasks used in various large-scale speaking tests. As can be seen in this Appendix,

reading aloud and sentence repetition are not common tasks in the face-to-face test.

Thus, it was decided to compare only the information transfer task and the oral

presentation task in this study. These are defined as tasks in which examinees take some

time to make several points and to develop an adequate reply to task prompts.

     Specifically, a narrative task and an opinion task from the speaking section o f the

GTEC fbr STUDENTS5 were used, representing the information transfer task and the

oral presentation task respectively(see Appendix B fbr task prompts). The narrative

task contained 4 pictures that told a simple story which participants had one minute to

relate. The opinion task provided a graph and required participants to give opinions on a

topic based on the information in the graph within two minutes. In the narrative task, a

video prompt in which an American female acted the role of asking questions and

giving simple preset feedback was presented. Examinees were given some time to

prepare a response f~)r each task. They started to record their responses by clicking the

“start”icon on the screen when they were ready. When the preparation time was over,

their responses were recorded automatically.

3.2.2 Face-to-face monologic tasks

     The face-to-face tasks were constructed using the same content and format as

those used in the computer mode and conducted on a one-to-one basis. The author, a

Chinese female, served as the interviewer in all face-to-face tests. Instructions were

written on a prompt card fbr each task(see Appendix C fbr task prompts). In the

5The GTEC fbr STUDENTS, developed by Benesse Corporation, is a four-skill computer-based English

 test. It targets mainly Japanese high school students and university students.

                                   48

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

narrative task, the interviewer played the role of asking questions and giving feedback

as the video character did in the conlputer-delivered task. The interviewer timed

preparation and response time and recorded examinees’oral responses using an IC

recorder. The interviewer tried not to give any verbal reaction except simple

backchannels(e.g.,㎜一㎞and uh-huh)with nodding and eye contact. Appendix D

summarizes the features oftasks used in the computer and face-to-face modes.

3.2.3  Questionnaires

     Two questionnaires in Japanese were used for investigating examinee attitudes

toward testing speaking in the computer mode and the face-to-face mode. The

questionnaire items were adapted from Hill(1998)and O’Loughlin(2001). The author

translated the questionnaires, which had originally been developed in English, into

Japanese. A Japanese Ph.D. student specializing in applied linguistics at TUFS6

checked the translation. A back-translation of the Japanese transcript into English was

performed by another Japanese Ph.D. student. Refer to Appendix E fbr the English

version and Appendix F for the Japanese version ofthe questionnaires.

3.2.3.1 Feedback sheet on each test

     After taking each test, examinees completed Questionnaire l about the test. The

questionnaires fbr each test, which were parallel in f()rmat and content, comprised of

five statements regarding each of the fb llowing five aspects:examinees’nervousness,

examinees’perceptions oftest difficulty, test fairness, accuracy ofthe test as a measure

ofthe examinees’English speaking level, and affective appeal ofthe test. For instance,1

6TUFS is the abbreviation of Tokyo University of Foreign Studies.

                                   49

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

プielt nervous when l was taking the te5・t(nervousness);1/leel th is旋ヲst was dbjficult(test

difficulty). Examinees were asked to indicate the degree to which they agreed with each

statement using a five-point Likert scale:5=Strong!y agree,4=.4gree,3=Neutral,2

=Disagree,1=ぷtrong!y disagree.

3.2.3.2 Feedback sheet on test comparisons

      A feedback sheet on test comparisons(Questio皿aire 2)was also provided. This

consisted of six questions. Five of these questions paralleled the content of the

statements in Questio皿aire l and required examinees to compare the two modes

directly in regard to how nervous they felt, how difficult and how fair they perceived the

tests to be, and how accurate they felt the tests were as a measurement o f their speaking

English level. Moreover, they were asked to rate the affective appeal o f the tests. For

examp le, wh ich test did J/ou/ilel more neハノous taking?(nervousne ss);wh ich test did J/o u

feel more両貨cμ1τ?(test difficulty). In addition, one question asking which of the two

modes examinees preferred was included. For each question, examinees chose one of

three options:ccCo〃zputer-delivered sjりeaking test”,‘‘Both theぷa〃ze”, and‘‘.Face-to-face

test”. They were also asked to write the丘comments on each question after their

「esponses・

3.3  1)ata collection procedure

3.3.1Design

     Data were collected during the period of November to December in 2006. All

participants were randomly assigned to two groups in a counterbalanced design. Group

A(n=48)took the computer mode first and then the face-to-face mode, while Group B

50

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

(n=48)took the face-to-face mo de frrst and then the computer mode. Due to the invalid

data,41 data from Group A and 38 data from Group B were used in the subsequent

analyses ofexaminees’perfbrmance. Examinees took the two tests with an interval of7

to 10 days. It was considered that during this period, the participants’English ability

would not grow so迦ch and the practice effect might be minimized.

     The computer-delivered speaking test was administered to each participant on

campus, either individually in a quiet room(University A and B), or in a coml)uter lab

with about 20 classmates(University C). All the high school students took the test in a

computer lab with 20r 3 classmates. They were seated far enough apart to ensure that

they would not influence each other. For each face-to-face test, participants met

individually with the author in a quite room on their campus. None of the students

except one female participant廿om University A㎞ew the author beforehand. The

participants were not infbrmed befbrehand that the second test would be administered

having the same task content as the first one. After each test, they were asked to fi11 out

a questionnaire about their attitudes toward that specific test(Questio皿aire 1). A丑er

completing both tests and the questionnaire fbr each test, examinees completed a final

questiomaire comparing the tests delivered in the two modes(Questionnaire 2). Table

3.3presents the design ofthe study.

51

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Table 3.3 Design ofthe study

Group A(N=48) Group B(N=48)

Week l

Week 2

Computer-delivered test         ↓

   Questionnaire 1         ↓

   Face-to-face test

         ↓

   Questionnaire 1         ↓

   Questionnaire 2

   Face-to-face test

         ↓

   Questionnaire 1         ↓

Computer-delivered test         ↓

   Questionnaire 1         ↓

   Questionnaire 2

3.3.2  Scoring

     Two independent accredited raters7 fbr the GTEC for STUDENTS scored the

examinees’responses to the two monologic tasks delivered by cornputer on each ofthe

fo llo wing four elements:grammar, vo cabulary, fluency, and pronunc iation. All the

elements were rated on a O-4 analytic scale(see Appendix G for the scoring rubric). The

scores fbr each element of each task were determined by taking the mean ofthe ratings

from two raters. The final scores f()r each element were calculated by averaging the

element scores of each task, and these were added up to obtain a total score for each

partlc lpant.

     Tasks hl the face-to 一一 face mode were scored in the same way as the

computer-delivered tasks by the same pool of raters with the same scoring rubric.

Reliability and agreement ofratings assigned by two raters fbr both computer-delivered

and face-to-face mono logic tasks are reported in Chapter Four.

7For this study, no fUrther infbrmation on raters was available. Examinees may have received ratings

 from the di fferent pair of raters. As pointed out in Lee(2006), this is quite a common practice fbr a

 large-scale perfbrmance test, since it is usually impractical to ask the same pair of raters to do all the

 ratingS.

52

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

3.3.3 Coding of analytic measures

     Speech samples were transcribed by five senior students at TUFS and

double-checked by the author8. Appendix H provides some examples ofthe transcripts9.

The utterances were then coded in order to calculate measure s o f fluency, accuracy, and

complexity. The coded speech samples were entered into a database fbr use with the

CLAN(Computerized Language Analysis)program developed as part ofthe CHILDES

project(MacWhinney,2000). The CLAN pro gram allows a large number o f automatic

analyses to be performed on data, including frequency counts, word searches, and

calculation of type/token ratios. This section describes how the linguistic perfbrmance

in terms of fluency, accuracy and complexity was operationalized and how the speech

samples were coded in this study. Table 3.4 at the end of this section presents a

summary o f the measures.

3.3.3.1  Fluency

     Measures of fluency could be categorized into two types:hesitation phenomena

relating to dysfluency and temporal variables concerned with the speed of speaking

(Ellis&Barkhuizen,2005). The hesitation dimension was operationalized as the

number ofdysfluent words divided by the total amount of speech measured in seconds

(e.g., Kormo s&Denes,2004). Dysfluent words include words o f fUnctionless repetition,

self-correction, and false starts.

     Following Foster et al.(2000), fUnctionless repetition means repetition of exact

words, syllables, phrases, or partial repetition of some part of a word qr utterance.

8All errors in pronunciation were transcribed in their correct fbrms.

9These examples were presented in the high, middle, and low-proficiency groups based on their scores in

 computer-delivered speaking tasks. Refer to the details of grouping in Section 4.3.2.

                                   53

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Self-correction occurs when the speaker identi丘es an e廿or either du血g or immediately

f()llowing production and stops and refbrmulates the speech. A false start is defined as

an輌ncomplete utterance, which is begun and then either abandoned altogether or

refbrmulated in some way. In addition to dysfluent words, separate measures fbr the

above three aspects were also reported. The fbllowing are some examples fbr words of

repetition, self-correction, and飽lse starts.

IIgoing to spcaking test.(1 repeated word)

However however however some some man talked to me.(3 repeated words)

Ican I couldn’t win the speech contest.(2 words of self二correction)

This pie chart said talked shows that nearly sixty percent of people㎞ow that

what’s happening in society on TV through TV.(4 words of self-correction)

So we can so the needs ofthe newspaper is not so important.(3 words of false

start)

旦旦L-yanyway I thhik we should getinformation.(6 words o f false start)

     For temporal variables, three measures were employed:(1)speech rate as

represented by the number ofwords divided by the total amount of speech measured in

seconds(e.g., Lennon,1990),(2)the number of unfilled pauses divided by the total

amount of speech measured in seconds(e.g., Kormo s&Denes,2004), and(3)the

number of filled pauses divided by the total amount of speech measured in seconds(e.g.,

Brown et al.,2005).

     To calculate speech rate, the transcribed speech was first pruned by excluding

dysfluent words. When the number of words was counted, filled pauses both in

Japanese and English, such as“uh”,“um”, and“eito”were excluded. Contractions, such

as“wasn’t”and“he’s”, were counted as two words. Such a Katakana word as“telebi”

and“pasokom”was not counted as one word. Then the resulting total number ofwords

54

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

was divided by the total speech time an examinee took to complete the task, excluding

pauses of three or more secondslo. The number of unfilled pauses was calculated by

counting the number ofpauses of two seconds or more that occurred in the examinees’

speechl l.Here are some examples for filled pauses and words.

・旦hIspeak旦坦spoke very well.(2 words offilled pauses)

・1 do n’t agree with your idea becau se um 1 can gΩLt Qyb got info rmation by telebi.

  (13wordsl2).

3.3.3.2 Accuracy

     Accuracy was assessed by two general measures:the percentage of error一丘ee

clauses in the total number ofclauses(e.g., Foster&Skehan,1996)and the percentage

of error-free AS-units in the total number of AS-units(e.g., Robinson,2001). General

measures fbr accuracy are suggested to have the advantage ofbeing potentially the most

comprehensive, hl that all errors are considered(e.g., Bygate,2001). Another reason fbr

adopting general measures is that previous studies(Luoma,1997;Ko ike,1998)using

specific measures failed to fmd any difference acro ss test modes.

     The Analysis of Speech Unit(AS-unit)was chosen as the basic syntactic unit of

analysis in this study, as it has been considered to be a better measure than others(e.g.,

T-units)fbr spoken data produced by L2 speakers(Fo ster, Tonkyn,&Wigglesworth,

2000).This is defined as a‘‘single speaker’s utterance consisting of an independent

clause, or sub-clausal unit, together with any subordinate clause(s)associated with

lo Following Brown et al.(2005), pauses of three or more seconds were considered to be a substantial

 length of elapsed time and were thus excluded from total speech time.11

she author found it diffricult to measure pause time under tWo seconds with a stopwatch reliably. Thus,

 only pauses over two seconds were counted.12

she underlined words were excluded from counting, since they represented self-correction

55

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

either”(Foster et al., p.365). Examples ofan AS-unit are as fbllows:

・ Students have to read newspaper every day.(1 AS-unit)

・Ispeech(1 AS-unit)

・And happy to attend that speech contest(1 AS-unit)

・And then very loud a clap hands(1 AS-unit)

     To calculate the number ofclauses, they were classified into independent clauses

and subordinate clauses. An independent clause was“minimally a clause including a

丘nite verb”(Foster et al.,2000, p.365). A subordinate clause included an adverbial

clause, an adjective clause, and a nominal clause and“consisted minimally ofafinite or

non-finite verb element plus at least one other clause element(su句ect, o句ect,

complement or adverbial”(Foster et al., p.366). In coding the clauses, all the utterances,

including those with errors as defined above, were used. The following gives some

examples ofindependent and subordinate clause.

Ihad a speech contest.(1 independent clause)

And hapPy to attend that speech contest.(O independent c lause)

Idon’t say that-.(1 independent and l subordinateclause as underlined in the sentence)

一【is important.(l independent and l subordinate clauseas underlined in the sentence)

     Error-free AS-units are AS-units that do not contain any errors in syntax,

morphology, word order, or lexical choice. Error-free clauses refer to clauses free丘om

any errors in syntax, mo叩hology, or lexical choice. Lexical errbrs were defined as

errors in lexical fbrm or collocation. In counting errors, only the pnlned utterances were

used. Also, errors in pronunciation were not considered since, as mentioned earlier, they

were not reflected in the transcription. The native-like use of the language, in terms of

56

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

the grammar and lexis, was generally considered as a criterion in judging error-free

clauses and error-free AS-units.

3.3.3.3 Comple】dty

     Complexity was measured in terms of syntactic complexity and lexical

complexity. Three types of syntactic complexity measures were adopted:(1)the

percentage of clauses in the total number of AS-units(e.g.,0’Sullivan,2002),(2)the

percentage of subordinate clauses in the total number of AS-units(e.g., Wigglesworth,

1997),and(3)the ratio ofthe number ofwords divided by the total number ofAS-units

(e.g., Bygate,2001).

     Lexical complexity was established by three measures, namely, Guiraud’s Index,

lexical density index, and weighted lexical density index. For all these measures, only

pnlned utterances were used. Guiraud’sIndex was calculated by dividing the number of

types by the square root of the number of tokens. This measure is thought to be more

appropriate than the type-token ratio(TTR), as it takes sample length into account

(Vermeer,2000). Following Daller, van Hout, and Treffers-Daller(2003), words that

differed fヒom each other only in inflectional morphology were counted as one single

type(e.g., go-went;』-is-was-are), whereas words canying different derivational

morphemes(e.g., danger-dangerouぷ)were counted as different types.

     Lexical density was calculated as the ratio of lexical words in the total number of

words. Following O’Loughlin(2001)and Koizumi(2005), words were classified into

grammatical and lexical words. Given that the lexical density index has been criticized

fbr ignoring the relative significance of words of different frequency, this study also

adopted weighted lexical density that takes the frequency of words into account. To

57

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

calculate weighted lexical density, lexical words were fUrther divided hlto

high-frequency and low一丘equency words. High一丘equency words were defined as the

most丘equent 2000 wordsl3 in the JACET list of 8000 Basic Words(JACET Basic

Revision Committee,2003)14. High一廿equency lexical words were given half the weight

of low-fヒequency words, and the number of weighted lexical words as a percentage of

the total number ofwords was then calculated.

     Two coders coded the data. The author coded all the data;the second coder, a

Ph.D. student specializing in language testing at TUFS, coded a randomly selected

sample of approximately 10%ofthe data. Inter-coder reliability was calculated by the

percentage ofagreement between the coders, which is reported in Chapter Four.

13The criterion of beyond 2000 words was used because high school students in Japan are supposed to

 learn about 2000 words(Ministry of Education, Science&Culture,1989).14

The JACET list of 8000 Basic Words uses a lemma count.(]tperationally, a lexical frequency profile

  analysis was conducted using the computer program“JACET 8000 Analysis Program”(Shimizu,  2004),which resulted in a distinction between the words belonging to the 2000 most frequent words

  and those with a lower frequency.

                                    58

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

Table 3.4 Measures of fluency, accuracy, and complexity used in the study

Measure Definition

Fluency

Accuracy

Complexity

No. of words

No. ofunfilled pauses

No. of filled pauses

No. of dysfluent words

No. ofrepetition words

No. of self-correction

words

No. of false start words

Percentage oferror-free

clauses

Percentage oferror-free

AS-units

Syntactic complexiりy

 Percentage ofclauses

Percentage of subordinate

clauses

No. of words

・乙exical co碑フlexiリノ

 Guiraud’s Index

Lexical density

Weighted lexical density

The number ofwords/

the total amount of speech(in seconds)

The number ofunfilled pauses/

the total amount of speech(in seconds)

The number of filled pauses/

the total amount of speech(in seconds)

The number of dysfluent words/

the total amount of speech(in seconds)

The number of repetition words/

the total amount of words(in seconds)

The number of repetition words/

the total amount of words(in seconds)

The number of false start words/

the total amount of words(in seconds)

The number of error-free clauses/

the total number of error-free clauses

The number of error-free AS-units/

the total number of error-free AS-units

The number ofclauses/

the total number of AS-units

The number of subordinate clauses/

the total number of AS-units

The number of words/

the total number of AS-units

The number of types/

V-the number oftokens

The number of lexical words/

the total number of words

(The number ofhigh-frequency words/2十

the number of low-frequency words)/

the total number of words

59

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

3.4  Data analysis

3.4.1 Comparing the magnitude of raw scores

     In order to evaluate the agreement of the ratings assigned by the two

randomly-selected raters, two types of indexes were employed:(1)Pearson product

moment correlation coef丘cients fbr inter-rater reliability and(2)perfect agreement and

adjacent agreement between the ratings o f the two independent raters.

     Befbre proceeding to analysis of raw scores, the order effect on test scores was

checked. As described earlier, participants were assigned at random to two testing

conditions:(a)computer mode first and face-to-face mode second or(b)face-to-face

mode丘rst and computer mode second. With a counterbalanced design like this, a

concem is that taking the test in one mode fbllowed by the other could affect the score

fbr the second mode. Consequently, descriptive statistics of the scores fbr each rating

element were frrst examined fbr the fbur mode-by-order conditions separately to detect

any possible order effect.

     To assess the effect of delivery mode and the mode-by-order interaction

statistically, a series of two-way repeated measures analyses of variance(ANOVAs),

with one within-su句ect factor(mode)and one between-su句ect factor(order), was

per丘)rmed on element scores and total scores. If there was no mode-by-order interaction

effect, the results of the main effect of delivery mode would be interpreted directly. If

such interactions were fbund to be statistically significant, it would imply that mode

effect was not consistent across the two groups assigned to the different testing

conditions. It would then be invalid to use the combined data from the two groups. In

that case, fbllowing Choi, Kim, and Boo(2003), independent t-tests using scores fヒom

the frrst test administered would be conducted to compare the magnitude oftest scores.

60

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

For the purpose of reference, the results of independent t-tests based on scores from the

second test would also be presented. SPSS 11.0 for Windows was employed for all the

statistical analyses in this study.

3.4.2 Comparing psychometric constructs

     The data analyses in this section consisted of data transformation as a measure

against order effect, a preliminary analysis of descriptive statistics, and a series of

exploratory factor analyses.

     If an order effect was not fbund through the analysis in the previous section,

scores fbr each of the f()ur elements fbr each of the two tasks, which resulted in eight

variables for each mode, were used in the subsequent analysis. However, even when an

order effect was observed, it was decided not to take the measure ofusing only the data

from the first test. The reason was that this would lead to a decrease of data and

violation of the assurnption of exploratory factor analysis f()r sample size15. Instead,

after consulting with a statistical expert, it was decided to transform the original data by

subtracting them from the mean scores ofeach variable f()r the group that took the tests

in different orders.

     Descriptive statistics were calculated fbr the total l 6 variables to examine the

assumptions o f exploratory factor analysis. The univariate normality o f the data was

assessed through skewness and kurtosis ofthe variables, which should be within a range

of-2 to 2. The Pearson product-moment correlation matrix was computed to assess the

multicollinearity ofthe data. Following Field(2005), a correlation ofabove.90 between

15lany criteria have been proposed to determine the sample size fbr exploratory factor analysis.

 Generally, it is considered to be proper to have the number of subjects at lease 5 times the number of

 variables(Field,2005).

                                   61

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

two variables was considered to be a threat to the singularity of the data. In that case,

one o f the variables would be eliminated fro m fUrther analysis. Univariate outliers were

examined, which were defined as cases with values more than three standard deviations

丘om the mean ofagiven variable.

     In order to investigate whether there were components that were shared in

common as measured by mono logic tasks delivered in the co卿uter and the血ce-to一血ce

modes, principal factor analysisl6 using varimax rotations as a factor anal》戊ic method

was performed on the data for the two modes respectively. The number of飴ctors,

which was dete㎜㎞ed based on the sizes ofeigenvalues and the scree test(Tabac㎞ick

&Fidell,2001), was compared across modes. The factor loading fbr each variable was

also checked and compared acro ss modes. Another principal factor analysis was run on

the combined data from the two modes to conf五m the comparability of the factor

structure from the above analyses.

3.4.3 Comparing speech samples

     To address Research Question 3, the coded responses of examinees in the

computer mode and the face-to-face mode were compared in terms of measures o f

fluency, accuracy, and complexity to examine the effect of computer delivery mode on

speech samples. The purpose of Research Question 4 was to examine a possible

interaction e脆ct of examinees’proficiency on their speech samples across delivery

modes. To this end, participants were first categorized into three groups based on their

total scores in the computer-delivered tasks. A one-way ANOVA was then perf()rmed to

16orincipal factor analysis(PFA)is also called common factor analysis or principal axis factoring. This

 extraction method was selected over principal component analysis since it reflects only the common

 variance of variables, not including the unique variance.

                                   62

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)

ensure that the scores were statistically different acro ss the three groups.

     Aseries oftwo-way repeated measures ANOVAs, with delivery mode(computer

vs. face-to-face)as a within-su句ect factor and examinees’proficiency level(low vs.

mid vs. high proficiency)as a between-su句ect factor, was conducted. The ANOVA

results were expected to determine whether there were signi丘cant dif民rences in

measures of fluency, accuracy, and complexity across test modes and whether there was

any interaction effbct between delivery mode and proficiency level.

3.4.4 Comparing examinee attitudes

     Means and the standard deviations for Likert-scale statements in Questionnaire l

were calculated for both tests. A mean score greater than 3 was considered to show a

positive trend and one less than 3 a negative trend. A repeated measures t-test was

conducted to compare examinees’responses in each test statistically. For Questionnaire

2,the chi-square test was used to compare the observed values across groups who chose

each option to the statistically expected values. In addition, the examinees’responses to

the open-ended questions were categoriZed through content analysis and were tallied.

63

東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)