[Siepmann Dirk] Discourse Markers Across Languages

cover next page >

Cover

title: Discourse Markers Across Languages : A Contrastive Study ofSecond-level Discourse Markers in Native and Non-native TextWith Implications for General and Pedagogic LexicographyRoutledge Advances in Corpus Linguistics ; 6

author: Siepmann, Dirk.publisher: Taylor & Francis Routledge

isbn10 | asin: 0415349494print isbn13: 9780415349499

ebook isbn13: 9780203315262language: English

subject Discourse markers, Contrastive linguistics, Lexicography,Language and languages--Study and teaching.

publication date: 2005lcc: P302.35.S54 2005eb

ddc: 401/.41subject: Discourse markers, Contrastive linguistics, Lexicography,

Language and languages--Study and teaching.

cover next page >

< previous page page_i next page >

Page iDiscourse Markers Across LanguagesThis book deals with ready-made phrases, or second-level discourse markers such as it is argued that or the samegoes for. Specifically the book answers questions such as how can such phrases be defined or translated? or howcan they be recorded in dictionaries?The book falls into two parts. Part I presents a functional taxonomy of second-level markers in English, French andGerman as well as an analysis of their use in continuous text. Part II offers a contrastive interlanguage analysis ofthe performance of non-native writers and translators.The book is essential reading for professional linguists or lexicographers with an interest in collocation andphraseology, as well as for academics, translators and language teachers seeking to produce well-crafted text in aforeign language.Dirk Siepmann is Lecturer in English at Siegen University, Germany.

< previous page page_i next page >

< previous page page_ii next page >

Page iiRoutledge advances in corpus linguisticsEdited by Anthony McEnery, Lancaster University, UK,and Michael Hoey, Liverpool University, UK.Corpus-based linguistics is a dynamic area of linguistic research. The series aims to reflect the diversity ofapproaches to the subject, and thus to provide a forum for debate and detailed discussion of the various ways ofbuilding, exploiting and theorizing about the use of corpora in language studies.1. Swearing in EnglishAnthony McEnery2. AntonymyA corpus-based perspectiveSteen Jones3. Modelling Variation in Spoken and Written EnglishDaid Y.W.Lee4. The Linguistics of Political ArgumentThe spin-doctor and the wolf-pack at the White HouseAlan Partington5. Corpus StylisticsSpeech, writing and thought presentation in a corpus of English writingElena Semino and Mick Short6. Discourse Markers Across LanguagesA contrastive study of second-level discourse markers in native and non-native text with implications for general andpedagogic lexicographyDirk Siepmann

< previous page page_ii next page >

< previous page page_iii next page >

Page iiiDiscourse Markers Across LanguagesA contrastive study of second-level discourse markers in native and non-native text with implications for general andpedagogic lexicographyDirk Siepmann

LONDON AND NEW YORK

< previous page page_iii next page >

< previous page page_iv next page >

Page ivFirst published 2005 by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RNSimultaneously published in the USA and Canada by Routledge 270 Madison Ave, New York, NY 10016Routledge is an imprint of the Taylor & Francis GroupThis edition published in the Taylor & Francis e-Library, 2005.

To purchase your own copy of this or any of Taylor & Francis or Routledges collection of thousands of eBooksplease go to www.eBookstore.tandf.co.uk. 2005 Dirk SiepmannAll rights reserved. No part of this book may be reprinted or reproduced or utilized in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers.British Library Cataloguing in Publication Data A catalogue record for this book is available from the British LibraryLibrary of Congress Cataloging in Publication Data A catalog record for this book has been requestedISBN 0-203-31526-X Master e-book ISBNISBN 0-415-34949-4 (Print Edition)

< previous page page_iv next page >

< previous page page_v next page >

Page vUnter Sdlndern ist die Sprache ein Ingredienz der Lebensfreude, dem man weit lebhaftere gesellschaftlicheSchtzung entgegenbringt, als der Norden sie kennt. Es sind vorbildliche Ehren, in denen das nationale Bindemittelder Muttersprache bei diesen Vlkern steht, und etwas heiter Vorbildliches hat die genureiche Ehrfurcht, mit derman ihre Formen und Lautgesetze betreut. Man spricht mit Vergngen, man hrt mit Vergngenund man hrt mitUrteil*Thomas Mann, Mario und der ZaubererMache die Dinge so einfach wie mglichaber nicht einfacherAlbert EinsteinIl fautramener la linguistique vers le lexique o la complexit des langues parvient son plus haut degr de forceet dpanouissementHarald Weinrich, Le Franais dans le Monde (303)* Among southern peoples language is an ingredient of lifes joys which is held in much livelier social esteem than inthe north. The honours paid by these nations to that national binder, the mother tongue, are exemplary, and there issomething joyfully exemplary about the appreciation and awe with which they treat its forms and sounds. Onespeaks with pleasure, one listens with pleasureand one listens with discernment. (my translation) Everything should be made as simple as possible, but not simpler. (my translation) Linguistics has to be steered back towards the area of lexis, where language attains its highest degree ofexpressiveness and complexity. (my translation)

< previous page page_v next page >

< previous page page_vi next page >

Page viThis page intentionally left blank.

< previous page page_vi next page >

< previous page page_vii next page >

Page viiContents

Preface ix What this book is about xii Acknowledgements xiii

PART I Linguistic considerations 1

1 Observing languages: introduction to Part I 3 1.1 Aims, scope and methodology 3 1.2 Corpora and corpus-enquiry tools 222 Investigating routines: defining and describing multi-word discourse markers 34 2.1 Pragmatic perspectives on discourse markers 37 2.2 Lexicological perspectives on multi-word discourse markers 45 2.3 Syntactic realizations of SLDMs 523 Identifying meanings and functions: an attempt at a functional taxonomy of SLDMs 82 3.1 Introduction 82 3.2 Language functions and textual relations 82 3.3 A taxonomy of SLDMs 87 3.4 Points of interest 984 Straddling cultures: three types of second-level discourse markers in contrastive perspective 106 4.1 Exemplifiers 111 4.2 Reformulators and resumers 141 4.3 Inferrers 219 4.4 Summary and conclusion 239

< previous page page_vii next page >

< previous page page_viii next page >

Page viii

PART II A contrastive interlanguage analysis with implications for dictionary making 241

1 Introduction 2432 Facing realities: the performance of non-native writers and translators 245 2.1 Interlanguage analysis 246 2.2 German writers performance in the field of discourse markers 252 2.3 Translations under the spotlight 278 2.4 Conclusion 2823 Lexicographic treatment of SLDMs 283 3.1 Lexicographic coverage of SLDMs 283 3.2 Macrostructural and microstructural treatment of SLDMs 290 3.3 Sample entries 3084 Avenues for further research 325

Notes 327 Bibliography 330 Index 351

< previous page page_viii next page >

< previous page page_ix next page >

Page ixPrefaceThe present work stands at the interface of several converging developments in linguistics and language teaching.Most importantly, perhaps, there has been in recent decades a dramatic increase in the amount of scholarship ontext and discourse. Indeed, despite a time-honoured concern with both written and spoken texts in classical rhetoric,systematic discourse analysis did not really get off the ground until the 1970s, with the work of such linguists asQuirk et al. (1972) and van Dijk (1972). The overarching concern in such work has been the empirical investigationof the structure and functions of naturally occurring text rather than the atomistic study of sentence-level syntaxinspired by Chomsky.Such paradigm change has not been without influence on the study of language for specific purposes (LSP), a fieldin which the greatest research effort has probably been expended on academic writing. Here too there has been amove from the microlinguistic analysis of syntax, terminology and word formation prevalent until well into the 1970s(Drozd and Seibicke 1973; Kocourek 1982) towards the study of specialist text (Glser 1979; Hoffmann 1983;Baumann 1986). A natural consequence of this has been the establishment of contrastive text linguistics as adiscipline in its own right; numerous works have been published comparing specific languages at the textual level(cf. for example Newsham 1977 for the language pair English/French; Hinds 1983 for English/Japanese; Kachru 1983for English/Hindi; Clyne 1987 for English/German; Blumenthal 1997 for French/German; Hatim 1997 forEnglish/Arabic).Another rapidly growing line of research is the study of phraseology. Founded by Bally (1909), this branch oflinguistics suffered comparative neglect in Western Europe until fairly recently. It was the great merit of Russianlinguistics to intensify phraseological research from the 1940s onwards, establishing precise criteria for thedescription of various types of conventionalized expressions. With the large-scale shift in the West from idealistChomskyan linguistics to renewed empirical research, the phraseology of languages such as English, German andFrench became in its turn the subject of numerous research ventures (e.g. Cowie 1975; Burger 1973; Feilke 1996;Moon 1998; Bresson 1998). Among the results

< previous page page_ix next page >

< previous page page_x next page >

Page xof such research was the discovery that phraseological units occur in far greater numbers than previously thought,leading researchers such as Sinclair to posit the idiom principle, whereby a language user has available to him orher a large number of semi-preconstructed phrases that constitute single choices, even though they might appear tobe analysable into segments (Sinclair 1991:110). Stemming from this, there has been growing recognition inlanguage-teaching circles that multi-word units should be accorded more detailed attention in the foreign languageclassroom (Nattinger and DeCarrico 1992; Lewis 1993; Cowie 1998), and this view has been confirmed by recentresearch showing such units to be either under- or over-represented in the linguistic output of L2 learners (Granger1998a; DeCock 1998; Milton 1999).Last but by no means least, all of the above strands of research have received further impetus from computercorpus linguistics. With unprecedented riches of authentic data at their fingertips, linguists are now in a position togo beyond intuition and pen-and-paper analysis. More empirical and less speculative, their research finally bearscomparison with that of hard-pure sciences such as physics and chemistry. It is beginning to open up exciting newperspectives on language.This study is an attempt to weave together the aforementioned strands of research with a view to offering a broadpicture of one sub-section of the native writers phrasicon: multi-word, or, more technically, second-level discoursemarkers. As their name indicates, multi-word discourse markers are recurrent word combinations serving pragmaticand/or discourse-structuring functions. Adequate use of such items is pivotal in making academic and other prosetexts comprehensible and effective; however, many linguists, teachers and writers have so far been unaware of theirvery existence. It thus appears that the description of second-level markers is one of the most compelling tasks forthe applied linguist, whose aim it is to provide a sound basis for language teaching and expert communication acrosscultural barriers.In such an endeavour there is much to be said for the adoption of a multilingual approach. Cross-linguistic researchhas shown that the ways sentences connect may differ from one language to another, often making the rendition oflinking words less straightforward than dictionary equivalence might suggest. This fits in with evidence fromcontrastive rhetoric which suggests that L2 writers are prone to reproduce L1 patterns of text organization. In aworld where translation and L2 writing skills are at a premium, it therefore makes both lexicographic and pedagogicsense to investigate differences and similarities between markers across languages.This brings us to the question of language pedagogy. With the return of the language awareness movement (Moore1995), it is now fairly generally realized that the communicative approach to language teaching needs to beenhanced through discovery-based and awareness-raising practices grounded in constructivist learning theory(Rschoff and Wolff 1999;

< previous page page_x next page >

< previous page page_xi next page >

Page xiWolff 2000). It is towards this conception of language learning that the present study wishes to make a usefulcontribution. In particular, my hope is that learner lexicography, composition teaching and translator training willincorporate some of the findings of the present research, thereby sensitizing language professionals to a discourse-centred approach to language teaching. Secondary-level foreign language teachers in particular continue to seelanguage as a ragbag of vocabulary governed by a fixed set of grammar rules. The more teaching materialsincorporate evidence of multi-word units and information on their use in text, the easier it will be to instil in teachersa more holistic view of language integrating textual, situational and lexico-grammatical aspects.To sum up, the present study tries to be innovative in at least three ways. To begin with, it is the first large-scalecorpus-based contrastive study to look at three languages at a time. Second, it considers an almost entirelyunexplored set of pragmatic markers functioning as structural and semantic units. Third, it closes gaps in dictionariesand textbooks, thus providing assistance for writers, translators and second-language teachers; the index will helpthe language practitioner locate specific items of interest. For obvious reasons, it is impossible within the narrowcompass of a monograph to treat every question in meticulous detail; many avenues of research opened up here canbe further explored. What the reader can expect to find, though, is a fairly comprehensive account of a few,carefully selected categories of multi-word marker.Finally, a note on language: in German and North-American academic circles in particular, there is a deplorabletendency to equate pretentious language with superior thought content. I have been at pains to distance myselffrom this tendency, trying to avoid wherever possible the overelaboration and abstruseness of much pseudo-scientific jargon. Culturally, I hope to have fused British practicalism with German thoroughness. Stylistically, I hopeto have balanced the elegance of the past with the easy familiarity of the present.Dirk Siepmann Herdecke, February 2004

< previous page page_xi next page >

< previous page page_xii next page >

Page xiiWhat this book is aboutThis study falls into two parts. The core aim of Part I is for a functional taxonomy of multi-word or second-leveldiscourse markers in three European languages (English, German and French) and for a contrastive analysis of theiruse in continuous text across the three idioms. The author begins by defining the term second-level discoursemarker and follows this up with an overview of possible syntactic realizations of such markers. He proceeds toinvestigate what functions they serve and to address the problems attendant upon the translation of the members ofthree functional categories. A wide variety of corpus sources is laid under contribution with a view to building up acomprehensive picture of second-level discourse marker use.In Part II a critical analysis is made of a corpus comprising advanced English-language writing by German academicsand students, as well as of a small number of published translations by professional translators. It is found that theuse of second-level markers by these groups compares unfavourably with that of natives. Unnatural writing is shownto be the result of overt errors on the one hand, and of the unusual frequency of occurrence of particular items onthe other hand. Lexicographic implications are then considered, which lead to suggestions for the lexicographictreatment of second-level discourse markers in general and pedagogic dictionaries.Authors noteMy interest in multi-word discourse markers continues. My website (www.dirk-siepmann.de/multiwordmarkers)contains sections of my thesis that I have not been able to include in this book, an executive summary, and journalarticles which deal with other types of multi-word markers than those discussed here.

< previous page page_xii next page >

< previous page page_xiii next page >

Page xiiiAcknowledgementsThis book is based on my doctoral thesis, submitted at the University of Wuppertal (Germany) in 2003. It took fouryears of unceasing effortalongside a full-time jobto bring that thesis to birth, but its actual gestation spanned amuch longer period. Directly or indirectly, it profited from the benevolent influence of many. My first and probablygreatest debt is to Petra Heise, my English teacher at Otto-Pankok-Schule, Mlheim an der Ruhr, who, although ofGerman origin, used English with a naturalness and precision unequalled by any other German I have ever met. Herlove of language was infectious, and prompted me to strive for mastery of the foreign languages of my choice.Among other things, I started drawing up vocabulary lists, some of which formed the basis of the present work.Later in life, my quest for linguistic perfection received further impetus from kindred spirits, such as Professor SergeGouaz of Valenciennes University and John D.Gallagher of Mnster University, who initiated me into the art oftranslation and taught me the importance of good style. I also derived much benefit from a lecture given byProfessor John Sinclair during my student days at Durham and a subsequent visit to COBUILD at the University ofBirmingham. This sparked my interest in computer corpus linguistics, the practice of which has given me ever deeperinsight into language.My Ph.D. supervisors, Professor Dieter Wolff and Professor Peter Scherfer, deserve special thanks for their part inthe making of this book. Dieter Wolff kindly allowed me to use the multilingual translation corpus partly compiled atWuppertal University. He patiently read two drafts of the manuscript and offered valuable advice. I have found hisfriendly and supportive manner extremely helpful. Peter Scherfer, a theoretician in the best sense of that abusedterm, helped to give greater conceptual precision to some of my ideas.I am also indebted to the Series Editors, Professor Tony McEnery and Professor Michael Hoey, as well as the editorialteam at Routledge (Terry Clague, Joe Whiting and Yeliz Ali) for seeing this book through to publication.Finally, I should like to thank my partner, Sabine, for her unflinching moral and practical support.

< previous page page_xiii next page >

< previous page page_xiv next page >

Page xivThis page intentionally left blank.

< previous page page_xiv next page >

< previous page page_1 next page >

Page 1Part ILinguistic considerations



Page 2This page intentionally left blank.



Page 31Observing languagesIntroduction to Part IBegin with the end in mind.Stephen R.Covey, The Seen Habits of Highly Effectie PeopleAnyone reading academic treatises, reference books or quality newspapers with any regularity will be struck by therecurrence of a particular kind of multi-word unit. Typical examples are it is argued that, it is a fair guess that or thesame goes for. Curiously enough, such firmly established phrases seem to have escaped scholarly notice until fairlyrecently. As a result, they have never been dealt with in any extensive manner, with the exception perhaps ofGallaghers (1992) ground-breaking article on the translation of the German discourse device erschwerend kommthinzu, dass, Grieves (1996) dictionary of French connectors and four recent articles by Siepmann (2000, 2001a) andOakey (2002a, 2002b). The first part of this study, then, may be seen as an attempt to redress a perceivedimbalance between the poor stock of knowledge about the routine formulae in question and the somewhat largerbody of research on related one-word connectives.1.1 Aims, scope and methodologyThe core aim of the present work is for a contrastive analysis of such multi-word units. The term contrastiveanalysis is here taken to refer to the processes involved in identifying, recording and describing lexical items whichassume identical or similar functions in actual manifestations of English, French and German language use (forfurther detail, see Section 1.1.2.3 below). Such an analysis depends on two prerequisites: first, an inventory of thelexical class under consideration; second, a categorization, or taxonomy, of the members of this class. Since previouswork on phraseology has severely neglected multi-word units of the type exemplified above, both the inventory andthe taxonomy used in the present study had to be built up from scratch. There being no copyright on lexicographicmaterial, the full-size inventory underlying this study will,



Page 4however, be published in a separate work (Gallagher et al., in preparation). The taxonomy therefore confines itselfto typical instances of each category, and the contrastive analysis is restricted to three major categories of multi-word units.The account given here of this part of the phrasicon situates itself within the Firthian tradition of British textlinguistics. Accordingly, it is essentially descriptive and data-driven rather than theoretical and introspective. Thepresent chapter discusses the essential foundation of such an approach and the methodological choices entailed byit. Section 1.1 is concerned with general questions regarding the use of corpora and the setting up of interlingualequivalence, while Section 1.2 focuses on corpus compilation.Chapter 2 sets out to define the object of study. I shall argue that it forms a separate sub-class of the lexicon whosemembers can be described as second-level markers. Unfortunately, there is as yet no agreed-upon conceptualframework for describing the items belonging to this class. Traditional grammars have tended to classify some of itsmembers, such as cet effet or pour ce faire under the ragbag heading of adverbials; other items, such as thesentence fragments exemplified above, have been altogether excluded from consideration. A satisfactory definitioncan, however, be arrived at by drawing on two distinct strands of current linguistic research: discourse analysis ortext linguistics on the one hand, and lexicology on the other. Drawing on the discourse-analytical literature, I showthat second-level markers bear close similarities to oral discourse markers. Drawing on the phraseological literature,I argue that they constitute a distinct type of phraseological unit. Having thus provided a functional-lexicologicaldefinition of second-level markers, I move on to outline the entire gamut of syntactic forms they can take.The sheer size of the lexical class under investigation forbids a detailed contrastive analysis for all the members ofthe class. Chapter 3 therefore sets up a broad taxonomy predicated on the general functions of second-levelmarkers.After these methodological and theoretical preliminaries, the way is open for a finely honed contrastive analysis ofthree functional categories of second-level markers: exemplifiers, reformulators and inferrers. This is the subject ofChapter 4. In a first step, I give a detailed description of the behaviour of particular items, including their frequency.On this basis, I then consider to what extent interlingual equivalence obtains between items of the functional typeunder investigation. With the most complex categories it seems wise to follow a stepwise approach proceeding frommonolingual analysis to multilingual comparison, thus allowing the reader to form a just understanding of therationale behind particular equivalences. Where equivalence relations turn out to be more straightforward, discussionwill move directly to the multilingual comparison.On the basis of these investigations, Part II looks at



Page 5 how successfully non-native writers and translators cope with multi-word markers; what implications research into multi-word markers has for general and learner lexicography.1.1.1 Theoretical foundationThe present research locates itself squarely within the British tradition of text analysis established by Firth (1957).This empiricist tradition has a number of features which distinguish it from theory-driven approaches to the study oflanguage. To begin with, the Firthian tradition, the most prominent present-day adherents of which are Halliday,Sinclair, Stubbs, Francis and Hunston, views linguistics as an applied social science (Stubbs 1993:3). On thetheoretical side, it foregrounds social interaction as the main determinant of linguistic form. This is in keeping withstate-of-the-art discussions of language science such as Feilke (1996), who convincingly argues that whateverdisposition for language a child may have, it is a silent disposition which only develops into a functioning languagethrough communicative interaction and the institutional practices such interaction puts in place (Feilke 1996:32). Onthe practical side, a dominant current in the British tradition aims at direct relevance to both the L1 (Coulthard andSinclair 1975) and the L2 language classroom (cf. the long list of COBUILD publications).Second, the Firthian tradition takes actual occurrences of language as the object of study. Introspective data areconsidered to be invalid as a primary source of evidence. In this sense, Firth adopts a different approach fromChomsky, who relies almost entirely on invented, sentence-level examples. In the Firthian tradition the data comesfirst, and the theory is built up from the data.Closely linked with these fundamental features of the Firthian approach is a third one: linguistics is regarded as thestudy of the meaning of linguistic units. It thus views language from a different angle than Chomskyan linguistics,which stresses the autonomy of particular components of the language systemmost importantly syntax (Chomsky1957:17). Corpus linguists working in the Firthian tradition have ingeniously suggested that much if not all syntax islexis-driven, thereby at least complementing the Chomskyan view:syntax is driven by lexis: lexis is communicatively prior. As communicators we do not proceed by selecting syntacticstructures and independently choosing lexis to slot into them. Instead, we have concepts to convey andcommunicative choices to make which require central lexical items, and these choices find themselves syntacticstructures in which they can be said comfortably and grammatically.(Francis 1993:142)



Page 6The lexico-grammars (Francis et al. 1996, 1998) produced under the supervision of Sinclair provide ample proof ofthe close association between sense and syntax posited by Francis, specifying, as they do, the behaviour of themajor lexical classes (verbs, nouns and adjectives) in terms of their meaning and syntactic preferences. The resultsshow clearly that particular semantic sets are linked with particular syntactic choices and that, conversely, particularsyntactic patterns cluster around particular semantic sets.By the same token, linguists working in the Firthian tradition see meaning as at least partially determined by typicalcombinations of lexical choices or collocability on the one hand, and typical combinations of grammatical choices orcolligation on the other (Hunston 2001). That is, words obtain at least part of their meaning from the contexts inwhich they typically occur, and, more specifically, from collocates with which they typically team up. In this sensethe Firthian tradition is far removed both from componential analysis and from semantic field theory as framed byTrier (1931), where the meaning of a word arises from its position in an abstractly conceived paradigmatic field. Acrucial aspect of an items meaning is its semantic prosody, a term which reflects the realization that lexical itemsbecome infused with particular connotations due to their typical linguistic environment (Sinclair 1991; Louw 1993;Stubbs 1995). Thus, Sinclair (1991:7375) demonstrates that the phrasal verb set in carries unfavourableconnotations because it co-occurs significantly with words denoting undesirable events or processes such as decay,bad weather or disillusionment. Similarly, the French multi-word unit NP na(ura) qu bien se tenir, which isunaccountably absent from the major dictionaries, is imbued with a semantic prosody of rivalry, with the referent ofthe noun phrase being in competition with another entity mentioned in the surrounding discourse:Lessence na qu bien se tenir, la lutte pour le titre de carburant le moins nocif sest resserre.(LHumanit, 11.8.2001)This example conveniently brings us to a fourth major characteristic of the Firthian tradition: it recognizes thepervasiveness of what Sinclair (1991) has called the idiom principle, whereby a language user has available to himor her a large number of semi-preconstructed phrases that constitute single choices, even though they might appearto be analysable into segments (Sinclair 1991:110). Corpus work in the Firthian tradition has shown that between 50and 80 per cent of all text is made up of habitual word associations (Gross 1988; Stubbs 1997; Altenberg 1998).Most importantly for this study, around 20 per cent of all academic text has been shown to consist of lexical bundlesof the type by the fact that, similar to that of the, presence of the, it is interesting to, size and shape or theory andprac-



Page 7tice (Biber et al. 1999:995ff.), and it is reasonable to put the amount of academic text taken up by various types ofcollocation, including multi-word markers, at a minimum of 30 per cent. From a Firthian perspective, then, thenegotiation of meaning between language users relies on conventionalized, multi-word sense units rather thanisolated words. It is interesting to note that this view of language was anticipated by Bally as early as 1909; heviewed phraseological items as conceptual units (unit de conception, Bally [1909] 1951:76):Ainsi, lunit lexicologique, telle quelle est donne par lcriture, le mot enfin, est une unit trompeuse et illusoiredans beaucoup de cas et ne correspond pas toujours aux units de pense, aux reprsentations, aux concepts, auxnotions de lespritles faits de langage ont un caractre beaucoup plus synthtique quon ne le pense et que ne lefait supposer lanalyse dite logique.(Bally [1909] 1951:3ff.)The present study espouses all the above-mentioned features of the Firthian tradition: underpinning it is an applied-linguistic approach to the study of language which has direct relevance to lexicography (see Part II) as well as theteaching of composition and translation. Any theorization arises directly from the textual evidence produced frommanual or computer-driven analysis. Among other things, this book will provide compelling evidence for an evenstronger version of the idiom principle than that put forward by Sinclair (1991). It will be shown that collocationalpatterns are formed not only by two words belonging to different lexical classes but also by multi-word units; suchmulti-word units will be seen to form lexical dependencies over considerable spans of text (see Chapter 3).1.1.2 MethodologyThis study, unlike the bulk of recent corpus-based scholarship, takes a multi-method approach. For one thing, itcombines traditional, manual text analysis with computer-based corpus linguistics, thereby bringing togetherqualitative and quantitative perspectives on the object of study. For another, it blends the contrastivists approachwith the translatologists.1.1.2.1 Corpus s. intuitionThere is growing recognition that language study can benefit from a combination of manual and computer-basedanalysis. As Kennedy points out:The use of a corpus as a source of evidence however is not necessarily incompatible with any linguistic theory, andprogress in the language



Page 8sciences as a whole is likely to benefit from a judicious use of evidence from various sources: texts,introspection, elicitation or other types of experimentation as appropriate.(Kennedy 1998:8; my emphasis)The evidence is abundant that it would be methodologically misguided to let present-day computers, or rathersoftware, perform corpus searches on their own, and some authors wisely stress caution. Siepmann (1998) points upthe dangers inherent in the wholesale compilation of dictionaries by computer. A stark result of such an approach isseen in The Cobuild Collocations CD-ROM (Sinclair 1995): it includes compounds (disaster relief, schools inspector),which can be found more readily in a general bilingual or monolingual dictionary, free combinations1 (thus new isgiven as a collocate of gallery, or such+disaster) and, most inappropriately, such manifest absurdities asnature+because, religious+between, or adances+heay, all of which reveal a regrettable lack of humaninterference and might lead learners of English badly astray.As such examples show, over-dependence on computational methods may lead to a distorted picture of collocationalsignificance. Thus, a restricted collocation such as note that, which may be observed with some frequency in anacademic corpus, may be totally absent from a heavily newspaper-based corpus such as the Bank of English. Inother words, what is statistically significant in one corpus may be insignificant in another, a point which needs to bekept in mind by any corpus linguist.This leads to the question of whether any corpus can approach the collective linguistic experience of a languagecommunity (Howarth 1996:72). Clearly, the answer still has to be in the negative at the moment of writing,especially since most of todays major corpora are narrowly synchronic, comprising only the last ten years or so. Yetin future very large corpora may well be built which will reflect the knowledge and experience of languageaccumulated over several generations. Everything stands or falls by corpus size, so that it would obviously be wrongat the present time to infer the non-existence of a word or phrase from its absence from a corpus:Sinclair sets the data from a finite set of texts against the intuition of an individualand dismisses the latter.However, to have absolute faith in either is limiting. The most productive course is to begin with no pre-determinedphilosophy and to employ data from both sources co-operatively.(Howarth 1996:73)This statement needs to be set against evidence from a substantial literature (e.g. Sinclair 1991; Francis 1993;Siepmann 1999) that both intuition and manual analysis are prone to error and imprecision when it



Page 9comes to describing the realities of language use. Many pre-electronic descriptions testify to this, either in makinggross misjudgements about the behaviour of a language item (examples relevant to this study are Nattinger andDeCarrico 1992 and Grieve 1996; see Chapters 2 and 4) or in presenting examples which have the unmistakable ringof artificiality. It is clear, then, that Howarths cooperative use of data from both sources should be reinterpreted asuse in temporal succession: hypotheses generated from introspection must always be tested against corpusevidence, and corpus evidence may never be altered to fit intuitive beliefs.A further limitation of computer corpus investigation stems from the inability of current retrieval software to extractcomplex, variable sequences of words (cf. for example Moon 1998:51ff.). In an ideal world, the computer wouldidentify all the instances, including the permutations, of a particular lexical pattern (e.g. it is to be noted, it must benoted, it will be noted, it is notable, it is noticeable, it is worth noting, etc.). This would have the addedadvantage that categories worth investigating could be determined empirically rather than derived from intuition (cf.McEnery and Wilson 1996:86, footnote 1). The reality, however, is that automatic routines are not yet sophisticatedenough to process data in terms of semantic function, making it particularly difficult to detect equivalences amongtransformationally unpredictable marker words and phrases of the kind under analysis here. It follows that to rely onautomatic routines would be to commit a serious fallacy: that of allowing linguistic theory to be ruled by computertechnology.1.1.2.2 Sampling and categorizationThe rationale for a combination of qualitative and quantitative methods is simple enough: to be able to study multi-word discourse markers in a machine-readable corpus one needs first to compile an ocular scan-based inventory, orlook-up list, of such items and then to categorize its content (cf. Schmied 1987). Next a computer corpus can betapped to check the categories thus developed against a larger amount of authentic data.2 When conducting suchqualitative research, the analyst has to extend the corpus enquiry beyond the narrow span of the usual KWICconcordances to take account of longer pieces of text. This kind of investigation may then provide feedback whichwill necessitate a rethinking of categories,3 additions to the inventory, and so on, in an iterative cycle.Once a categorized list has been drawn up, the investigation can proceed on a quantitative basis, enabling theanalyst to assign frequencies to various tokens of discourse markers and to separate regular and typical associationsfrom marginal occurrences. It is then possible to approximate to citation forms of discourse markers to be used indictionaries and teaching materials such as those discussed in Part II of this study.



Page 101.1.2.3 Contrastie linguistics s. translation studiesComing now to the distinction between contrastive and translation studies, we find that the area is fraught withterminological confusion (cf. Krzeszowski 1990:11; Wilss 1996:71ff.).4 The distinction is admittedly a rather fine one;contrastivity, it will be said, is what unites the two branches of language science under consideration. Yet, as Wilss(1996:71ff., esp. 81) and Gallagher (1993:150) point out, contrastive linguistics is generally regarded as operating atthe level of abstract language systems, with the aim of segmenting and classifying linguistic data, whereastranslatology is deemed to work at the level of actual language use. This reflects the recognition that linguistic unitssingled out as equivalent may not automatically correspond to translation units and that, conversely, textconstituents of widely varying grammatical structure and length may take on similar functions in different languages.There is some truth in the above distinction insofar as one wishes to uphold the traditional version of the Saussureandichotomy between langue and parole, which is rapidly losing ground in the face of new evidence from corpuslinguistics. Especially problematic is the structuralist assumption that lexis and syntax are neatly distinct andautonomous systems which do not impact on one another other than through the operation of general semanticrules and regularities. Lexico-grammars such as Francis et al. (1996, 1998) provide overwhelming evidence to thecontrary: there is, in fact, a high degree of interdependence between communicative, lexical and syntactic choicesor, more simply put, between sense and syntax. This clearly undermines the Saussurean dichotomy (cf. Sinclair1991): in the absence of an independent syntactic core of language it becomes difficult, if not impossible, to specifythe essence of an abstractly conceived language system; at best, we may assume a large number of heterogeneouslexico-textual subsystems or patterns5 or, as Sinclair (1991:105) has it, an integrated sense-structure complex. Theonly unifying feature would be the notion of pattern as such, as described in Hunston and Francis (2000) andHunston (2001). Given the impossibility of generalizing across instances of language use to arrive at a unifyingtheory, the distinction between langue-centred contrastive linguistics and parole-centred translatology becomesblurred accordingly.6 The time is thus ripe for a paradigm shift which will considerably widen the scope ofcontrastive linguistics.A new-paradigm contrastive linguistics drawing on computer corpora will enable us to compare lexico-textualsubsystems across languages at a hitherto unimagined degree of delicacy. Many equivalences which have until nowbeen assumed to be peculiar to one text, and therefore within the purview of translation studies rather thancontrastive linguistics, will then turn out to be attributable to regular correspondences between lexico-textualsubsystems or to deviations from such subsystems. This will



Page 11be particularly true with pragmatic texts such as newspaper articles, treaties or manuals, and only slightly less sowith literary texts (to the extent that novelists or poets defamiliarize language use). Table 1.1 contains a briefexample from a textbook of translation (Lozes and Lozes 1994) which will illustrate what is meant.For the inexperienced linguist, both the English original and the French translation in Table 1.1 may at first glanceappear to contain a large number of creative, one-off occurrences. As a corpus-linguistic investigation shows,nothing could be further from the truth. Leaving aside the headline for the moment, we can see that the English textbegins with a fixed expression (as long as I can remember), which can be rendered by a small number of equallyfixed French equivalents (daussi loin que je me souienne, aussi loin que je me souienne). Here the relevantEnglish and French lexico-textual subsystems resemble each other perfectly.It is somewhat different with the lexico-textual subsystem comprising the subject and the verb of the relative clause.This pattern can be glossed as in Table 1.2.In Francis et al. (1996:8) this pattern is subsumed under a more general pattern termed the Begin and StopGroup. Other typical members of this group include the items presented in Table 1.3.It is fairly easy to locate the same lexico-textual subsystem in newspaperTable 1.1 An excerpt from a textbook of translationEnglish original French translationA silver lining to antiques fair in Dublin Eclaircie sur le Salon des Antiquaires de DublinAs long as I can remember, the Irish AntiquesDealers Fair, which opens next Monday in theMansion House in Dublin, has been preceded bygroans of despair from the antique trade. This year,the 27th year of the event is no exception.

Aussi loin que je me souvienne, le Salon des Antiquairesirlandais, qui ouvre ses portes lundi prochain Mansion House Dublin, est prcd de pleurs et de gmissements de la part desgens de la profession. Cette anne, vingt-septime anniversairede cette manifestation, ne fait pas exception.

Source: adapted from Lozes and Lozes 1994:4243Table 1.2 A lexico-textual subsystemEent/public place (trade fair, museum, shop,) Verb expressing start of eentThe Irish Antique Dealers Fair opens



Page 12Table 1.3 Examples of a lexico-textual subsystem comprising verbs denoting beginning and endingThe talks beganThe negotiations endedTable 1.4 A segment of the French lexico-textual subsystem noun (event)+verb (expressing start of event)Eent/public place (trade fair, museum, shop,) Verb expressing start of eentle muse de lAventure Peugeot ouvre ses portesle Salon de lagriculture ouvrira ses portesFrench; one then finds that the verb ourir is not normally used on its own in this pattern (see Table 1.4).A look at the entire subsystem also reveals that an indirect object is often appended to the phrase ourir ses portes,a variant which Lozes and Lozes (1994) fail to mention. This indirect object commonly takes the form aux isiteursor au public.In their commentary Lozes and Lozes (1994:43) describe their rendition of open by ourir ses portes as an instanceof toffement (Vinay and Darbelnet 1958:9), or syntactic augmentation. It thus appears as if they have used a text-specific translation procedure which falls outside the scope of contrastive linguistics, especially since the target-language syntagm differs in structure and length from the source-language syntagm. However, as our corpusinvestigation has shown, the augmentation in question might equally well be regarded as a regular equivalenceamenable to contrastive analysis. Similar analyses could be made for all the other translatorial choices evident in theabove texts. This is because, as corpus linguists (Gross 1988; Stubbs 1997; Altenberg 1998) have demonstrated, upto 80 per cent of all text is made up of habitual word associations, while the remaining 20 per cent consists oflanguage of regular composition or slight deviations from the collocational norm.Another example from my own translation work may serve to illustrate how contrastive analysis can beoperationalized; the source and target texts are excerpts from the website of a Mallorca-based German photographer (see Table 1.5). The final sentence of this excerpt is not easy to translate into idiomatic English, and thesuperficial difference in structure and length between the source and target versions might suggest that the Englishtranslation is merely a matter of intuition. It is of course true that, in practice, the translators accumulatedknowledge and experience or, to put it in cognitive-psychological terms, his procedural knowledge, will



Page 13Table 1.5 German original and English translation of a websiteGerman original English translationFr Journalisten bieten wir einen umfassenden Service,der sich schon oft arbeitserleichtend bewhrt hat:Terminplanung vor der Anreise

We provide a comprehensive service for journalists, whichhas often contributed to lightening their workload: timescheduling prior to arrival

Flughafenabholung collection from airportSeparate Gstewohnung mit Terrasse separate guest flat with roof gardenbersetzungen translation serviceInternetzugang ADSL und LEONARDO PRO Internet access ADSL and LEONARDO PROFahrdienstInselscout chauffeur serviceisland scoutund natrlich Fotografie. and, of course, photography.So ersparen Sie sich unntige Zeitverluste wegenOrtsunkenntnis, Taxi, Mietwagen, Hotelbuchung,bersetzer etc.

This will save you from wasting time unnecessarily tryingto find your way around and looking for taxis, hire cars,hotels, translators, etc.



Page 14lead him to automatically transpose the noun Zeiterluste to a verb. Yet such a strategy can be operationalized interms of lexico-textual subsystems. The translator has to convey the concept of Zeitverlust in neutral English style.The central lexical items available for this purpose are noun+verb collocations, notably waste time, lose time andsquander time, rather than the highly formal compound noun time loss;7 of these noun+verb collocations, wastetime is the most common. The lexico syntactic subsystem containing the verb waste in this sense is described inFrancis et al. (1996:289290). There the translator learns that verbs concerned with passing time in a particular waytypically enter the pattern verb+noun phrase+ -ing clause; he therefore has to construct his target sentence aroundthis pattern, so that the prepositional phrase wegen Ortsunkenntnis, Taxi, Mietwagen, Hotelbuchung, bersetzer hasto be converted into an -ing clause and the compound nouns have to be translated by means of verbs. Thetranslator may now consult a corpus or the Internet for the construction under discussion, and will find the Germanmeaning expressed as follows:Plan the storage of your equipment so that you will not waste time unnecessarily in looking around for them.Firms spend half their time dealing with lawyersYour effects unit really saved me from lounging around and wasting time nunecessarily.We should not spend our time worrying about the futureDont spend too much time shoppingBut before Mr Major and Mr Blair waste more time trying to double-guess themwaste management time dealing with such a challengeit does not waste much time worrying about its pride being hurtskilled reserves who can jump back in without losing time learning a routineThis leaves him with possible chunks such aswaste time (unnecessarily)/spend too much time/lose time (unnecessarily)worrying about/dealing with/lookingfor/trying toNote that such corpus-based analysis throws up a far greater variety of equivalences than intuition, preciselybecause it is based on a comparison of lexico-textual subsystems. The last step is to ferret out an English equivalentfor the German collocation Zeiterlust+ersparen, such as save (s.o.) from wasting time, help (s.o.) avoid wastingtime or stop (s.o.) [from] wasting time. Thus, we arrive at the variants shown in Table 1.6.In a way such an analysis exemplifies the interplay between the open-



Page 15Table 1.6 Lexico-syntactic variantsThis will save you fromThese services will stop you (from) will help you (to) avoid wasting time (unnecessarily) (in) losing time (unnecessarily) (in) spending too much time (in) trying to find your way around and looking for having to find your way around and worrying about finding your way around and dealing with (such matters as)choice principle and the idiom principle (Sinclair 1991). Each open choice of a particular variant entails specificidiomatic constraints on the surrounding discourse, the central open choices in the present example being the verbsstop/sae and waste/spend. An alternative corpus search could start with the concept of problem avoidance,yielding less faithful but functionally equivalent translations such as this will save you the hassle/the trouble offinding your way around Mallorca, these serices will sae (you) hours of searching for, these serices will saehours of research time for journalists, etc.8It thus appears that, in cases where two texts are designed to assume the same, or closely similar functions in twocultures (Funktionskonstanz, or functional invariability, as skopos, see Rei and Vermeer 1984), a corpus-basedcontrastive analysis can supply objective criteria for the discovery and assessment of any translation decision,thereby providing the basis for a fusion of contrastive linguistics and translatology. In this view translation solutions,rather than being one-off, parole-based occurrences, turn out to be instantiations of sense-structure complexesexisting in more than one language; the translators task is to identify the key semantic concepts contained in thetext to be translated, to study target-language lexico-textual subsystems encoding these concepts and to build thetarget text around the patterns of colligation, collocation and text grammar found in these subsystems. In the rareevent, however, that the client commissions a translation whose function differs from that of the source text,contrastive linguistics and translation science must part company. The relevance of this latter type of translationsituation has, however, been overstated by translation theorists (cf. Schmitt 1990), and this has led to a concurrentoverstatement of the differences between contrastive and translation studies.In this study, then, for want of a better expression, I use the term contrastive analysis to refer to the processesinvolved in identifying and recording multi-word units which assume identical or similar functions in actualmanifestations of English, French and German language use. The



Page 16investigation is primarily non-directional, thereby avoiding the pitfalls of comparing already translated texts in anarea where discourse patterns have been shown to be culture-specific (cf. for example Kumaul 1978; Clyne 1981,1987; Gnutzmann and Lange 1990; Oldenburg 1992).9 Before considering the procedures of contrastive analysis ingreater detail, we need to address the vexed question of equivalence.1.1.2.4 Equialence in contrastie analysisHow can we define and assess interlingual equivalence between discourse markers? The concept, fundamental totranslation studies (Sager 1994:142) and to contrastive linguistics, has probably caused more ink to flow than anyother translatological term, and various typologies of equivalence have been proposed: a distinction may, forexample, be made between connotative and denotative equivalence or between cognitive and linguistic equivalence(see Koller 1992; Sager 1994).Full equivalence between two or more languages is a rare event; it can be said to obtain only in the case of one-to-one correspondences between place-names such as English London and French Londres or between clearly definedmonosemous words such as French tlphone and German Telefon. Generally speaking, however, equivalencedenotes similarity rather than identity of source and target-language units. For evidence of this, we need look nofurther than the word level, where the fuzzy, context-bound nature of meaning leads to a potentially infinite numberof renderings for any polysemous word. Thus, a French adjective like sauage covers such a wide spectrum ofmeanings that English equivalents may range from shy at one end of the spectrum to cruel at the other, with anabundance of subtly differentiated, context-specific translations in between (cf. Hausmann 1995:21).The foregoing implies that the equivalence of source and target-language multi-word markers can only beestablished on the basis of functional similarities in their contextual uses, and will usually be subject to certainconstraints. For the purposes of the present study, therefore, equivalence can be defined as follows:Equivalence is said to obtain between a source-language item A and a target-language item B if and only if Aperforms a function in source-language contexts which is identical with, or similar to, the function assumed by B intarget-language contexts.The basis of comparison, or tertium comparationis, is thus onomasiological or, more precisely, functional-onomasiological. Possible objections to a functional definition of equivalence can be easily dismissed. Hoey andHoughton (1998:47), for example, object to the use of semantic and/or functional equivalence as a tertiumcomparationis on the grounds that a



Page 17pair of sentences might be semantically and/or pragmatically equivalent, but have widely differing likelihoods ofoccurrence in the languages from which they are drawn. Clearly, this objection becomes invalid in the context of acorpus-driven methodology using frequency counts to control for such factors as differing probabilities of occurrenceacross languages.Beyond this, functional equivalence has been found too narrow a notion in Mtrichs (1998) investigation into thetranslation of German particles. Comparing such sentences asIch glaube kaum, dass er das schafft, aber er kann es ja versuchen. Je ne pense pas quil russisse, mais il peuttoujours essayer.(Mtrich 1998:197)he notes that, strictly speaking, the German particle ja and the French particle toujours are not functionallyequivalent: evidential ja points to the existence of a remote possibility of success, whereas toujours has a chieflytemporal meaning. There is thus no functional equivalence between the two markers, although the rendition of ja bytoujours is intuitively satisfactory. This leads Mtrich (1998:198) to propose a much broader definition of the target-language equivalent as that lexical item which must be omitted in the target-language text if the particle is omittedfrom the source-language text. The problem with this definition is that it relies entirely on already translatedmaterial, taking for granted the equivalence of the source and target-language texts in which the markers areembedded. Hence also its circularity: the equivalence of the target sentences defines the equivalence of the markerwords, and vice versa. Thus, while Mtrichs definition of equivalence is of great lexicographic value, it would bemethodologically unsound for this study to depart from a functional definition. It is also unnecessary, since multi-word units behave differently from the kind of particle studied by Mtrich.1.1.2.5 Procedure in contrastie analysisThis brings us to the procedure followed in establishing functional equivalence between the monolingual data, aprocedure common to all contrastive studies (cf. Schmidt 1996): the first step is to identify potential equivalents, i.e.items fulfilling identical or similar functions in the languages under survey. This is followed by a close comparison ofsuch potential equivalents with a view to determining formal, semantic and/or functional similarities and differences.This then leads to a sub-categorization of the material by semantic-pragmatic field; in the process it may be foundthat some items defy straightforward classification, as is the case with any attempt to arrange lexis into neatlybounded sets. In a final step an analysis is made of the use of potential translation equivalents in actual text, thefocus being on the various discoursal constraints placed upon their use as



Page 18well as on the strategies which can be employed by translators to cope with such constraints. A contrastive analysisof the kind just described yielded the taxonomy of discourse markers presented in Chapter 3 and the more finelygrained analysis of particular functional sets offered in Chapter 4.To ensure perfect reliability, the analyst would have to make a complex factor analysis for each individual type ofequivalence. More concretely, she would have to analyse dozens of occurrences of each marker, taken from a widevariety of authors, within a large number of parameters. Such an investigation, while clearly beyond the capacities ofa single individual, would equally well be outside the reach of professional research groups. This is because functioncan only be analysed according to common-sense criteria which may differ from one person to another (Courdier etal. 1994).Added to this are problems stemming from the fuzzy nature of meaning or function, a point already touched uponin Section 1.1.2.4. If a complex factor analysis were to be made, it would be found that no two occurrences of aparticular marker are completely identical, and that the set of function/meaning components to be taken intoaccount for definitional purposes can be extended at will (cf. Hausmann 1995:19). In other words, the neatcompartmentalizing of meanings or functions characteristic of lexicography and other branches of linguistics can dono more than partially capture an infinitely complex reality.With multilingual contrastive analysis, the situation is further compounded by the fact that no two linguists will haveexactly the same knowledge of two or more languages. That is, no individual linguist will have exactly the samecommand of, say, English and French; nor will she have the same command of, say, French as another linguist. Thiscan be easily illustrated with Grieves (1996) study of discourse devices in written French, for Grieve, although anative speaker of English, overlooks a large number of English equivalents of the connectors he discusses (seeChapter 4 for detailed evidence).1.1.2.6 Analytic frameworkThe remaining question is what framework to use for describing and analysing the complex web of discourserelations. Among the wide variety of text-linguistic models two distinct currents are discernible (Brinker 1988:1217;Oldenburg 1992:4247): on one side are the proponents of system-oriented text linguistics, who set out to describethe propositional structure of texts by analogy with traditional sentence-based grammar; on the other arecommunication-oriented text linguists, who analyse text in terms of a model of communication which takes fullaccount of the situational and cultural embedding of text. A moments reflection will reveal that the two approachesare in fact complementary, as it is necessary to



Page 19relate the function and propositional content of particular text segments to their linguistic realizations at the textsurface.Fusing as it does the two ways of seeing just mentioned, the model developed by Hatim and Mason (1990) has beenchosen for the present study. Hatim and Mason distinguish between two hierarchically arranged subdivisions of text,namely elements and sequences, defining an element as the smallest lexico-grammatical unit which can fulfil somerhetorical function and a sequence as a unitwhich normally consists of more than one element and which serves ahigher-order rhetorical function than that of the individual elements in question. In a later version of this model,Hatim (1997:57) recognizes that a sequence may be realized by only one element. To elements and sequences Ihave added the notion of text segment, or Teiltextsegment, by which is here meant a piece of text comprising atleast two sequences. Firm support for such a notion is provided by German studies in text linguistics such as Hengst(1985), Oldenburg (1992), Gil (1995) and Eggs (1996), which involve close readings of the logico-semantic andpragmatic structure of texts.The functions served by elements, sequences and text segments can be described in terms of Rhetorical StructureTheory (Mann and Thompson 1988), hereafter abbreviated as RST. RST likens the rhetorical relations between textspans to those obtaining between clause complexes. There are two major types of rhetorical relations: nucleus-satellite and list. Drawing an analogy with the clause complex, we can compare to hypotaxis the complex nucleus-satellite relations, and to parataxis the list relations formed by the simple joining together of independent elements.Judging from available RST analyses, the former are far more frequent than the latter. Table 1.7 briefly explains themajority of nucleus-satellite relations. Each of these relations receives more detailed treatment in terms of theconstraints placed upon it, its effect and the locus of this effect. Thus, for example, the background relation isdefined as follows (Mann and Thompson 1988):constraints on N: R wont comprehend N sufficiently before reading text of Sconstraints on the N+S combination: S increases the ability of R to comprehend an element in Nthe effect: Rs ability to comprehend N increaseslocus of the effect: NTo illustrate the workings of RST, let us consider a simple example taken from the beginning of a Scientific Americanarticle (Mann 1999):1) Lactose and Lactase2) Lactose is milk sugar, 3) the enzyme lactase breaks it down. 4) For



Page 20Table 1.7 Nucleus-satellite relationsRelationname

Nucleus Satellite

Antithesis Ideas favoured by the author Ideas disfavoured by the authorBackground Text whose understanding is being facilitated Text for facilitating understandingCircumstance Text expressing the events or ideas occurring in

the interpretive contextAn interpretive context of situation or time

Concession Situation affirmed by author Situation which is apparently inconsistent but alsoaffirmed by author

Condition Action or situation whose occurrence results fromthe occurrence of the conditioning situation

Conditioning situation

Contrast Situation which is compared with another situationthat is (a) identical with another situation in atleast some respects, (b) similar to, or differentfrom, another situation in a few respects

Situation which is compared with another situationthat is (a) identical with another situation in atleast some respects, (b) similar to, or differentfrom, another situation in a few respects

Elaboration Basic information Additional informationEnablement An action Information intended to aid the reader in

performing an actionEaluation A situation An evaluative comment about the situationEidence A claim Information intended to increase the readers belief

in the claimInterpretationA situation An interpretation of the situationJustify Text Information supporting the writers right to express

the textMotiation An action Information intended to increase the readers

desire to perform the actionNon-olitionalcause

A situation Another situation which causes that one, but notby anyones deliberate action

Non-olitionalresult

A situation Another situation which is caused by that one, butnot by anyones deliberate action

Otherwise(anti-conditional)

Action or situation whose occurrence results fromthe lack of occurrence of the conditioning situation

Conditioning situation

Purpose An intended situation The intent behind the situationRestatement A situation A re-expression of the situationSolutionhood A situation or method supporting full or partial

satisfaction of the needA question, request, problem, or other expressedneed

Summary Text A short summary of that textVolitionalcause

A situation Another situation which causes that one, bysomeones deliberate action

Volitionalresult

A situation Another situation which is caused by that one, bysomeones deliberate action

Sources: adapted from Mann and Thompson 1988, Mann 1999



Page 21want of lactase most adults cannot digest milk. 5) In populations that drink milk the adults have more lactase,perhaps through natural selection.Five text elements may be recognized in the excerpt under discussion, preceded by the numbers 1 to 5. At thebroadest level, two relations are clearly discernible: the relation holding between 2 and 3 is one of elaboration, thatholding between 4 and 5 is one of contrast. Adding a second layer of dependency structure, we find that elements 2and 3 form a back-ground relation with elements 4 and 5. At yet another level of analysis we find that the heading(1) stands in a preparation relation with the rest of the excerpt.An important reason for the adoption of RST is that this framework is open to extension. New relations or sub-relations can be added to existing ones if need be. RST is thus in keeping with the Firthian tradition of text analysis,which, as explained on p.5, gives absolute priority to the data. Moreover, RST has already proved a useful heuristictool in contrastive analysis (cf. Salkie and Oates 1999); its particular value lies in its ability to relate linkage byconjunctions and tacit, or zero, linkage.A number of criticisms have been made of RST. The most serious charge is that it fails to recognize the possibility ofmore than one relation at a time existing between nucleus and satellite (Martin 1992:259260). At first blush, asimple solution might be to extend RST with a view to covering such dual relations, but there is the addedcomplication that two or more stretches of text may be interrelated at different hierarchical levels. For example, twopieces of evidence following and supporting an assertion may in turn be related to each other through addition orcomparison. However, this short-coming is not directly relevant to the present research, which focuses on theprimary functions of specific second-level markers in specific text types, not on the whole range of possibleconjunctive relations in text at large.More broadly speaking, the plethora of discourse-analytic approaches demonstrates that there is no one definitiveway of analysing discursive relationships. The human analysts identification of textual units is crucially dependent onher understanding of the text being examined. Nowhere can this be seen more clearly than in the self-defeatingattempts of earlier literary criticism to establish the ultimate meaning of a text. Today, with the rise of reader andreception theory, the idea associated with author-centred or text-centred criticism of a singular, unified reading hasgiven way to reader-centred notions of infinitely many possible readings. Admittedly, academic texts are a far lesscomplex text type in this regard, and we are here concerned with discourse relations rather than ultimate meaning,but there clearly remains an element of uncertainty in any text analysis.The same, it may be noted at this juncture, holds true for the classification of discourse markers. As a rule, thereseems to be agreement among analysts on made-up examples deemed to be prototypical, but such theorizing



Page 22cannot normally stand up to the complex evidence of language as it is (see Degand 1998; Mauranen 1993). Giventhe bewildering variety of heuristic-interpretative methods of text analysis, it seems judicious, therefore, to keep anopen mind:Further, given the currently unsatisfactory state of knowledge, descriptive and analytic textlinguistic models shouldbe intuitively plausible. As verifiable findings on the structure of language acts are still largely missing, the criterionof plausibility seems best suited to the evaluation of such models and of the results they yield. This is because everynative speaker of a language has an intuitive knowledge of text and text structures.(Oldenburg 1992:47; my translation)Mann and Thompson also take explicit account of the subjectivity of the human analyst:Given the nature of text analysis, these are judgments of plausibility rather than certaintythe analyst is judgingwhether it is plausible that the writer desires the specified condition.(Mann and Thompson 1988:245)The model I have proposed, borrowing from Hatim and Mason (1990) as well as Mann and Thompson (1988),clearly meets the criterion of intuitive plausibility.1.2 Corpora and corpus-enquiry toolsShall I make spirits fetch me what I please, resolve me of all ambiguities.Christopher Marlowe, Doctor FaustusSince at least the 1980s language science has been in the throes of a corpus revolution. Naturally occurring data,available in their millions from machine-held collections of text, have become the life-blood of the discipline. Indeed,the ever-expanding size of corpora may tempt linguists and translation scholars to indulge Faustian fantasies ofomniscience. It is not, of course, the use of electronic corpora in itself that is new; in France, for example, thecomputerized database Frantext has existed for over forty years (Habert et al. 1997:7). Rather, the revolution hasbeen made possible by increasing ease of access to electronic text and corpora,10 their enrichment throughcomputerized annotation and the ability of micro-computers to process ever-larger stretches of text at ever-greaterspeeds. This latter development has also resulted in data-driven approaches making inroads into translation andforeign language classes (Johns and King 1991).In view of the foregoing considerations it is not surprising that this



Page 23study, too, avails itself of machine-readable corpora. Work with such corpora raises a number of methodological andpractical issues. Foremost amongst these is the relationship, discussed in Section 1.1.2.1, between intuition andcorpus evidence, but almost equally important are the issues associated with the compilation and analysis ofcorpora. These are the subject of the present chapter, which confines itself to native speaker corpora; learnercorpora will occupy us in Part II.1.2.1 Corpus building and information retrievalThis is not the place to give a detailed account of the history, aims and methods of corpus linguistics.11 I shalltherefore confine my discussion to the issues of corpus building and information retrieval as they apply to thepresent research, which draws its data from five different types of computer-readable text archive in each language: electronic editions of wide-circulation quality newspapers and news magazines (The Times, the Guardian, TheEconomist, Le Monde, Le Monde diplomatique, Sddeutsche Zeitung, Frankfurter Rundschau, Der Spiegel) (seeSection 1.2.2.1); the largest reference work available on CD-ROM (Britannica CD, CD-ROM UNIVERSALIS, Gabler WirtschaftslexikonCD-ROM) (see Section 1.2.2.2); a large corpus of academic texts produced from reviews, journal articles, doctoral theses and portions of books(see Section 1.2.2.3); a ten-million-word parallel corpus made up of evenly sized subsections representing various disciplines of the full-size academic corpora; the texts in the parallel corpora are roughly matched by such criteria as date of compositionand text category (see Section 1.2.2.4); a multilingual translation corpus comprising a variety of text categories from both fictional and non-fictionalsources.12Table 1.8 gives a breakdown of the sources used by corpus type, content, size, baseline year and analysis software.The research based on these corpora was divided into two phases: corpus building (Section 1.2.2) and text analysis(Section 1.2.3). I will deal with these in turn.1.2.2 Corpus design and compilationThe corpus linguist has to address a number of competing design issues (see Kennedy 1998; Pearson 1998),including: time and resources available for corpus building purpose of the research



Page 24Table 1.8 Corpora usedCorpus TypeContent Word

countBaseline year Analysis software

News corpora (see Section 1.2.2.1)Britishnewspapers andnews magazines(NE)

Full-text

Issues of The Times, the Guardian and TheEconomist, published in London

100million

1990 WordSmithConcordancer,Microconcord;Search Engine

Frenchnewspapers andnews magazines(NF)

Full-text

Issues of Le Monde and Le Monde diplomatique,published in Paris

100million


Germannewspapers andnews magazines(NG)

Full-text

Issues of Sddeutsche Zeitung, FrankfurterRundschau and Der Spiegel, published respectivelyin Stuttgart, Frankfurt and Hamburg

100million


CD-ROM-based corpora (see Section 1.2.2.2)EncylopaediaBritannica (EB)

Full-text

Humanities and sciences texts 50million

1996 Netscape Navigator

EncyclopedicUniversalis (EU)

Full-text

Humanities and sciences texts 50million

1996 Search Engine

GablersWirtschaftslexikon(GW)

Full-text

Economics texts 6million

1994 Search Engine

Full-text academic corpora (see Section 1.2.2.3)Corpus ofacademic English(CAE)

Full-text

Reviews, journal articles, doctoral theses andportions of books

50million

1980 (less than5% of textspredate 1980)

WordSmithConcordancer,Microconcord

Corpus ofacademic French(CAF)

Full-text


30million



Corpus ofacademic German(CAG)

Full-text


30million



Parallel academic corpora (see Section 1.2.2.4)Parallel corpus ofacademic English(PCAE)

Full-text

Reviews, journal articles and portions of booksfrom CAE

9.5million

1980 WordSmithConcordancer,Microconcord

Parallel corpus ofacademic French(PCAF)

Full-text

Reviews, journal articles and portions of booksfrom CAF

9.5million


Parallel corpus ofacademic German(PCAG)

Full-text

Reviews, journal articles and portions of booksfrom CAG

9.5million


Multilingual translation corpus (see Section 1.2.2.4)Multilingualtranslation corpus(MTC)

Full-text

Academic prose, novels, EU financial regulations,manuals, etc.

1950 Multiconcord



Page 25 size text type and length text categorization by subject area baseline year sources and availability of electronic text publication status and technicalityAs we shall see, these issues are fairly easy to settle as regards news corpora, while academic corpora are moreproblematic.1.2.2.1 News corporaSuch has been the invasion of journalistic genres into all spheres of life that a study of discourse cannot do withoutthem. Newspaper language has been aptly described as a perennial Fountain of Youth (Gallagher 1982:7) in whichliving idioms are constantly renewed. Next to the novel, the newspaper article is probably the single most widelyread written genre (Kennedy 1998:49), holding a pivotal place in the evolution of any language of the civilized world.It is also common knowledge that there is ample cross-fertilization between the language of academics and that ofjournalists. Just as academic shop talk tends to influence the popular science writer, so journalistic coinages spillover into academic treatises.Annual editions of major quality papers on CD-ROM provide a commercially available, pre-designed means forlinguists to study news media language. The prodigious dimensions of such text archivesusually between 30 and40 million words per CD-ROMmake it possible to give an accurate and reliable portrait of the use which journalistsmake of such linguistic items as discourse markers.The ready availability of text on archival CD-ROMs led to a pragmatic sampling approach which took account ofboth the representativeness and the quality of the writing involved. For each language under investigation, 90 percent of the corpus texts was collected from highbrow broadsheets, while 10 per cent was culled from quality newsmagazines (see p. 24). Texts were selected from across the various daily, weekly or monthly sections appearing inthe news media under consideration; the corpus material thus obtained was transferred on to the hard disk of acomputer.No claim is made here that the news corpora are representative of the written language as a whole; rather, theassumption is simply that they represent a broad sample of ordinary usage in the British, French and German qualitypress in the 1990s. While it is true that even a 100-million-word corpus is small compared to the totality of writtennews media text published in the 1990s, work on other large corpora such as the Bank of English has shown that100 million words taken from various sources is a sufficient sample to guarantee representativeness and reliability.



Page 261.2.2.2 Reference CD-ROMsWith the reference CD-ROMs the issues of design and compilation have, as it were, been settled by the publishers.The Britannica CD boasts 65,000 text articles, whereas the French-language CD-ROM Uniersalis contains 22,000, afigure exactly equalled by the German Gabler Wirtschaftslexikon CD-ROM. With articles ranging from brief definitionsthrough impressionistic overviews to all-encompassing expositions, it may be appropriate to express more preciselythe size of these storehouses of scholarly writing. Thus, at a conservative estimate, the Britannica and UniersalisCD-ROMs run to approximately 50 million words each, while the Gabler Wirtschaftslexikon, at around 6 millionwords, is considerably smaller. This apparent drawback is tempered by the fact that the Gabler Wirtschaftslexikon isdevoted in its entirety to a humanities discipline and therefore easy to understand for the analyst.1.2.2.3 The full-size academic corporaLet us now consider one by one the aforementioned design issues as they apply to the construction of the three full-size corpora of academic English, French and German.1.2.2.3.1 TIME AND RESOURCES AVAILABLEFor the corpus linguist working in isolation, practical considerations preclude any attempt to emulate large-scalecorpus compilation projects. Since no sizeable corpus of academic English, French or German was publicly availableat the time this study got under way, I approached a number of publishers to obtain electronic versions of academicbooks. Apart from my French publisher Ellipses, who kindly sent two textbooks on floppy disk, no other companywas willing to contribute in the absence of substantial pecuniary rewards.I therefore had no alternative but to collect texts from other sources. Text collection took two forms: first andforemost, I downloaded copyright-free texts on to my PC from electronic journals and academic text archivesmounted on the Internet. Time-consuming as this text capture was, taking up at least 100 man-hours per corpus, itstill seemed the most viable option considering the extra expenditure of money, time and effort involved in scanningtexts. Second, a small number of authors kindly supplied a few publications on floppy disk.1.2.2.3.2 PURPOSEThe aim of the present study is to set up a taxonomy of discourse markers as well as to provide a detailed andsystematic study of their use in aca-



Page 27demic text. It is therefore not necessary for the corpora to be fully representative of academic language as a whole;nor is it indispensable for them to have the same size. The corpora merely need to contain a sufficiently largenumber of tokens of any category of discourse marker in the languages under investigation. This raises questionsabout the adequate size of the corpus, and the range of text types to be covered.1.2.2.3.3 CORPUS SIZEJust as corpus work usually proceeds in an iterative cycle (see Section 1.1), so too does corpus construction (cf.Biber 1993): to ensure that the size of the academic corpora fitted the purpose of the present study, a number ofpre-tests had to be carried out at various stages of the compilation process. These pre-tests were based on a widerange of items taken from an extensive list of discourse markers I had compiled manually over the years. Theestablishment of such an inventory of scholarly metadiscourse had long been a research interest of mine, arisingfrom a simple didactic desire: helping myself and others to write English, French and German in an authentic, correctand sufficiently sophisticated manner. The inventory data were collected during the six years of my studies andbeyond, as I happened upon them in scholarly treatises, academic textbooks and popular scientific works.13 Initially,a separate inventory was developed for each language. A number of correspondences then almost automaticallyforced themselves on my mind as the monolingual collections grew in size. Part of the rationale behind the presentcorpus compilation project is to check, with regard to some types and tokens of discourse markers, whether thematches thus obtained are viable.Two questions then arise: 1. How much text have I read over the years? 2. How reliable and representative havemy notes been? It is, of course, impossible to answer these questions exactly. As already noted, humans generallyread for content rather than form, and since their eyes tend to alight on odd or felicitous turns of phrase rather thancommon-or-garden wordings, it seems arguable that some tokens of discourse markers will have been missed in thecourse of such reading. Even allowing for human fallibility, though, it is evident that the amount of text I havescanned by eye over the years is at least equivalent to, if not larger than, the 30-million-word academic corporaassembled for the purposes of this study. This becomes apparent when we translate the usual word measure intomore familiar terms: 30 million words of text correspond roughly to a body of 240 medium-sized books or 150voluminous doctoral theses (Ball 1996). It is a good guess that normally gifted students get through at least half thatamount in the course of their studies.The aforementioned pre-tests showed that a 10-million-word academic corpus is large enough to provide asubstantial number of instances of multi-word discourse markers which can usefully, if not quite exhaustively,



Page 28fill out the picture emerging from manual analysis. A fully exhaustive description would require corpora larger by atleast one order of magnitude; but to date no such corpora have been assembled. Accordingly, it was felt desirable toincrease the size of the corpora with a view to increasing their representativeness. The growing availability ofelectronic text made it possible to increase corpus size for the academic corpora to around 30 million words forFrench and German, and 50 million words for English. While it would have been easy to obtain a larger amount ofEnglish text at the moment of writing, the opposite is true of French and German text. Deplorable as this may be,English clearly dominates the international academic scene, and probably more than 90 per cent of all academictexts published on the Internet or on CD-ROM are in English. That said, however, it should be pointed out that theacademic corpora built for this study, most especially the French and German corpora, are easily the largest andmost diverse electronic archives of academic language ever created, including substantial quantities of text fromvarious subject fields and genres. The English corpus, for example, is considerably larger than the relevant sectionsof the British National Corpus or the corpus used for the Longman Grammar of Spoken and Written English (Biber etal. 1999).1.2.2.3.4 TEXT TYPE AND LENGTHThe text types chosen to represent academic prose were reviews, journal articles, doctoral theses and othermonographs. The vast majority of the material included was literature written by specialists for specialists, with just afew articles and books intended for a wider readership. Only complete texts found their way into the corpora, withthe exception of book chapters, which are usually self-contained in their discursive structure. Equally valid argumentscan of course be found in favour of both sample and full-text corpora, but, as Sinclair (1991:2324) points out, theuse of sample corpora may pose problems when the analysis requires surveying large, logically connected portions oftext, as in the case of discourse markers, which tend to have scope over fairly extended text spans.The actual compilation process

Documents

[Siepmann Dirk] Discourse Markers Across Languages