‹#› 1 Introduction to Corpus Linguistics ‹#› 2 What is a corpus? nA corpus is a collection of naturally-occurring language texts, chosen to characterize a state or variety of a language. In modern computational linguistics, a corpus typically contains many millions of words: this is because it is recognised that the creativity of natural language leads to such immense variety of expression that it is difficult to isolate the recurrent patterns that are the clues to the lexical structure of the language Sinclair (1991 :171) . ‹#› 3 What is corpus linguistics? (Teubert & Krishnamurthy 2007) nThe linguists are not in charge of the language; the discourse community is (ibid.: 9) n n“The discourse community establishes the conventions for what is acceptable and what is not” (ibid.: 9) ‹#› 4 What is corpus linguistics? n“Corpus linguistics is concerned with meaning, with symbolic content. People are not interested in grammatical constructions; they want to know the meaning of what has been said.” (ibid.: 9) n n ‹#› 5 What is corpus linguistics? n“What sets corpus linguistics apart from cognitive linguistics is that it looks at language from a social, not a psychological perspective. Language is verbal communication between people, is the discourse of what is actually being said (written) and listened to (read).” (ibid.: 9) ‹#› 6 What is corpus linguistics? n“Corpus linguistics is bottom-up … accommodate the full evidence of the corpus. It analyses the evidence with the aim of finding probabilities, trends, patterns, co-occurrences of elements, features or groupings of features” (ibid.: 6) that form units of meaning n n“the starting point is always the corpus, real language data” (ibid.: 6) ‹#› 7 What is corpus linguistics? n“Corpus linguistics uses frequency to arrive at generalisations. Statistical significance makes us aware of connections that we would not see otherwise. The generalisations that corpus linguistics arrives at are not interpreted as laws or rules, but as plausible ways to group similar things together.” (ibid.: 9) ‹#› 8 What is corpus linguistics? n“Corpus linguistics can also make specific claims concerning unique events of language phenomena by showing in which aspects this event differs form all other occurrences of the same type of phenomenon.” (ibid.: 9) n ‹#› 9 History of corpus linguistics (1) nLate 19th century nThe Oxford English Dictionary, compiled by means of an enormous number of slips collected containing authentic examples of language in use n nLate 1950s and early 1960s nthe beginning of proper corpus linguistics (Tognini-Bonelli 2001: 52) ‹#› 10 History of corpus linguistics (2) n1959 nRandolph Quirk announced his plan to start a Survey of English Usage of both written and spoken English n nNot computerized because 50% spoken language n nA Comprehensive Grammar of the English Language (Quirk, Greenbaum, Leech, & Svartvik 1985) ‹#› 11 The input of new technologies nThe computer nTo assemble corpora (large amounts of data) from the Web, scan electronic databases on CD-ROM or connect to a database by remote access nTo store large amounts of information nA very fast tool to process and systematise a quantity of information in real time (Tognini-Bonelli 2001: 5-6) ‹#› 12 Corpus analysis software nSoftware packages such as WordSmith Tools (Scott, 1999) n nThe software “selects, sorts, matches, counts and calculates” (Hunston and Francis, 2000: 15) n ‹#› 13 Three stages of the computer corpus in linguistics work (1) 1. 1.As a tool to process, in real time, a quantity of information 2. 2.A distinctive and enhanced methodology of enquiry into language - to provide abundant new evidence in a speedy and systematic way ‹#› 14 Three stages of the computer corpus in linguistics work (2) 3.Corpus linguistics is a domain of research; “a new philosophical approach to linguistic enquiry” (Tognini-Bonelli 2001: 1); re-unites the activities of data gathering and theorising which lead to a qualitative change in our understanding of language (Halliday 1993: 24) ‹#› 15 Authentic language use n“in the final analysis if linguistics is not about language as it is actually being spoken and written by human beings, then it is about nothing at all” (Trudgill 1996: xi) n nCorpus linguistics is the study of language through observation of language evidence in corpora. It differs from traditional linguistics in its insistence on the systematic study of authentic examples of language in use (Tognini-Bonelli 2001: 1). ‹#› 16 The contextual theory of meaning (Firth 1957) n nJ.R. Firth (1880-1960) died before the advent of computers and electronic corpora, but “laid the theoretical foundation of a contextual theory of meaning which is central to our present-day view of corpus work” (Tognini-Bonelli 2001: 157). ‹#› 17 Assumption of the contextual theory of meaning n“We must take our facts from speech sequences, verbally complete in themselves operating in contexts of situation which are typical, recurrent, and repeatedly observable. Such contexts of situation should themselves be placed in categories of some sort, sociological and linguistic, within the wider context of culture.” (Firth 1957: 35) ‹#› 18 Applications of the contextual theory of meaning n n“Speech events have to be apprehended in their contexts, as shaped by the creative acts of speaking persons.” (Firth 1957:193) n nFirth’s (1957) contextual theory of meaning can be applied to: ¨the analysis of a text: language as function in context ¨the analysis of a corpus as a corpus contains texts ‹#› 19 Why to use a corpus? nDictionary explanation is not accurate ¨Is ‘place’ mostly used to refer to the ‘physical environment’ as defined in the dictionary. Sinclair (2003) denied this point and found it is most frequently used in the phrase ‘take place’. ¨ nIntuition alone is not enough n – Is “starting” always replaceable by “beginning”? n – Is it only “time” that is “immemorial”? n – “think of” vs. “think about” n nNative speaker intuition is also unreliable n – provides no information on frequency of occurrence n – “head” => body part - Is this the most used sense? ‹#› 20 The Word Counter nhttp://www.youtube.com/watch?v=ixw-XyycGdU ‹#› 21 How to Read a Text vs. Corpus (Tognini-Bonelli 2001: 3) TEXT CORPUS Read whole Read fragmented Read horizontally Read vertically Read for content Read for formal patterning Read as a unique event Read for repeated events Read as an individual act of will Read as a sample of social practice Read as a Coherent communicative event Not a coherent communicative event ‹#› 22 How to read a corpus n1. Read fragmented and vertical: Concordance n Concordance is a term that signifies a list of a particular word or sequence of words in a context. The concordance is at the centre of corpus linguistics, because it gives access to many important language patterns in texts. The computer has made concordances easy to compile. ‹#› 23 Concordance / Concordancer nKWIC ¨KWIC is an acronym for Key Word In Context, the most common format for concordance lines. ¨ ¨A KWIC index is formed by sorting and aligning the words within a corpus search either in alphabetical order or in frequency order. nConcordancers: online concordancers; Softwares, like WordSmith Tools n http://www.americancorpus.org/ ‹#› 24 Text vs. Corpus [ -5 key word +5] ‹#› 25 Types of corpora nSpecialized corpus nGeneral corpus nMultilingual corpora nComparable corpora nParallel corpora nFree-translation corpus nLearner corpus nPedagogic corpus nHistorical or diachronic corpus nThe Internet as corpus n(see, Hunston 2002: 14-16; Tognini-Bonelli 2001: 6-9) ‹#› 26 Uses of corpora nthe tracking of changes in the English language nthe production of dictionaries and other reference materials nthe development of aids to translation nlanguage teaching materials nthe investigation of ideologies and cultural assumptions nthe study of all aspects of linguistic behaviour, including vocabulary, grammar and pragmatics nthe study of register variation nnatural language processing ‹#› 27 US and British English corpora nAll of these corpora spanning 60 years are based on written texts, and have used the same design criteria to allow comparisons to be made across the two varieties of English and across time. n n US English British n English n BLOB (1931) n Brown (1961) LOB (1961) n Frown (1991) FLOB (1991) ‹#› 28 Brown Corpus (1961) (1) ·A computerized corpus of US English · ·“a standard sample of present-day edited American English, for use with digital computers” (Francis and Kucera 1979) ‹#› 29 Major English Corpora nThe Brown Corpus (1964) n 1 million words (500 samples/2,000 words, written American English, texts published in the US in 1961 nThe Lancaster-Oslo/Bergen (LOB) Corpus (1978) similar to the Brown corpus, British English, text from 1961 (compiled 1970-1978) nThe London-Lund Corpus (LLC) n 200 samples, ~5000 words each, 1953-1987, spoken British English, transcribed. n ‹#› 30 Monitor Corpora n nThe world’s two largest corpora are in the UK: n nBank of English – approx. 500 m words nThe Collins WordbanksOnline English corpus: 56 million words of contemporary written and spoken text. (http://www.collins.co.uk/corpus/CorpusSearch.aspx) n nBritish National Corpus (BNC) – 100 m words n(http://sara.natcorp.ox.ac.uk/lookup.html) n n ‹#› 31 British National Corpus (BNC) nThe British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007. nhttp://www.natcorp.ox.ac.uk/ ‹#› 32 BNC written nThe written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. ‹#› 33 BNC Written (1) n“The BNC was designed to characterise the state of contemporary British English in its various social and generic uses.” (Aston & Burnard, 1998: 28) ¨Imaginative 20% ¨Arts 8% ¨Belief & thought 4% ¨Commerce & finance 8% ¨Leisure 11% ¨Natural & pure science 4% ¨Applied science 11% ¨Social science 15% ¨World affairs 14% ¨Unclassified 2% (p. 29) ‹#› 34 BNC Written (2) nBook 46% nPeriodical 36% nMiscellaneous published 6% nMiscellaneous unpublished 7% nTo-be-spoken 1% nUnclassified 2% n(Aston & Burnard, 1998: 30) ‹#› 35 BNC Spoken nThe spoken part (10%) includes a large amount of unscripted informal conversation, recorded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins. ‹#› 36 The Cobuild project: Bank of English nCOLLINS Birmingham University International Language Database, 1980-1986 n nThe use of the computer plays “a clerical role in lexicography” (Sinclair 1991: p. 2) n nA huge database of annotated examples of language use was assembled n nA substantial dictionary edited from the database: Collins Cobuild Dictionary (Sinclair et al 1987) ‹#› 37 American National Corpus nThe ANC is being developed to have, for American English, the kind of linguistic documentation that exists for British English in the British National Corpus. n nThe goal for the ANC is to parallel the general structure of the BNC, while adding genres like blogging and instant messaging that did not exist when the BNC was created. ‹#› 38 Online Resources n1. PolyU Language Bank n The PolyU Language Bank n2. Mark Davis’s website n http://corpus.byu.edu n3. David Lee’s website: n http://tinyurl.com/r7zubf n4. Sketch Engine n http://ca.sketchengine.co.uk/login/ ‹#› 39 Classification of Corpora (Mode) Synchronous: online chatting; online conferencing; instant messaging Asynchronous: Emailing; blogging; BBS forum posting; Online film/book reviewing; etc Corpora Written Spoken Monolingual Bi-/Multi-lingual ‹#› 40 Classification of Corpora (Content) BNC Monolingual Language for General Purposes (LGP) Reference corpus Language for Special Purposes (LSP) Academic corpus Legal corpus Tourism Corpus BAWE HKLAW TnT ‹#› 41 Classification of Corpora (CMC) CMC Corpora Synchronous corpus Asynchronous corpus Blogging Web pages Emails Business Education Government Online chatting Instant messaging Online conferencing ‹#› 42 Methods of corpus-based analysis nWordlists nConcordances nCollocations nKeywords ‹#› 43 Analytical procedure: Four steps nStep 1: Word listing and counting – Tearing the text apart n nStep 2: Compiling a concordance – Putting words back into context n nStep 3: Sorting the context in a concordance – Uncovering patterns n nStep 5: Examining the context of a word – Looking for collocations ‹#› 44 A problem for you nWhat verbs go with “battle”? n nWhat adjectives go with “battle”? n nWhat phrases contain “battle”? ‹#› 45 n n“The ability to examine large text corpora in a systematic manner allows access to a quality of evidence that has not been available before” (Sinclair 1991: 4) n n ‹#› 46 ‘battle’: LTP Dictionary of Selected Collocations nVerbs to the left: engage in, fight, force, go into, join in, lose, take part in, win ~ n nVerbs to the right: ~ continues, dragged on, ended in stalemate, is in progress, raged n nAdj: bitter, bloody, crucial, decisive, fierce, final, hopeless, important, last-ditch, long, long-running, major, mock, pitched, real, relentless, running, successful ~ n nPhrases: fight a losing ~, outcome of ~ ‹#› 47 BNC Written nIn 90 million words, “battle” comes over 6,000 times, once every 14,000 words. nCollocated verbs in top 100 linked by MI score: ¨fought (153)/fighting (93) ¨rages (5)/raged (12) ¨waged (10)/waging (12) ¨ensued (8)/ensuing (13) ¨defeated (39) ¨losing (68) ¨won (152) ¨commence (5) ‹#› 48 Clusters in BNC Written nto do battle (54) nfighting a losing battle (24) nwin the battle (22) nwon the battle (22) nfighting a losing battle (21) nto fight a battle (15) ‹#› 49 Lexicography and corpora nCorpus provides authentic uses of language nExtract samples (concordance) to identify different senses nWord frequency information nHelp identify collocation, set phrase nSet phrase: night and day, black and white nModern English dictionaries are all now corpus-based. nOxford, Collins, Longman, Cambridge… ‹#› 50 Linguistics and Corpora lVerify linguistic theory and hypotheses nResearch on empirical linguistics nLanguage variation ne.g Intonation lCheng et al (2008). A corpus-driven study of discourse intonation n Grammar lBiber et al (1999) Longman Grammar of Spoken and Written English lHunston et al (1996) Pattern Grammar n Discourse lBaker (2006). Using corpora in discourse analysis. n Language variation lReppen et al (Eds.) (2002). Using corpora to explore linguistic variation ‹#› 51 Language Teaching and Corpora nUse corpus as a resource Knowledge : n – Know better about English: n e.g. answer specific questions of certain n words, phrases, structures. n – Know where the problems are: n e.g. error analysis on a learner corpus n – Know what should be taught n e.g. syllabus design, teaching materials ‹#› 52 Language Teaching and Corpora nUse corpus for syllabus design: n – Native corpora => what are actually used n – Learner corpora => what are the problems n – Find out which aspects should be given priority n – Lexical syllabus = focus on frequency of occurrence n – How many words the students should know? n What are they? ‹#› 53 Corpus and literary study nCorpus of literary works: ¨e.g. Corpus of Shakespeare’s drama nStylistic studies ¨Compare the works of different writers ¨Compare the literary works of different genres, for different readership nHistorical studies ¨Compare works of different historical period ¨Investigate the changes of the patterns of language uses ¨Examine the changes of vocabulary ‹#› 54 Corpus and Translation Study nCorpora as a resource for translation nParallel corpora: corpus of translation and its original texts ¨Provide examples of translation nCorpus of translation vs Corpus of target language ¨Help editing translation to be native-like ¨Help understanding difficult words/concepts