What exactly is corpus linguistics?

This post, in my mini-series of posts entitled ‘What exactly is…”, will try to give an overview of Corpus Linguistics and hopefully pique your interest to find out more.

First of all, a definition: a corpus is a collection of texts, often used to study language. These days, corpora are generally held electronically – access is much faster and analysis can be more powerful.

Corpora have a considerable history. The very first corpora date back to ancient times – one example is the Hippocratic Corpus of Ancient Greek – a collection of medical texts. Another well-known corpus was used by Dr Johnson to produce his Dictionary of the English Language – it was based on quotations from famous authors, copied onto slips of paper, becoming part of a huge filing system (the ‘corpus’).

Corpus linguistics as a discipline

The first book dedicated to the subject was written by Aarts and Meijs in 1984. Corpus linguistics has developed quickly in recent decades due to the great possibilities offered by computerized processing of natural language. Corpus linguistics is, however, not the same as obtaining language data through the use of computers. Corpus linguistics is the study and analysis of data obtained from a corpus.

To find out more about corpus linguistics, see the W3-Corpora Project at the University of Essex website.

Corpora can be used to:

help translators (the corpora can be bilingual or monolingual) – for example the DGT translation memories can be described as bilingual or parallel corpora
‘teach’ machine translation programs (corpora of United Nations documents were among the first to be used in developing Google Translate)
study translation habits (how translators translate)
study changes in language over time
help people to learn a language (or even jargon in a particular field) – corpora were used to produce the well-known COBUILD dictionary for learners of English as a foreign language, for example
analyze discourse – the linguistic ‘behavior’ of a given community and the way in which meaning is constructed or construed
establish the ‘plain’ or ‘ordinary’ meaning of a word or term – corpora are starting to be used in the interpretation of statutes, regulations and contracts by judges & forensic linguists, especially in the US
study a body of law (an early example is the Corpus Juris – fundamental works of jurisprudence collected in Ancient Rome)

As stated above, corpora can be used to study how language changes over time. Below I used Google n-grams (see this post see for more information) to check on usage of the term. Look how “corpus linguistics” jumps in use from around 1987 onwards! And note how “forensic linguistics” only started to appear around 1998.

Kinds of corpora

monolingual, bilingual, multilingual
parallel* (texts and their translations)
comparable* (similar, untranslated texts in different languages)
spoken or written
general or specialized
unidirectional, bidirectional or multidirectional
diachronic or synchronic (restricted to certain periods of time)

* Scholars disagree about the definition of these two terms.

How corpora can be collected

by hand (think index cards or even parchment)
automatically
by humans assisted by electronic tools (e.g. BootCat)

Corpus size

From 10,000 words (see Bowker & Pearson, 2002, p. 48) to billions (e.g. enTenTen2) or even infinitely large (the Web). Some people take a ‘big is beautiful’ approach, while others consider that a small corpus in a specialized domain can be very useful too.

Access to corpora

Unfortunately some organizations and countries are better than others at sharing corpus data openly. 😉

Consulting a corpus

Many corpora can be consulted through online portals, and you can also use software (such as AntConc) to look up a corpus on your own computer.

One online portal giving access to several corpora (and lots of other things besides) is the Sketch Engine. For more details about that see my post here.

Mark Davies, at Brigham Young University in the United States, makes available several corpora at this portal: http://corpus.byu.edu/. You just need to register in order to use them. However the user interface can be a little tricky.

And to round off, some key terms (click to enlarge):

Selected publications

Aarts, J. & Meijs, W. (Eds.). (1984). Corpus linguistics: Recent developments in the use of computer corpora in English language research. Amsterdam: Rodopi.
Bowker, L. & Pearson J. (2002). Working with specialized language: A practical guide to using corpora. London: Routledge.
McEnery T. & Hardie, A. (2011). Corpus Linguistics. Cambridge: Cambridge University Press.
Taylor, C. (2008). What is corpus linguistics? What the data says. ICAME Journal, 32, 179-200.

You might also be interested in other posts in this series: What exactly is forensic linguistics?, What exactly is a lawyer-linguist?, and What exactly is comparative law?