What is a corpus?

A corpus is a collection of texts of written (or spoken) language presented in electronic form. It provides the evidence of how language is used in real situations, from which lexicographers can write accurate and meaningful dictionary entries. The Oxford English Corpus is at the heart of dictionary-making in Oxford in the 21st century and ensures that we can track and record the very latest developments in language today. By analysing the corpus and using special software, we can see words in context and find out how new words and senses are emerging, as well as spotting other trends in usage, spelling, world English, and so on.

Using the corpus enables lexicographers to examine a word in detail by looking at all the different contexts in which it occurs. Below is a typical way of viewing the results of a search of the corpus, using a display format called KWIC (or ‘key word in context’):

corpus search example

The Oxford English Corpus gives us the fullest, most accurate picture of the language today. It represents all types of English, from literary novels and specialist journals to everyday newspapers and magazines, and even the language of blogs, emails, and Internet message boards. And, as English is a global language, used by an estimated one third of the world’s population, the Oxford English Corpus contains language from all parts of the world – not only from the UK and the United States but also from Ireland, Australia, New Zealand, the Caribbean, Canada, India, Singapore, and South Africa. It is the largest English corpus of its type: the most representative slice of the English language available.

The corpus reaches new heights

The corpus contains over 2.5 billion words of real 21st-century English; this is the largest lexical corpus in the world. It is not only size that matters, though: it is the size of the corpus coupled with the careful selection and development of its contents which means that it is a resource unlike any other in the world. Moreover, because the corpus is a collection of texts, there are not two billion different words: the humble word ‘the’, the commonest in the written language, accounts for almost 100 million of all the words in the corpus!

Keeping track of our language

Meanings of words and phrases change, and so do spellings, despite the existence of ‘standard’ or ‘correct’ spelling. A strength of the corpus is that it contains not only published works in which the text has been edited (and made to conform to standard spellings and grammar) but also unpublished and unedited writing like emails and blogs. Some of the most inventive uses and deliberate exploitations of language (as well as genuine mistakes) start out in this kind of informal and unselfconscious language, so tracking them is an essential part of tracking the language as a whole.

See more from The corpus and Oxford Dictionaries