Technical information about the corpus
Web crawling and text processing
Documents in the Oxford English Corpus consist of a series of text segments derived from closely related pages within a single website. Pages are considered related when they link to each other and all discuss a particular topic, or are all written by a particular author etc. Text is collected using a custom-built web crawler. A configuration file is used to direct the crawler to a particular website (or an area of a website) and to define the behaviour of the crawler within that site: the navigational route it should follow, and the type of pages it should collect along that route.
Collecting text in this way is more labour-intensive than randomized or exhaustive crawling methods. Collection of each document requires a new entry to be added to the configuration file in order to specify the crawler’s route and behaviour. However, this approach has two important benefits. Firstly, it means that metadata (domain, year, author, etc.) can be accurately defined in advance. Secondly, it facilitates removal of ‘boilerplate’ text: that is, standardized pieces of text used as part of a computer programme. Boilerplate can be identified by comparing a cluster of related pages and looking for similar HTML strings.
Having collected and boilerplate-stripped a series of web pages, the pages are stripped of tags, links, and other coding, and normalized to plain-text without formatting. The text is then ‘tokenized’ into individual words, annotated for part of speech, and parsed. Finally, the annotated text is converted to XML, and document metadata is added.
Each document has the following metadata:
- author (if known; many websites make this difficult to determine reliably)
- author gender (if known)
- language type (e.g. British English, American English)
- source website
- year (+ date, if known)
- date of collection
- domain + subdomain
- document statistics (number of tokens, sentences, etc.)
In addition, each page within a document has metadata giving the URL of the source webpage.
Tagging and parsing
The corpus is tagged using the Tree Tagger developed by Stuttgart University. Each word in the corpus is annotated with its lemma and tag, drawn from a fine-grained tagset based on the Penn Treebank. Major grammatical relations between the words, such as subject, objects, and modifiers, are then determined using a sketch grammar.
The principal tool used for analysis of the Oxford English Corpus is the Sketch Engine, software developed by Lexical Computing Ltd: see www.sketchengine.co.uk.
We take a look at several popular, though confusing, punctuation marks.