The Oxford New Words Corpus (New Monitor Corpus)
What is the Oxford New Words Corpus?
The Oxford New Words Corpus collection was started in early 2012 and now totals approximately 7 billion words. It complements and extends the Oxford English Corpus, a carefully balanced collection of English for the period of 2000-2006.
What information is collected in the New Words Corpus?
Whatever we’re interested in, it’s hard for us to keep up with the sheer quantity of new information that appears on the Internet daily. Many people subscribe to RSS feeds which collect together links to recently published web pages on particular topics.
Here at Oxford Dictionaries, we’re interested in anything talked about in words, so we identified more than 10,000 RSS feeds on topics covering the gamut of topics from current affairs, science, sport, hobbies, popular culture and hundreds of others. Twice a day we visit each of these feeds, identify links to pages we haven’t seen before, and collect the text of these pages. We get pages in English written by people all over the world and in all sorts of styles, including newspapers, scientific articles, and individual blogs. Most importantly, we know to within a day or so when the page was first published.
How does Oxford Dictionaries use the New Words Corpus to analyse language?
Every month we analyse what we’ve collected, removing duplicated copies of pages and even paragraphs, leaving us with more than 150 million words. We identify parts-of-speech and grammatical patterns to give us a wide-ranging and up-to-date picture of the vocabulary of current English.
Using the detailed data about when the pages were written, we apply a combination of statistical and editorial techniques to identify the appearance of new words and words which are becoming more widely used. We examine not only changes in how often a word is used, but how that usage is distributed across the world, and very importantly, whether it is moving from specialist forums such as science journals or websites for particular interest groups into the wider domain of English discussion.
We take a look at several popular, though confusing, punctuation marks.