Corpora of misspellings for download

birkbeck

The birkbeck file contains 36,133 misspellings of 6,136 words. It is an amalgamation of errors taken from the native-speaker section (British or American writers) of the Birkbeck spelling error corpus, a collection of files of spelling errors gathered from various sources, available as separate files with detailed documentation from the Oxford Text Archive. It includes the results of spelling tests and errors from free writing, taken mostly from schoolchildren, university students or adult literacy students. Most of them were originally handwritten.

Each correct word is preceded by a dollar sign and followed by its misspellings, each on one line, without duplicates. (A misspelling might appear more than once in the corpus, but only as a misspelling of different words.) Where the spelling or misspelling contained a space, this has been replaced by an underscore (a_lot, Christ_mas). While most of the misspellings are non-words, there are also some real-word errors, such as "omitted" for "admitted".

Correct spellings of dictionary words are given in Oxford English form. Where the misspellings were taken from American writers, attempts at specifically American forms (color, theater etc.) have been excluded. Where a correct American form appears as a misspelling, it represents a British writer's attempt at the British form, such as "color" for "colour". Apart from dictionary words, the correct spellings also contain some proper nouns, abbreviations, words with apostrophes or hyphens, made-up words and two-word items (e.g. "too much") where the misspelling was a single string ("tomuch").

Users of this corpus should bear in mind that it includes the efforts of young children and extremely poor spellers being subjected to spelling tests way beyond their ability, so some of the misspellings are very different from their targets; the single letter "o", for example, appears as a misspelling of the word "accordingly".

holbrook

The passages in holbrook are taken from the book 'English for the Rejected' by David Holbrook, Cambridge University Press, 1964. They are extracts from the writings of secondary-school children, in their next-to-last year of schooling. In a couple of the passages, the children were copying material supplied to them, but all the rest is their own writing. The passages are linked, in the book, by description, explanation and comment by the author, and some of them make little sense without that context.

Holbrook, who was the children's English teacher, preserved the original spelling and punctuation when he typed the passages for publication. The children are given fictitious names in the book, and these are retained in the file, together with page references.

The text was put into computer-readable form, with the permission of David Holbrook and the Cambridge University Press, via a data-entry machine and was manually proofread.

The material is presented here in two forms. The file holbrook-tagged contains the passages pretty much as published except that the misspellings are tagged. For example <ERR targ=sister> siter </ERR> means that "siter" was written for "sister". In a few cases, where I could not decide what word was intended, a question mark is given where the target ought to be.

The file holbrook-missp contains the misspellings extracted from the tagged file and put into the same form as the birkbeck file described above, with the difference that the frequency of each error-target pair is also provided. It contains 1791 misspellings of 1200 target words (including 20 of unknown targets represented as '?').

aspell

The aspell file contains 531 misspellings of 450 words. It is derived from one assembled by Atkinson (click here) for testing the GNU Aspell spellchecker.

This version is based closely on one used by Deorowicz and Ciura in a recent paper ("Correcting spelling errors by modelling their causes"). They removed some errors from the original file where the correct words were not in their lexicon. (I used their version in order to preserve comparability with their work.) I have reformatted it along the lines of the birkbeck file described above. I also added British spellings as correct forms. Where the initial letter was upper-case but the word was not a proper noun, I changed this to lower-case.

wikipedia

The wikipedia file contains 2,455 misspellings of 1,922 words. It is a list of misspellings made by Wikipedia editors (click here).

Like the aspell file (see above), this version is closely based on one used by Deorowicz and Ciura, and the same remarks apply to this as to aspell . In addition, I changed "rigeur" as a correct spelling to "de_rigeur", "orignal" as a correct spelling to "Orignal" (a place in Ontario), and I deleted "vigeur" as a correct spelling.

Note that a single misspelling may appear as the misspelling of several correct words; for example "buring" appears as a misspelling of "burin", "burning", "burying" and "during".