Real world typo statistics? [closed] Real world typo statistics? [closed] python python

Real world typo statistics? [closed]


Possible source for real world typo statistics would be in the Wikipedia's complete edit history.

http://download.wikimedia.org/

Also, you might be interested in the AWB's RegExTypoFix

http://en.wikipedia.org/wiki/Wikipedia:AWB/T


I would advise you to check the trigram alogrithm. In my opinion it works better for finding typos then edit distance algorithm. It should work faster as well and if you keep dictionary in postgres database you can make use of index.

You may find useful stackoverflow topic about google "Did you mean"


Probability Scoring for Spelling Correction by Church and Gale might help. In that paper, the authors model typos as a noisy channel between the author and the computer. The appendix has tables for typos seen in a corpus of Associated Press publications. There is a table for each of the following kinds of typos:

  • deletion
  • insertion
  • substitution
  • transposition

For example, examining the insertion table, we can see that l was incorrectly inserted after l 128 times (the highest number in that column). Using these tables, you can generate the probabilities you're looking for.