How do I find the frequency count of a word in English using WordNet? How do I find the frequency count of a word in English using WordNet? python python

How do I find the frequency count of a word in English using WordNet?


In WordNet, every Lemma has a frequency count that is returned by the methodlemma.count(), and which is stored in the file nltk_data/corpora/wordnet/cntlist.rev.

Code example:

from nltk.corpus import wordnetsyns = wordnet.synsets('stack')for s in syns:    for l in s.lemmas():        print l.name + " " + str(l.count())

Result:

stack 2batch 0deal 1flock 1good_deal 13great_deal 10hatful 0heap 2lot 13mass 14mess 0...

However, many counts are zero and there is no information in the source file or in the documentation which corpus was used to create this data. According to the book Speech and Language Processing from Daniel Jurafsky and James H. Martin, the sense frequencies come from the SemCor corpus which is a subset of the already small and outdated Brown Corpus.

So it's probably best to choose the corpus that fits best to the your application and create the data yourself as Christopher suggested.

To make this Python3.x compatible just do:

Code example:

from nltk.corpus import wordnetsyns = wordnet.synsets('stack')for s in syns:    for l in s.lemmas():        print( l.name() + " " + str(l.count()))


You can sort of do it using the brown corpus, though it's out of date (last revised in 1979), so it's missing lots of current words.

import nltkfrom nltk.corpus import brownfrom nltk.probability import *words = FreqDist()for sentence in brown.sents():    for word in sentence:        words.inc(word.lower())print words["and"]print words.freq("and")

You could then cpickle the FreqDist off to a file for faster loading later.

A corpus is basically just a file full of sentences, one per line, and there are lots of other corpora out there, so you could probably find one that fits your purpose. A couple of other sources of more current corpora: Google, American National Corpus.

You can also suppsedly get a current list of the top 60,000 words and their frequencies from the Corpus of Contemporary American English


Check out this site for word frequencies:http://corpus.byu.edu/coca/

Somebody compiled a list of words taken from opensubtitles.org (movie scripts). There's a free simple text file formatted like this available for download. In many different languages.

you 6281002i 5685306the 4768490to 3453407a 3048287it 2879962

http://invokeit.wordpress.com/frequency-word-lists/