Determining Word Frequency of Specific Terms Determining Word Frequency of Specific Terms linux linux

Determining Word Frequency of Specific Terms


I would go with the second idea. Here is a simple Perl program that will read a list of words from the first file provided and print a count of each word in the list from the second file provided in tab-separated format. The list of words in the first file should be provided one per line.

#!/usr/bin/perluse strict;use warnings;my $word_list_file = shift;my $process_file = shift;my %word_counts;# Open the word list file, read a line at a time, remove the newline,# add it to the hash of words to track, initialize the count to zeroopen(WORDS, $word_list_file) or die "Failed to open list file: $!\n";while (<WORDS>) {  chomp;  # Store words in lowercase for case-insensitive match  $word_counts{lc($_)} = 0;}close(WORDS);# Read the text file one line at a time, break the text up into words# based on word boundaries (\b), iterate through each word incrementing# the word count in the word hash if the word is in the hashopen(FILE, $process_file) or die "Failed to open process file: $!\n";while (<FILE>) {  chomp;  while ( /-$/ ) {    # If the line ends in a hyphen, remove the hyphen and    # continue reading lines until we find one that doesn't    chop;    my $next_line = <FILE>;    defined($next_line) ? $_ .= $next_line : last;  }  my @words = split /\b/, lc; # Split the lower-cased version of the string  foreach my $word (@words) {    $word_counts{$word}++ if exists $word_counts{$word};  }}close(FILE);# Print each word in the hash in alphabetical order along with the# number of time encountered, delimited by tabs (\t)foreach my $word (sort keys %word_counts){  print "$word\t$word_counts{$word}\n"}

If the file words.txt contains:

linuxfrequenciessciencewords

And the file text.txt contains the text of your post, the following command:

perl analyze.pl words.txt text.txt

will print:

frequencies     3linux   1science 1words   3

Note that breaking on word boundaries using \b may not work the way you want in all cases, for example, if your text files contain words that are hyphenated across lines you will need to do something a little more intelligent to match these. In this case you could check to see if the last character in a line is a hyphen and, if it is, just remove the hyphen and read another line before splitting the line into words.

Edit: Updated version that handles words case-insensitively and handles hyphenated words across lines.

Note that if there are hyphenated words, some of which are broken across lines and some that are not, this won't find them all because it only removed hyphens at the end of a line. In this case you may want to just remove all hyphens and match words after the hyphens are removed. You can do this by simply adding the following line right before the split function:

s/-//g;


I do this sort of thing with a script like following (in bash syntax):

for file in *.txtdo   sed -r 's/([^ ]+) +/\1\n/g' "$file" \  | grep -F -f 'go-words' \  | sort | uniq -c > "${file}.frq"done

You can tweak the regex you use to delimit individual words; in the example I just treat whitespace as the delimiter. The -f argument to grep is a file that contains your words of interest, one per line.


First familiarize yourself with lexical analysis and how to write a scanner generator specification. Read the introductions to using tools like YACC, Lex, Bison, or my personal favorite, JFlex. Here you define what constitutes a token. This is where you learn about how to create a tokenizer.

Next you have what is called a seed list. The opposite of the stop list is usually referred to as the start list or limited lexicon. Lexicon would also be a good thing to learn about. Part of the app needs to load the start list into memory so it can be quickly queried. The typical way to store is a file with one word per line, then read this in at the start of the app, once, into something like a map. You might want to learn about the concept of hashing.

From here you want to think about the basic algorithm and the data structures necessary to store the result. A distribution is easily represented as a two dimensional sparse array. Learn the basics of a sparse matrix. You don't need 6 months of linear algebra to understand what it does.

Because you are working with larger files, I would advocate a stream-based approach. Don't read in the whole file into memory. Read it as a stream into the tokenizer that produces a stream of tokens.

In the next part of the algorithm think about how to transform the token list into a list containing only the words you want. If you think about it, the list is in memory and can be very large, so it is better to filter out non-start-words at the start. So at the critical point where you get a new token from the tokenizer and before adding it to the token list, do a lookup in the in-memory start-words-list to see if the word is a start word. If so, keep it in the output token list. Otherwise ignore it and move to the next token until the whole file is read.

Now you have a list of tokens only of interest. The thing is, you are not looking at other indexing metrics like position and case and context. Therefore, you really don't need a list of all tokens. You really just want a sparse matrix of distinct tokens with associated counts.

So,first create an empty sparse matrix. Then think about the insertion of the newly found token during parsing. When it occurs, increment its count if its in the list or otherwise insert a new token with a count of 1. This time, at the end of parsing the file, you have a list of distinct tokens, each with a frequency of at least 1.

That list is now in-mem and you can do whatever you want. Dumping it to a CSV file would be a trivial process of iterating over the entries and writing each entry per line with its count.

For that matter, take a look at the non-commercial product called "GATE" or a commercial product like TextAnalyst or products listed at http://textanalysis.info