What is a simple way to generate keywords from a text?
The name for the "high frequency English words" is stop words and there are many lists available. I'm not aware of any python or perl libraries, but you could encode your stop word list in a binary tree or hash (or you could use python's frozenset), then as you read each word from the input text, check if it is in your 'stop list' and filter it out.
Note that after you remove the stop words, you'll need to do some stemming to normalize the resulting text (remove plurals, -ings, -eds), then remove all the duplicate "keywords".
You could try using the perl module Lingua::EN::Tagger for a quick and easy solution.
A more complicated module Lingua::EN::Semtags::Engine uses Lingua::EN::Tagger with a WordNet database to get a more structured output. Both are pretty easy to use, just check out the documentation on CPAN or use perldoc after you install the module.
To find the most frequently-used words in a text, do something like this:
#!/usr/bin/perl -wuse strict;use warnings 'all';# Read the text:open my $ifh, '<', 'text.txt' or die "Cannot open file: $!";local $/;my $text = <$ifh>;# Find all the words, and count how many times they appear:my %words = ( );map { $words{$_}++ } grep { length > 1 && $_ =~ m/^[\@a-z-']+$/i } map { s/[",\.]//g; $_ } split /\s/, $text;print "Words, sorted by frequency:\n";my (@data_line);format FMT = @<<<<<<<<<<<<<<<<<<<<<<... @########@data_line.local $~ = 'FMT';# Sort them by frequency:map { @data_line = ($_, $words{$_}); write(); } sort { $words{$b} <=> $words{$a} } grep { $words{$_} > 2 } keys(%words);
Example output looks like this:
john@ubuntu-pc1:~/Desktop$ perl frequency.pl Words, sorted by frequency:for 32Jan 27am 26of 21your 21to 18in 17the 17Get 13you 13OTRS 11today 11PSM 10Card 10me 9on 9and 9Offline 9with 9Invited 9Black 8get 8Web 7Starred 7All 7View 7Obama 7