Counting words on a html web page using php
The one line below will do a case insensitive word count after stripping all HTML tags from your string.
print_r(array_count_values(str_word_count(strip_tags(strtolower($str)), 1)));
To grab the source code of a page you can use cURL or file_get_contents()
$str = file_get_contents('http://www.example.com/');
From inside out:
- Use strtolower() to make everything lower case.
- Strip HTML tags using strip_tags()
- Create an array of words used using str_word_count(). The argument
1
returns an array containing all the words found inside the string. - Use array_count_values() to capture words used more than once by counting the occurrence of each value in your array of words.
- Use print_r() to display the results.
The below script will read the contents of the remote url, remove the html tags, and count the occurrences of each unique word therein.
Caveat: In your expected output, "This" has a value of 2, but the below is case-sensitive, so both "this" and "This" are recorded as separate words. You coudl convert the whole input string to lower case before processing if the original case is not significant for your purposes.
Additionally, as only a basic strip_tags is run on the input, mal-formed tags will not be removed, so the assumption is that your source html is valid.
Edit: Charlie points out in the comments that things like the head
section will still be counted. With the help of a function defined in the user notes of the strip_tags function, these are also now taken care of.
generichtml.com
<html><body><h1> This is the title </h1><p> some description text here, <b>this</b> is a word. </p></body></html>
parser.php
// Fetch remote html$contents = file_get_contents($htmlurl);// Get rid of style, script etc$search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<head>.*?</head>@siU', // Lose the head section '@<style[^>]*?>.*?</style>@siU', // Strip style tags properly '@<![\s\S]*?--[ \t\n\r]*>@' // Strip multi-line comments including CDATA);$contents = preg_replace($search, '', $contents); $result = array_count_values( str_word_count( strip_tags($contents), 1 ) );print_r($result);
?>
Output:
Array( [This] => 1 [is] => 2 [the] => 1 [title] => 1 [some] => 1 [description] => 1 [text] => 1 [here] => 1 [this] => 1 [a] => 1 [word] => 1)