Counting words on a html web page using php Counting words on a html web page using php php php

Counting words on a html web page using php


The one line below will do a case insensitive word count after stripping all HTML tags from your string.

Live Example

print_r(array_count_values(str_word_count(strip_tags(strtolower($str)), 1)));

To grab the source code of a page you can use cURL or file_get_contents()

$str = file_get_contents('http://www.example.com/');

From inside out:

  1. Use strtolower() to make everything lower case.
  2. Strip HTML tags using strip_tags()
  3. Create an array of words used using str_word_count(). The argument 1 returns an array containing all the words found inside the string.
  4. Use array_count_values() to capture words used more than once by counting the occurrence of each value in your array of words.
  5. Use print_r() to display the results.


The below script will read the contents of the remote url, remove the html tags, and count the occurrences of each unique word therein.

Caveat: In your expected output, "This" has a value of 2, but the below is case-sensitive, so both "this" and "This" are recorded as separate words. You coudl convert the whole input string to lower case before processing if the original case is not significant for your purposes.

Additionally, as only a basic strip_tags is run on the input, mal-formed tags will not be removed, so the assumption is that your source html is valid.

Edit: Charlie points out in the comments that things like the head section will still be counted. With the help of a function defined in the user notes of the strip_tags function, these are also now taken care of.

generichtml.com

<html><body><h1> This is the title </h1><p> some description text here, <b>this</b> is a word. </p></body></html>

parser.php

// Fetch remote html$contents = file_get_contents($htmlurl);// Get rid of style, script etc$search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript           '@<head>.*?</head>@siU',            // Lose the head section           '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly           '@<![\s\S]*?--[ \t\n\r]*>@'         // Strip multi-line comments including CDATA);$contents = preg_replace($search, '', $contents); $result = array_count_values(              str_word_count(                  strip_tags($contents), 1                  )              );print_r($result);

?>

Output:

Array(    [This] => 1    [is] => 2    [the] => 1    [title] => 1    [some] => 1    [description] => 1    [text] => 1    [here] => 1    [this] => 1    [a] => 1    [word] => 1)


The previous code is a point where start. The next step is delete html tags with the regular expressions. Look for ereg and eregi functions. Some other tricks are required for style and script tags (you have to remove the content)Points and commas have to be removed too...