Counting words on a html web page using php

The one line below will do a case insensitive word count after stripping all HTML tags from your string.

print_r(array_count_values(str_word_count(strip_tags(strtolower($str)), 1)));

To grab the source code of a page you can use cURL or file_get_contents()

$str = file_get_contents('http://www.example.com/');

From inside out:

Use strtolower() to make everything lower case.
Strip HTML tags using strip_tags()
Create an array of words used using str_word_count(). The argument 1 returns an array containing all the words found inside the string.
Use array_count_values() to capture words used more than once by counting the occurrence of each value in your array of words.
Use print_r() to display the results.

php html scripting bots

The below script will read the contents of the remote url, remove the html tags, and count the occurrences of each unique word therein.

Caveat: In your expected output, "This" has a value of 2, but the below is case-sensitive, so both "this" and "This" are recorded as separate words. You coudl convert the whole input string to lower case before processing if the original case is not significant for your purposes.

Additionally, as only a basic strip_tags is run on the input, mal-formed tags will not be removed, so the assumption is that your source html is valid.

Edit: Charlie points out in the comments that things like the head section will still be counted. With the help of a function defined in the user notes of the strip_tags function, these are also now taken care of.

generichtml.com

<html><body><h1> This is the title </h1><p> some description text here, <b>this</b> is a word. </p></body></html>

parser.php

// Fetch remote html$contents = file_get_contents($htmlurl);// Get rid of style, script etc$search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript           '@<head>.*?</head>@siU',            // Lose the head section           '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly           '@<![\s\S]*?--[ \t\n\r]*>@'         // Strip multi-line comments including CDATA);$contents = preg_replace($search, '', $contents); $result = array_count_values(              str_word_count(                  strip_tags($contents), 1                  )              );print_r($result);

Output:

Array(    [This] => 1    [is] => 2    [the] => 1    [title] => 1    [some] => 1    [description] => 1    [text] => 1    [here] => 1    [this] => 1    [a] => 1    [word] => 1)

php html scripting bots

The previous code is a point where start. The next step is delete html tags with the regular expressions. Look for ereg and eregi functions. Some other tricks are required for style and script tags (you have to remove the content)Points and commas have to be removed too...

CodeHunter

Counting words on a html web page using php

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last