How to detect browser spoofing and robots from a user agent string in php How to detect browser spoofing and robots from a user agent string in php php php

How to detect browser spoofing and robots from a user agent string in php


In addition to filtering key words in the user agent string, I have had luck with putting a hidden honeypot link on all pages:

<a style="display:none" href="autocatch.php">A</a>

Then in "autocatch.php" record the session (or IP address) as a bot. This link is invisible to users but it's hidden characteristic would hopefully not be realized by bots. Taking the style attribute out and putting it into a CSS file might help even more.


Because, as previously stated, you can spoof user-agents & IP, these cannot be used for reliable bot detection.

I work for a security company and our bot detection algorithm look something like this:

  1. Step 1 - Gathering data:

    a. Cross-Check user-agent vs IP. (both need to be right)

    b. Check Header parameters (what is missing, what is the order and etc...)

    c. Check behavior (early access and compliance to robots.txt, general behavior, number of pages visited, visit rates and etc)

  2. Step 2 - Classification:

    By cross verifying the data, the bot is classified as "Good", "Bad" or "Suspicious"

  3. Step 3 - Active Challenges:

    Suspicious bots undergo the following challenges:

    a. JS Challenge (can it activate JS?)

    b. Cookie Challenge (can it accept coockies?)

    c. If still not conclusive -> CAPTCHA

This filtering mechanism is VERY effective but I don't really think it could be replicated by a single person or even an unspecialized provider (for one thing, challenges and bot DB needs to be constantly updated by security team).

We offer some sort of "do it yourself" tools in form of Botopedia.org, our directory that can be used for IP/User-name cross-verification, but for truly efficient solution you will have to rely on specialized services.

There are several free bot monitoring solutions, including our own and most will use the same strategy I've described above (or similar).

GL


Beyond just comparing user agents, you would keep a log of activity and look for robot behavior. Often times this will include checking for /robots.txt and not loading images. Another trick is to ask the client if they have javascript since most bots won't mark it as enabled.

However, beware, you may well accidently get some people who are genuinely people.