How to prevent scraping my blog's updates?

wordpress web-scraping clone

You can't really stop them in the end, but you might be able to find them and mess with them. Try hiding the request IP in an HTML comment, or white-on-white text, or just somewhere out of the way, then see what IPs show up on the copies. You can also try to obfuscate that text if you want by turning it into a hex string or something so it's less obvious to someone who doesn't know or make it look like an error code, just so they don't catch on to what you're doing.

In the end, though, I'm not sure how much it will buy you. If they're really inattentive, rather than shutting them down and calling attention to the fact that you're onto them, you can feed them gibberish or whatever whenever one of their IPs crops up. That might be fun and it's not too hard to make a gibberish generator by putting sample texts into a Markov chain.

EDIT: Oh, and if pages aren't rewritten too much, you might be able to add some inline JS to make them link to you, if they don't strip that. Say, a banner that only shows up if they're not at your site, giving the original link to your articles and suggesting that people read that.

wordpress web-scraping clone

Are you willing to shut down your RSS Feed? if so you could do something like

function fb_disable_feed() {wp_die( __('No feed available,please visit our <a href="'. get_bloginfo('url') .'">homepage</a>!') );}add_action('do_feed', 'fb_disable_feed', 1);add_action('do_feed_rdf', 'fb_disable_feed', 1);add_action('do_feed_rss', 'fb_disable_feed', 1);add_action('do_feed_rss2', 'fb_disable_feed', 1);add_action('do_feed_atom', 'fb_disable_feed', 1);

it means if you go to a feed page, it just returns with the message in wp_die() on line two. We use it for 'free' versions of our WP Software with an if-statement so they can't hook into their RSS feeds to link to their main website, it's an upsell opportunity for us, it works well is my point, haha.

wordpress web-scraping clone

Even though this is a little old of a post, I thought it would still be helpful for me to weigh in in case other people see the post and have the same question. Since you've eliminated the RSS feed from the mix and youre pretty confident it isnt a manual effort, then what you need to is better stop the bots they are using.

First, I would recommend banning proxy servers in your IPTables. You can get a list of known proxy server addresses from Maxmind. This should limit their ability to anonymize themselves.

Second, it would be great to make it harder for them to scrape. You could accomplish this in one of a couple of ways. You could render part, or all of your site in javascript. If nothing else, you could at least just render the links in javascript. This will make it significantly harder for them to scrape you. Alternatively, you can put your content within an iframe inside the pages. This will also make it somewhat harder to crawl and scrape.

All this said, if they really want your content they will pretty easily get by these traps. Honestly, fighting off webscrapers is an arms race. You cannot put any static trap in place to stop them, instead you have to continuously evolve your tactics.

For full disclosure, I am a co-founder of Distil Networks, and we offer an anti-scraping solution as a service.

CodeHunter

How to prevent scraping my blog's updates?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last