How to Identify, Disable and Prevent Scrapers from Thieving Your Blogs’ Content

How to Identify, Disable and Prevent Scrapers from Thieving Your Blogs’ Content

bunnymeow The first time I ever saw one of my blogs on another site I was enraged. “They stole my content!” It was verbatim. They could have at least changed the wording a bit. Then I realized that they were tapping into my RSS to steal it, strip it of links, and post without credit.

Unless you religiously monitor your analytics and server logs, you may not even know that your content has been stolen. I found someone copying my content when I viewed trackbacks on another blog, supposedly linking to mine but actually linking to the theft’s website. If you have an active blogging community that is cranking out some high quality content, you need to be able to give your bloggers some protection against scrapers. Here’s how:

How To Identify Scraper Sites:

  • No RSS feed available
  • Many quality posts that contain no links
  • Many quality posts but very low subscriber count
  • Great content but with zero comments on any posts
  • Lots of good content but with lots of Adsense or other ads
  • No “About” page or business information
  • And the number one brain-dead giveaway: no contact form or email address

 (Source for list above: Perishable Press)

How to Diagnose the Scraper’s Source:

Monitor your incoming links.

In your WordPress dashboard you will see a section for “Incoming Links.” This is probably the easiest way to track down content thieves.

Monitor your bandwidth.

Log analysis software is imperative if you want to find the source of some of your most persistent bots. If you have cPanel, you have Analog Stats, Awstats, and Webalizer available to you to help you identify the evil bots crawling your server. Dig into the data and find out their IP addresses. Then you can block them in your .htaccess file.


Monitor your trackbacks.

Scrapers will often send a “ping” to your site for a trackback. You will generally be notified in the same way as if it were a comment. Check out their content, verify the theft, and then grab their IP address and block it in order to prevent them from returning to your site.

How to Disable and Prevent Scrapers

Find the IP address of your scraper and then deny it access from your .htaccess file:

order allow,deny
deny from
deny from
deny from
allow from all

Crack down on hotlinking:

RewriteEngine on
RewriteCond %{HTTP_REFERER} ^http://.*lame-bandwidth-theft\.com [NC]
RewriteRule .* – [F]

Use partial feeds.

This may not be desirable to you but I guarantee you that you’ll attract fewer scrapers with content excerpts or titles only. This is one measure that will seriously curb the amount of content being scraped from your site.

Contact the Scraper Directly

This might seem ridiculous but if you ask them to remove your content, they may comply. If they do not, there are places where you can report their thieving. File a formal DMCA notice with each of the major search engines.

Add Copyright Information to Your RSS Footer

Make sure that anyone who reads your content knows the original source. You can customize your RSS footer using this WordPress plugin.