WPMU robots.txt done globally

I know there are several robots.txt plugins out there that you could add to your plugins collection within your MU intsall and they would work very well.  They allow each individual blog to customize their own robots.txt file individually. But in my opinion, if you are managing a mu install there are several pitfalls to this method.

1. You have to activate the plugin

Let’s say you manage all of the blogs on your MU. Yes, I know you could use plugin commander to auto activate, or activate all and activate none, but I still find this somewhat of a manual job and anything that I can do to automate my job makes my life that much easier.

2. You have to modify the plugin to customize robots.txt file

If you want to customize the robots.txt file, then you either have to go in to each individual blog or modify the plugin so that when it is activated, the customization is already there.  Both methods are not preferable.

So why would you want to create a global robots.txt file for you mu install?

1. Same directory structure for every blog

Because you are running multiple blogs off the same install, the directory structure for every blog will be exactly the same.

2. Many of your users might not have a clue as to what a robots.txt file is

Again if you are managing the MU install and want to make sure you have the best platform/solution for your users, why not help them out by already having a robots.txt file created when their blog is created.

3. Better protect yourself

I see a lot of wp-login.php pages appearing the search engines.  Not that this is extremely bad, but as the main admin, you can feel more secure knowing that every blog is blocking the spiders from crawling pages that you don’t want crawled.

4. The main site’s seo effects everyone’s seo

Whether you are using subdomains, or subfolders for your domain structure, the seo authority of several of your main blogs will effect every other blogs seo authority.  Look at wordpress.com.  Now you can sign up for a wordpress.com blog and instantly have a better seo start than someone that starts up on their own. This is because they have a domain name jason.wordpress.com which just being associated with the wp.com domain already have some authority.

So how can you do this?

First create a file called global.php. This is going to be placed in your mu-plugins directory and we are going to add all of our global settings to this file.

Next we actually want to create our robots.txt function.


function my_global_robots_function(){
global $wpdb;

$blog = $wpdb->blogid;

echo “Disallow: /wp-adminn”;
echo “Disallow: /wp-includesn”;
echo “Disallow: /wp-login.phpn”;
echo “Disallow: /wp-content/pluginsn”;
echo “Disallow: /wp-content/cachen”;
echo “Disallow: /wp-content/themesn”;
echo “Disallow: /trackbackn”;
echo “Disallow: /commentsn”;
echo “Disallow: */trackbackn”;
echo “Disallow: */commentsn”;
echo “Disallow: /*?*n”;
echo “Disallow: /*?n”;
echo “Allow: /wp-content/blogs.dir/” . $blog . “/files/*nn”;

echo “Sitemap: ” . get_bloginfo(‘url’) . “/sitemap.xml”;

}

We first start by grabbing the blog id to ensure that we direct the spiders to the correct upload path for the blogs images.

And finally we need to grab the blogs url to correctly direct the spiders to the correct sitemap location.

Now to tie everything together we are going to plugin in to the wordpress do_robots action.


add_action('do_robots', 'my_global_robots_function');

Now anytime someone visits any one of your blogs robots file, you will have it already set up and blocking them from spidering any content you do not want them to crawl.

Here are some live examples:

San Diego Real Estate, Sacramento Real Estate, Indianapolis Real Estate

Some information on robot files:

Ask Apache, Daily Blog Tips

What say you?

Tags

Comments (14)

  1. Probably should add in the feed urls as well. I know most seo folks suggest keeping feeds out of google since they’re not normally meant to be seen by people.

    Do you really need the blogs.dir line in there? It’s already being rewritten by mu so that it looks just like the regular wordpress installs. I know on the mu installs we host, there’s no problem with images being indexed and they get indexed as the rewritten URLs.

    The sitemap line depends on how you;re serving the sitemaps. A long while back, I wrote up a method on the mu forums on how to get arne’s plugin to place the sitemaps so it would look like they were in the root of the blogs. (We duped the htaccess line and the wp-content/blogs.php file but modified them to work for the sitemaps in their own subdirectory. Works fine for us.)

    We just use a regular robots.txt file into the root of the install. Works fine for us.

  2. @Dr Mike:

    Yeah probably good call with the blogs.dir line. My impression was that there was no need for an Allow line anyway. Everything gets crawled unless you disallow it. I got that line from askapache.com

    So if you just have a robots.txt file in the root directory, then there is no customization as to the sitemap url? Maybe that’s not needed either…

    As for the sitemap:

    I add a couple lines to my .htaccess file

    RewriteRule ^(.*/)?sitemap.xml wp-content/blogs.php?file=sitemap.xml [L]
    RewriteRule ^(.*/)?sitemap.xml.gz wp-content/blogs.php?file=sitemap.xml.gz [L]

    Then manually write the sitemaps to the appropiate blogs.dir folder.

Participate