Sitemap.xml and robots.txt

What is the recommended way of handling Sitemaps with WPMU?

It seems that each site reads the same robots.txt from the root drive.

How can Google or other spiders fine Sitemap.xml?


  • drmike

    If you use the kb-robots.txt plugin, you can include a line for it as a default if you hard code it although I'm still concerned about security for it.

    We whacked it a bit like this:

    function build_robots_txt() {
         global $wpdb;
         $blog = $wpdb->blogid;
         $kb_defaultrobotstxt = "# This is your robots.txt file. Visit Options->Robots.txt to change this text.";
         $kb_defaultrobotstxt .= "\n";
         $kb_defaultrobotstxt .= "Disallow: /cgi-bin/\n";
         $kb_defaultrobotstxt .= "Disallow: /wp-admin/\n";
         $kb_defaultrobotstxt .= "Disallow: /wp-includes/\n";
         $kb_defaultrobotstxt .= "Disallow: /wp-login.php\n";
         $kb_defaultrobotstxt .= "Disallow: /wp-content/plugins/\n";
         $kb_defaultrobotstxt .= "Disallow: /wp-content/mu-plugins/\n";
         $kb_defaultrobotstxt .= "Disallow: /wp-content/cache/\n";
         $kb_defaultrobotstxt .= "Disallow: /wp-content/themes/\n";
         $kb_defaultrobotstxt .= "Disallow: /trackback/\n";
         $kb_defaultrobotstxt .= "Disallow: /comments/\n";
         $kb_defaultrobotstxt .= "Disallow: */trackback/\n";
         $kb_defaultrobotstxt .= "Disallow: */comments/\n";
         $kb_defaultrobotstxt .= "Disallow: /feed/\n";
         $kb_defaultrobotstxt .= "Disallow: /rss/\n";
         $kb_defaultrobotstxt .= "Disallow: */feed/\n";
         $kb_defaultrobotstxt .= "Disallow: */rss/\n";
         $kb_defaultrobotstxt .= "Disallow: /*?*\n";
         $kb_defaultrobotstxt .= "Disallow: /*?\n";
         $kb_defaultrobotstxt .= "Allow: /wp-content/blogs.dir/" . $blog . "/files/*\n\n";
         $kb_defaultrobotstxt .= "Sitemap: " . get_bloginfo('url') . "/sitemap.xml";
         return $kb_defaultrobotstxt;
    $kb_defaultrobotstxt = build_robots_txt();
    add_option("kb_robotstxt", $kb_defaultrobotstxt, "Contents of robots.txt", 'no');		// default value
  • Ovidiu


    I played around with this robots.txt plugin: because users can easily (with a gui) select not to index different stuff, i.e. categories, etc. I made a quick hack to to allow users to edit .htaccess though I left them with access to their respective robots.txt

    now I started wondering whether that is safe :wink:

    I just found the plugin you are all talking about:

    @drmike: why the need to hack anything? whats wrong/unsafe with the original?

  • drmike

    users can easily (with a gui) select not to index different stuff, i.e. categories, etc.

    I only have two installs using this plugin. Both of them present it to their clients as something for advanced users. That's actually why we added the bit above, to give it a decent default for those who choose not to use it.

    why the need to hack anything?

    Plus when you deal with soccer moms who open up support tickets with " stole my site!", you learn that with decent instructions, even a mother of half dozen can edit a robots.txt file when they see it as something they have to do. Sure you can explain the net to them but chances are, they're not going to understand. (We did try to explain what was to the client and even noted their collection of grateful dead which was met with a long rant about the evil drug culture of the 60's.)

    whats wrong/unsafe with the original?

    The concern is javascript making it into the database. Why we're talking about a text file, something that normally isn't runable within a browser, the concern is that the script isn't being filtered and is, in fact showing up intact within the database. We feel that's a security concern that may lead to an opening for those hackers who may do harm.

    You'll note that doesn't allow uploads of *.txt files any longer.