[SmartCrawl Pro] Sitemap not working in multisite

Dear support team,

I'm trying to generate a sitemap for my multisite network : defi-ecologique.com

On the network admin panel, I'm able to change my sitemap's settings.
https://www.defi-ecologique.com/sitemap.xml

On my subdomains, the sitemaps don't work :
https://blog.defi-ecologique.com/sitemap.xml

Also, in the URL crawl report, I've got weird issues.
For instance :
"https://www.defi-ecologique.com/services/We couldn’t find this URL in your sitemap, which means search engines are probably having trouble finding and indexing it, too."

This is odd and this URL (and other listed URLs) should be on the sitemap.

Besides, there is a hierarchy between the pages and it seems that the sitemap doesn't corrolate with the website structure.

Regards,
Greg

  • Adam Czajczyk

    Hi Greg

    That was just a first step as initially only one sitemap was even being displayed while the other one had validation errors due to style being included.

    Anyway, as you provided me with access to the site, I checked settings and the point is: it's sub-domain multisite where you got a main site and separate sub-sites. The SmartCrawl was set to use "network settings" so that wouldn't allow sitemap creation for sub-sites (sub-site is not a part of a main site, it's a separate site).

    On "Network Admin -> SmartCrawl -> Settings" switching the "Sitewide Mode" off and switching single modules ("sitemap") in this case on would give you direct access to "SmartCrawl -> Sitemap" option from the dashboard of each site (and I though it was initially set that way) letting you create a specific sitemap for each of the sites, so a "blog" site sitemap would include URLs from "blog" sub-site and the main site sitemap would include URLs from the main site.

    Hwoever, on your site this doesn't seem to work proper as well. Initially, when I changed that setting, both maps were updated to point to right URLs but it seems that any "update sitemap" or triggering crawl on any of these sites or even changing settings (which is related to auto-update maps) is breaking that and making both sitemaps pointing to the same URLs (either "blog" sub-sites or main site). This isn't an expected and valid behavior. I also double checked that on my own setup just to make sure ant these maps should stay separate.

    I'm confused with this, I admit, and I'll need a helping hand so I asked our second line support team to take a look at this. Please keep an eye on this ticket and we'll update you as soon as we got feedback from them.

    Kind regards,
    Adam

  • Konstantinos Xenos

    Hi Greg ,

    Sorry for the long wait but I've been trying to figure out what might be going wrong with the sitemaps.

    After a lot of digging I think I've finally found what the issue is, but I need some more information as this doesn't make much sense from the first looks of it.

    Let me explain the issue:

    Usually on a Multisite every sub-site has it's own upload folder with it's ID as well. As an example the "blog." sub-site should be uploading the images at /uploads/sites/2/ , and so on for every subsite.

    On your installation all of the uploads are ending up on the main sites folder though and their link looks at the main site instead.

    Since this is happening the "sitemap.xml" as well is getting read by all sub-sites from 1 folder resulting on all sub-sites trying to overwrite & read the same file.

    Now I'm not sure if this is how your hosting provider works, or if this was done intentionally etc but it is not the correct way of handling files on Multisites.

    Unfortunately since I don't know your setup I can't help further at this point since this whole file management might have actually been intentional even though it's not correct.

    Please tell me if you had any knowledge of this for your installation.

    Regards,
    Konstantinos

  • Predrag Dubajic

    Hi Greg,

    Perhaps this will help with creating XML file:
    https://www.guru99.com/php-and-xml.html

    About what sitemap should contain, well, pretty much all the pages that you want to index, some suggest that you should have all your pages but that depends on you, this might help with deciding what to include in there:
    https://ux.stackexchange.com/a/107025

    Would you have any suggestions as to how to structure the sitemap for a multisite ? Any reallife examples ?

    Adding sitemap URLs in your robots.txt might help with having separate sitemaps for your subsites, something like this for example:

    User-agent: *
    Disallow: /wp-content/plugins/
    Sitemap: https://maindomain.com/sitemap_location/sitemap.xml
    Sitemap: https://subdomain1.maindomain.com/sitemap_location/sitemap.xml
    Sitemap: https://subdomain2.maindomain.com/sitemap_location/sitemap.xml

    Hope that helps to get you started.

    Best regards,
    Predrag

    • Greg

      Dear Predrag,

      Thank you very much for your help.

      I've been able to sort this out.
      I've created a plug-in that generates a XML sitemap for each site, as well as an index and then pings Google and Bing.
      I've also set-up a cronjob that triggers the function.

      Even though I disabled the sitemap feature network-wide, the SmartCrawl sitemap is still visible online : https://www.defi-ecologique.com/sitemap.xml
      You can see it there : https://www.defi-ecologique.com/wp-admin/network/admin.php?page=wds_settings
      (I've enabled support access)

      Here is the result :
      Index : https://www.defi-ecologique.com/wp-content/uploads/sitemaps/sitemap_index.xml
      Site officiel : https://www.defi-ecologique.com/wp-content/uploads/sitemaps/www-defi-ecologique-sitemap.xml
      Blog : https://www.defi-ecologique.com/wp-content/uploads/sitemaps/blog-defi-ecologique-sitemap.xml
      E-books : https://www.defi-ecologique.com/wp-content/uploads/sitemaps/ebook-defi-ecologique-sitemap.xml
      Banc Refuge : https://www.defi-ecologique.com/wp-content/uploads/sitemaps/banc-refuge-defi-ecologique-sitemap.xml
      Conteneur Collector : https://www.defi-ecologique.com/wp-content/uploads/sitemaps/conteneur-collector-defi-ecologique-sitemap.xml

      Here is the code (FYI)

      <?php
      /*
      Plugin Name: DEFI-Écologique : générer les sitemaps
      Plugin URI: https://www.defi-ecologique.com
      Description: Génère des sitemaps pour le multisite de DEFI-Écologique
      Author: Grégoire Llorca
      Version: 1.0
      Author URI: https://www.defi-ecologique.com/
      */
      function generer_sitemaps(){
      
      	global $wpdb;
      	$output = '';
      	$sitemaps = array();
      	$sites = get_all_sites();
      	$i_name = 'sitemap_index.xml';
      	$i_path = get_sitemap_dir( 'abs' ) . $i_name;
      	$i_url = get_sitemap_dir( 'http' ) . $i_name;
      	$today = date( 'c' , time() );
      
      	// Index des sitemaps
      	$index = '<?xml version="1.0" encoding="UTF-8"?>
      <sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      
      	xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd&quot;
      
      	xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">';
      
      	foreach( $sites as $s ){
      		if( !isset( $s['private'] ) ){
      			$slug = $s['slug'];
      			$s_url = get_sitemap_dir( 'http' ) . "$slug-defi-ecologique-sitemap.xml";
      			$s_path = get_sitemap_dir( 'abs' ) . "$slug-defi-ecologique-sitemap.xml";
      			// ajout du site à l'index
      			$index .= "
      		<sitemap>
      			<loc>$s_url</loc>
      			<lastmod>$today</lastmod>
      		</sitemap>";
      			// ajout de la sitemap à l'array
      			$s_temp = '<?xml version="1.0" encoding="UTF-8"?>';
      			$s_temp .= '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">';
      
      			// recherche des pages
      			$posts = array();
      			$reqs = array(
      				array(
      					"req" => "SELECT * FROM {$s['prefix']}posts WHERE post_type='page' AND post_status='publish' ORDER BY post_modified_gmt DESC",
      					"type" => 'pages'
      				),
      				array(
      					"req" => "SELECT * FROM {$s['prefix']}posts WHERE post_type='post' AND post_status='publish' ORDER BY post_modified_gmt DESC",
      					"type" => 'posts'
      				),
      			);
      			switch( $slug ){
      				case "banc-refuge":
      					$reqs[] = array(
      						"req" => "SELECT * FROM {$s['prefix']}posts WHERE post_type='insecte' AND post_status='publish' ORDER BY post_modified_gmt DESC",
      						"type" => 'insectes'
      					);
      				break;
      				case "blog":
      					$reqs[] = array(
      						"req" => "SELECT * FROM <code>{$s['prefix']}term_taxonomy</code> AS t1 LEFT JOIN <code>{$s['prefix']}terms</code> AS t2 ON t1.term_taxonomy_id=t2.term_id WHERE t1.taxonomy='category'",
      						"type" => "categories"
      					);
      					$reqs[] = array(
      						"req" => "SELECT * FROM <code>{$s['prefix']}term_taxonomy</code> AS t1 LEFT JOIN <code>{$s['prefix']}terms</code> AS t2 ON t1.term_taxonomy_id=t2.term_id WHERE t1.taxonomy='post_tag'",
      						"type" => "tags"
      					);
      				break;
      				case "www":
      					$reqs[] = array(
      						"req" => "SELECT * FROM <code>{$s['prefix']}term_taxonomy</code> AS t1 LEFT JOIN <code>{$s['prefix']}terms</code> AS t2 ON t1.term_taxonomy_id=t2.term_id WHERE t1.taxonomy='type'",
      						"type" => "types"
      					);
      					$reqs[] = array(
      						"req" => "SELECT t1.* , t3.slug AS type FROM {$s['prefix']}posts AS t1 LEFT JOIN {$s['prefix']}term_relationships AS t2 ON t1.ID=t2.object_ID LEFT JOIN {$s['prefix']}terms AS t3 ON t3.term_id=t2.term_taxonomy_id WHERE t1.post_type='service' AND t1.post_status='publish' GROUP BY t1.ID",
      						"type" => "services"
      					);
      				break;
      			}
      			foreach( $reqs as $t ){
      				$res = $wpdb->get_results( $t['req'] );
      				// if( in_array( $t['type'] , array( 'services' ) ) )
      					// debug_var( $t['req'] );
      				if( $t['type'] == 'types' ){
      					$posts[] = array(
      						"loc" => $s['url'] . 'services/',
      						"lastmod" => $today,
      						"changefreq" => 'never',
      						"priority" => '1.0',
      					);
      				}
      				elseif( $t['type'] == 'insectes' ){
      					$posts[] = array(
      						"loc" => $s['url'] . 'insectes/',
      						"lastmod" => $today,
      						"changefreq" => 'never',
      						"priority" => '1.0',
      					);
      				}
      				foreach( $res as $r ){
      					$is_noi = 0;
      					$p = '0.5';
      					$cf = 'monthly';
      					$images = array();
      					switch( $t['type'] ){
      						case "tags": // tags
      							$url = $s['url'] . "tag/{$r->slug}/";
      							$lm = $today;
      							$p = '0.4';
      							$cf = 'monthly';
      						break;
      						case "types": // types
      						case "categories": // categories
      							$url = $s['url'] . $r->slug . '/';
      							$lm = $today;
      							if( $slug == 'www' )
      								$url = "{$s['url']}services/{$r->slug}/";
      							$p = '0.9';
      							$cf = 'weekly';
      						break;
      						case "services": // types
      							$url = $s['url'] . "{$r->type}/{$r->post_name}/";
      							$lm = date( 'c' , strtotime( $r->post_modified_gmt ) );
      							$p = '0.8';
      							$cf = 'weekly';
      						break;
      						case "posts": // posts
      						case "insectes": // insectes
      							$url = $s['url'] . $r->post_name;
      							$lm = date( 'c' , strtotime( $r->post_modified_gmt ) );
      							if( $slug == 'ebook' )
      								$p = '0.9';
      							elseif( $slug == 'banc-refuge' ){
      								$url = "{$s['url']}insectes/{$r->post_name}/";
      							}
      							elseif( $slug == 'blog' ){
      								$images = get_post_images( $r );
      							}
      						break;
      						default: // pages
      							$url = $s['url'] . $r->post_name;
      							$lm = date( 'c' , strtotime( $r->post_modified_gmt ) );
      							// debug_var( $r->post_title , $r->ID );
      							$p = '0.8';
      							$cf = 'weekly';
      							if( $slug == 'www' ){
      								$p = '0.5';
      								$cf = 'monthly';
      							}
      							if( in_array( $r->post_name , array( 'home' , 'banc-refuge' , 'conteneur-collector' , 'le-blog-de-la-faune-et-de-la-flore-sous-tous-les-angles' , 'e-books-naturalistes-publies-defi-ecologique' ) ) ){
      								$url = $s['url'];
      								$p = '1.0';
      								$lm = $today;
      								$cf = 'daily';
      								if( $slug == 'conteneur-collector' )
      									$cf = 'yearly';
      							}
      							else{
      								$req = "SELECT IFNULL( meta_value , 0 ) FROM {$s['prefix']}postmeta WHERE meta_key='_wds_meta-robots-noindex' AND post_id={$r->ID}";
      								$is_noi = $wpdb->get_var( $req );
      								// debug_var( $is_noi , $req );
      								if( $is_noi == 1 ){
      									$p = '0.0';
      									$cf = 'yearly';
      								}
      							}
      						break;
      					}
      					if( !in_array( $slug , array( 'blog' , 'conteneur-collector' , 'ebook' ) ) || $is_noi == 0 ){
      						$posts[] = array(
      							"loc" => $url,
      							"lastmod" => $lm,
      							"changefreq" => $cf,
      							"priority" => $p,
      							"images" => $images
      						);
      					}
      				}
      			}
      			// p = loc (url) , lastmod (format sitemap) , changefreq (format sitemap) , priority (0.0>1.0)
      
      			// ajout à la sitemap
      			foreach( $posts as $p ) {
      
      				$s_temp .= "
      				<url>
      					<loc>{$p['loc']}</loc>
      					<lastmod>{$p['lastmod']}</lastmod>
      					<changefreq>{$p['changefreq']}</changefreq>
      					<priority>{$p['priority']}</priority>";
      				if( isset( $p['images'][0] ) ){
      					foreach( $p['images'] as $i ){
      						$s_temp .= "
      					<image:image>
      						<image:loc>{$i['loc']}</image:loc>
      						<image:caption>{$i['caption']}</image:caption>
      						<image:title>{$i['title']}</image:title>
      					</image:image>";
      					}
      				}
      				$s_temp .= "
      				</url>";
      			  }
      
      			// création des vecteurs
      			$s_temp .= '</urlset>';
      
      			$s_array = array(
      				"slug" => $slug,
      				"label" => $s['label'],
      				"url" => $s_url,
      				"path" => $s_path,
      				"xml" => $s_temp
      			);
      			$sitemaps[$slug] = $s_array;
      		}
      	}
      
      	$index .= '
      </sitemapindex>';
      
      	// sauvegarde de l'index
      
      	set_time_limit(999999);
      	$fi = fopen( $i_path , 'w' );
          fwrite( $fi, $index );
          fclose( $fi );
      	$output .= SubmitSiteMap( "http://www.google.com/webmasters/sitemaps/ping?sitemap=" . htmlentities( $i_url ) );
      	$output .= SubmitSiteMap( "http://www.bing.com/webmaster/ping.aspx?siteMap=" . htmlentities( $i_url ) );
      	$output .= "<li><a href='$i_url'><strong>Index</strong> : $i_url</a></li>";
      
      	// sauvegarde des sitemaps
      	foreach( $sitemaps as $sm ){
      		$fs = fopen( $sm['path'] , 'w' );
      		fwrite( $fs, $sm['xml'] );
      		fclose( $fs );
      		$output .= SubmitSiteMap( "http://www.google.com/webmasters/sitemaps/ping?sitemap=" . htmlentities( $sm['url'] ) );
      		$output .= SubmitSiteMap( "http://www.bing.com/webmaster/ping.aspx?siteMap=" . htmlentities( $sm['url'] ) );
      		$output .= "<li><a href='{$sm['url']}'><strong>{$sm['label']}</strong> : {$sm['url']}</a></li>";
      	}
      
      	return "<ul>$output</ul>";
      }
      
      function get_post_images( $post ){
      
      	$result = array();
      	//get shortcode regex pattern wordpress function
      	$pattern = get_shortcode_regex();
      	// debug_var( $pattern );
      	// $pattern = 'artimg';
      
      	if (   preg_match_all( '/'. $pattern .'/s', $post->post_content, $matches ) )
      	{
      		// foreach( $matches as $m )
      			// debug_var( $m , "Matches" );
      		$keys = array();
      		$result = array();
      		foreach( $matches[0] as $key => $value) {
      			$image = array();
      			if( $matches[2][$key] == 'artimg' ){
      				// debug_var( $value , $key );
      				$str = explode( '|' , str_replace( '" ' , '|' , $matches[3][$key] ) );
      				$loc = $caption = $title = '';
      				foreach( $str as $s ){
      					$atts = explode( '=' , str_replace( '"' , '' , $s ) );
      					// debug_var( $atts[1] , trim( $atts[0] ) );
      					switch( trim( $atts[0] ) ){
      						case "title":
      							$loc = get_uploads_dir( 'http' ) . $atts[1];
      						break;
      						case "alt":
      							$title = $caption = html_entity_decode( $atts[1] );
      						break;
      					}
      				}
      				$image = array(
      					"loc" => $loc,
      					"caption" => $caption,
      					"title" => $title,
      				);
      				// debug_var( $image , "Image" );
      				$result[] = $image;
      			}
      		}
      	}
      	return $result;
      }
      
      // Ping search engines
      function Submit($url){
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_exec($ch);
        $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);
        return $httpCode;
      }
      function SubmitSiteMap($url) {
        $returnCode = Submit($url);
        if ($returnCode != 200) {
          return display_response( "error" , "Erreur $returnCode: $url" );
        } else {
          return display_response( "success" , "$returnCode: $url" );
        }
      }

      Also, I've created two user sitemaps :
      https://www.defi-ecologique.com/sitemap/
      https://blog.defi-ecologique.com/sitemap/

      What do you think about that ? Would you have any suggestions for improvement ?

      I'm also having issues in Google Search Console. First of all, the SmartCrawl sitemap is still listed there and I can't find any button to remove it.

      I'm also having notifications about pages not being included in the sitemap :

      But, they are visible within the sitemaps :
      https://www.defi-ecologique.com/wp-content/uploads/sitemaps/www-defi-ecologique-sitemap.xml

      Could you please help me sort this out ?

      Thank you very much,
      Greg

  • Adam Czajczyk

    Hi Greg

    I took a look at this sitemaps and they look fine, great work with the code and thanks for sharing it!

    As for the SmartCrawl sitemap - this won't be automatically removed from Google Search Console just by disabling the feature. But I admit I couldn't find a way to remove it manually from Google Search Console too. That is - using its current version.

    However, I found out that you should still be able to switch GSC to "old version" and in that "old" version of the interface on the list of sitemaps you can actually select the sitemap and use the "Delete" button above the list. So you could temporarily switch to "old interface", remove the sitemap and switch back. I think that should do the trick.

    As for the "missing pages". The screenshot that you shared, I think it's showing the pages that are actually "included in Google index" currently - from the Sitemap. So the "missing ones" would be those that are not listed there, is that right? (I'm sorry, I don't speak French so I could only use Google Translator to try to understand the screenshot).

    It might take some time for Google to crawl the site and include all the pages so it's quite possible that all it needs is just some additional time :slight_smile:

    Best regards,
    Adam

  • Greg

    Dear Predrag Dubajic ,

    I've tried to use the same set up on a multilingual install.

    This time, I'm having issues with the rel=canonical stuff.

    The install is as follows :
    - curly-de-provence.fr -> main site, selling horses (fr-fr)
    - en.curly-de-provence.fr -> translation of the horse selling site (en-gb)
    - lavande.curly-de-provence.fr -> selling lavender (fr-fr)

    I've added xmlns:xhtml="http://www.w3.org/1999/xhtml" to the URL set.
    In the index and lavender sitemaps, everything works fine, since there aren't any translations involved :
    - https://curly-de-provence.fr/wp-content/uploads/sitemaps/sitemap_index.xml
    - https://curly-de-provence.fr/wp-content/uploads/sitemaps/lavande-curly-de-provence-sitemap.xml

    However, on main sitemap and english translations, there is a big display mix-up.
    If I remove the alternate links, there isn't such display.
    - https://curly-de-provence.fr/wp-content/uploads/sitemaps/main-curly-de-provence-sitemap.xml
    - https://curly-de-provence.fr/wp-content/uploads/sitemaps/en-curly-de-provence-sitemap.xml

    Here is how it looks like, here :

    And on notepad++ :

    Do you see anything that might explain this broken display ?

    Regards,
    Greg

  • Predrag Dubajic

    Hi Greg,

    Can you try editing your non-working sitemap and replace this part:
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:xhtml="http://www.w3.org/1999/xhtml">
    With this and see if that does the trick for you:
    <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd http://www.w3.org/TR/xhtml11/xhtml11_schema.html http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd&quot; xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/TR/xhtml11/xhtml11_schema.html">

    These discussions should further help you with your issue:
    https://stackoverflow.com/questions/51923627/xml-sitemap-rendering-as-plain-text/51943325#51943325
    https://stackoverflow.com/questions/39450542/namespace-prefix-xhtml-on-link-is-not-defined-sitemap-xml/39451291#39451291

    Best regards,
    Predrag

Thank NAME, for their help.

Let NAME know exactly why they deserved these points.

Gift a custom amount of points.