post indexer network wide tagging is killing my site

On a large multi-db based network of 30k blogs, we've had a LOT of spam signups recently for all the "black friday / cyber monday" spam blogs. Unlike previous spam blogs, these make HEAVY use of tags and categories. When a new spam blog is created, the owner/bot automates the process of posting a bunch of spam posts, each indexed by the post indexer plugin. Surely this wouldn't be a big deal, except that in less than one day this balloons the size of the wp_site_terms, wp_site_term_relationships and wp_term_counts tables (40k+ per day), and since these records are attempted to be purged when the system deletes the blog, it kills my site each time I try to delete these blogs. Instead, I must first disable post indexer and purge these tables.

truncate wp_site_terms;
truncate wp_site_term_relationships;
truncate wp_term_counts;

Is there any way to reduce the load from this so it doesn't keep killing the server? For now I have post indexer disabled.

  • Fee

    Hi there, I'm on the step before starting my big network and am concerned about my site performance after really many posts. Also read this older thread: https://premium.wpmudev.org/forums/topic/my-post-indexer-is-flooded
    What about integrating an option to delete old posts from the global tables directly from the backend? Or maybe an option to choose if posts should be deleted automatically after a period of time (year, months and so on)? This would be a great feature!

  • aecnu

    Greetings Everyone,

    Though a lot has been mentioned here about performance, flooding, and the like not a thing was mentioned about the hosting itself nor about Antisplog, captcha, etc.

    Concerning hosting if these are by any chance on a Go Daddy/Hostgator $5 hosting package ... enough said.

    Concerning the other items, it is obvious that spam and splogs are the real problem here and not a word about stopping that has been mentioned.

    Deleting old posts is fine, except when it comes to search engines with indexing of those very posts, everyone loves to get the 404 page cannot be found error .... lol

    So my approach to resolving this issue would be two fold, first and foremost stopping the spammers/sploggers using antisplog and the "Cats" captcha which totally burns the spam bots.

    Next I would take a close look at my hosting and see if it indeed meets the high performance I would expect. CPU, RAM, allocated RAM to php, and one critical item that is constantly over looked, I/O wait time.

    There you have my two cents being a server guru.

    Have a GREAT weekend!

    Cheers, Joe

  • digitsoft

    Since I was just working on the post indexer for another thread I'll chime in...

    Like Joe said - preventing splogs should be your primary concern followed up by a great performing host.

    Joe - of course I had to take it one step further

    After a great deal of trial and error I have 2 filters that will limit tags and categories to 2 each (you can change in the code). Put these in the template functions.php or other place that will get executed on all sites - people can type in as many tags and cats as they want, but only 2 will appear.

    Hope this helps...

    add_filter( 'term_links-post_tag', 'limittags' );
    function limittags( $c )
    {
    	foreach( (array) $c as $k => $v )
    	{
    		$cnt++;
    		if ($cnt <= 2)
    		{
    			$a[] .= $v;
    		}
    	}
    	return $a;
    }
    
    add_filter('the_category', 'limitcat');
    function limitcat( $c )
    {
    	$catlist = explode(",", $c);
    	foreach( (array) $catlist as $k => $v )
    	{
    		$cnt++;
    		if ($cnt <= 2)
    		{
    			$a .= $v;
    		}
    	}
    	return $a;
    }
  • Aaron

    Cool code, though those seem to only affect display of cats on a per site basis. They would still be indexed by the post indexer.

    I think adding an option to delete older indexedposts and their tags is the best option to keep the tables manageable. For now you could do it with one sql query I think.

    The indexer built into MarketPress does this:
    "DELETE p.*, r.* FROM {$wpdb->base_prefix}mp_products p LEFT JOIN {$wpdb->base_prefix}mp_term_relationships r ON p.id = r.post_id WHERE p.site_id = {$wpdb->siteid} AND p.blog_id = $blog_id"

    There's a lot of room for improving the performance of deleting items from post indexer. Right now it has to perform a bunch of queries each time to adjust the counts.

    I'm going to put this on our todo list officially.

  • Shawn

    @aecnu - the dedicated server used is a quad-core 3.4ghz with 16gb ram, and the db tables are using a 4096 multidb setup. It's not an issue of resources - it's an issue of indexing stuff that shouldn't be indexed.

    Around black friday we had 70k new user/blog signups, and each would post a couple hundred posts within a few hours, each post would have 20-100 tags. These tags, due to the way post indexer is designed, would be globally indexed and tabulated. While the intent with that was wonderful and performance was "okay" even under this load - the issue was actually when I deleted these users & blogs. Deleting each blog required the posts to be de-indexed by Post Indexer, which would have to perform a non-indexed query against tables that were, at their largest, about 5.5gb total. The queries and Post Indexer script required to "reduce" the count on each indexed term for the data from Post Indexer would take a full second (minimum) per term x ~50 terms/post x 200 posts per blog x 70k spam blogs. Do the math there. You end up with a grand total of WTF.

    Bottom line, when you're deleting even a SINGLE spam blog (if you let it create that many posts) it would take a couple hours, and the page would die before it was done, requiring way too many tries to catch up with the spammers. The ONLY way I could get around it was to disable Post Indexer and do the SQL stuff on the backend - and truncating them was the easiest solution.

    @digitsoft - this code is really cosmetic only, and wouldn't have prevented the spammers from being able to stuff the Post Indexer data nor minimize the time required to delete the spam blogs. Thanks for trying to help though.

    @Aaron - thank you! I'd really like to be able to use it again, as I believe it helped anti-splog identify spammers more accurately. Here's a feature request or two, though, kind of based on Fee's comment above:
    * Provide a couple filters:
    1) new users (x days) aren't included in Post Indexer. This will minimize viability of drive-by spam.
    2) purge any 'site terms' that have less than x (~3) instances after x (~14) days. This will prevent the terms database from including irrelevant data.

    Thanks again, everyone!