How to scale WordPress to half a million blogs and 8,000,000 page views a month

We figured it was about time we shared some of the lessons we’ve learned scaling Edublogs to nearly half a million blogs and a place in the Quantcast top 5000 sites! So if you have grand plans for your site (or want to improve your existing setup / performance) read on and feel free to ask any questions :)

The fundamental principle in scaling a large WordPress installation runs along the same basic principles of scaling any large site.

The key component is to truly understand your application, the architecture and the potential areas of contention. For WordPress specifically, the two key points of contention and work is the page-generation time as well as the time spent with the database.

Database Layer:

Given the flexibility of WordPress, the database is the storage point not only for the “larger” items, such as users, posts, comments, but also for many little options and details. The nature of WordPress is that it may make many round trip calls to the database to load many of these options — each requiring database and network resources.

The first-level of “defense” on overloading the database would be to use the MySQL Query Cache.

The Query Cache is a nifty little feature in MySQL, where it stores — in a dedicated are within main memory — any results of a query for a table which has not recently changes.

That is, assuming a request comes in to retrieve a specific row in a table — and that table has not recently been modified in any way — and the cache has not filled up requiring purging/cleaning — the query/data can be satisfied from this cache. The major benefit here of course is the to satisfy the request — the database does not need to go to the disk (which is generally the slowest part of the system) and can be immediately satisfied.

Memory

The other major boost for the database would be to keep the working set in memory. The working set is loosely defined as the current set of data which will be aggressively referenced in a period of time. Your database can have 500GB worth of data — but the working set — the data actually needed NOW [and in the next N amount of time] is only 5GB.

If you can keep that 5GB within memory (either using generous key-caches & system I/O buffers for MyISAM or a large Buffer Pool for InnoDB) will of course reduce the required round-trip-time to the disk. If the contention in the database is write related, consider changing the storage engine for the WordPress tables to InnoDB. Depending on the number of tables — this can lead to memory starvation, so approach with caution.

Disks

The last point on databases is disks. In the even the working set doesn’t fit in memory (which is most of the time usually), have the the disk sub-system be as quick as possible. Trade in those “ultra-fast 3.0GB SATA” disks for high-speed SCSI disks. Consider a striped array (RAID-0) — but for safeties sake let it be a RAID-10. Spread the workload over multiple disks: for 150GB of disk space, consider getting several 50GB disks so that a large throughput can be obtained. If you will be doing heavy writes to this disk-subsystem, a battery-backed write-back cache. The throughput will be a lot higher.

The really nice “defense mechanism” for the database is to avoid the database all-together. As mentioned earlier, per-page WordPress tends to make many many database calls. If these calls can be drastically reduced or eliminated the database time goes down and page-generation time goes up. This is usually done by using memcached.

There are two types of cache: object-cache (which are loosely defined as be being things like options, settings, counts, etc.) and full-page cache. A full-page cache is a fully-generated page (HTML output and all) which is stuffed into cache. This type of cache of course virtually eliminates page-generation time altogether.

We should not forget to mention MySQL slave replication. If your single database server cannot keep up — consider using MySQL replication and using a plugin like MultiDB or HyperDB to split the reads and the writes. Keep in mind that you will always have to write to a single database — but should be able to read from many/any.

Page-Generation Time

WordPress spends a considerable amount of time compiling and generating the resultant HTML page ultimately served to the client. For many, the typical choice is using a server like Apache — which with its benefits also brings some limitations. By default, in Apache the PHP processes are built into the processes serving all pages on the site — regardless if they are PHP or not.

By using an alternate web server (e.g. nginx, lighttpd, etc.) you essentially “box-in” all PHP requests — and send them directly to a PHP worker pool which can work on the page-generation part of the request. This leaves the web server free to continue serving static files — or anything else it needs to. Unlike Apache, the PHP worker pool does not even need to reside on the same physical server as the web server. The most widely used implementation is using PHP as a FastCGI process (with the php-fpm patches applied).

File Storage

When using multiple web-tier servers to compile and generate WordPress pages, one of the issues encountered is uploaded multi-media. In a single-server install, the files get placed into the wp-content/blogs.dir folder and we forget about it. If we introduce more than one server — we need to be careful that we no longer store these data files locally as they will not be accessible from the other servers.

To work around this issue, consider having a dedicated or semi-dedicated file server running a distributed file-system (NFS, AFS, etc.). When a user uploads a file, write it to the shared storage — which makes it accessible to all connected web-servers. Alternatively, you may opt to upload it to Amazon S3, Rackspace CloudFiles or some other Content Delivery Network. Either way, the key here is to make sure the files are not going to be local to a single web-server — as if they are — they will not be know to other servers.

On a distributed file-system, refrain — or never — serve files off this system directly. Place a web-server or some other caching services (varnish, squid) who is responsible from reading the data off the shared storage device and returning it to the web server for sending back to the client. One advantage of using something like varnish is that you can create a fairly large and efficient cache — in front of the shared file system. This allows the file-system to focus on serving new files and leaving the highly-requested files to the cache to serve.

Semi-static requests

For requests which can be viewed as semi-static, treat them so. Requests such as RSS feeds, although are technically updated and are available immediately following the publishing of a post, comment, etc. consider caching those for a period of time (5 minutes or so) in a caching proxy such as varnish, squid, etc. This way you can have a high number of requests for things like RSS feeds be satisfied almost for “free” — as they only need to be generated once and then fed by the cache hundreds or thousands of times.

What we use at Edublogs:

3x web-tier servers
2x database servers
1x file server

The web-tier service each has an nginx running, a php-fcgi pool and a memcached instance. The Edublogs.org name resolves to three IP addresses – each being fronted by one of the nginx servers. The nginx is configured to distribute the PHP requests to one of the three servers (itself or the other two in the pool).

The database servers in this case are functioning as a split-setup. The heavier traffic (e.g. blog content) is stored on one set of servers and the global data is stored on a separate set. “Global” data can be thought of options, settings, etc.

The file server is fronted by a varnish pool and connected via NFS to all three web servers. Each web server has a local copy of the PHP files which comprise the site (no reading off of NFS). The user uploads a multi-media file which then gets copied over to the NFS mounts. Upon subsequent requests — the data is server in return by varnish (who also caches it for future requests).

Global Tables, InnoDB & Memcache

The global tables are InnoDB as there are not that many of them and thus have better performance. One of the primary reasons for the individual blog tables are not InnoDB is because of InnoDB data dictionary issues. For large amounts of tables the dictionary can become too large and exhaust all memory on the system. Though there are patches available to change this behavior — the individual tables are still mostly read-only which MyISAM does quite well.

As for caching: We use the memcached-backed object cache and on top of that we also use Batcache (which utilizes the memcached-backed object cache).

We hope that helps… and special shout out to our SysAdmin Michael who pretty much wrote this guide :)

Comments (19)

  1. Very useful information.

    My WordPress database is approaching 2GB (about 50 tables total), including the wp_posts table with more than 70K posts and tons of postmeta records. I’ve been considering switching to InnoDB from MyISAM for the WordPress tables, but most of our hits are reads. Does the number of records matter when choosing between the two or should I just focus on the types of queries?

    Many of our tables are non-Wordpress related, but kept in the same database because of an automated job that merges data from it into WordPress posts. Should I be concerned as our database approaches 2GB and thinking about splitting out the WordPress stuff?

  2. Thanks for posting this, James. No doubt you guys are leading the way with WPMU implementations.

    Would you consider posting a similar “security” guide — i.e. how you guys protect your site against hackers, spammers, malicious robots, and other bad guys?

  3. Thanks for the comments guys – we’re trying to share as much as we can to assist anyone in the same boat!

    Also, you never know, we might be missing something that you could help us with :)

    I think the security setup is an excellent idea – we have a raft of measures in place including SVN and multiple on site and offsite backups…

  4. I had to ask :)

    While this guide is fantastic, I think many readers may be overwhelmed by how much effort is required to duplicate your setup.

    Stated another way, this is the “end result” whereas a “roadmap to get there” would be really helpful as well.

    For example, would a smaller site with ~5,000 blogs and ~100,000 pageviews/month also require such an elaborate setup? Of all of these recommendations, which should be considered “mandatory” and which are “nice to have if you’ve got at least X number of blogs/pageviews?”

    Can you provide any rough guidelines as to the order things should be implemented in while growing a new site from a few blogs to a few thousand? Which of these should be implemented from Day 1 (like Multi-DB and memcached, for instance).

    As a side note, Yoast has an article along the same lines where he recommends a VPS Cloud Server over a dedicated server (for scalability), along with a CDN.

    Would love to get your thoughts.

  5. There’s never any harm in asking :)

    I think that this would definitely be a valuable new manual for WPMU DEV… going from the basics to this advanced stuff.

    We’ll look into it!

  6. @Dennis:

    At a 50,000 foot overview – the key to having an optimally running database (regardless of size) is ensuring that the “working set” fits into some memory footprint. The devil of course is in the details and superior hardware configuration goes a long way too. However, in your case — 2GB should easily fit into memory — even on today’s commodity hardware.

    If there are still performance concerns, try and isolate the issue in determining what the bottleneck is. If it’s high disk I/O, perhaps the memory buffers aren’t configured correctly. If its CPU, perhaps there is a lot of unneeded contention which can be looked at. If the issue is high concurrency, definitely explore switching to InnoDB from MyISAM.

    There is one thing to watch out for on WordPress specifically — it generates many many temporary tables. Generally these temporary tables shouldn’t be much of a concern, however for MySQL — as long as there is either a TEXT or BLOB column in the temporary table — it always spools to disk. This means that even if you have a highly optimized data set which fits snugly into memory, you will still experience a significant amount of disk I/O for these temporary tables (assuming their creation rate is high)… The WordPress queries which generate temporary tables almost always have BLOB or TEXT tables.

    -Michael

  7. @Nick:

    Quite obviously the most “basic” steps should be something to alleviate the largest bottleneck (which usually ends up being the database). The obvious choice is memcached with a combination of page-level caching (batcached, Super-Cache, etc.)

    The other two significant things would be monitoring/graphing as well as load testing. Getting a graphical representation of what your server load looks like over a period of time can help in capacity planning as well as understanding traffic patterns and site usage.

    For load testing, understanding what your server(s) are able to handle at its capacity is also invaluable as you can know (somewhat) in advance how potential hits will affect your install. Benchmark with different plugins/caching installed to see the differences.

    In regard to VPS vs. Dedicated, here’s my take: it depends :) … it depends on the underlying problem you’re trying to scale or solve. If the bottleneck is CPU cycles and you just need a few more ticks — then yeah an elastic VPS is for you. If you need some more memory, perhaps an elastic VPS is for you. If you need faster disks and the VPS can support that too — then yeah, that can be for you too. But the key point in choosing a hosting solution is in knowing what aspect you are trying to scale. Elasticizing a VPS when trying to solve an overloaded database by throwing some more CPU’s at it — might not help the problem… etc.

    For monitoring/graphic, I’d recommend checking out Munin and/or Cacti. For benchmarking try Apache Bench or httperf …

    -Michael

  8. This is such a great primer on how to run a mega-large WP install. I’d love to hear more about how you set up SVN. My corporate hose has, of late, been interested in setting up something like this. Also, do you have a staging server set up for all changes you make? If so, how do you keep the content synced?

  9. The obvious choice is memcached with a combination of page-level caching (batcached, Super-Cache, etc.)

    I’ve read lots of posts where WPMUDev folks have mentioned memcached, but I don’t quite understand what it is… can you briefly explain how to implement it on a WPMU install? For example, is it a plugin or script or APACHE/MYSQL/PHP setting?

    Thanks so much for your help, Michael and James! You guys rock.

  10. Hi James,

    Thanks for this great post. I’m interested to know what hosting provider you use and also what the monthly costs of a set-up such as the one you describe above is?

    Thanks
    Tim

  11. Hi there,
    great info.
    I am concern of mutisite as for the users table.
    Does it get stored into one massive table and only blog site cotent is being sharded??? if so how can i make user lookup fast as my users will grow into thousands and thousands.

    Thanks in advance.
    Jerome

  12. How does edublogs handle the linux sub-directory limits for 31,998?
    Does edublogs use a Windows file server on NTFS for the NFS mounts to get around the limits? Or do you use some other hashing scheme to further sub-divide subfolders?

  13. Would be interested in hearing an update to this blog post or any best practices that have come up since this was written…

    I am most interested to see if there is a yet a hosting solution for starting a WP MS install that can autoscale up in both cost and performance to go from 100′s of blog sites to 100′s of thousands… thanks!

Participate