How to scale WordPress to half a million blogs and 8,000,000 page views a month

How to scale WordPress to half a million blogs and 8,000,000 page views a month

We figured it was about time we shared some of the lessons we’ve learned scaling Edublogs to nearly half a million blogs and a place in the Quantcast top 5000 sites! So if you have grand plans for your site (or want to improve your existing setup / performance) read on and feel free to ask any questions :)

The fundamental principle in scaling a large WordPress installation runs along the same basic principles of scaling any large site.

The key component is to truly understand your application, the architecture and the potential areas of contention. For WordPress specifically, the two key points of contention and work is the page-generation time as well as the time spent with the database.

Database Layer:

Given the flexibility of WordPress, the database is the storage point not only for the “larger” items, such as users, posts, comments, but also for many little options and details. The nature of WordPress is that it may make many round trip calls to the database to load many of these options — each requiring database and network resources.

The first-level of “defense” on overloading the database would be to use the MySQL Query Cache.

The Query Cache is a nifty little feature in MySQL, where it stores — in a dedicated are within main memory — any results of a query for a table which has not recently changes.

That is, assuming a request comes in to retrieve a specific row in a table — and that table has not recently been modified in any way — and the cache has not filled up requiring purging/cleaning — the query/data can be satisfied from this cache. The major benefit here of course is the to satisfy the request — the database does not need to go to the disk (which is generally the slowest part of the system) and can be immediately satisfied.

Memory

The other major boost for the database would be to keep the working set in memory. The working set is loosely defined as the current set of data which will be aggressively referenced in a period of time. Your database can have 500GB worth of data — but the working set — the data actually needed NOW [and in the next N amount of time] is only 5GB.

If you can keep that 5GB within memory (either using generous key-caches & system I/O buffers for MyISAM or a large Buffer Pool for InnoDB) will of course reduce the required round-trip-time to the disk. If the contention in the database is write related, consider changing the storage engine for the WordPress tables to InnoDB. Depending on the number of tables — this can lead to memory starvation, so approach with caution.

Disks

The last point on databases is disks. In the even the working set doesn’t fit in memory (which is most of the time usually), have the the disk sub-system be as quick as possible. Trade in those “ultra-fast 3.0GB SATA” disks for high-speed SCSI disks. Consider a striped array (RAID-0) — but for safeties sake let it be a RAID-10. Spread the workload over multiple disks: for 150GB of disk space, consider getting several 50GB disks so that a large throughput can be obtained. If you will be doing heavy writes to this disk-subsystem, a battery-backed write-back cache. The throughput will be a lot higher.

The really nice “defense mechanism” for the database is to avoid the database all-together. As mentioned earlier, per-page WordPress tends to make many many database calls. If these calls can be drastically reduced or eliminated the database time goes down and page-generation time goes up. This is usually done by using memcached.

There are two types of cache: object-cache (which are loosely defined as be being things like options, settings, counts, etc.) and full-page cache. A full-page cache is a fully-generated page (HTML output and all) which is stuffed into cache. This type of cache of course virtually eliminates page-generation time altogether.

We should not forget to mention MySQL slave replication. If your single database server cannot keep up — consider using MySQL replication and using a plugin like MultiDB or HyperDB to split the reads and the writes. Keep in mind that you will always have to write to a single database — but should be able to read from many/any.

Page-Generation Time

WordPress spends a considerable amount of time compiling and generating the resultant HTML page ultimately served to the client. For many, the typical choice is using a server like Apache — which with its benefits also brings some limitations. By default, in Apache the PHP processes are built into the processes serving all pages on the site — regardless if they are PHP or not.

By using an alternate web server (e.g. nginx, lighttpd, etc.) you essentially “box-in” all PHP requests — and send them directly to a PHP worker pool which can work on the page-generation part of the request. This leaves the web server free to continue serving static files — or anything else it needs to. Unlike Apache, the PHP worker pool does not even need to reside on the same physical server as the web server. The most widely used implementation is using PHP as a FastCGI process (with the php-fpm patches applied).

File Storage

When using multiple web-tier servers to compile and generate WordPress pages, one of the issues encountered is uploaded multi-media. In a single-server install, the files get placed into the wp-content/blogs.dir folder and we forget about it. If we introduce more than one server — we need to be careful that we no longer store these data files locally as they will not be accessible from the other servers.

To work around this issue, consider having a dedicated or semi-dedicated file server running a distributed file-system (NFS, AFS, etc.). When a user uploads a file, write it to the shared storage — which makes it accessible to all connected web-servers. Alternatively, you may opt to upload it to Amazon S3, Rackspace CloudFiles or some other Content Delivery Network. Either way, the key here is to make sure the files are not going to be local to a single web-server — as if they are — they will not be know to other servers.

On a distributed file-system, refrain — or never — serve files off this system directly. Place a web-server or some other caching services (varnish, squid) who is responsible from reading the data off the shared storage device and returning it to the web server for sending back to the client. One advantage of using something like varnish is that you can create a fairly large and efficient cache — in front of the shared file system. This allows the file-system to focus on serving new files and leaving the highly-requested files to the cache to serve.

Semi-static requests

For requests which can be viewed as semi-static, treat them so. Requests such as RSS feeds, although are technically updated and are available immediately following the publishing of a post, comment, etc. consider caching those for a period of time (5 minutes or so) in a caching proxy such as varnish, squid, etc. This way you can have a high number of requests for things like RSS feeds be satisfied almost for “free” — as they only need to be generated once and then fed by the cache hundreds or thousands of times.

What we use at Edublogs:

3x web-tier servers
2x database servers
1x file server

The web-tier service each has an nginx running, a php-fcgi pool and a memcached instance. The Edublogs.org name resolves to three IP addresses – each being fronted by one of the nginx servers. The nginx is configured to distribute the PHP requests to one of the three servers (itself or the other two in the pool).

The database servers in this case are functioning as a split-setup. The heavier traffic (e.g. blog content) is stored on one set of servers and the global data is stored on a separate set. “Global” data can be thought of options, settings, etc.

The file server is fronted by a varnish pool and connected via NFS to all three web servers. Each web server has a local copy of the PHP files which comprise the site (no reading off of NFS). The user uploads a multi-media file which then gets copied over to the NFS mounts. Upon subsequent requests — the data is server in return by varnish (who also caches it for future requests).

Global Tables, InnoDB & Memcache

The global tables are InnoDB as there are not that many of them and thus have better performance. One of the primary reasons for the individual blog tables are not InnoDB is because of InnoDB data dictionary issues. For large amounts of tables the dictionary can become too large and exhaust all memory on the system. Though there are patches available to change this behavior — the individual tables are still mostly read-only which MyISAM does quite well.

As for caching: We use the memcached-backed object cache and on top of that we also use Batcache (which utilizes the memcached-backed object cache).

We hope that helps… and special shout out to our SysAdmin Michael who pretty much wrote this guide :)