Scaling Strategy

Hi,

This is a subject that's been dealt with many times here and in other forums, but I wanted to bring it up again and see if I can get a fresh perspective.

I'm starting a multisite network (a specialized hosting service) that I expect to do well. And I want to plan the best scaling strategy that will allow me to start with a reasonable amount of gear now (e.g., one or two dedicated servers), but grow with minimal headache as needed.

The strategies I'm looking at are:

#1: Putting Everything in One Basket

The entire network would be under one domain, and would scale as needed. The question is how it would scale. I assume I could start on one or two servers now, then span the system across multiple servers later via Multi-DB and a distributed file system for uploads. The downside is that, as mentioned, everything would be in one basket. A failure on one server could cause problems for the whole network.

So, if I go with this strategy, I'd certainly appreciate any recommendations on the best way to grow the system when it's time.

#2: Standalone Mini-Networks

With this strategy, when the original network gets to a predetermined capacity, I'd bring another one online under a different domain name. The domain names are somewhat arbitrary, since most users will probably map their own domains to their sites anyway. So users probably won't care whether they're signing up as mysite.bigbloghostingservice.com or mysite.bighostingcompany.com or whatever.

A variant of this strategy would be to start with two or more networks, each using their own dedicated server, to give people a choice of domains from the start. This would essentially create a similar situation to using the multi-domain plugin offered here, except that users would actually be choosing between different networks (unbeknownst to them).

The advantage of this strategy is NOT putting all the eggs in one basket. Instead of growing one big network and having the related growing pains, this strategy would just consist of creating a new network. If one network fails, only the users on that network are affected.

The disadvantage is that I'll have to install themes, plugins and other modifications multiple times instead of once, but I think the extra work probably wouldn't be any more than the efforts that would be needed to manage the "single basket" strategy above.

#3: Hybrid Combination of #1 and #2

This approach would consist of separate networks...but when each one outgrows its server (or comes close to it), that network is grown into a small cluster. For example, a single-server network might be split off into three servers (database, web, file uploads). And we'd leave it there. Then, when each cluster gets as big as we want it, we bring another new network online.

The advantages of this strategy would be limiting the number of networks (instead of putting a new small one online whenever we need more), while limiting the headaches and risk of putting everything in one huge basket. This would be several smaller baskets, but not lots of tiny ones.

Am I on the right track or barking up the wrong tree entirely? (Woof.) Any thoughts would be greatly appreciated.

Thanks,

Mark

  • bhaun

    wpcdn,

    I am in a similar situation and decided it would be best to scale horizontally. I don't like the amount of administration work necessary to shard (#2) or a hybrid like #3.

    So the network looks something like this:

    [ MASTER ] --- [ SLAVE1 ] -- [ SLAVE2 ] ... [ SLAVE10 ]

    So what I do is have a central copy of all the WPMS data available for each server. I check the code into a revision control server (GIT). When I need a new server I have the SLAVE grab a copy of the code from GIT.

    The users are all directed to the MASTER server for their dashboard (which I am having a problem with). When they login they can publish their user generated content. The MASTER server has a set of scripts based on inotfy that will automatically push that static content to S3. I then use S3 as a origin to a CDN. I rewrote a CDN plugin to create a set of static asset hostname's for all the static content.

    The SLAVE servers are basically read-only slaves of MASTER. They boot as needed and download the latest copy of the code (without the user generated content wp-content).

    For the DB I currently use a different shard method that lets me take advantage of AWS's RDS. I utilize the multi-az feature to give me HA pair of MySQL servers that are async. I then use read only replicas to scale out the reads. I have choose a sharding scheme that will allow me to scale writes to multiple RDS instances in the future.

    I also take advantage of nginx, php-fpm, eAccelerator, memcache.

    I personally made the choice to do this all in AWS so that I can automatically spin up additional resources as the site needs it as well as for temporary loads.

    Just because I did this in AWS does not mean it can't be replicated on physical gear if you feel more comfortable. One option many large sites are doing now is visualizing the entire physical hardware. You can setup something like cloud.com's CloudStack and build a virtual environment in your own MSP or Colo. Each VM takes up the entire physical gear. This allows you to move instances around and more importantly perform tons of automated tasks that would be far more difficult with straight physical gear.

    I would be glad to share more details if this interests anyone.

  • wpcdn

    So far I'm very impressed with bhaun's strategy but also a bit intimidated by it, to be honest. It seems very complex, which is a testament to your knowledge and expertise.

    Before moving forward with that approach, I'd still be interested in hearing other opinions.

    One question: Does our strategy #2 make sense? This strategy would be to create separate mini-networks (each hosting a few hundred clients, for example). So in other words, we might start with two dedicated servers. Each would have its own multisite network under its own domain name, and the names would be somewhat generic so users probably wouldn't have any reason to choose one over another. Each network would be identical from the end users' standpoint. Other than the domain name, a user would have no indication of being on a different server. Once one server reaches our quota margin, we'd add another server.

    We like not putting our eggs in one basket, and maintaining a "capsulized" approach that would make sense for various verticals. But we also wonder if we're crazy for thinking that way? The only downsides we can think of are:

    - What happens if a client wants to add a lot of sites? If that server is "full" we wouldn't be able to let them do that under the same account. So we'd have to leave plenty of margin in our quota to allow for this sort of thing.

    - We'd be managing multiple networks, although I don't know if that's any worse than managing a big complex one.

    Thanks,

    Mark

  • LexBlog

    Mark,

    @bhaun's setup is a good starting point. I like his strategy of using git (though @bhaun, you might look at something like chef or puppet for provisioning instead of a git).

    I am building out a similar infrastructure using NFS to map wp-content and blogs.dir to different NFS servers for scaling. My choice to use NFS instead of S3/Cloudfront has strictly been around the issues with content expiration and content refresh. I'm staying away from that option for now - although trying to keep myself in a position to add it in the future.

    My setup is:

    [WPMU Server 1] ------> [NFS for WP-CONTENT]
    [WPMU Server 1] -------> [NFS for blogs.dir]

    then I use Chef to provision additional servers with my custom config (nginx/php-fpm/memcache) and they automagically mount NFS on boot.

    I chose memcache, as I hope to have traffic volumes in the future that will require me to segregate my caching from the WP server to another group of caching-only servers. If my goal was to keep everything on one server, I would have chosen APC instead.

    @bhaun - Why do you force all users to login to a specific server? I can understand if you were using a CMS like Movable Type, but not WP. Can you let us know your strategy around that? I'm more curious if you're doing something I haven't thought of, vs. just trying to talk you out of it.

    Tim

  • wpcdn

    Tim,

    Thank you for the info.

    I've heard that NFS has issues with larger scaling, do you know if that's true? Have you ever considered GFS?

    I think one of the biggest factors for us in our strategy is that it's not necessary for everything to be part of the same network. We're selling a hosting product, but we aren't aggregating users' content or creating a community. So each user site is still a standalone entity. So in our eyes it's kind of like our cPanel hosting in which case we just start a new server when an existing server reaches our maximum quota zone. We end up with a cluster of standalone cPanel servers. By the same token, our multisite hosting clients won't know or care if they're part of a large network or a cluster of smaller WP multisite installations. That's probably the most important factor in giving us the freedom to explore our options.

    Thanks,

    Mark

  • LexBlog

    Mark,

    By GFS do you mean GlusterFS? I have tried to use that with another infrastructure project and it failed miserably. If that is what you mean, I can take my insight offline and discuss it over email.

    NFS works fine, depending on your load and your caching strategy. I come from a large ISP background and used NFS extensively. I always worry more about DB access with WP.

    RE: NFS v. GFS, The key is to not tie yourself into any technology too tightly. You should continually be asking yourself "How tough will it be to move from NFS to GFS in a year? What can I do now to prepare for that?" Answering that question takes twice as much work, because not only do you have to figure out how to implement NFS, but have to figure out how to go from NFS to GFS. A little homework today saves you long term headache.

    Good luck.

    Tim

  • bhaun

    @bhaun's setup is a good starting point. I like his strategy of using git (though @bhaun, you might look at something like chef or puppet for provisioning instead of a git).

    Tim,

    You need GIT + Chef or Puppet for the orchestration. GIT or any revision control simple maintains a central place to local all the Wordpress code so as the site auto-scale's all the the slave instances have identical content. The orchestration allows me to start from a vanilla base image and install all the software and configurations on boot.

    Because I use AWS NFS is not a proper option. NFS presents a single point of failure between multiple availably zones. Clustering filesystems and be used with varying degrees of success. My concern in AWS is you have a limited pipe which is shared potentially (depending on instance) with all the other VM's on that machine. That means your performance will vary.

    I personally have instances in 3 availability zones giving me highly available infrastructure. Based on performance metrics (CPU, Memory, NIC) each machine "votes" if it is over/under-loaded. If enough machines vote they are out of spec then I am able to automatically launch or destroy as needed. This keeps our costs low but allows us to provision the resources we need almost instantly. Currently during a load event it takes ~5 minutes to provision 3 additional servers and add it to the load balancer.

    If I was building this in a traditional DC and I have the option of highly available NAS I would definitely go that route. It is much easier to implement. The tradeoff off is cost.

    One note is that both S3/CloudFront let you set the expiration with Expires Header and TTL's as well as API purges if that helps you at all. I will say that CloudFront can be expensive and may not have the best POP's for your users. You may want to look at using S3 as the orgin-pull for some other CDN.

    Ok so why redirect all users to a single dashboard. I would love to hear alternatives to this issue...

    Wordpress creates a virtual dashboard for each new subdomain that is created. When the new user logs into their dashboard they may be on any one of the many instances in the cluster. Since we want to keep the user generated data in-sync this can make it very difficult. To keep the data synced I keep a single machine for people to upload their user generated content.

    As soon as the content is uploaded, the inotify subsystem will sync that content to S3. That makes it instantly available to the end user.

    I would be glad to discuss, share and help anyone if they are interested, especially if you want to get started with AWS I would be glad to help WPMU DEV community get all this in place.

  • wpcdn

    Thanks for the additional information and the clarification.

    We're still struggling with this. Again, since each of our user sites is a separate entity, it's not so important for us to have one giant network. But there are advantages...we have to decide whether those advantages outweigh the work required to set up and maintain an infrastructure for scaling.

    I'm very impressed by what you have here.

  • bhaun

    I've heard that NFS has issues with larger scaling, do you know if that's true? Have you ever considered GFS?

    Mark,

    NFS does have issues at scale. As Tim pointed out if you are caching most of the content then it should not be a issue. In the occasions I see this being a issue is either many connections to the NFS server or extremely large number of files in a single directory. Wordpress has fantastic code to make sure to shared the user generated content making it far easier on any filesystem to drill down to the correct file.

    I do have servers running GFS and it works very well. You should know to fence in GFS you need access to the remote management NIC on your server (ilo, ipmi etc). This may be tricky to setup and depending on where you get your server depends on if it has, or you are given access to this. You can also checkout OCFS and Gluster are also very popular choices.

    Another interesting option that I setup in the past is Unison http://www.cis.upenn.edu/~bcpierce/unison/. Unison is like rsync but it can be configured over multiple servers to keep the data in-sync. This can be done over SSH or TCP. I have always in my experience had some interesting issues with this over 3 servers when you have files that are constantly changing.

Thank NAME, for their help.

Let NAME know exactly why they deserved these points.

Gift a custom amount of points.