Some feedback

I've just quickly scanned through this plugin, and here's some quick feedback / suggestions.
Note this is mostly applicable if you're working on very large WPMU instances (we've been manipulating a 200000+ blogs DB and a 70000+ blogs DB).
First, it's really easier to use a basic algorithm to connect to DBs rather than the "dblist" thing.
Basically the password is dynamically calculated for each db using a simple hash algorithm (kept secret), and the host is selected based on a simple calculation too like $blog_id % $nb_servers -> server to connect to.
In such a context, you just need to be careful about your repartition algorithm, for instance we use a number such as 2520 rather than 1024 or 4096 for the number of DBs. Why ?
Because 2520 is a multiple of 2,3,4,5,6,7,8,9,10.
It means that if your connection algorithm is done properly, until you reach 11 DB servers (should give you time), when you want to add a DB server you only have to move to that server some of the DBs, without having to change your repartition algorithm. While if you used 4096, you'll have some more trouble once you reach 5 DB servers and will likely have to move all or most blogs around to fit into a new algorithm. You can use the exact same algorithm with 4096 DBs, it's only that you have to build it from the start in a way that is compatible with having an unknown number of servers up to X, depending on what you target.

  • Quenting

    also, the algorithm that says that :

    if( $this->blogs != '' && preg_match("/(" . $this->blogs . "|" . $this->users . "|" . $this->usermeta . "|" . $this->site . "|" . $this->sitemeta . "|" . $this->sitecategories . ")/i",$query) )

    then the query is global seems incorrect to me. As far as i remember, we found that there were some types of local queries in which these values could appear, and so you couldn't preg_match against $query but had to cut it first up to the fragment relevant to determining the destination table (the where or set clause in most cases).
    Do you guys who use this multi-db scheme not find that some queries don't go to the right destination ?

  • Quenting

    well i don't know, before the first time we split our dbs, we tracked for a while queries that would match a rule such as yours, but still be aimed at a local table, and we found quite a few.
    Those would be queries on options, spams, or content that would contain the global tables names.
    Like for instance if i write a post that contains the string "wp_sitemeta", it will in the end result in something along the lines of :

    insert into wp_XXX_posts (...) values (..."wp_sitemeta"...).

    This is a local query, but it will match

    if( $this->blogs != '' && preg_match("/(" . $this->blogs . "|" . $this->users . "|" . $this->usermeta . "|" . $this->site . "|" . $this->sitemeta . "|" . $this->sitecategories . ")/i",$query) )

    and in your system will be routed to global table. This is just an example, and it may be unlikely that your users talk about WP internals, however when we tested we found some more system-generated queries that would end up in the same situation.

    In our system, to avoid such things, we first cut $query before a keyword that depends on the type of query (insert/select etc.) and that makes sure that the first part of the query only is checked for global tables names, this way avoinding parts of the queries that may contain this content too and be misleading. Hence to make the test we cut "select" and "update" queries before "where" or "set" etc.

  • Luke

    Um, yeah... VIP blogs was my idea. Mainly so you could separate certain blogs onto their own db, for instances where you needed to. Like you have a "high profile" blogger/site that needs the extra speed of being separated into another db with less users (or no other users at all).

    One prime example of this, at least for me, is that I put my home blog in its own database. I do find it comes in handy though, and a pretty good idea. That way if you have some high end blogs, they could even be excluded to their own database server if needed.

    As for mixed queries, I can't say that I've noticed the issue myself. I mean, maybe it exists and I just haven't had an instance of it yet.

    With the current code, I think it's pretty flexible. pick the number of databases, connection info, done.

    If I only want half as many (8, 128, 1024), I can double up the connection info.

    To me, it would seem that using the hash method to determine the correct connection (using the first 1-3 characters of the hash) would use less overhead vice an algorithm, but maybe not.

    This particular topic is pretty interesting to me though, so feel free to continue.

  • Quenting

    nah, although our algorithm is currently changing, I'm doing pretty much the same as what's in this plugin. The code is different but the idea is similar.
    The main changes are :
    - I cut queries to their first half to avoid the 'insert into globaltable ... "content localtablename content"' issue
    - I use a decimal matching rather than hexadecimal. Basically requests for blog_id go to DB blog_id%1000 (this changed recently). And then there's another map for dbname%10 which tells which DB is in which server. I'm not sure why is hexadecimal matching used in your plugin. Seems that basically you don't have to extract the blog_id from the query and can just md5 the subdomain, is that why ? I do have to extract the blog_id from the query, however the good point is that my DB class can be used for cron tasks (or even potentially live cross-blogs functions) that loop through multiple blogs.
    - I have support for replication of global tables. Some of these (usermeta etc.) have lots of reads/writes, and splitting them, just for the sake of avoiding locks, provides a nice performance boost.

  • Luke

    Having to grab the blog_id anyway, it's hexed and then grabs the the first 1-3 characters, which tells it what database it's in.

    Example: 16 databases.

    With the hash, you get 0-9, and A-F. 16 values.

    Then you have 16 databases, which correlate to 0-9, or A-F.

    Hash is C, use db C.

    Not saying this makes it right, but it's the same thing wp.com is doing as well, using a hash of the blog_id to figure out the db.

  • Luke

    Maybe I'm missing your question, but you have to do something with the blog_id to translate/assign it to a database, right?

    I mean, you couldn't just say blog_id's 1-500 are db1, 501-1000 are db2, etc, since you would end up with an unbalanced load as users come and go. You have to do something to the blog_id, and this method just happens to use a hash.

    With the hash, it balances it out fairly well.

    One hash, take the first (1, 2, or 3) character(s) as $value, connect to $db[$value].

  • Quenting

    luke : i do %1000 (for 1000 dbs). which means blog_id 56497 will go to db_497. It's an even repartition as well, and it makes for easy deduction of the DB from the blog id. Not that big a deal anyway :-p.

    andrew : for (some) global tables, my wpdb routes the write queries to a master and the read to a slave. For these tables all write go one way, all reads the other (which minimizes locks). mysql handles the master/slave configuration. I have experimental master-master configuration (with still writes going one way and reads the other, the goal of the master-master being to do failover : if one db fails, all read/writes go to the other, and when the broken db gets back up, the replication brings it up-to-date.

  • Luke

    I'm curious. Do you ever notice any lag between the data that is in the global write, and what's being read from the global reads?

    This has been one concern for me after quite a bit of consultation for another (non-mu related) project, where it was brought up that data being replicated could be queued up for several hours before it is actually able to be read across the replicated database(s).

    Even still, while there obviously times where replication has its advantages, the db's reserved for reading still have to be written to as well when the data is replicated, right?

  • Andrew

    Quenting: I'm assuming that you have the global databases on a db server by themselves and then have a few other db servers setup as slaves? Basically i'm curious if you have the master/slaves on separate physical machines or if you are doing the replication on a single machine with several mysql instances?

    Out of curiosity, how many database servers do you guys have?

    Thanks,
    Andrew

  • Quenting

    luke, not really experiencing lag so far, although we only do it on some global tables, and both global servers are breezing at this stage. I'm not sure what would happen if they would be more busy, because replication is indeed asynchronous. In practice we see the data replicated very quickly in our setup, but if it wasn't it would definitely be a proble (update option, save page, option not updated). Maybe a question to ask on hyperdb ? I know they use replication for wp.com.
    Regarding the fact that the slaves still have to be written, that's true, only it makes all writes "low priority", enabling all selects to be very fast, and avoiding clients waiting for their inserts to complete due to locks on the DB. Overall that's very small times we're talking about (locks aren't normally very long anyway), but it's the bulk that ends up making it worth it.
    For performance only, it's probably not worth the extra trouble though. For us it's rather in a progressive approach to a full failover mechanism that we've taken this route. Ideally (some day maybe) I'm aiming at a setup with no SPoF, which i have tons of today and it bothers me :-p.
    andrew, we have one global master and (currently) only one slave since we're still playing with things. They are on separate physical machines yes. We currently have 3 DB servers (quad cores) with the 4th soon to come.

  • Luke

    That's pretty interesting Q.

    I'm moving my primary baby over to a bigger/better server in the next week or two, which will help it's situation out immensely. Effectively, it's 8 times the server that it's on now.

    This month we've had out busiest month by far, damn near triple our previous best, with more increases to come. It's not really super significant yet, we're just now approaching the 1million/month mark, but all things considered from where it came to what I've built it into thus far is pretty nice.

    For me, I've taken a few different approaches with building it. There's your general user, and I'm putting together some specially invited/applied for spots with a few select people. Thus far, it's working very well, and I've got a couple more (so far) to add into the mix once I get servers changed over.

    Initially, I'm going to go with Dual Quad Core Xeons, 8G of RAM, and 3 146G SAS drives in a RAID array. I've also got additional networked storage, which eventually I'll map all storage to.

    What I'm finding interesting, at least in theory thus far, is the potential use of clustering. Pointing all db requests to the cluster, and letting it sort it out. The downside thus far has been it's relatively new to MySQL, at least in a usable format. However, improved in 5.1 and looking even better in 6.0, it's something that down the line could prove as a valuable asset as well.

    If you could maintain say, 2 clusters, you could load balance between them and also replicate between them as well. The potential is there to set-up read and write separation, etc. Thus far, and I would expect the findings to scale along with size, there by far isn't a doubt that reads are the majority of db requests. That being the case, one cluster for each, and also setting up replication as well.

    On our current time line, that's 18-24 months off, and in all honesty by that point in time I'll have hired someone to handle all that. lol

    But hey, the real fun hasn't even begun for me yet. Although I've got enough of a taste that it's more like a constant teasing now.

  • Quenting

    I assume you are careful with the wording and are actually talking about mysql clustering (which is different from replication).
    We studied mysql clustering and to me it doesn't look very appealing for a WPMU setup.
    First, you need 3 boxes for a mysql cluster to be truly redundant. That alone could be ok without another problem : mysql clusters work by putting all the DB in memory. The issue here is the number of files generated by wpmu. To be able to put all DB in memory, mysql needs pointers to all files (indexes and data), which means if you have lots of blogs, a tons of tables in table_cache and a ton of open files. In our experience, going above 20000 tables in table cache results in performances degrading, and degrading more and more quickly as this number goes further up (we even experienced random crashes when we set this value to something like 100000). 20000 tables being about 3000 blogs, a mysql cluster would be comfortable serving maybe 6000 blogs (assuming each set of data is kept in the memory of 2 of the 3 servers). 3 servers per batch of 6000 blogs is really expensive, which is why we took the pseudo-cluster replication route instead.
    By having a master-master replication scheme for global tables, and doing all writes on one side, all reads on the other, then all queries on one of the two in case of a failure, you get an almost redundant setup for global tables. Once we have this working properly we'll look into making the local tables setup redundant too :-p.
    One remark on your new box : in our experience, going big box is good for global tables, but for local tables, since the limit is more on the number of files manipulated and kept in memory, and without considering the extra maintenance cost, it is a better route to go with 2 smaller machines than one twice bigger.

  • Quenting

    what i'm saying is that loads of memory isn't very helpful with local tables, because tables are generally small, and you reach the max number of files manipulated faster than you reach your memory limit. This is of course untrue for global tables which are big.
    In my opinion the best DB setup is one (resp several if you do replication) big (in terms of memory) DB server for global tables and plenty of smaller machines for local tables (like 4Gigs ram). For the global "big" means big enough to handle all your global tables (the generic rules for mysql server sizing apply here, there's some formulas outher to calculate how much memory you need depending on your DB size, for us 8G is good enough).

    Of course though, better CPUs / faster disks always help, and 64 bits is really nice too in all cases.

  • Luke

    What I was getting at with clustering:

    Starting with 5.1, only the indexes are kept in memory of the cluster, and 6.0 further improves/enhances clustering. That being the case, at some point clustering may be a viable option to run in combination with other things to even further enhance redundancy and fail over rate. For example, as mysql gets better with clustering, if you had a small cluster of 3-4 servers that handled 20% of your user data you're done. Why? All your backup is done within the cluster. Lose 1 machine, and the other two still have all the data, and can still perform until the failed server is corrected.

    Again, as mysql gets better with it, a small cluster could be an interesting tool down the road. You'll need to be pretty big to need it, but the possibility isn't far off if their roadmap is on track. Right now, clustering in mysql 5.0.x requires enough memory within the cluster on the effect of something like 1.1 or 1.2 times the size of the database. In 5.1, it's the same formula but compared to the total size of the database indexes. That effectively cuts your memory needs within the cluster by a minimum of 60% (typically more).

    On the same line, as mysql improves overall, the less memory it will need as well. sure 6.0 is a ways off and 5.1 is just emerging for stable use, but it's getting better. What's available in 2 years with this will be interesting indeed.

    For now, at least for me, a single box with that much power will be quite sufficient as it will be serving both web and db requests. As things get moving, the database itself (all of them) will end up on a separate server from the web requests. Your basic natural progression.

    However, MU won't be all that's encompassed here, either. Most certainly a big part of it, but not the whole enchilada. When we get into the need for having 4 or 5 db servers, 2 or 3 web servers, and load balancing, that's another ball of wax.

    Later on, as mysql progresses and things get bigger, I could see clustering being useful for certain databases. Like the global database, maybe larger databases like a VIP database with all your high traffic/size clients. Mysql has to get there, and I wouldn't do it with anything less than the 5.1 branch, but it could be quite a viable option then.

    Disks will be plenty fast enough. We're using SAS drives, naturally they'll be in an array. User files will all end up on network mapped storage, which will be nice as well.

  • Quenting

    interesting thing that now only indexes need to be in memory for mysql cluster, can only help.
    however note that the number of open tables is the real limitations that we've seen, so that's really the factor you need to look at if looking at such a solution. I suspect it's not only a mysql but also linux limitations that results in performances degrading in such situation, which means it might not get sovled very quickly. However the approach of clustering parts of a db (typically global tables) seems totally viable.

  • Luke

    I thought it was an interesting consideration. Of course I'm far from needing that at the moment, but the concept interests me considerably. However, there will come a time when things jump literally overnight, so at the same time I'll need to be as prepared as possible. But, that's a whole different story.

    There are some people out there that swear by clustering, even now. I wouldn't, from the sheer memory needs of it now, but with 5.1 and just keeping indexes in memory, that's a plus and a step in the right direction.

    By clustering global tables, for example, it could relieve the need for specific read/write db's, which the cluster would handle it all. Certain things like locking issues could be side-stepped with it, although the pipe to the cluster would need to be pretty large I would think.

    For certain things, I can see it being a potential solution. At least part of it, anyway.

    It solves replication issues, as each machine in the cluster has a full copy of the database, which provides the redundancy and backup. One server goes down, the cluster continues to operate, and when the issue is resolved the server is brought up to sync with the cluster.

    What I'm not sure of, however, is if you could run multiple databases off of a single cluster. My understanding is that it would be possible, but I'm out on a limb with that one.

    If that were the case, and the cluster acts like a big db server, I could see potential with that as well.

    Using the md5 method of separating db's as an example, if you had say 256 user databases, you could potentially put 64 per cluster, and have 16 total db servers (4 clusters w/ 4 machines).

    I have no idea if that would work or not, but it doesn't seem impossible down the road as mysql improves.

    Also to come into play here is specifically what equipment is available as well.

    For me, I know that while a db server for strictly user tables would be quite sufficient with (2) dual core processors, I'd end up with (1) Quad core. The network provider I work with has already pushed them aside. That brings up some interesting things as well, because IMHO 2 sockets are better than 1 regardless of cores. So, you end up with (2) Quads. That being the case, 8G of RAM is looking to be the new standard. Sure RAM is shared, unlike L2, but at the same time I've read numerous cases for having a sufficient amount of RAM in relation to the number of cores as well.

    Also, cost is going down on these as well. For example, this server compared to a server with the same specs as this (with SATA drives vice SAS) that I put online last August, is now just over 30% less than what that one was. And it's improved in the hd area as well.

    At some point, I think the servers are going to be overpowered for some uses for a while. Especially as mysql becomes more efficient in succeeding versions.

    That's OK though, I think.

    That gets us into the file limitations, however, and the number of users per database.

    At that point, I can see where more db's (not servers, but individual db's) can help out at well.

    Which gets interesting with the md5 solution. With the first 3 digits, you're already up to 4096.
    That's one big config file. I think most users of this right now have 256, but that's still a big config file.

    What I'm interested in, is the configuration file. There "has" to be some way to keep that in check. While I'm not sure of the exact file size of a configuration with 4096 db's in it, I would think at that point it would be more efficient to include a smaller file vice having all of that data included.

    When you parse the query, which happens before the connection is made, a configuration file could be included at that point, that has the relevant information in it.

    Better still, would be some sort of algorithm to calculate what the connection information is instead of loading a file at all. Naturally there are concerns to be taken into consideration with that, but it's nothing that would be impossible.

    If you have your databases named accordingly, that's one step.
    Next, you can break up your password into 2 chunks. Define the first and last pieces, and
    then use the corresponding hash value (up to the first 3 characters), to mix in the missing piece of the password. The same could be used on the username as well.

    The difficult part is going to be the actual server connection, and setting the IP.

    The easy way is to have an array for each database server, then check for the value in it.

    To keep it simple, let's say you have 4 database servers with 1024 db's on each one. That would give you 4 arrays, with 1024 values in each one. To me, that's going to be a little much and defeat the whole purpose of dynamically generating the connection info.

    For your 001, 002, 003, 004 type values (first 3 md5 chars), it wouldn't be so bad.

    When you get into your alphanumeric values of A0D, B4A, etc, then it becomes more complicated.

    At that point, it would be easier to skip it all, and just include a file with the username, pass, and connection info once the db to connect to has been determined.

    You could also combine files a bit, as at that point it isn't going to matter if it's 4 lines or 16 lines.

    So if you had 4096 db's, you could have 256 files which hold the info for 16 db's. 256*16 being 4096. For 256 db's, you end up with 16 files. (0-f, or 00-ff).

    However, you're going to have 000-999 in terms of numeric values. You could potentially put 000-499 on one DB server, 500-999 on another. Then, if it's one of the alpha numeric db's, include a file. Have 16 files (0-f) which only map the db to a particular sever. That would be something like 524 lines per file, which isn't that bad, and only included some of the time instead of all the time.

    Either way, there are plenty of options out there for it. The question comes down as to what's most efficient? Easier to just grab a file, and skip determining it on the fly, or a combination of it.

    Don't ask me where I'm going with this all, just brain dumping I guess.

    At the very minimum though, breaking up the connection information into smaller files has to be better than having it all in one.

  • Quenting

    well I read it :-p.
    Two things :

    - On the config file issue. I don't know how you guys do it (do you have an array of 4096 values one for each DB ??), but i have no such thing as a config file anymore. I have a set of functions :
    get_db_from_blog_id
    get_server_from_db
    get_server_connection_info
    that returns me the various things i need to connecte to the proper DB in the proper location. The db user and password are all dynamically generated from the blog_id, and the repartition od blog to dbs and dbs to server is a plain simple algorithm that takes 3 lines of code.

    - On the file limitations : i think (i may be wrong though), that you're messing 2 concepts. There's one file limitation for which splitting your DB in pieces help : this limitation is the number of files in a directory. There's no real limit to that number, but according to various benchmarks above 15000 it starts getting slower and slower in time (although at some stage we had some DBs with over 300000 files and it was stil working allright if not fast - note we also got rid of the "show tables" statement in MU). Splitting DBs help immensely with that, in that i totally agree with you. But the file limitation I'm talking about is the number of open files, which is directly linked to the number of configured tables in table_cache. Splitting your DBs doesn't have any impact on this issue, because this a per-server (and not per-db) issue. It's the total number of tables mysql can keep a file pointer to in memory server-wide, so the layout of your dbs is unrelated. It ahs a direct impacton performance, because only the tables in table_cache have their indexes kept in memory. If you access a table that isn't in the cache, the index has to be read and put in memorybefore the query is executed. If your cache is full, mysql needs to first take a table off it's table_cache before it moves the new one upthere. Which means if your cache is full, you have a lot of disk reads because everytime a table not in cache is accessed its index needs to be read.
    So theoretically, you would have to use a table_cache that's larger than the sum of all your tables or at least all your "active tables" (meaning the number of tables being accessed between 2 mysql restarts). Only (and that was my point originally) if your table cache gets too high, performance degrades, probably due to this number of tables maintained in memory. Which is why i suggested that plenty of small servers was better than less bigger ones.

  • ZappoMan

    Quenting,

    Are you still using the design you described in this thread? I realize it was a year ago... and it seemed like a pretty reasonable solution.

    http://mu.wordpress.org/forums/topic.php?id=2163

    My site is still relatively small with about 2,100 blogs. I'm looking to get ahead of the curve.

    I just signed up to premium so I could get access to Andrew's Multi-db... (and all the other cool stuff, of course)... Not to start a "my code is better than your code" type discussion, I'm curious what you guys would recommend.

    Btw, on the related note of user files, My plan is to move file based stuff to Amazon S3, I moved all the theme files over to S3 about a month ago, and am happily serving 3m+ resources a month for less than $5.

  • Quenting

    we used it originally, although it has now evolved quite a bit.
    We still have reverse proxying, then a layer of http servers, then a layer of db servers. Our db access method is neither the one here, neither hyperdb, neither the one we used at that time :p. Basically we split our local DBs in 1000 located on multiple servers, with a slightly different repartition method than the one you will find here, and we use master/slave replication for global tables. Our DB code is now inbetween the one here and hyperdb i'd say.

  • Luke

    What's you're average number of tables per DB?

    It would "seem" that wp.com is running about 4k/DB, with 2M blogs (give or take) and 4096 db's, and something like 256 db's per server. Give or take, of course.

    Overall though, without needing any special algorithms, I would have to say that using the hash of the blog id makes sense though. It balances users pretty well, can be scaled, and doesn't create holes or gaps on one server vice another. Plus, it can be scaled as well.

    Let's say you use the first 2 characters of the hash (256), but only want 128 db's, or even 64.
    Easy. Break it down in half and point 2 results (like 00 and 01) to db1, etc. You end up with 128 db's.

    Or, if you only want 64, you can take 00,01,02,03 and point them to db1. All done easily through the connection array.

    The downside of all of this, that I see anyway, is when you get to even 256, you have to have a huge db connection file.

    IMHO, I would like to see something where the information was generated on the fly to keep the connection array to a minimum. That's just me though.

    Not that it doesn't work, it works pretty damn well. Then again, I'm always looking for efficiency where it can be found. Part of the good thing about having a group of developers to bounce ideas back and forth off of.

    For generating a user/pass, it could be something as simple as the user supplies the start of the username, then the pass, and then let the hash provide a 1,2,or 3 character difference, followed by a second user input. So you would end up with something like:

    u: johnny{a}black
    p: xvFu{a}jI18

    Where {a} is the db hash value being used.

    Where I could see a problem is multiple db servers, but you're still cutting the connection array by 75% or better. Even still, I couldn't see someone having more than 16 db servers, at that point you can use the first character of the hash, compare against an array key (array_key_exists), and be done.

    Also, something I've thought about was how you have to list user tables, and global tables. Why list user tables at all? If it isn't global, it's user. Whether their in a VIP db or not isn't relevant, as this is the table connection part.

    This would help out for adding plugins and such, and prevent the need for adding it to the array. Further scrutinizing the connection information.

    I guess the plugin now takes a different approach, but I would think that global tables aren't necessarily added to often, which would be easier to handle. Potentially a shorter list as well, considering the vast amount of plugin possibilities.

  • Quenting

    [quote]What's you're average number of tables per DB?[/quote]

    We have about 200, but we could grow a lot in the same layout.

    [quote]Overall though, without needing any special algorithms, I would have to say that using the hash of the blog id makes sense though. It balances users pretty well, can be scaled, and doesn't create holes or gaps on one server vice another. Plus, it can be scaled as well.

    Let's say you use the first 2 characters of the hash (256), but only want 128 db's, or even 64.
    Easy. Break it down in half and point 2 results (like 00 and 01) to db1, etc. You end up with 128 db's.

    Or, if you only want 64, you can take 00,01,02,03 and point them to db1. All done easily through the connection array.[/quote]

    I do not see any of these as things you don't also have when using blog_id % 1000 which has all of these and is just plain more simple. And you really don't need a config file. I have 8 lines of php generate all authentication and DB routing stuff.

    [quote]Also, something I've thought about was how you have to list user tables, and global tables. Why list user tables at all? If it isn't global, it's user. Whether their in a VIP db or not isn't relevant, as this is the table connection part.[/quote]

    I'll re-ask your question : why the need to list any table at all ? We don't list either global or local tables. Either it has a _XXX_ in the table name with XXX a number and then it's local, or it doesn't and it's global. If you want new tables, just depending on the format of their name, they'll automatically be interpreted as local or global, no extra config needed.

  • ZappoMan

    Hey Q,

    I agree with everything you're saying here. Any chance you'd be willing to share your code?

    By the way, my team and I were discussing migration and doing upgrades in an always hot manner. I hate the idea of ever taking the service off line.

    One thing that we were discussing is the idea of not using a hash at all, and instead simply storing in a blogs (or similar global table) a map from blog domain to server/port/schema/password. Even if this table was 100m rows, mySQL should have no problem getting these results back quickly and easily.

    One advantage of this completely non-algorithmic way of assigning blogs to DBs is that it allows you to administratively control what cluster a user is in. Of course you still assign new users to a cluster algorithmically upon creation... but after that you use a lookup.

    The main advantage of not assigning blogs to clusters with an algorithm would be the ability to migrate blogs between clusters without worrying that an algorithm wants it in a different cluster.

    Thoughts?

  • ZappoMan

    Q:

    I just reread this...

    ...we use master/slave replication for global tables.

    And I realize now that what you are saying is that you didn't have to do any special db class code for determining on the fly if a query call should go from one db instance to another... At least you shouldn't have to if all the DBs contain a copy of all the global tables...

    Are you using the standard wpmu DB class?