multi-db: hash distribution stats

Has any one done any analysis of the distribution for the md5 style hash that is used by multi-db?

I am not too worried about it, since I know if will be hella better than no hashing (as in not using multi-db) but I was chatting with a buddy of mine and he was curious what kind of distribution you actually get from this.

What I'm noticing on the 6000 blogs I have using the 4096 distribution is that I am seeing about 20% of the hashes getting used (49 slots out of 256) and about 80% going unused.

I'm not saying that's a problem, I'm just curious if that's what other people see as well.

Obviously, I understand with the 4096 wide hash slots and only 6000 keys so far we wouldn't expect to see an even distribution. I'm just curious what others are seeing on larger deployments.

Thanks.

  • drmike

    Damn, he beat me. :slight_smile:

    [link removed - drmike]

    It's hard coded to display a million blogs with 4096 databases and you should be able to edit it.

    Download a copy, change the extension to php and put it somewhere up on a server to mess with it.

    edit: Don't forget that you'll be deleting blogs left and right from spammers, folks who leave and all those test blogs that Luke likes to set up. It'll throw off the numbers.

  • ZappoMan

    Here's a slightly modified version of Dr. Mike's script that takes params for #of blogs (blogcount) and amount to split (split=1,2,3).

    It also displays the buckets, even if they have 0 hashes assigned to them, and then will display the number of unused, underutilized (<40% of expected), and over utilized (>160% of expected) buckets.

    I've run this on a couple of blog counts, and after about 40,000 blogs you end up with a very flat distribution, but 100,000 blogs it's completely flat.


    <?php
    $blogcount = isset($_REQUEST["blogcount"]) ? $_REQUEST["blogcount"] : 1000000;
    $split = isset($_REQUEST["split"]) ? $_REQUEST["split"] : 3;
    $display_work = isset($_REQUEST["display_work"]) ? true : false;

    $hashes = pow(16,$split);
    // evenly distributed = $blogcount/$hashes
    $evenDistro = $blogcount/$hashes;

    echo "hashes=$hashes evenDistro=$evenDistro<br/>\n";

    $count = 0;

    $databasecouunt = array();

    $hex = Array ('0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f');
    switch($split)
    {
    case 3:
    foreach ($hex as $val1)
    foreach ($hex as $val2)
    foreach ($hex as $val3)
    {
    $db_hash = $val1.$val2.$val3;
    $databasecouunt[$db_hash]=0;
    }
    break;
    case 2:
    foreach ($hex as $val1)
    foreach ($hex as $val2)
    {
    $db_hash = $val1.$val2;
    $databasecouunt[$db_hash]=0;
    }
    break;
    case 1:
    foreach ($hex as $val1)
    {
    $db_hash = $val1;
    $databasecouunt[$db_hash]=0;
    }
    break;
    }

    if ($display_work)
    echo "start setup count
    \n";

    while ($count <= $blogcount)
    {

    $md5number = substr((md5($count)), 0, $split);

    $howmanysofar = $databasecouunt[$md5number];

    $databasecouunt[$md5number] = $howmanysofar + 1;

    if ($display_work)
    echo "count - " . $count . " md5number - " . $md5number . " how many so far - " . $databasecouunt[$md5number] . "
    \n ";

    $count++;

    }

    if ($display_work)
    echo "end setup count
    \n";

    // Sort the damn thing

    ksort($databasecouunt);

    echo "display data
    \n";

    $underUsed=0;
    $overUsed=0;
    $notUsed=0;

    foreach($databasecouunt as $key => $value)
    {
    //$percent = $value;
    $ratioToEven = $value/$evenDistro;
    echo "Database - $key - $value ($ratioToEven)<br>\n";

    if ($ratioToEven == 0)
    $notUsed++;
    else if ($ratioToEven < 0.40)
    $underUsed++;
    else if ($ratioToEven > 1.60)
    $overUsed++;
    }
    echo "<h1>Outliers</h1>\n";
    echo "unused hashes=$notUsed\n";
    echo "under utilized hashes=$underUsed\n";
    echo "over utilized hashes=$overUsed\n";
    ?>