WPMU randomly freezing OSX server(s) — no errors

Alright — this is a big one, and really driving me crazy.

My ENTIRE WPMU (2.8.4a) server is crashing, and the error logs show nothing — no errors, no system errors. Just a locked up system that has to be hard reset. Up until a few days ago, it was happening about once every ~3 weeks. Now, its been every 1-2 days.

I know I’m not offering much, but I dont have much to go on, either.

We moved the entire WPMU instance onto a new identical box, and had the same issues (wanted to make sure it wasnt hardware related).

We are running on a Max OSX server with apache, and have used this setup for over a year with 0 issues until the past couple months.

I’m desperate and would love some suggestions for things to begin trying.

I already started removing any adminstrator plugins I didnt absolutely need, and even uninstalled some useful ones like super cache. The themes I created do not rely on any plugins for displaying content except vipers video quick tags and a contact form plugin used on only 2 sites (out of 50).

Traffic is pretty low, only about 4-5k every day across all sites.

I wish I had a trail of errors to follow, but the logs have been clean.

Thanks for any help you can offer!

  • cmorales954
    • Flash Drive

    mu-plugins:

    AHP Sitewide Recent Posts for WordPress MU 0.6.1

    cets_blog_defaults.php 1.2.4

    cets_simple_dashboard.php 1.3.2

    domain_mapping.php 0.4.3 (donncha)

    More Privacy Options 2.9.1

    Toggle Admin Menus 2.5.2

    LDAP Authentication Plug-in 2.8.2

    Listem .1

    WPMU Plugin Manager 1.4.1

    WordPress Mu Google Analytics (by Rafik)

    Viper’s Video Quicktags 6.2.6

    52 blogs, 183 users

    We aren’t seeing any consistency in regards to who is accessing us before the server dies (the last traces in the access logs are ‘normal’ ‘random’ page hits, and nothing seems repetitive).

  • James Farmer
    • CEO (of WPMU DEV, honest)

    I reckon it’d have to be an infinite loop or similar within one of those pluygins then.

    Can you try replacing AHP Sitewide Recent Posts for WordPress MU 0.6.1 with our sitewide posts plugins?

    Also, can you bri g up apache server status ,like the below?

    0-1 28783 0/48/121826 _ 41.26 0 12 0.0 0.19 5774.24 65.55.51.17 k12onlineconference.org GET /docs/k12online2007schedule.html HTTP/1.1

    1-1 – 0/0/117370 . 1.08 14 0 0.0 0.00 5513.00 127.0.0.1 server2.edublogs.org OPTIONS * HTTP/1.0

    2-1 29546 0/62/120599 _ 14.46 1 253 0.0 9.95 5164.98 60.216.154.219 blogs.mu GET /wp-admin/options-general.php HTTP/1.1

    3-1 29917 1/33/115234 C 7.88 0 0 0.0 0.28 5223.81 127.0.0.1 server2.edublogs.org OPTIONS * HTTP/1.0

    4-1 28962 0/18/114728 W 1.57 277 0 0.0 0.01 5384.39 69.86.59.99 edublogs.tv GET /uploads/audio/ZzBUCqxRxr0bYwm3SiFT.mp3 HTTP/1.1

    5-1 29897 0/34/112364 _ 18.61 1 0 0.0 0.29 5324.19 65.55.51.17 k12onlineconference.org GET /robots.txt HTTP/1.1

    6-1 – 0/0/114011 . 16.99 4 0 0.0 0.00 5467.95 127.0.0.1 server2.edublogs.org OPTIONS * HTTP/1.0

    7-1 – 0/0/108506 . 1.37 1 0 0.0 0.00 4708.70 127.0.0.1 server2.edublogs.org OPTIONS * HTTP/1.0

    8-1 – 0/0/101603 . 5.24 21 0 0.0 0.00 4587.12 127.0.0.1 server2.edublogs.org OPTIONS * HTTP/1.0

    9-1 – 0/0/101670 . 2.19 22 0 0.0 0.00 4064.49 127.0.0.1 server2.edublogs.org OPTIONS * HTTP/1.0

    That might give you some ideas?

    (I get that via WHM).

  • Ovidiu
    • Code Wrangler

    had a similar issue but running debian: server completely unaccessible. connecting onto my serial console, I saw memory was choke full and the swap file was 100% and response time was minutes instead of seconds, so I had to hard-restart.

    what I am trying to say is you should isntall a system monitoring tool like nagios or whatever you rpefer and if it happens again, check memory and swap usage before the crash, it’ll give you a starting point.

  • cmorales954
    • Flash Drive

    James: for now, I’ll go one step further and just remove that plugin. It was only used on the main site page, and I bet only mostly by me to monitor activity. I can live with it off for now.

    I wouldnt be able to bring up that server status info while the break down is occurring, but looking at it now, it almost seems like an apache access log — are they the same thing?

    Ovidiu: Did you end up isolating the problem? fix? resolution? I will suggest it to our server admin on the next crash

  • svteg
    • Flash Drive

    Had the exact same problem, it all started when a pretty big weblog (over 1400 posts and 3000 visitors a day) moved from blogger to mine wpmu. Searched through all the log files, over and over again.. there was no clear reason to find what was causing the freezing. It was very random. I solved it by putting in 1 gig extra ram memory. The server is online for 7 days now, without any reboot. Before that upgrade I had to reboot the server 2 times a day. It’s the solution for now, don’t know if I solved the real problem.

  • Ovidiu
    • Code Wrangler

    @cmorales954

    to help solve your problem you have to figure out first what exactly the problem is. The server crashing are the symptoms, not the problem.

    I suggest again, installing a monitoring software, it would help answer half of these questions, the other half you can answer by checking your configuration.

    Otherwise, can you tell me what was your swap usage before the crash? How many apache processes? Whats your apache configuration? Max clients? Allowing persistent connections? Timeouts? PHP configuration? Interested in all the values inside the php.ini What about your mysql configuration? max connections? timeouts? allowing persistent connections? How many mysql connections did you have right before the crash? Whats the amount of RAM the server has? Any other services running on the machine? Can you give a lsit of processes running right before the crash?

  • cmorales954
    • Flash Drive

    Well.. I hope I am not posting this prematurely, but the crashes have stopped for over 3 weeks.

    The only thing I did this time was to remove an extra file, a blogs.php that was in the /wp-admin/ folder. I believe it was left over from an old update.

    …I am sure that as I type this, the server is going to crash :slight_smile:

    I don’t understand how that could have caused these crashes, however.

  • cmorales954
    • Flash Drive

    Andrew,

    I actually agree with you, it’s just that I hadn’t changed anything else since the last batch of crashes. It’s been over a month now, whereas before it was occurring every 2-3 days.

    Maybe the sys. admin changed something and I dont know about it :slight_smile:

  • fiddyp
    • Site Builder, Child of Zeus

    I get the same thing happening to me now and again, it’ll go for 14 days without problem and then crash with 100% cpu and a load of about 60-70 requiring a reboot.

    I didn’t see anything in the logs that pointed to it, the long queries log was empty and the maximum connections in the hour before and during the problem didn’t go over what had already been handled with no problem.

    my hosting provider has enabled a lot more logging to see what happens when/if it happens again but reading what you said about blogs.php, I might just remove that just in case! :slight_smile:

    I’ll post again if the nice folks at ukfast find out what is causing the problem

  • cmorales954
    • Flash Drive

    ..well we were crash-free for 2 months. I hadn’t TOUCHED the install, and the crashes started up back up again.

    At this point, my theory is some kind of attack, perhaps an exploit.

    We enabled a more rigorous firewall (didnt work, so it’s coming in on standard ports).

    We had TOP dump all the processes every minute to see how performance was leading up to the crash, and within 30 seconds of a crash happening, TOP does not show anything out of the normal (normal loads).

    It’s like a button is getting pressed, with no build-up.

Thank NAME, for their help.

Let NAME know exactly why they deserved these points.

Gift a custom amount of points.