[Uptime monitor] Monitoring seems to trigger e-mails when it shouldn't

Dear all,

I've had a chat with my hosting provider about the many downtimes I've had accross all my websites.

They tell me that it is an issue related to your services, that the websites are in fact up and running 99% of the time.

Here is the chat transcript :

Dian Tue, 09/18/18 11:38:49 am Europe/Paris

Hi Gregoire, how can I help you?

Gregoire 11:38:59 am

Hello Dian
All my websites are down 11:39:03 am

Dian 11:39:04 am

Yes, it is a temporary issue with the UK server, we are working on the fix right now.

Gregoire 11:39:14 am

It happens a lot lately
I use an uptime monitor
and I keep receiving alters
alerts
at least 5 per day, for each website

Dian 11:41:42 am

The issue is from now so websites have been up in previous days. Would you provide us with monitor results you`ve had so we can take a look.

Gregoire 11:43:34 am

Here is an example :
https://i.postimg.cc/pTRzJmpd/Capture.png

Dian 11:46:39 am

Everything should be back to normal now.

Gregoire 11:54:32 am

How can we make sure that uptime is better in the future ?
88% over the last 30 days isn't good enough

Dian 11:56:35 am

I believe that the Uptime monitor you use is returning a false positive results, the real Uptime is around 99% in the last 30 days.
Your website has been running without issues in the past 30 days, today is the first time that the server has returned off time for around 15 minutes. 11:58:13 am

Gregoire 11:59:17 am

How do you explain that the monitoring gives different results ?

Dian 12:01:24 pm

It is possible for the monitor to ping the website very often so ModSecurity to block their IP for 15 minutes because of security measures. Another reason is the website to return a local error which can cause this error, so if you see a downtime again, please contact us as soon as possible so we can check the website immediately. We`ve checked the website and everything looks good now.

Gregoire 12:05:25 pm

OK, I'll have a word with WPMU about the monitoring
have a nice day

How do you feel about this issue ?
Regards,
Greg

  • Adam Czajczyk

    Hello Greg

    I hope you're well today and thank you for your question!

    I think I should start with a bit of explanation about how this works. Our Uptime is trying to contact your site, requesting HTTP headers (that's similar to a system "curl -h" command). That should return some HTTP status .

    If it returns 2xx (successful) or some proper redirect (3xx) that's fine and the site's considered up. If it returns some error code, it's considered down. That's the basic check.

    However, there's also a time limit as we cannot wait for response "indefinitely". I believe it's 20 seconds, though I'm not sure, I think it might have recently been raised to 30 seconds. Still though, if the site responds to slow and doesn't fit in that timeframe, it will be considered down as well - even though it really is up. That actually makes sense because from a visitor perspective it might be completely down as well as most visitors wouldn't wait that long.

    Now, to that we must add up some "net retention", so to say. You can make a simple experiment with Pingdom Tools service at https://tools.pingdom.com/ and check your site a couple of times, each time selecting a different location from "Test from" drop-down list. You will notice that depending on the location the load time might vary significantly. That's also something that might affect the test in relation to aforementioned time limit.

    Then, you host might actually be right: if for some reason Uptime detects that site is down, it checks it again and again to detect whether it got up so it is possible that this triggers some additional temporary "blacklist" on the server. They're also right about possible errors: for example site can actually be online in that sense that server is up and should be serving it but something caused a temporary overload eating up resources and resulting in some time-out status or 500 or 503 (or similar) error.

    So, in that case I'd suggest getting back to your host and asking them if they could:

    1. permanently white list these two IPs so they wouldn't trigger any "bans" (via mod_security or other security tools on server):

    34.196.51.17
    52.57.5.20

    2. If they could check server logs for you to see if they could correlate them somehow against any possible "resource limits being exhausted" around the times that Uptime reported down-times?

    The first step might actually help permanently but the second one could help us get a better insight into this case.

    Best regards,
    Adam

Thank NAME, for their help.

Let NAME know exactly why they deserved these points.

Gift a custom amount of points.