cbrandtbuffalo has asked for the wisdom of the Perl Monks concerning the following question:

As part of our application monitoring, we have scheduled jobs that run WWW::Mechanize scripts to check our web apps. They run fairly frequently, some every 10 minutes.

This past Sunday, one of our monitoring scripts sent email with a "500 SSL read timeout" error twice, a few hours apart. I tracked it down in the code and it appears WWW::Mechanize uses LWP which uses Crypt::SSLeay for https. Crypt::SSLeay has a module called Net::SSL and this is where the error came from. It appears to be a timeout on the read, and the node 500 SSL read timeout supports that.

Question: Can anyone speculate on what might cause this error? I just want to have an idea where to look if it crops up again. It has happened so infrequently (this is the first time I've seen it in several years), I'm hesitant to chalk it up as general internet slowness.

Some thoughts I had:

We have a typical Apache/mod_perl configuration with a front proxy server and a back app server. SSL takes place on the front. We have Sun boxes with dedicated crypto-cards, if that makes a difference.

Thanks.

Replies are listed 'Best First'.
Re: Net::SSL SSL read timeout
by jhourcle (Prior) on Dec 20, 2005 at 13:49 UTC

    Do you have any other monitoring?

    Typically, when I'm doing monitoring, I don't just look for yes/no answers, but I try to find things that might be a sign of trends -- in a case like this, I might look at the total time that it took to get the page ... that way, I can graph it in MRTG (or whatever you favorite graphing program is).

    If I saw an increased response time around the timeout, I'd look to see what the load was on the machine, and assume that the monitoring wasn't at fault ... if it was an isolated failure, then I'd have to look at the monitoring from end to end.

    (it's like the 'check engine' light on your car -- I'd much rather have a series of gauges, so I had some earlier warnings that something's trending upwards, rather than some light turn on once it his a threshold)

    Anyway, I'd try to look at whatever other logs you might have, as there might be anomalies that would indicate what went wrong, or at least hint at where to look.

      I thought of many of the same things when I first stated looking at this.

      The first logs I looked at were the apache logs, but this really isn't a 500 error, so it is nearly impossible to put the request together with anything on the apache side.

      Unfortunately, this monitoring script doesn't record response time. That's something we need to add because, as you rightly point out, this would be a good extra piece of information. However, in this case I assume it would be 60 seconds since that's the timeout.

      Correlating with other data is also difficult because we have a farm-type architecture. Other monitoring scripts may or may not be hitting the same servers as this one because we hit the front door.

      As far as other logs, the only thing I could think of were SSL logs.

      Thanks for the suggestions.

        I had the same "500 SSL read timeout" problem using LWP and Crypt::SSLeay. Adding a "timeout => XX" value sorted thus out. It was simply a step in a process which took slightly longer to return than the previous request which had all succeeded.