Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

This is really a question about web servers, but it relates to Perl so I know you'll forgive me.

I have a Perl script which goes off to various websites and aggregates content, suitably HTML-munged, for the Palm Pilot.

Another script, over which I have no control, then hits on the page created and downloads it to the Palm (AvantGo, for those who know).

I was doing it in this way (pseudocode):

for each page-I-want-to-get{ get it using LWP:Simple or skip if error munge the HTML write it out to a file } then write out a new "index.html" file with links to all the files rewritten at the previous step.

What I would do is manually run the script that did all that, and then the other script would come by and hit on the "index.html" file created by it. Which is inefficient. So I re-wrote it to get the HTML, write the pages and output the links to them all in one.

Then the other script started saying that my script was "idle too long" and time out when accessing it.

So I wrote it another way:

for each page-I-want-to-get{ check it's up using LWP:Simple's head(); or skip if error } output a new "index.cgi" page with links to all the files which are *about* to be created from the pages which passed the test above then go and actually get the pages' html , munge it and write the files

Now the other script was happy and my script wasn't timing out on it.

Then I thought -- that's not logical. My script takes the same time to run, in fact more. Just because it stops outputting text/html to the other script at an earlier point, doesn't mean it actually takes less time to run.

How does a web server "know" when a CGI script is finished, in other words? It surely isn't just when it sends

print '</body></html>';

And what's the effect of putting $| = 0; at the top of my script? I thought that would have it output the header and top of the page, just to "keep the other script interested" kind of thing, before it output the main parts of the content, but that didn't seem to help either.

--
Weaselling out of things is important. It's what separates us from the animals ... except the weasel.

Replies are listed 'Best First'.
Re: Perl Output/Web Server Question
by dws (Chancellor) on Jan 27, 2002 at 02:44 UTC
    How does a web server "know" when a CGI script is finished, in other words? It surely isn't just when it sends ... '</body></html>'

    The CGI is finished when it exits. (Or, in rarer cases, when it closes STDOUT.) Some web servers are configured to time out if a script runs too long. ISPs do this for protection. It doesn't look like this is what you're running into.

    The question that might you closer to your problem is

    How does the client know when a CGI script is finished?
    The answer here is that a client, which can be a browser or and LWP script, can't easily tell whether the page being fetched is being generated dynamically or not. (It's possible, but not easy.) But the client can detect that either (a) the number of bytes specified in the HTTP "Content-header" response have been read, or (b) that the server has closed the socket. What you're probably seeing is the result of the client timing out before (b) is satisified. "The web server hasn't sent me anything in the last n seconds, so it's probably borked."

    A solution to this is to slowly dribble some content out to the client. Have your index.cgi disable buffering, emit some content immediately, then for each link you check that succeeds, emit that link immediately. This should keep the client from timing out.

Re: Perl Output/Web Server Question
by maverick (Curate) on Jan 27, 2002 at 02:30 UTC
    There's a couple of different places it could time out.
    • The browser. IE waits about 30s econds for a response before giving up
    • The OS socket itself may have a timout.
    • The webserver may have a time out.
    But, the bottom line is, as long as your CGI outputs SOMETHING then these timeouts are reset...that's why your rewritten code works. It produces something sooner, thus eliminating the timeout.

    The webserver knows your script is done, when it stop running :) not by any sentinal output.

    If you took your original script and changed it to

    $|++; # turn off output buffering. for each page-I-want-to-get{ print "getting page" get it using LWP:Simple or skip if error print "done" munge the HTML write it out to a file }
    This should prevent the timeout from occuring.

    HTH

    /\/\averick
    perl -l -e "eval pack('h*','072796e6470272f2c5f2c5166756279636b672');"

Re: Perl Output/Web Server Question
by Aristotle (Chancellor) on Jan 27, 2002 at 03:56 UTC
    And what's the effect of putting $| = 0; at the top of my script? I thought that would have it output the header and top of the page, just to "keep the other script interested" kind of thing

    Oops :-) What you wanted is $|=1; which enables autoflushing the output buffer. What you did disables the effect you wanted to achieve. (Actually, it made no difference since the default is to buffer).

    Makeshifts last the longest.

      Thanks all for your help.

      I think I made a couple of mistakes, or wasn't clear, in the first post.

      Yes, dws, the question is "what causes the client to time out", and the client is, unfortunately, the script being run by AvantGo that I have no control over.

      But in answer to your other points, I did have $|=1; not zero, (mistake in posting, not scripting) and I did have little bits of output dribbling out as the script progressed to try and keep the other server interested.

      My "when does a web server know the script is finished" question should have been formulated "my web server seems to be signalling somehow earlier to the AvantGo client, in the new version, that it's finished, which doesn't make sense because it actually finishes, if anything, later than the old version, it just stops outputting HTML earlier."


      --
      Weaselling out of things is important. It's what separates us from the animals ... except the weasel.

        Reading again, I see something I overlooked before. The message you got from the AvantGo client is idle too long - not a timeout per se. I am guessing that the AvantGo client is still timing out in a way, except it doesn't complain because by that time, your output is complete. Even if that's not entirely correct: I think the answer is in how that client works. It may make uncommon assumptions.

        Makeshifts last the longest.