ghettofinger has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I am having some issues with a cgi script that I have written that uses LWP::UserAgent to scrape webpages and the runs through each page using a regex to match content and puts said content into a hash. It also extracts links using HTML::SimpleLinkExtor and follows specified links to run the whole process over again. At the end it puts the information into a database.

This script works like a champ on a page with a small amount of content, but when it is presented with a large amount of content it chokes. This script is run on a web server using CGI. There are no errors in the apache logs, nor is any error presented to the screen.

I will spare everyone here from burning their eyes on my hobbled code, but I have a couple of questions about debugging in this kind of scenario. Does anyone use the strace command on the process id? I tried that. This is what I get:

fcntl64(9, F_GETFL) = 0x2 (flags O_RDWR) fcntl64(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0 write(9, "GET /ublank/0___files/4155."..., 416) = 416 select(16, [9], NULL, NULL, {180, 0}) = ? ERESTARTNOHAND (To be rest +arted) --- SIGTERM (Terminated) ---

This is really cryptic and starts to scare me after a while. Can anyone recommend maybe other ways of getting low level debugging output from a script? Does this sound like a memory issue? If so, could it be due to poorly written code, or maybe even buggy perl (5.6.1 on Debian)?

I appreciate everyone's help.

Take care,
ghettofinger

Replies are listed 'Best First'.
Re: Methods to debug CGI issues
by chb (Deacon) on Apr 20, 2005 at 06:35 UTC
    Is the webserver running under your control? Some ISPs put a time limit on running cgi-scripts. If your script takes too long on a large amount of content, it may be killed off by the server (SIGTERM....). HTH, although it is not really a debugging hint...
      LWP::UserAgent also may have a timeout

        LWP::UserAgent does have a timeout, but I have increased it to 20 minutes. I have also tcpdumped the interface and it is making connections rapidly in succession with no errors. The tcp sessions are also shutting down cleanly.

Re: Methods to debug CGI issues
by polettix (Vicar) on Apr 20, 2005 at 08:31 UTC
    It's not a Perl-related problem, chb should be on the right path. Googling a bit, I see that this ERESTARTNOHAND has to do with the way signals interfere with system calls, maybe you can unfog it a little reading this.

    It's probable that the web server sets some alarm that intervenes during the select syscall, interrupting it and trying to make the whole stuff restart. Probably you'd better search more in Linux kernel mailing lists or fora to see if this problem (if there's a problem!) has been addressed in some recent version of the kernel.

    Flavio (perl -e "print(scalar(reverse('ti.xittelop@oivalf')))")

    Don't fool yourself.
Re: Methods to debug CGI issues
by eXile (Priest) on Apr 20, 2005 at 06:51 UTC
    you could use something like Devel::DProf to dump a profile of your data, and have 'dprofpp -T' print out the call tree of your program. That should give you some idea on what is happening.
Re: Methods to debug CGI issues
by kprasanna_79 (Hermit) on Apr 20, 2005 at 07:04 UTC
    Hai,
    Since u didnt see any errors in errorlog, its always good to trace the code.
    So use the following way
    use Data::Dumper; print STDERR Dumper($variable);

    The Module Data::Dumper can be downloaded from Cpan. This is the easiest way to debug i think, and what we follow also.
    Please excuse for any typhos...
    --prasanna.k
      Usually no need to download from CPAN, AFAIK Data::Dumper is a core module included in every perl installation.
Re: Methods to debug CGI issues
by ropey (Hermit) on Apr 20, 2005 at 13:00 UTC

    Out of interest, why are you running such as a CGI rather than a simple script, perhaps you could make a command line script do the same thing and see if it has the same trouble ? at least then you could see if its related to it being a CGI and some issue with the webserver or the code itself ?

Re: Methods to debug CGI issues
by sfink (Deacon) on Apr 20, 2005 at 16:30 UTC
    Looks to me like chb is exactly right. And for this problem, strace is precisely the right tool, if you know how to read the output. If not, then as usual the granddaddy of all Swiss Army Intercontinental Ballistic Missiles would be the right tool (you may know it as "Google").

    I am not Google, merely someone who can interpret strace output, so I'll break it down for you.

    • Your script sends a GET request to some web server using file descriptor 9.
    • It then waits up to three minutes (180 sec) for the response to come back on that file descriptor.
    • The select call is interrupted by something, which generally means a signal.
    • Sure enough, the next line shows a SIGTERM received.
    • There were no intervening syscalls, so your script didn't send the signal to itself. Something else must have.
    That something else is probably the web server, since it's generally the only thing that knows enough and cares enough about CGI processes to murder them. The next step is to look into its error log to see if it reported why it terminated your script -- but you did that and it didn't, which is kind of mysterious. Personally, I'd next strace the parent httpd of the script (run the script first, then attach to the parent you get in the process table with ps axf). But unless you're familiar with using strace output, that may just be a sure recipe for insanity.

    I'll echo ropey's comment and ask why you're running this as a CGI?

    As a last comment, try running
      strace perl -le '$v=""; vec($v,0,1)=1; select($v,undef,undef,180)'
    in one window and then
      kill pid
    in another (where pid is the pid of the perl script). You should see very familiar-looking strace output.

      Your strace skills are awesome. You and chb are right on the money.

      I have never written any CGI that can take this long to send output to the browser. It seems that apache will end the script if the amount of time reaches a thresh hold defined as Timeout in httpd.conf.

      I have increased the timeout and everything works great. I also learned about debugging CGI as well. Thanks everyone for your help!

      Thanks again,
      ghettofinger

Re: Methods to debug CGI issues
by arc_of_descent (Hermit) on Apr 20, 2005 at 14:46 UTC

    You really should check out Log::Log4perl. It serves as a powerful logger and who says you can't use it for debugging too?