Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a web spider, using LWP::RobotUA, and running it on WinXP (ActivePerl). When I retrieve certain webpages, the anchor parsing (which uses HTML::TokeParser) just dies, in the middle of printing out a progress report, if I have tracing turned on, or later, if not. I have handled all the sigs, and the (single, common) handler is never invoked (though I have tried to manually invoke a few sigs, and they seem to work fine). Failure seems to be 100% repeatable, so far, stopping at the exact same char of printout in each case.

I would suspect that memory allocation failure is at the root of this, but have no idea how to confirm that suspicion... the failing pages (5 so far, completely unrelated to each other) all seem fairly long - over 80k. However, most other, much longer (200k+) pages parse quite successfully.

Is there any known way to trap a memory allocation failure?

Is there any known way to trap any other normally silent failure? (and what might fall into this category?)

Thanks, for any ideas or pointers. I am a noob to perl, but have >40 years experience programming... so I probably have some incorrect assumptions that are blinding me.

Dick Martin Shorter

  • Comment on Why would a Perl script stop with no informatives?

Replies are listed 'Best First'.
Re: Why would a Perl script stop with no informatives?
by ww (Archbishop) on Dec 22, 2009 at 02:23 UTC
    As I read the source of the page you referenced with the tinyurl, there are 40 <a href...</a>s; 3 of those (17, 18 and 19) are hidden in a comment.

    Number 24, counting those in the comment, <a href="http://gs.statcounter.com/press/bing-gains-another-1-perc-of-search-market">Bing Gains Another 1% of Search Market</a> doesn't seem to have anything that would cause the effect; neither does number 24, ignoring those in the comment.

    In fact, IMO, the most remarkable thing about that page is that there are no obvious html, css or js errors (though I haven't looked at the linked css, nor at the linked js).

    1. its html is entirely valid to PerlTidy and the w3c validator;
        and
    2. it's marked for utf-8.

    You might assist wiser monks to help by providing the partial data where the script stops, pinpointing the character you identify as the one at which the issue arises.

      First, my sincere thanks for looking at what is, in all liklihood, a perception error on my part. Since last post, I've looked into the buffering, and believe that, yes, it's stopping elsewhere than indicated due to buffering... the real clue was that it stopped WAY later when run in the debugger....

      That being said, I found some coding errors that occur shortly after where the prints stopped, fixed them, and that appears to have removed the problem.

      I am attempting - more for the learning experience than otherwise - to take the anchors looted from each page, pass them (as a string) along with a ref to the URI::new()'d object holding the current page's URI, together as two args to a sub (called fqURL) that handles relative and short-absolute URI's and returns a fully qualified URL for the next page-fetch.

      called as:

      my ($newURI, $newScheme, $newHost, $newPathSegs, $newExtension, $newQu +ery) = fqURL($anchor,$thisURI); # so it returns an object, a string, a string, an array of strings, an +d two more strings #

      My first problem was how to make the sub recognize the object passed as its second parm - I had used a convention of partial-caps to denote the object (oldURI), and initial-caps to denote the string representation of the object that I created from it (oldUri). Of course, I looked right past the mistake several times.... but - once it was found, it turned out to be just obscuring smoke! The actual problem was an anchor that had an href of '.' - a single standalone dot. This, when fed back into the Mech for a fetch, made the program refetch the same page forever, which somehow (still not clear to me) caused the script to crash - probably because it was making fqURLs of successively larger size, eventually running out of memory.

      So, as expected, the problem was not in the parser at all, was obscured by the buffering delays until the helpful posts here were read, and was entirely in my inadequate understanding of ref-passing in and out of subs - there actually WAS code in the sub to detect and defang the single-dot problem - it just wasn't being called, because of a capitalization-typo. With this experience in hand, I'm gonna re-write the whole dang thing to erase that caps convention, and replace it with something less easy to elide visually while debugging. At least, thanks again to prodding from my brother monks here, I got some experience with the debugger out of the whole mess.

      Thanks, Folks, for all the help and patience!

      Dick Martin Shorter, novice perl monk

Re: Why would a Perl script stop with no informatives?
by gmargo (Hermit) on Dec 21, 2009 at 22:51 UTC

    What are the 5 bad pages? I'd be really curious to see if the HTML is so bad that it chokes up the parser.

    You could also try the perl debugger.

      only have one at my fingertips:
      http://tinyurl/nvzfar

      It seems to be pretty straightforward... and the parse completes... both with HTML::TokeParser and WWW:Mechanize. The failure occurs some fixed amount of execution after the parse is finished. Even if just stacking meaningless prints, it fails... (only tested in Mechanize)

      Let me say this again: after this loop:

      foreach my $link (@links) { print "LINK: " . $link->url() . "\n" if ($DEBUG>=1); push(@anchors,$link->url()); } my $goodAnchors = 0; print " @ 1ANCLOOP\n" if ($DEBUG>=1); print " @ 2ANCLOOP\n" if ($DEBUG>=1); print " @ 3ANCLOOP\n" if ($DEBUG>=1); print " @ 4ANCLOOP\n" if ($DEBUG>=1); . . . print " @ 29ANCLOOP\n" if ($DEBUG>=1); print " @ 30ANCLOOP\n" if ($DEBUG>=1);
      it stops at "24ANCHLOOP" - 24 out of 30 meaningless print statements...

      BTW, I just replaced the link parsing code (used to use HTML::TokeParser) with WWW::Mechanize, and the same thing still happens, albeit at a slightly different place. Of course, the failure changes each time I add tracing prints... but is completely repeatable, down to the character it fails on in a print, if that's where it is failing.

        it stops at "24ANCHLOOP" - 24 out of 30 meaningless print statements..

        Not necessarily. Do you get the same result if you add $|=1;? You could be suffering from buffering.

Re: Why would a Perl script stop with no informatives?
by ikegami (Patriarch) on Dec 21, 2009 at 22:35 UTC
    What's the process's exit code? (echo $?)
      ummmm, this is Windoze... there's probably a way to get the process exit code by putting it in a batch script, but...
        Oh, I got fooled by your mention of signals. Windows doesn't have signals*. Seeing as I was trying to figure out which signal killed your app, ignore my request.

        It also preempts the other ideas I had, sorry.

        * — Well, you could consider Ctrl-C and Ctrl-Break signals, but that's it. Windows apps use messages instead, and they aren't deadly. You can't even send one to a console app unless it creates a Window.