As I read the source of the page you referenced with the tinyurl, there are 40 <a href...</a>s; 3 of those (17, 18 and 19) are hidden in a comment.
Number 24, counting those in the comment, <a href="http://gs.statcounter.com/press/bing-gains-another-1-perc-of-search-market">Bing Gains Another 1% of Search Market</a> doesn't seem to have anything that would cause the effect; neither does number 24, ignoring those in the comment.
In fact, IMO, the most remarkable thing about that page is that there are no obvious html, css or js errors (though I haven't looked at the linked css, nor at the linked js).
- its html is entirely valid to PerlTidy and the w3c validator;
and - it's marked for utf-8.
You might assist wiser monks to help by providing the partial data where the script stops, pinpointing the character you identify as the one at which the issue arises.
| [reply] [d/l] [select] |
First, my sincere thanks for looking at what is, in all liklihood, a perception error on my part. Since last post, I've looked into the buffering, and believe that, yes, it's stopping elsewhere than indicated due to buffering... the real clue was that it stopped WAY later when run in the debugger....
That being said, I found some coding errors that occur shortly after where the prints stopped, fixed them, and that appears to have removed the problem.
I am attempting - more for the learning experience than otherwise - to take the anchors looted from each page, pass them (as a string) along with a ref to the URI::new()'d object holding the current page's URI, together as two args to a sub (called fqURL) that handles relative and short-absolute URI's and returns a fully qualified URL for the next page-fetch.
called as:
my ($newURI, $newScheme, $newHost, $newPathSegs, $newExtension, $newQu
+ery) = fqURL($anchor,$thisURI);
# so it returns an object, a string, a string, an array of strings, an
+d two more strings #
My first problem was how to make the sub recognize the object passed as its second parm - I had used a convention of partial-caps to denote the object (oldURI), and initial-caps to denote the string representation of the object that I created from it (oldUri). Of course, I looked right past the mistake several times.... but - once it was found, it turned out to be just obscuring smoke! The actual problem was an anchor that had an href of '.' - a single standalone dot. This, when fed back into the Mech for a fetch, made the program refetch the same page forever, which somehow (still not clear to me) caused the script to crash - probably because it was making fqURLs of successively larger size, eventually running out of memory.
So, as expected, the problem was not in the parser at all, was obscured by the buffering delays until the helpful posts here were read, and was entirely in my inadequate understanding of ref-passing in and out of subs - there actually WAS code in the sub to detect and defang the single-dot problem - it just wasn't being called, because of a capitalization-typo. With this experience in hand, I'm gonna re-write the whole dang thing to erase that caps convention, and replace it with something less easy to elide visually while debugging. At least, thanks again to prodding from my brother monks here, I got some experience with the debugger out of the whole mess.
Thanks, Folks, for all the help and patience!
Dick Martin Shorter, novice perl monk
| [reply] [d/l] |
What are the 5 bad pages? I'd be really curious to see if the HTML is so bad that it chokes up the parser.
You could also try the perl debugger.
| [reply] |
only have one at my fingertips: http://tinyurl/nvzfar
It seems to be pretty straightforward... and the parse completes... both with HTML::TokeParser and WWW:Mechanize. The failure occurs some fixed amount of execution after the parse is finished. Even if just stacking meaningless prints, it fails... (only tested in Mechanize)
Let me say this again: after this loop:
foreach my $link (@links) {
print "LINK: " . $link->url() . "\n" if ($DEBUG>=1);
push(@anchors,$link->url());
}
my $goodAnchors = 0;
print " @ 1ANCLOOP\n" if ($DEBUG>=1);
print " @ 2ANCLOOP\n" if ($DEBUG>=1);
print " @ 3ANCLOOP\n" if ($DEBUG>=1);
print " @ 4ANCLOOP\n" if ($DEBUG>=1);
. . .
print " @ 29ANCLOOP\n" if ($DEBUG>=1);
print " @ 30ANCLOOP\n" if ($DEBUG>=1);
it stops at "24ANCHLOOP" - 24 out of 30 meaningless print statements...
BTW, I just replaced the link parsing code (used to use HTML::TokeParser) with WWW::Mechanize, and the same thing still happens, albeit at a slightly different place. Of course, the failure changes each time I add tracing prints... but is completely repeatable, down to the character it fails on in a print, if that's where it is failing.
| [reply] [d/l] |
| [reply] [d/l] |
What's the process's exit code? (echo $?)
| [reply] [d/l] |
ummmm, this is Windoze... there's probably a way to get the process exit code by putting it in a batch script, but...
| [reply] |
Oh, I got fooled by your mention of signals. Windows doesn't have signals*. Seeing as I was trying to figure out which signal killed your app, ignore my request.
It also preempts the other ideas I had, sorry.
* — Well, you could consider Ctrl-C and Ctrl-Break signals, but that's it. Windows apps use messages instead, and they aren't deadly. You can't even send one to a console app unless it creates a Window.
| [reply] |