Problem timing out XML::LibXML parsing

samtregar has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I'm using XML::LibXML to parse some HTML. Mostly it's working great - fast and very useful XPath support. My problem is that it's choking on some very bad HTML in a very bad way - it's sitting on the CPU until killed manually. I expected some HTML wouldn't parse, so this isn't such a tragedy. What is a big problem is that my attempt to work around this with alarm() aren't working!

Here's my code:

use strict;
use warnings;
use XML::LibXML;

my $html = do { local $/; <> };

my $libxml = XML::LibXML->new();
#$libxml->recover(2);

eval {
    local $SIG{ALRM} = sub { die "TIMEOUT\n" };
    alarm(10);
    $libxml->parse_html_string($html);
    alarm(0);
};
if ($@ and $@ eq "TIMEOUT\n") {
    warn "Timed out ok.\n";
} elsif ($@) {
    die $@;
}
[download]

If I replace the parse call with sleep(20) then it works as expected - the alarm triggers and the timeout is caught. If I run it as-is with my sample HTML then it never stops until killed. If you want to play along at home here's the test file:

http://sam.tregar.com/libxml-fail.html

BEWARE: that's some really bad HTML and it not only breaks XML::LibXML but it also crashed Firefox while I was writing this post the first time! You probably don't want to load it in your browser.

I've never had alarm() fail like this. Is there an alternative I can try? Any other ideas about how to handle this?

Thanks!

-sam

UPDATE: perrin reminded me about how safe-signals work in recent perls. That is indeed the problem - setting PERL_SIGNALS=unsafe makes my code DWIM, at the cost of a certain degree of safety. Ideas for alternatives are still welcome of course.

Comment on Problem timing out XML::LibXML parsing Download Code

Replies are listed 'Best First'.
Re: Problem timing out XML::LibXML parsing by gwadej (Chaplain) on Feb 03, 2009 at 20:26 UTC
I'm not completely sure, but isn't there something about signals in Perl having been changed so that they are only returned when you are somewhere safe in Perl. Deferred_Signals_(Safe_Signals) There's a environment variable (PERL_SIGNALS) you can set to unsafe that disables this behavior. That might allow you to test if `alarm` is being affected by this feature. G. Wade	[reply] [d/l]
Re^2: Problem timing out XML::LibXML parsing by hobbs (Monk) on Feb 04, 2009 at 00:54 UTC
The other thing that you can do, which is also documented in perlipc, is to use `POSIX::sigaction()` to set your handler, which has the advantage of only invoking the "unsafe" behavior for that particular SIGALRM and not for any signal your app might catch, ever. You decide whether that sounds worth your time or not :)	[reply] [d/l]
Re: Problem timing out XML::LibXML parsing by hbm (Hermit) on Feb 03, 2009 at 20:21 UTC
The recipes I've seen have an inner eval block and a second `alarm(0)`. Paraphrasing, the inner eval is to trap any exception from your long process, which otherwise might pop you out of the outer eval with the alarm still pending; and the second alarm reset is for the slim chance of an exception occurring after the inner eval, but before the first alarm reset. See the three new lines below. I haven't tested this, but does it work? `use strict; use warnings; use XML::LibXML; my $html = do { local $/; <> }; my $libxml = XML::LibXML->new(); #$libxml->recover(2); eval { local $SIG{ALRM} = sub { die "TIMEOUT\n" }; alarm(10); eval { ###### $libxml->parse_html_string($html); }; ###### alarm(0); }; alarm(0); ###### if ($@ and $@ eq "TIMEOUT\n") { warn "Timed out ok.\n"; } elsif ($@) { die $@; }` [download]	[reply] [d/l]
Re^2: Problem timing out XML::LibXML parsing by samtregar (Abbot) on Feb 03, 2009 at 20:25 UTC
No, that doesn't help - the problem is that the alarm() signal isn't interrupting the call to libxml2's C code. I added an UPDATE on my question with one fix that works... -sam	[reply]
Re: Problem timing out XML::LibXML parsing by mirod (Canon) on Feb 03, 2009 at 20:01 UTC
Can HTML::Parser deal with this code? And do you really need to use XML::LibXML? If the answers are yes and no you can use HTML::TreeBuilder (and HTML::TreeBuilder::XPath for very usefull XPath support. Or use XML::Twig, which uses HTML::TreeBuilder to wrestle XML out of the HTML. Othersiwe you could use HTML::Tidy, or just plain `tidy`, to clean-up the HTML before using it. IIRC, the I was looking for a way to convert HTML to XML, HTML::TreeBuilder seemed to be the most robust parser available in Perl.	[reply]
Re^2: Problem timing out XML::LibXML parsing by samtregar (Abbot) on Feb 03, 2009 at 20:12 UTC
Thanks, but yes, I really want to use XML::LibXML. It's so much faster than HTML::TreeBuilder and speed is critical in my application. So far it's actually been pretty reliable - this problem only occurs in around 1 out of every 100,000 or so pages I've parsed. -sam	[reply]


Perl-Sensitive Sunglasses
	PerlMonks