Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Problem timing out XML::LibXML parsing

by samtregar (Abbot)
on Feb 03, 2009 at 19:40 UTC ( #741100=perlquestion: print w/replies, xml ) Need Help??

samtregar has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I'm using XML::LibXML to parse some HTML. Mostly it's working great - fast and very useful XPath support. My problem is that it's choking on some very bad HTML in a very bad way - it's sitting on the CPU until killed manually. I expected some HTML wouldn't parse, so this isn't such a tragedy. What is a big problem is that my attempt to work around this with alarm() aren't working!

Here's my code:

use strict; use warnings; use XML::LibXML; my $html = do { local $/; <> }; my $libxml = XML::LibXML->new(); #$libxml->recover(2); eval { local $SIG{ALRM} = sub { die "TIMEOUT\n" }; alarm(10); $libxml->parse_html_string($html); alarm(0); }; if ($@ and $@ eq "TIMEOUT\n") { warn "Timed out ok.\n"; } elsif ($@) { die $@; }

If I replace the parse call with sleep(20) then it works as expected - the alarm triggers and the timeout is caught. If I run it as-is with my sample HTML then it never stops until killed. If you want to play along at home here's the test file:

http://sam.tregar.com/libxml-fail.html

BEWARE: that's some really bad HTML and it not only breaks XML::LibXML but it also crashed Firefox while I was writing this post the first time! You probably don't want to load it in your browser.

I've never had alarm() fail like this. Is there an alternative I can try? Any other ideas about how to handle this?

Thanks!

-sam

UPDATE: perrin reminded me about how safe-signals work in recent perls. That is indeed the problem - setting PERL_SIGNALS=unsafe makes my code DWIM, at the cost of a certain degree of safety. Ideas for alternatives are still welcome of course.

Replies are listed 'Best First'.
Re: Problem timing out XML::LibXML parsing
by gwadej (Chaplain) on Feb 03, 2009 at 20:26 UTC

    I'm not completely sure, but isn't there something about signals in Perl having been changed so that they are only returned when you are somewhere safe in Perl. Deferred_Signals_(Safe_Signals)

    There's a environment variable (PERL_SIGNALS) you can set to unsafe that disables this behavior. That might allow you to test if alarm is being affected by this feature.

    G. Wade
      The other thing that you can do, which is also documented in perlipc, is to use POSIX::sigaction() to set your handler, which has the advantage of only invoking the "unsafe" behavior for that particular SIGALRM and not for any signal your app might catch, ever. You decide whether that sounds worth your time or not :)
Re: Problem timing out XML::LibXML parsing
by hbm (Hermit) on Feb 03, 2009 at 20:21 UTC

    The recipes I've seen have an inner eval block and a second alarm(0). Paraphrasing, the inner eval is to trap any exception from your long process, which otherwise might pop you out of the outer eval with the alarm still pending; and the second alarm reset is for the slim chance of an exception occurring after the inner eval, but before the first alarm reset.

    See the three new lines below. I haven't tested this, but does it work?

    use strict; use warnings; use XML::LibXML; my $html = do { local $/; <> }; my $libxml = XML::LibXML->new(); #$libxml->recover(2); eval { local $SIG{ALRM} = sub { die "TIMEOUT\n" }; alarm(10); eval { ###### $libxml->parse_html_string($html); }; ###### alarm(0); }; alarm(0); ###### if ($@ and $@ eq "TIMEOUT\n") { warn "Timed out ok.\n"; } elsif ($@) { die $@; }
      No, that doesn't help - the problem is that the alarm() signal isn't interrupting the call to libxml2's C code. I added an UPDATE on my question with one fix that works...

      -sam

Re: Problem timing out XML::LibXML parsing
by mirod (Canon) on Feb 03, 2009 at 20:01 UTC

    Can HTML::Parser deal with this code? And do you really need to use XML::LibXML? If the answers are yes and no you can use HTML::TreeBuilder (and HTML::TreeBuilder::XPath for very usefull XPath support. Or use XML::Twig, which uses HTML::TreeBuilder to wrestle XML out of the HTML.

    Othersiwe you could use HTML::Tidy, or just plain tidy, to clean-up the HTML before using it.

    IIRC, the I was looking for a way to convert HTML to XML, HTML::TreeBuilder seemed to be the most robust parser available in Perl.

      Thanks, but yes, I really want to use XML::LibXML. It's so much faster than HTML::TreeBuilder and speed is critical in my application. So far it's actually been pretty reliable - this problem only occurs in around 1 out of every 100,000 or so pages I've parsed.

      -sam

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://741100]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2022-09-26 02:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (116 votes). Check out past polls.

    Notices?