| [reply] |
Om-mani- ...
Thanks. While brute force reading of doco and source didn't do much for me browsing the cpan page for XML::LibXML::Parser just had a break-through result, and I feel stupid for having asked in the first place =o)
All it took was to change from
my $source = XML::LibXML->load_xml(location => 'blub.html');
to
my $source = XML::LibXML->load_html(location => 'blub.html');
*sigh*
| [reply] [d/l] [select] |
Tinkster? Seriously? My name is Toby Inkster.
Anyway, the difference in times may be due to DTDs. By default libxml (and libxslt is all libxml-based) downloads DTDs and uses them to expand entities (i.e. convert é → é). This network activity significantly slows down parsing.
LibXML can thankfully be pointed at a local catalogue of DTDs. (See XML::LibXML::Parser and the load_catalog method.) This speeds it up significantly.
Also check out my module HTML::HTML5::Parser which (IMHO) parses HTML much better than libxml's built-in HTML parser.)
perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
| [reply] [d/l] |
Thanks Toby,
Re my nick: that's a long story, doesn't belong here ;]
Re the parser: I'm using an xslt sheet to translate some ugly (non-standard) apple wiki HTML(-like) documents to wiki markup, not sure how I'd integrate the HTML::HTML5::Parser with that approach, thanks for the recommendation, anyway.
Will have a play with the XML::LibXML::Parser once sanity is restored here. Ta ;)
Cheers,
Tink
| [reply] |