Re: Parsing HTML/XML with Regular Expressions (XML::LibXML)

Replies are listed 'Best First'.
Re^2: Parsing HTML/XML with Regular Expressions (XML::LibXML; updated!) by haukex (Archbishop) on Oct 16, 2017 at 15:12 UTC
`<update nr="4">` For the sake of completeness, here's a working script with the changes mentioned below: `use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml( location => 'example.xhtml', no_network=>1, recover=>1 ); my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs('html', 'http://www.w3.org/1999/xhtml'); my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $xpc->findnodes(q{//html:div[@class='data']}); $_->[1] =~ s/\W+//g for @ids2text; print join ", ", map sprintf("%s=%s", @$_), @ids2text;` [download] `</update>` Thanks very much for the reply! Your post inspired some more test cases for my file, and I'm sorry to say I broke your code `:-(` But here's the fix: `my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $doc->findnodes(q{//div[@class='data']});` [download] Update: And yes, it does seem that `load_html` doesn't like XHTML - `load_xml` seems to work a bit better, although fetching the DTD from the net is pretty slow at the moment; adding the options `{no_network=>1,recover=>1}` disables the network check. However, with `load_xml` one also has to start using XML::LibXML::XPathContext: `my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs('html', 'http://www.w3.org/1999/xhtml'); my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $xpc->findnodes(q{//html:div[@class='data']});` [download] Update 2: Even with network, XML::LibXML is still complaining about ` ` ("`Entity 'nbsp' not defined`"), I'm not entirely sure why yet, as it seems to be defined in the DTD... Update 3: The W3C Validator doesn't complain...	[reply] [d/l] [select]
Re^3: Parsing HTML/XML with Regular Expressions (validation of the content) by Discipulus (Canon) on Oct 17, 2017 at 07:38 UTC
Hello again haukex, the thread is interesting and I made my best last night to provide an XML::Twig solution, but due to limited understanding of the XML in general I report here some thing i do not understand about the file you presentend as input. First I cheated because I get the sample XML file before writing the program, because with XML i always go for a try-and-check path.. Second, in my wide ignorance, I really dont know how XHTML, DTD, DOM and transitional can affect the approach to the XML to parse. My sin. Third: if XML::Twig (the only module I use for these task) complains about the document I'll use W3C validator to check the content, before crashing my head with the content, task i very dont like. So, your sample is a valid one. I put it after the `__DATA__` token and I got the following error: `no element found at line 2, column 0, byte 39 at D:/ulisse/perl5.26.64 +bit/perl/vendor/lib/XML/Parser.pm line 187. at dontregexXML03.pl line 20.` [download] After half an hour searching the web I ended reading of xpath bugs dated 2009 but no clue at all. Any attempt to brutally cut the XML, removing lines and tags ended with the very same error, at the same line (??). So I tested the YourMother's solution with your own modification and I get many errors but also the correct solution: `sample.html:11: HTML parser error : Element script embeds close tag console.log(' <div class="data" id="Hello">World</div> '); ^ sample.html:49: HTML parser error : htmlParseStartTag: invalid element + name <![CDATA[ ^ sample.html:50: HTML parser error : Unexpected end tag : div <div class="data" id="Bye">Bye</div> ^ Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday` [download] So i assumed the XML had some problems effectively: my others attempts to `fix` it using such detailed reports emitted by XML::LibXML had no more luck that previous ones. As last resource i put the XML sample into a separate file and: TADA' all run smooth (not considering the `&nbsp` issue) with XML::Twig as presented above. Any suggestion? Which is the best module to report formal errors in the XML structure? are the above reported errors real ones or are due to limits of the parsing module? If the thread will continue can be the Rosetta of Perl XML parsing. Goood one! L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^4: Parsing HTML/XML with Regular Expressions (validation of the content) by haukex (Archbishop) on Oct 17, 2017 at 11:24 UTC
Thanks for looking into that! So as for the ` `, my understanding so far is this: of course an HTML parser will know what it is, but a generic XML parser will by default not know that entity - for that, it has to load the DTDs, but apparently not all XML parsers do that. So, to separate the two problems (the parsing of the XML in the root node vs. figuring out the right options to get the XML parser to recognize the HTML entities), I've updated the example XHTML in the root node to replace the ` ` (and a few other updates - unfortunately causing `load_html` to throw more errors, but `load_xml` to work better). Which is the best module to report formal errors in the XML structure? I typically use xmllint, which is also based on `libxml2` just like XML::LibXML, so really either of those two tools should do XML validation pretty well (as I said above I'm not sure yet what's going on with the DTDs). For example, to validate the example from the root node against the XHTML schema, the following command works; it's also possible to speed it up by downloading the schema files locally and using the options `--nonet --path /path/to/schemas/ --schema /path/to/schemas/xhtml1-strict.xsd` (the "`I/O error : Attempt to load network entity`" messages can usually be ignored). `$ xmllint --noout --schema \ 'http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd' example.xhtml example.xhtml validates` [download] `<update>` Or, you can use the `--valid` option for DTD validation. `</update>` For any (X)HTML, I'd consider the W3C Validator the gold standard. I've also often just used the above `xmllint` command. As for your problem with parsing the XML file from the `DATA` section, I'd have to look into that a bit when I find some more time. Perhaps the parser is doing something with the filehandle that is not compatible with `DATA`. Also, ikegami made an excellent point a while back: XML files should be treated like binary files, and it's better to let the XML parser handle the decoding (although my example file is currently pure 7-bit ASCII).	[reply] [d/l] [select]