Re^4: Parsing HTML/XML with Regular Expressions

Could it just be a version issue? The code you posted works fine for me:

use warnings;
use strict;
use XML::LibXML;

my $XML = <<'_END_XML_';
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html [ <!ENTITY atad "data"> ] >
<html xmlns="http://www.w3.org/1999/xhtml">
<div class="&atad;" id="Hello" />
<div class="&atad;" id="World" />
</html>
_END_XML_

print $XML::LibXML::VERSION, " ",
    XML::LibXML::LIBXML_DOTTED_VERSION, " ",
    XML::LibXML::LIBXML_VERSION, " ",
    XML::LibXML::LIBXML_RUNTIME_VERSION, "\n";
my $dom = XML::LibXML->load_xml(string=>$XML);
my $xpc = XML::LibXML::XPathContext->new;
$xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml');
for my $div ($xpc->findnodes('//xh:div[@class="data"]', $dom)) {
    print "1:", $div->{id}, "\n"
}
for my $div ($xpc->findnodes('//xh:div', $dom)) {
    print "2:", $div->{id}, " ", $div->{class}, "\n"
}

__END__

2.0129 2.9.1 20901 20901
1:Hello
1:World
2:Hello data
2:World data
[download]

Comment on Re^4: Parsing HTML/XML with Regular Expressions Download Code

Replies are listed 'Best First'.
Re^5: Parsing HTML/XML with Regular Expressions by choroba (Cardinal) on Oct 18, 2017 at 22:13 UTC
I modified your original input: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[ <!ENTITY atad 'data'> <!ATTLIST html xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDATA #IMPLIED > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" + /> <title>Hello, World</title> <script type="text/javascript"> //<![CDATA[ console.log(' <div class="data" id="Hello">World</div> '); //]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One">Monday</div><div class="data" id="Two">Tues +day</div> <div id="Three" class='data'>Wednes<div id="day">day</div></div> <div class="&atad;" id='Four'><b>Thursday</b></div> [download] etc. Four was missing from the output for both the libraries. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^6: Parsing HTML/XML with Regular Expressions (updated) by haukex (Archbishop) on Oct 18, 2017 at 22:54 UTC
Although I'm definitely not an expert in the latter, there does seem to be a difference between XML::LibXML and XML::XSH2. `<update>` I think I found it, looks like it's the option `load_ext_dtd=>0`, from XML::LibXML::Parser: "Thus switching off external DTD loading, will disable entity expansion, validation, and complete attributes on internal subsets as well." `</update>` #!/usr/bin/env perl use warnings; use strict; our $XML = <<'_END_XML_'; <!DOCTYPE html [ <!ENTITY atad "data"> ] > <html xmlns="http://www.w3.org/1999/xhtml"> <div class="data" id="One" /> <div class="&atad;" id="Two" /> <div class="&atad;" id="Three" /> </html> _END_XML_ use XML::LibXML (); print $XML::LibXML::VERSION, " ", XML::LibXML::LIBXML_DOTTED_VERSION, " ", XML::LibXML::LIBXML_VERSION, " ", XML::LibXML::LIBXML_RUNTIME_VERSION, "\n"; my $dom = XML::LibXML->load_xml(string=>$XML); my $xpc = XML::LibXML::XPathContext->new($dom); $xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml'); for my $div ($xpc->findnodes('//xh:div[@class="data"]')) { print "1:", $div->{id}, "\n" } for my $div ($xpc->findnodes('//xh:div')) { print "2:", $div->{id}, " ", $div->{class}, "\n" } use XML::XSH2 'xsh'; print $XML::XSH2::VERSION, "\n"; xsh(<<'_END_XSH_'); open :s {$::XML} ; register-namespace xh http://www.w3.org/1999/xhtml ; for //xh:div[@class="data"] { echo "a:" @id ; } for //xh:div { echo "b:" @id @class ; } _END_XSH_ __END__ 2.0130 2.9.1 20901 20901 1:One 1:Two 1:Three 2:One data 2:Two data 2:Three data 2.1.26 parsing string done. a: One b: One data b: Two data b: Three data [download]	[reply] [d/l] [select]
Re^7: Parsing HTML/XML with Regular Expressions (updated) by choroba (Cardinal) on Oct 19, 2017 at 10:02 UTC
Yes, that's it. I turned it off because otherwise loading the XML took 2 minutes. You can get the same effect in XML::XSH2 with `load_ext_dtd 1 ;` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]