Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^3: Parsing HTML/XML with Regular Expressions

by choroba (Cardinal)
on Oct 18, 2017 at 21:52 UTC ( [id://1201623]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Parsing HTML/XML with Regular Expressions
in thread Parsing HTML/XML with Regular Expressions

XML::XSH2 is just a wrapper around XML::LibXML. I'd be surprised if it didn't work the same. And indeed, the following doesn't print the id of the div that uses the &atad; class:
#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::LibXML; my $dom = 'XML::LibXML'->load_xml(location => '1.xml', load_ext_dtd => + 0); my $xpc = 'XML::LibXML::XPathContext'->new; $xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml'); for my $div ($xpc->findnodes('//xh:div[@class="data"]', $dom)) { print $div->{id}, "\n" }

Interestingly, at the same time the following shows the classes of all the divs as data:

for my $div ($xpc->findnodes('//xh:div', $dom)) { print join ' ', @{ $div }{qw{ id class }}, "\n" }

Bugreport anyone?

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^4: Parsing HTML/XML with Regular Expressions
by haukex (Archbishop) on Oct 18, 2017 at 22:05 UTC

    Could it just be a version issue? The code you posted works fine for me:

    use warnings; use strict; use XML::LibXML; my $XML = <<'_END_XML_'; <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html [ <!ENTITY atad "data"> ] > <html xmlns="http://www.w3.org/1999/xhtml"> <div class="&atad;" id="Hello" /> <div class="&atad;" id="World" /> </html> _END_XML_ print $XML::LibXML::VERSION, " ", XML::LibXML::LIBXML_DOTTED_VERSION, " ", XML::LibXML::LIBXML_VERSION, " ", XML::LibXML::LIBXML_RUNTIME_VERSION, "\n"; my $dom = XML::LibXML->load_xml(string=>$XML); my $xpc = XML::LibXML::XPathContext->new; $xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml'); for my $div ($xpc->findnodes('//xh:div[@class="data"]', $dom)) { print "1:", $div->{id}, "\n" } for my $div ($xpc->findnodes('//xh:div', $dom)) { print "2:", $div->{id}, " ", $div->{class}, "\n" } __END__ 2.0129 2.9.1 20901 20901 1:Hello 1:World 2:Hello data 2:World data
      I modified your original input:
      <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[ <!ENTITY atad 'data'> <!ATTLIST html xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDATA #IMPLIED > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" + /> <title>Hello, World</title> <script type="text/javascript"> //<![CDATA[ console.log(' <div class="data" id="Hello">World</div> '); //]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One">Monday</div><div class="data" id="Two">Tues +day</div> <div id="Three" class='data'>Wednes<div id="day">day</div></div> <div class="&atad;" id='Four'><b>Thursday</b></div>
      etc. Four was missing from the output for both the libraries.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        Although I'm definitely not an expert in the latter, there does seem to be a difference between XML::LibXML and XML::XSH2.

        <update> I think I found it, looks like it's the option load_ext_dtd=>0, from XML::LibXML::Parser: "Thus switching off external DTD loading, will disable entity expansion, validation, and complete attributes on internal subsets as well." </update>

        #!/usr/bin/env perl use warnings; use strict; our $XML = <<'_END_XML_'; <!DOCTYPE html [ <!ENTITY atad "data"> ] > <html xmlns="http://www.w3.org/1999/xhtml"> <div class="data" id="One" /> <div class="&atad;" id="Two" /> <div class="&atad;" id="Three" /> </html> _END_XML_ use XML::LibXML (); print $XML::LibXML::VERSION, " ", XML::LibXML::LIBXML_DOTTED_VERSION, " ", XML::LibXML::LIBXML_VERSION, " ", XML::LibXML::LIBXML_RUNTIME_VERSION, "\n"; my $dom = XML::LibXML->load_xml(string=>$XML); my $xpc = XML::LibXML::XPathContext->new($dom); $xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml'); for my $div ($xpc->findnodes('//xh:div[@class="data"]')) { print "1:", $div->{id}, "\n" } for my $div ($xpc->findnodes('//xh:div')) { print "2:", $div->{id}, " ", $div->{class}, "\n" } use XML::XSH2 'xsh'; print $XML::XSH2::VERSION, "\n"; xsh(<<'_END_XSH_'); open :s {$::XML} ; register-namespace xh http://www.w3.org/1999/xhtml ; for //xh:div[@class="data"] { echo "a:" @id ; } for //xh:div { echo "b:" @id @class ; } _END_XSH_ __END__ 2.0130 2.9.1 20901 20901 1:One 1:Two 1:Three 2:One data 2:Two data 2:Three data 2.1.26 parsing string done. a: One b: One data b: Two data b: Three data

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201623]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-26 03:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found