Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^4: Parsing HTML/XML with Regular Expressions

by haukex (Archbishop)
on Oct 18, 2017 at 22:05 UTC ( [id://1201624]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Parsing HTML/XML with Regular Expressions
in thread Parsing HTML/XML with Regular Expressions

Could it just be a version issue? The code you posted works fine for me:

use warnings; use strict; use XML::LibXML; my $XML = <<'_END_XML_'; <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html [ <!ENTITY atad "data"> ] > <html xmlns="http://www.w3.org/1999/xhtml"> <div class="&atad;" id="Hello" /> <div class="&atad;" id="World" /> </html> _END_XML_ print $XML::LibXML::VERSION, " ", XML::LibXML::LIBXML_DOTTED_VERSION, " ", XML::LibXML::LIBXML_VERSION, " ", XML::LibXML::LIBXML_RUNTIME_VERSION, "\n"; my $dom = XML::LibXML->load_xml(string=>$XML); my $xpc = XML::LibXML::XPathContext->new; $xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml'); for my $div ($xpc->findnodes('//xh:div[@class="data"]', $dom)) { print "1:", $div->{id}, "\n" } for my $div ($xpc->findnodes('//xh:div', $dom)) { print "2:", $div->{id}, " ", $div->{class}, "\n" } __END__ 2.0129 2.9.1 20901 20901 1:Hello 1:World 2:Hello data 2:World data

Replies are listed 'Best First'.
Re^5: Parsing HTML/XML with Regular Expressions
by choroba (Cardinal) on Oct 18, 2017 at 22:13 UTC
    I modified your original input:
    <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[ <!ENTITY atad 'data'> <!ATTLIST html xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDATA #IMPLIED > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" + /> <title>Hello, World</title> <script type="text/javascript"> //<![CDATA[ console.log(' <div class="data" id="Hello">World</div> '); //]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One">Monday</div><div class="data" id="Two">Tues +day</div> <div id="Three" class='data'>Wednes<div id="day">day</div></div> <div class="&atad;" id='Four'><b>Thursday</b></div>
    etc. Four was missing from the output for both the libraries.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Although I'm definitely not an expert in the latter, there does seem to be a difference between XML::LibXML and XML::XSH2.

      <update> I think I found it, looks like it's the option load_ext_dtd=>0, from XML::LibXML::Parser: "Thus switching off external DTD loading, will disable entity expansion, validation, and complete attributes on internal subsets as well." </update>

      #!/usr/bin/env perl use warnings; use strict; our $XML = <<'_END_XML_'; <!DOCTYPE html [ <!ENTITY atad "data"> ] > <html xmlns="http://www.w3.org/1999/xhtml"> <div class="data" id="One" /> <div class="&atad;" id="Two" /> <div class="&atad;" id="Three" /> </html> _END_XML_ use XML::LibXML (); print $XML::LibXML::VERSION, " ", XML::LibXML::LIBXML_DOTTED_VERSION, " ", XML::LibXML::LIBXML_VERSION, " ", XML::LibXML::LIBXML_RUNTIME_VERSION, "\n"; my $dom = XML::LibXML->load_xml(string=>$XML); my $xpc = XML::LibXML::XPathContext->new($dom); $xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml'); for my $div ($xpc->findnodes('//xh:div[@class="data"]')) { print "1:", $div->{id}, "\n" } for my $div ($xpc->findnodes('//xh:div')) { print "2:", $div->{id}, " ", $div->{class}, "\n" } use XML::XSH2 'xsh'; print $XML::XSH2::VERSION, "\n"; xsh(<<'_END_XSH_'); open :s {$::XML} ; register-namespace xh http://www.w3.org/1999/xhtml ; for //xh:div[@class="data"] { echo "a:" @id ; } for //xh:div { echo "b:" @id @class ; } _END_XSH_ __END__ 2.0130 2.9.1 20901 20901 1:One 1:Two 1:Three 2:One data 2:Two data 2:Three data 2.1.26 parsing string done. a: One b: One data b: Two data b: Three data
        Yes, that's it. I turned it off because otherwise loading the XML took 2 minutes.

        You can get the same effect in XML::XSH2 with

        load_ext_dtd 1 ;
        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201624]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-19 11:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found