Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^5: Parsing HTML/XML with Regular Expressions

by choroba (Cardinal)
on Oct 18, 2017 at 22:13 UTC ( [id://1201625]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Parsing HTML/XML with Regular Expressions
in thread Parsing HTML/XML with Regular Expressions

I modified your original input:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[ <!ENTITY atad 'data'> <!ATTLIST html xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDATA #IMPLIED > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" + /> <title>Hello, World</title> <script type="text/javascript"> //<![CDATA[ console.log(' <div class="data" id="Hello">World</div> '); //]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One">Monday</div><div class="data" id="Two">Tues +day</div> <div id="Three" class='data'>Wednes<div id="day">day</div></div> <div class="&atad;" id='Four'><b>Thursday</b></div>
etc. Four was missing from the output for both the libraries.

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^6: Parsing HTML/XML with Regular Expressions (updated)
by haukex (Archbishop) on Oct 18, 2017 at 22:54 UTC

    Although I'm definitely not an expert in the latter, there does seem to be a difference between XML::LibXML and XML::XSH2.

    <update> I think I found it, looks like it's the option load_ext_dtd=>0, from XML::LibXML::Parser: "Thus switching off external DTD loading, will disable entity expansion, validation, and complete attributes on internal subsets as well." </update>

    #!/usr/bin/env perl use warnings; use strict; our $XML = <<'_END_XML_'; <!DOCTYPE html [ <!ENTITY atad "data"> ] > <html xmlns="http://www.w3.org/1999/xhtml"> <div class="data" id="One" /> <div class="&atad;" id="Two" /> <div class="&atad;" id="Three" /> </html> _END_XML_ use XML::LibXML (); print $XML::LibXML::VERSION, " ", XML::LibXML::LIBXML_DOTTED_VERSION, " ", XML::LibXML::LIBXML_VERSION, " ", XML::LibXML::LIBXML_RUNTIME_VERSION, "\n"; my $dom = XML::LibXML->load_xml(string=>$XML); my $xpc = XML::LibXML::XPathContext->new($dom); $xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml'); for my $div ($xpc->findnodes('//xh:div[@class="data"]')) { print "1:", $div->{id}, "\n" } for my $div ($xpc->findnodes('//xh:div')) { print "2:", $div->{id}, " ", $div->{class}, "\n" } use XML::XSH2 'xsh'; print $XML::XSH2::VERSION, "\n"; xsh(<<'_END_XSH_'); open :s {$::XML} ; register-namespace xh http://www.w3.org/1999/xhtml ; for //xh:div[@class="data"] { echo "a:" @id ; } for //xh:div { echo "b:" @id @class ; } _END_XSH_ __END__ 2.0130 2.9.1 20901 20901 1:One 1:Two 1:Three 2:One data 2:Two data 2:Three data 2.1.26 parsing string done. a: One b: One data b: Two data b: Three data
      Yes, that's it. I turned it off because otherwise loading the XML took 2 minutes.

      You can get the same effect in XML::XSH2 with

      load_ext_dtd 1 ;
      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201625]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-20 13:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found