Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^5: Parsing HTML/XML with Regular Expressions (regex)

by RonW (Parson)
on Oct 19, 2017 at 23:50 UTC ( [id://1201716]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig)
in thread Parsing HTML/XML with Regular Expressions

I tried it and got no output. Did you fix something in my code?

I did add a statement to output the list of elements from the shallow parse regex. As far as I can tell, it split out the elements correctly, but it left the embedded newlines in the mark-up elements.

For example, the following:

</div >

became </div\n>

In the case of the Sunday div:

<div title=" class='data' id='Foo'>Bar" id="Seven" class="data">&#xA0;Sunda&#121;</div>

became:

<div title=" class='data' id='Foo'>Bar"\nid="Seven" class="data"> &#xA0;Sunda&#121; </div>

So, I added tr/\n/ / for (@elements); to get rid of the embedded newlines. Still no output (other than the dump of the elements list).

I did encounter an unexpected error: Variable "$XML_SPE" is not imported at extractor.pl line 46. So, I changed:

my @elements = $xml =~ /$XML_SPE/g;

to:

my @elements = $xml =~ /$::XML_SPE/g;

I don't have time to try to debug my code, now. Will try, later.

Current code:

#!perl # use strict; use warnings; # regex from http://www.cs.sfu.ca/~cameron/REX.html#AppA # Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions +", # Technical Report TR 1998-17, School of Computing Science, Simon Fras +er # University, November, 1998. # Copyright (c) 1998, Robert D. Cameron. # The following code may be freely used and distributed provided that # this copyright and citation notice remains intact and that modificat +ions # or additions are clearly identified. $TextSE = "[^<]+"; $UntilHyphen = "[^-]*-"; $Until2Hyphens = "$UntilHyphen(?:[^-]$UntilHyphen)*-"; $CommentCE = "$Until2Hyphens>?"; $UntilRSBs = "[^\\]]*](?:[^\\]]+])*]+"; $CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>"; $S = "[ \\n\\t\\r]+"; $NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]"; $NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]"; $Name = "(?:$NameStrt)(?:$NameChar)*"; $QuoteSE = "\"[^\"]*\"|'[^']*'"; $DT_IdentSE = "$S$Name(?:$S(?:$Name|$QuoteSE))*"; $MarkupDeclCE = "(?:[^\\]\"'><]+|$QuoteSE)*>"; $S1 = "[\\n\\r\\t ]"; $UntilQMs = "[^?]*\\?+"; $PI_Tail = "\\?>|$S1$UntilQMs(?:[^>?]$UntilQMs)*>"; $DT_ItemSE = "<(?:!(?:--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name(?: +$PI_Tail))|%$Name;|$S"; $DocTypeCE = "$DT_IdentSE(?:$S)?(?:\\[(?:$DT_ItemSE)*](?:$S)?)?>?"; $DeclCE = "--(?:$CommentCE)?|\\[CDATA\\[(?:$CDATA_CE)?|DOCTYPE(?:$DocT +ypeCE)?"; $PI_CE = "$Name(?:$PI_Tail)?"; $EndTagCE = "$Name(?:$S)?>?"; $AttValSE = "\"[^<\"]*\"|'[^<']*'"; $ElemTagCE = "$Name(?:$S$Name(?:$S)?=(?:$S)?(?:$AttValSE))*(?:$S)?/?>? +"; $MarkupSPE = "<(?:!(?:$DeclCE)?|\\?(?:$PI_CE)?|/(?:$EndTagCE)?|(?:$Ele +mTagCE)?)"; $XML_SPE = "$TextSE|$MarkupSPE"; use strict; my $xml = join '', <DATA>; my $nest = 0; my $out = ''; my @elements = $xml =~ /$::XML_SPE/g; # see http://www.cs.sfu.ca­/~cam +eron/REX.html#A­ppA tr/\n/ / for (@elements); print " $_\n" for (@elements); print "\n"; for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h*=\h*['"]da­ta['"]/); # \h is horizontal +white space next unless (/id\h*=\h*['"](\w+)­['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; print "$out\n"; __DATA__ <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[ <!ATTLIST html xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDATA #IMPLIED > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" + /> <title>Hello, World</title> <script type="text/javascript"> //<![CDATA[ console.log(' <div class="data" id="Hello">World</div> '); //]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One">Monday</div><div class="data" id="Two">Tues +day</div> <div id="Three" class='data'>Wednes<div id="day">day</div></div> <div class="data" id='Four'><b>Thursday</b></div> <div class="data" id="Five"> Friday </div> <div class = "data" id = "Six" > <div > Satur </div > day </div > <div title=" class='data' id='Foo'>Bar" id="Seven" class="data">&#xA0;Sunda&#121;</div> <div class="data otherclass" id="aaa">bbb</div> <div class="otherclass" id="ccc">ddd</div> <p class="data">eee</p> <p id="fff">ggg</p> <!-- <div class="data" id="Quz">Baz</div> --> <p><![CDATA[ <div class="data" id="Bye">Bye</div> ]]></p> </body> </html>

And the output:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http: +//www.w3.o rg/TR/xhtml1/DTD/xhtml1-strict.dtd"[<!ATTLIST html xmlns:xsi CDAT +A #FIXED " http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDA +TA #IMPLIE D > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" + xmlns:xs i="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ht +tp://www.w 3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict. +xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" +/> <title> Hello, World </title> <script type="text/javascript"> // <![CDATA[console.log(' <div class="data" id="Hello">World</div> '); +//]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One"> Monday </div> <div class="data" id="Two"> Tuesday </div> <div id="Three" class='data'> Wednes <div id="day"> day </div> </div> <div class="data" id='Four'> <b> Thursday </b> </div> <div class="data" id="Five"> Friday </div> <div class="data"id="Six"> <div> Satur </div> day </div> <div title=" class='data' id='Foo'>Bar" id="Seven" class="data"> &#xA0;Sunda&#121; </div> <div class="data otherclass" id="aaa"> bbb </div> <div class="otherclass" id="ccc"> ddd </div> <p class="data"> eee </p> <p id="fff"> ggg </p> <!--<div class="data" id="Quz">Baz</div>--> <p> <![CDATA[<div class="data" id="Bye">Bye</div>]]> </p> </body> </html>

Replies are listed 'Best First'.
Re^6: Parsing HTML/XML with Regular Expressions (regex)
by haukex (Archbishop) on Oct 20, 2017 at 09:03 UTC

    Here's the code I ran, other than adding the necessary stuff to get it to compile and read the external file, the only difference to your code is the addition of s/\W+//g;. The output I get is the following. <update> You were right, it does pick up the wrong id for Sunday, it was the id of Saturday that was missing, my mistake. </update>

    Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +ridaySaturday, Foo=xA0Sunda121bbbdddeeeggg

      Ignoring entity decoding and handling of quoted strings (the attribute values), I've gotten as close to the expected output as I care to pursue.

      The extra output of the Sunday division ("bbbdddeeeggg") was a side effect of incorrectly handling the empty division. The interesting divisions after the empty one were handled correctly because my code allows any division to contain an interesting division. The provided input does have that, but once the empty division handling was fixed, the extra output was eliminated and the rest of the output was correct (other than not decoding the entities and the incorrect id of the Sunday division as mentioned above).

      The embedded newlines (that the shallow parsing regex leaves as-is) have no general "solution". For mark-up elements, converting them to spaces provides clear enough syntax to reliably parse the attributes (at least for this challenge). The content elements need case by case handling. For this challenge, removing leading and trailing newlines gave the desired results.

      The shallow parsing regex is interesting and might be useful for some projects, but most projects will be better served by using one of the better XML modules from CPAN.

      Thanks, again, to [id://haukex] for the challenge and contributing to this little regex adventure.

      My (probably) final code for this:

      I ran your version of my code and got the same output you did.

      Since I already discovered the embedded newlines in the elements list, I added tr/\n//d; at the top of the for loop:

      for (@elements) { tr/\n//d;

      After doing that, the id for Saturday picked up correctly. Also, out of curiosity, I removed the s/\W+//g; you added. The result was:

      Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Foo=&#xA0;Sunda&#121;bbbdddeeeggg

      So, Saturday is cleaned up.

      I know why the id for Sunday is Foo, but still not sure why the "bbbdddeeeggg" is picked up. I will have to step through the code to see what's happening.

      As for the &#xA0;, that's encoding dependent. Not sure why it would get excluded other than by explicitly filtering out non-ASCII characters.

      The &#121; is the y in Sunday. Just requires entity decoding.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201716]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2024-03-28 22:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found