I tried it and got no output. Did you fix something in my code?
I did add a statement to output the list of elements from the shallow parse regex. As far as I can tell, it split out the elements correctly, but it left the embedded newlines in the mark-up elements.
For example, the following:
</div >
became </div\n>
In the case of the Sunday div:
<div title=" class='data' id='Foo'>Bar" id="Seven" class="data"> Sunday</div>
became:
<div title=" class='data' id='Foo'>Bar"\nid="Seven" class="data">  Sunday </div>
So, I added tr/\n/ / for (@elements); to get rid of the embedded newlines. Still no output (other than the dump of the elements list).
I did encounter an unexpected error: Variable "$XML_SPE" is not imported at extractor.pl line 46. So, I changed:
my @elements = $xml =~ /$XML_SPE/g;
to:
my @elements = $xml =~ /$::XML_SPE/g;
I don't have time to try to debug my code, now. Will try, later.
Current code:
#!perl # use strict; use warnings; # regex from http://www.cs.sfu.ca/~cameron/REX.html#AppA # Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions +", # Technical Report TR 1998-17, School of Computing Science, Simon Fras +er # University, November, 1998. # Copyright (c) 1998, Robert D. Cameron. # The following code may be freely used and distributed provided that # this copyright and citation notice remains intact and that modificat +ions # or additions are clearly identified. $TextSE = "[^<]+"; $UntilHyphen = "[^-]*-"; $Until2Hyphens = "$UntilHyphen(?:[^-]$UntilHyphen)*-"; $CommentCE = "$Until2Hyphens>?"; $UntilRSBs = "[^\\]]*](?:[^\\]]+])*]+"; $CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>"; $S = "[ \\n\\t\\r]+"; $NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]"; $NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]"; $Name = "(?:$NameStrt)(?:$NameChar)*"; $QuoteSE = "\"[^\"]*\"|'[^']*'"; $DT_IdentSE = "$S$Name(?:$S(?:$Name|$QuoteSE))*"; $MarkupDeclCE = "(?:[^\\]\"'><]+|$QuoteSE)*>"; $S1 = "[\\n\\r\\t ]"; $UntilQMs = "[^?]*\\?+"; $PI_Tail = "\\?>|$S1$UntilQMs(?:[^>?]$UntilQMs)*>"; $DT_ItemSE = "<(?:!(?:--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name(?: +$PI_Tail))|%$Name;|$S"; $DocTypeCE = "$DT_IdentSE(?:$S)?(?:\\[(?:$DT_ItemSE)*](?:$S)?)?>?"; $DeclCE = "--(?:$CommentCE)?|\\[CDATA\\[(?:$CDATA_CE)?|DOCTYPE(?:$DocT +ypeCE)?"; $PI_CE = "$Name(?:$PI_Tail)?"; $EndTagCE = "$Name(?:$S)?>?"; $AttValSE = "\"[^<\"]*\"|'[^<']*'"; $ElemTagCE = "$Name(?:$S$Name(?:$S)?=(?:$S)?(?:$AttValSE))*(?:$S)?/?>? +"; $MarkupSPE = "<(?:!(?:$DeclCE)?|\\?(?:$PI_CE)?|/(?:$EndTagCE)?|(?:$Ele +mTagCE)?)"; $XML_SPE = "$TextSE|$MarkupSPE"; use strict; my $xml = join '', <DATA>; my $nest = 0; my $out = ''; my @elements = $xml =~ /$::XML_SPE/g; # see http://www.cs.sfu.ca/~cam +eron/REX.html#AppA tr/\n/ / for (@elements); print " $_\n" for (@elements); print "\n"; for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal +white space next unless (/id\h*=\h*['"](\w+)['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; print "$out\n"; __DATA__ <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[ <!ATTLIST html xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDATA #IMPLIED > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" + /> <title>Hello, World</title> <script type="text/javascript"> //<![CDATA[ console.log(' <div class="data" id="Hello">World</div> '); //]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One">Monday</div><div class="data" id="Two">Tues +day</div> <div id="Three" class='data'>Wednes<div id="day">day</div></div> <div class="data" id='Four'><b>Thursday</b></div> <div class="data" id="Five"> Friday </div> <div class = "data" id = "Six" > <div > Satur </div > day </div > <div title=" class='data' id='Foo'>Bar" id="Seven" class="data"> Sunday</div> <div class="data otherclass" id="aaa">bbb</div> <div class="otherclass" id="ccc">ddd</div> <p class="data">eee</p> <p id="fff">ggg</p> <!-- <div class="data" id="Quz">Baz</div> --> <p><![CDATA[ <div class="data" id="Bye">Bye</div> ]]></p> </body> </html>
And the output:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http: +//www.w3.o rg/TR/xhtml1/DTD/xhtml1-strict.dtd"[<!ATTLIST html xmlns:xsi CDAT +A #FIXED " http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDA +TA #IMPLIE D > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" + xmlns:xs i="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ht +tp://www.w 3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict. +xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" +/> <title> Hello, World </title> <script type="text/javascript"> // <![CDATA[console.log(' <div class="data" id="Hello">World</div> '); +//]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One"> Monday </div> <div class="data" id="Two"> Tuesday </div> <div id="Three" class='data'> Wednes <div id="day"> day </div> </div> <div class="data" id='Four'> <b> Thursday </b> </div> <div class="data" id="Five"> Friday </div> <div class="data"id="Six"> <div> Satur </div> day </div> <div title=" class='data' id='Foo'>Bar" id="Seven" class="data">  Sunday </div> <div class="data otherclass" id="aaa"> bbb </div> <div class="otherclass" id="ccc"> ddd </div> <p class="data"> eee </p> <p id="fff"> ggg </p> <!--<div class="data" id="Quz">Baz</div>--> <p> <![CDATA[<div class="data" id="Bye">Bye</div>]]> </p> </body> </html>
In reply to Re^5: Parsing HTML/XML with Regular Expressions (regex)
by RonW
in thread Parsing HTML/XML with Regular Expressions
by haukex
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |