I tried it and got no output. Did you fix something in my code?

I did add a statement to output the list of elements from the shallow parse regex. As far as I can tell, it split out the elements correctly, but it left the embedded newlines in the mark-up elements.

For example, the following:

</div >

became </div\n>

In the case of the Sunday div:

<div title=" class='data' id='Foo'>Bar" id="Seven" class="data">&#xA0;Sunda&#121;</div>

became:

<div title=" class='data' id='Foo'>Bar"\nid="Seven" class="data"> &#xA0;Sunda&#121; </div>

So, I added tr/\n/ / for (@elements); to get rid of the embedded newlines. Still no output (other than the dump of the elements list).

I did encounter an unexpected error: Variable "$XML_SPE" is not imported at extractor.pl line 46. So, I changed:

my @elements = $xml =~ /$XML_SPE/g;

to:

my @elements = $xml =~ /$::XML_SPE/g;

I don't have time to try to debug my code, now. Will try, later.

Current code:

#!perl # use strict; use warnings; # regex from http://www.cs.sfu.ca/~cameron/REX.html#AppA # Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions +", # Technical Report TR 1998-17, School of Computing Science, Simon Fras +er # University, November, 1998. # Copyright (c) 1998, Robert D. Cameron. # The following code may be freely used and distributed provided that # this copyright and citation notice remains intact and that modificat +ions # or additions are clearly identified. $TextSE = "[^<]+"; $UntilHyphen = "[^-]*-"; $Until2Hyphens = "$UntilHyphen(?:[^-]$UntilHyphen)*-"; $CommentCE = "$Until2Hyphens>?"; $UntilRSBs = "[^\\]]*](?:[^\\]]+])*]+"; $CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>"; $S = "[ \\n\\t\\r]+"; $NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]"; $NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]"; $Name = "(?:$NameStrt)(?:$NameChar)*"; $QuoteSE = "\"[^\"]*\"|'[^']*'"; $DT_IdentSE = "$S$Name(?:$S(?:$Name|$QuoteSE))*"; $MarkupDeclCE = "(?:[^\\]\"'><]+|$QuoteSE)*>"; $S1 = "[\\n\\r\\t ]"; $UntilQMs = "[^?]*\\?+"; $PI_Tail = "\\?>|$S1$UntilQMs(?:[^>?]$UntilQMs)*>"; $DT_ItemSE = "<(?:!(?:--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name(?: +$PI_Tail))|%$Name;|$S"; $DocTypeCE = "$DT_IdentSE(?:$S)?(?:\\[(?:$DT_ItemSE)*](?:$S)?)?>?"; $DeclCE = "--(?:$CommentCE)?|\\[CDATA\\[(?:$CDATA_CE)?|DOCTYPE(?:$DocT +ypeCE)?"; $PI_CE = "$Name(?:$PI_Tail)?"; $EndTagCE = "$Name(?:$S)?>?"; $AttValSE = "\"[^<\"]*\"|'[^<']*'"; $ElemTagCE = "$Name(?:$S$Name(?:$S)?=(?:$S)?(?:$AttValSE))*(?:$S)?/?>? +"; $MarkupSPE = "<(?:!(?:$DeclCE)?|\\?(?:$PI_CE)?|/(?:$EndTagCE)?|(?:$Ele +mTagCE)?)"; $XML_SPE = "$TextSE|$MarkupSPE"; use strict; my $xml = join '', <DATA>; my $nest = 0; my $out = ''; my @elements = $xml =~ /$::XML_SPE/g; # see http://www.cs.sfu.ca­/~cam +eron/REX.html#A­ppA tr/\n/ / for (@elements); print " $_\n" for (@elements); print "\n"; for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h*=\h*['"]da­ta['"]/); # \h is horizontal +white space next unless (/id\h*=\h*['"](\w+)­['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; print "$out\n"; __DATA__ <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[ <!ATTLIST html xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDATA #IMPLIED > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" + /> <title>Hello, World</title> <script type="text/javascript"> //<![CDATA[ console.log(' <div class="data" id="Hello">World</div> '); //]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One">Monday</div><div class="data" id="Two">Tues +day</div> <div id="Three" class='data'>Wednes<div id="day">day</div></div> <div class="data" id='Four'><b>Thursday</b></div> <div class="data" id="Five"> Friday </div> <div class = "data" id = "Six" > <div > Satur </div > day </div > <div title=" class='data' id='Foo'>Bar" id="Seven" class="data">&#xA0;Sunda&#121;</div> <div class="data otherclass" id="aaa">bbb</div> <div class="otherclass" id="ccc">ddd</div> <p class="data">eee</p> <p id="fff">ggg</p> <!-- <div class="data" id="Quz">Baz</div> --> <p><![CDATA[ <div class="data" id="Bye">Bye</div> ]]></p> </body> </html>

And the output:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http: +//www.w3.o rg/TR/xhtml1/DTD/xhtml1-strict.dtd"[<!ATTLIST html xmlns:xsi CDAT +A #FIXED " http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDA +TA #IMPLIE D > ]> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" + xmlns:xs i="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ht +tp://www.w 3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict. +xsd"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" +/> <title> Hello, World </title> <script type="text/javascript"> // <![CDATA[console.log(' <div class="data" id="Hello">World</div> '); +//]]> </script> </head> <body> <div class="data" id="Zero" /> <div class="data" id="One"> Monday </div> <div class="data" id="Two"> Tuesday </div> <div id="Three" class='data'> Wednes <div id="day"> day </div> </div> <div class="data" id='Four'> <b> Thursday </b> </div> <div class="data" id="Five"> Friday </div> <div class="data"id="Six"> <div> Satur </div> day </div> <div title=" class='data' id='Foo'>Bar" id="Seven" class="data"> &#xA0;Sunda&#121; </div> <div class="data otherclass" id="aaa"> bbb </div> <div class="otherclass" id="ccc"> ddd </div> <p class="data"> eee </p> <p id="fff"> ggg </p> <!--<div class="data" id="Quz">Baz</div>--> <p> <![CDATA[<div class="data" id="Bye">Bye</div>]]> </p> </body> </html>

In reply to Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW
in thread Parsing HTML/XML with Regular Expressions by haukex

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.