I tried it and got no output. Did you fix something in my code?
I did add a statement to output the list of elements from the shallow parse regex. As far as I can tell, it split out the elements correctly, but it left the embedded newlines in the mark-up elements.
For example, the following:
</div
>
became </div\n>
In the case of the Sunday div:
<div title=" class='data' id='Foo'>Bar"
id="Seven" class="data"> Sunday</div>
became:
<div title=" class='data' id='Foo'>Bar"\nid="Seven" class="data">
 Sunday
</div>
So, I added tr/\n/ / for (@elements); to get rid of the embedded newlines. Still no output (other than the dump of the elements list).
I did encounter an unexpected error: Variable "$XML_SPE" is not imported at extractor.pl line 46. So, I changed:
my @elements = $xml =~ /$XML_SPE/g;
to:
my @elements = $xml =~ /$::XML_SPE/g;
I don't have time to try to debug my code, now. Will try, later.
Current code:
#!perl
# use strict;
use warnings;
# regex from http://www.cs.sfu.ca/~cameron/REX.html#AppA
# Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions
+",
# Technical Report TR 1998-17, School of Computing Science, Simon Fras
+er
# University, November, 1998.
# Copyright (c) 1998, Robert D. Cameron.
# The following code may be freely used and distributed provided that
# this copyright and citation notice remains intact and that modificat
+ions
# or additions are clearly identified.
$TextSE = "[^<]+";
$UntilHyphen = "[^-]*-";
$Until2Hyphens = "$UntilHyphen(?:[^-]$UntilHyphen)*-";
$CommentCE = "$Until2Hyphens>?";
$UntilRSBs = "[^\\]]*](?:[^\\]]+])*]+";
$CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>";
$S = "[ \\n\\t\\r]+";
$NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
$NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
$Name = "(?:$NameStrt)(?:$NameChar)*";
$QuoteSE = "\"[^\"]*\"|'[^']*'";
$DT_IdentSE = "$S$Name(?:$S(?:$Name|$QuoteSE))*";
$MarkupDeclCE = "(?:[^\\]\"'><]+|$QuoteSE)*>";
$S1 = "[\\n\\r\\t ]";
$UntilQMs = "[^?]*\\?+";
$PI_Tail = "\\?>|$S1$UntilQMs(?:[^>?]$UntilQMs)*>";
$DT_ItemSE = "<(?:!(?:--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name(?:
+$PI_Tail))|%$Name;|$S";
$DocTypeCE = "$DT_IdentSE(?:$S)?(?:\\[(?:$DT_ItemSE)*](?:$S)?)?>?";
$DeclCE = "--(?:$CommentCE)?|\\[CDATA\\[(?:$CDATA_CE)?|DOCTYPE(?:$DocT
+ypeCE)?";
$PI_CE = "$Name(?:$PI_Tail)?";
$EndTagCE = "$Name(?:$S)?>?";
$AttValSE = "\"[^<\"]*\"|'[^<']*'";
$ElemTagCE = "$Name(?:$S$Name(?:$S)?=(?:$S)?(?:$AttValSE))*(?:$S)?/?>?
+";
$MarkupSPE = "<(?:!(?:$DeclCE)?|\\?(?:$PI_CE)?|/(?:$EndTagCE)?|(?:$Ele
+mTagCE)?)";
$XML_SPE = "$TextSE|$MarkupSPE";
use strict;
my $xml = join '', <DATA>;
my $nest = 0;
my $out = '';
my @elements = $xml =~ /$::XML_SPE/g; # see http://www.cs.sfu.ca/~cam
+eron/REX.html#AppA
tr/\n/ / for (@elements);
print " $_\n" for (@elements);
print "\n";
for (@elements)
{
if (/^<div/)
{
$nest++ if ($nest > 0); # only increment if inside an interest
+ing <div>
next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal
+white space
next unless (/id\h*=\h*['"](\w+)['"]/);
$out .= ", $1=";
$nest = 1 if ($nest == 0); # if this is the outer most interes
+ting <div>
next;
}
$nest--, next if (/^<\/div/);
next if (/^[<]/); # skip other mark-up
$out .= $_ if ($nest > 0);
}
$out =~ s/^, //;
print "$out\n";
__DATA__
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[
<!ATTLIST html
xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation CDATA #IMPLIED > ]>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/1999/xhtml
http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"
+ />
<title>Hello, World</title>
<script type="text/javascript">
//<![CDATA[
console.log(' <div class="data" id="Hello">World</div> ');
//]]>
</script>
</head>
<body>
<div class="data" id="Zero" />
<div class="data" id="One">Monday</div><div class="data" id="Two">Tues
+day</div>
<div id="Three" class='data'>Wednes<div id="day">day</div></div>
<div class="data" id='Four'><b>Thursday</b></div>
<div
class="data" id="Five">
Friday
</div>
<div
class
=
"data"
id
=
"Six"
>
<div
>
Satur
</div
>
day
</div
>
<div title=" class='data' id='Foo'>Bar"
id="Seven" class="data"> Sunday</div>
<div class="data otherclass" id="aaa">bbb</div>
<div class="otherclass" id="ccc">ddd</div>
<p class="data">eee</p>
<p id="fff">ggg</p>
<!--
<div class="data" id="Quz">Baz</div>
-->
<p><![CDATA[
<div class="data" id="Bye">Bye</div>
]]></p>
</body>
</html>
And the output:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http:
+//www.w3.o
rg/TR/xhtml1/DTD/xhtml1-strict.dtd"[<!ATTLIST html xmlns:xsi CDAT
+A #FIXED "
http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDA
+TA #IMPLIE
D > ]>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
+ xmlns:xs
i="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ht
+tp://www.w
3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict.
+xsd">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"
+/>
<title>
Hello, World
</title>
<script type="text/javascript">
//
<![CDATA[console.log(' <div class="data" id="Hello">World</div> ');
+//]]>
</script>
</head>
<body>
<div class="data" id="Zero" />
<div class="data" id="One">
Monday
</div>
<div class="data" id="Two">
Tuesday
</div>
<div id="Three" class='data'>
Wednes
<div id="day">
day
</div>
</div>
<div class="data" id='Four'>
<b>
Thursday
</b>
</div>
<div class="data" id="Five">
Friday
</div>
<div class="data"id="Six">
<div>
Satur
</div>
day
</div>
<div title=" class='data' id='Foo'>Bar" id="Seven" class="data">
 Sunday
</div>
<div class="data otherclass" id="aaa">
bbb
</div>
<div class="otherclass" id="ccc">
ddd
</div>
<p class="data">
eee
</p>
<p id="fff">
ggg
</p>
<!--<div class="data" id="Quz">Baz</div>-->
<p>
<![CDATA[<div class="data" id="Bye">Bye</div>]]>
</p>
</body>
</html>