comment on

I tried it and got no output. Did you fix something in my code?

I did add a statement to output the list of elements from the shallow parse regex. As far as I can tell, it split out the elements correctly, but it left the embedded newlines in the mark-up elements.

For example, the following:

</div
>
[download]

became </div\n>

In the case of the Sunday div:

<div title=" class='data' id='Foo'>Bar"
id="Seven" class="data">&#xA0;Sunda&#121;</div>
[download]

became:

<div title=" class='data' id='Foo'>Bar"\nid="Seven" class="data">
&#xA0;Sunda&#121;
</div>
[download]

So, I added tr/\n/ / for (@elements); to get rid of the embedded newlines. Still no output (other than the dump of the elements list).

I did encounter an unexpected error: Variable "$XML_SPE" is not imported at extractor.pl line 46. So, I changed:

my @elements = $xml =~ /$XML_SPE/g;
[download]

to:

my @elements = $xml =~ /$::XML_SPE/g;
[download]

I don't have time to try to debug my code, now. Will try, later.

Current code:

#!perl
# use strict;
use warnings;

# regex from http://www.cs.sfu.ca/~cameron/REX.html#AppA
# Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions
+",
# Technical Report TR 1998-17, School of Computing Science, Simon Fras
+er 
# University, November, 1998.
# Copyright (c) 1998, Robert D. Cameron. 
# The following code may be freely used and distributed provided that
# this copyright and citation notice remains intact and that modificat
+ions
# or additions are clearly identified.

$TextSE = "[^<]+";
$UntilHyphen = "[^-]*-";
$Until2Hyphens = "$UntilHyphen(?:[^-]$UntilHyphen)*-";
$CommentCE = "$Until2Hyphens>?";
$UntilRSBs = "[^\\]]*](?:[^\\]]+])*]+";
$CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>";
$S = "[ \\n\\t\\r]+";
$NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
$NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
$Name = "(?:$NameStrt)(?:$NameChar)*";
$QuoteSE = "\"[^\"]*\"|'[^']*'";
$DT_IdentSE = "$S$Name(?:$S(?:$Name|$QuoteSE))*";
$MarkupDeclCE = "(?:[^\\]\"'><]+|$QuoteSE)*>";
$S1 = "[\\n\\r\\t ]";
$UntilQMs = "[^?]*\\?+";
$PI_Tail = "\\?>|$S1$UntilQMs(?:[^>?]$UntilQMs)*>";
$DT_ItemSE = "<(?:!(?:--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name(?:
+$PI_Tail))|%$Name;|$S";
$DocTypeCE = "$DT_IdentSE(?:$S)?(?:\\[(?:$DT_ItemSE)*](?:$S)?)?>?";
$DeclCE = "--(?:$CommentCE)?|\\[CDATA\\[(?:$CDATA_CE)?|DOCTYPE(?:$DocT
+ypeCE)?";
$PI_CE = "$Name(?:$PI_Tail)?";
$EndTagCE = "$Name(?:$S)?>?";
$AttValSE = "\"[^<\"]*\"|'[^<']*'";
$ElemTagCE = "$Name(?:$S$Name(?:$S)?=(?:$S)?(?:$AttValSE))*(?:$S)?/?>?
+";
$MarkupSPE = "<(?:!(?:$DeclCE)?|\\?(?:$PI_CE)?|/(?:$EndTagCE)?|(?:$Ele
+mTagCE)?)";
$XML_SPE = "$TextSE|$MarkupSPE";

use strict;

my $xml = join '', <DATA>;

my $nest = 0;
my $out = '';
my @elements = $xml =~ /$::XML_SPE/g; # see http://www.cs.sfu.ca/~cam
+eron/REX.html#AppA

tr/\n/ / for (@elements);

print "   $_\n" for (@elements);
print "\n";

for (@elements)
{
    if (/^<div/)
    {
        $nest++ if ($nest > 0); # only increment if inside an interest
+ing <div>
        next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal 
+white space
        next unless (/id\h*=\h*['"](\w+)['"]/);
        $out .= ", $1=";
        $nest = 1 if ($nest == 0); # if this is the outer most interes
+ting <div>
        next;
    }
    $nest--, next if (/^<\/div/);
    next if (/^[<]/); # skip other mark-up
    $out .= $_ if ($nest > 0);
}
$out =~ s/^, //;
print "$out\n";

__DATA__
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[
<!ATTLIST html
    xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation CDATA #IMPLIED  > ]>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/1999/xhtml
    http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"
+ />
    <title>Hello, World</title>
    <script type="text/javascript">
//<![CDATA[
console.log(' <div class="data" id="Hello">World</div> ');
//]]>
    </script>
</head>
<body>
<div class="data" id="Zero" />
<div class="data" id="One">Monday</div><div class="data" id="Two">Tues
+day</div>
<div id="Three" class='data'>Wednes<div id="day">day</div></div>
<div class="data" id='Four'><b>Thursday</b></div>
<div
class="data" id="Five">
Friday
</div>
<div
class
=
"data"
id
=
"Six"
>
<div
>
Satur
</div
>
day
</div
>
<div title=" class='data' id='Foo'>Bar"
id="Seven" class="data">&#xA0;Sunda&#121;</div>
<div class="data otherclass" id="aaa">bbb</div>
<div class="otherclass" id="ccc">ddd</div>
<p class="data">eee</p>
<p id="fff">ggg</p>
<!--
<div class="data" id="Quz">Baz</div>
-->
<p><![CDATA[
<div class="data" id="Bye">Bye</div>
]]></p>
</body>
</html>
[download]

And the output:

   <?xml version="1.0" encoding="UTF-8"?>

   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"     "http:
+//www.w3.o
rg/TR/xhtml1/DTD/xhtml1-strict.dtd"[<!ATTLIST html      xmlns:xsi CDAT
+A #FIXED "
http://www.w3.org/2001/XMLSchema-instance"      xsi:schemaLocation CDA
+TA #IMPLIE
D  > ]>

   <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" 
+  xmlns:xs
i="http://www.w3.org/2001/XMLSchema-instance"   xsi:schemaLocation="ht
+tp://www.w
3.org/1999/xhtml        http://www.w3.org/2002/08/xhtml/xhtml1-strict.
+xsd">

   <head>

   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" 
+/>

   <title>
   Hello, World
   </title>

   <script type="text/javascript">
   //
   <![CDATA[console.log(' <div class="data" id="Hello">World</div> ');
+//]]>

   </script>

   </head>

   <body>

   <div class="data" id="Zero" />

   <div class="data" id="One">
   Monday
   </div>
   <div class="data" id="Two">
   Tuesday
   </div>

   <div id="Three" class='data'>
   Wednes
   <div id="day">
   day
   </div>
   </div>

   <div class="data" id='Four'>
   <b>
   Thursday
   </b>
   </div>

   <div class="data" id="Five">
   Friday
   </div>

   <div class="data"id="Six">

   <div>
   Satur
   </div>
   day
   </div>

   <div title=" class='data' id='Foo'>Bar" id="Seven" class="data">
   &#xA0;Sunda&#121;
   </div>

   <div class="data otherclass" id="aaa">
   bbb
   </div>

   <div class="otherclass" id="ccc">
   ddd
   </div>

   <p class="data">
   eee
   </p>

   <p id="fff">
   ggg
   </p>

   <!--<div class="data" id="Quz">Baz</div>-->

   <p>
   <![CDATA[<div class="data" id="Bye">Bye</div>]]>
   </p>

   </body>

   </html>
[download]

In reply to Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW
in thread Parsing HTML/XML with Regular Expressions by haukex

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.