comment on

Ignoring entity decoding and handling of quoted strings (the attribute values), I've gotten as close to the expected output as I care to pursue.

The extra output of the Sunday division ("bbbdddeeeggg") was a side effect of incorrectly handling the empty division. The interesting divisions after the empty one were handled correctly because my code allows any division to contain an interesting division. The provided input does have that, but once the empty division handling was fixed, the extra output was eliminated and the rest of the output was correct (other than not decoding the entities and the incorrect id of the Sunday division as mentioned above).

The embedded newlines (that the shallow parsing regex leaves as-is) have no general "solution". For mark-up elements, converting them to spaces provides clear enough syntax to reliably parse the attributes (at least for this challenge). The content elements need case by case handling. For this challenge, removing leading and trailing newlines gave the desired results.

The shallow parsing regex is interesting and might be useful for some projects, but most projects will be better served by using one of the better XML modules from CPAN.

Thanks, again, to [id://haukex] for the challenge and contributing to this little regex adventure.

My (probably) final code for this:

use warnings;
use strict;

our $XML_SPE;
{ no strict 'vars';
# copied from http://www.cs.sfu.ca/~cameron/REX.html#AppA
# REX/Perl 1.0 
# Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions
+",
# Technical Report TR 1998-17, School of Computing Science, Simon Fras
+er 
# University, November, 1998.
# Copyright (c) 1998, Robert D. Cameron. 
# The following code may be freely used and distributed provided that
# this copyright and citation notice remains intact and that modificat
+ions
# or additions are clearly identified.

$TextSE = "[^<]+";
$UntilHyphen = "[^-]*-";
$Until2Hyphens = "$UntilHyphen(?:[^-]$UntilHyphen)*-";
$CommentCE = "$Until2Hyphens>?";
$UntilRSBs = "[^\\]]*](?:[^\\]]+])*]+";
$CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>";
$S = "[ \\n\\t\\r]+";
$NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
$NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
$Name = "(?:$NameStrt)(?:$NameChar)*";
$QuoteSE = "\"[^\"]*\"|'[^']*'";
$DT_IdentSE = "$S$Name(?:$S(?:$Name|$QuoteSE))*";
$MarkupDeclCE = "(?:[^\\]\"'><]+|$QuoteSE)*>";
$S1 = "[\\n\\r\\t ]";
$UntilQMs = "[^?]*\\?+";
$PI_Tail = "\\?>|$S1$UntilQMs(?:[^>?]$UntilQMs)*>";
$DT_ItemSE = "<(?:!(?:--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name(?:
+$PI_Tail))|%$Name;|$S";
$DocTypeCE = "$DT_IdentSE(?:$S)?(?:\\[(?:$DT_ItemSE)*](?:$S)?)?>?";
$DeclCE = "--(?:$CommentCE)?|\\[CDATA\\[(?:$CDATA_CE)?|DOCTYPE(?:$DocT
+ypeCE)?";
$PI_CE = "$Name(?:$PI_Tail)?";
$EndTagCE = "$Name(?:$S)?>?";
$AttValSE = "\"[^<\"]*\"|'[^<']*'";
$ElemTagCE = "$Name(?:$S$Name(?:$S)?=(?:$S)?(?:$AttValSE))*(?:$S)?/?>?
+";
$MarkupSPE = "<(?:!(?:$DeclCE)?|\\?(?:$PI_CE)?|/(?:$EndTagCE)?|(?:$Ele
+mTagCE)?)";
$XML_SPE = "$TextSE|$MarkupSPE";

}

my $xml = do { open my $fh, '<', $ARGV[0] or die $!; local $/; <$fh> }
+;

# Not tested and assumes proper nesting of <div> elements (and valid X
+ML syntax)
# (Warning: Messy hack. Read at your own risk.)
my $nest = 0;
my $out = '';
my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero
+n/REX.html#AppA
for (@elements)
{
    if (/^<div/)
    {
        tr/\n/ /;
        print "$_\n";
        $nest++ if ($nest > 0); # only increment if inside an interest
+ing <div>
        next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal w
+hite space
        next unless (/id\h*=\h*['"](\w+)['"]/);
        $out .= ", $1=";
        $nest = 1 if ($nest == 0); # if this is the outer most interes
+ting <div>
        $nest-- if (/\/>$/); # if empty <div>
        next;
    }
    $nest--, next if (/^<\/div/);
    next if (/^[<]/); # skip other mark-up
    s/^\n//;
    s/\n$//;
    print "=$_\n";
    $out .= $_ if ($nest > 0);
}
$out =~ s/^, //;
print "$out\n";
[download]

In reply to Re^7: Parsing HTML/XML with Regular Expressions (regex) by RonW
in thread Parsing HTML/XML with Regular Expressions by haukex

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.