comment on

It was possible to produce a regex that parses all of Perl, why not one for HTML?

There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing

That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of <div> nesting to find the end of the contained text.

# Not tested and assumes proper nesting of <div> elements (and valid X
+ML syntax)
# (Warning: Messy hack. Read at your own risk.)
my $nest = 0;
my $out = '';
my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero
+n/REX.html#AppA
for (@elements)
{
    if (/^<div/)
    {
        $nest++ if ($nest > 0); # only increment if inside an interest
+ing <div>
        next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal w
+hite space
        next unless (/id\h*=\h*['"](\w+)['"]/);
        $out .= ", $1=";
        $nest = 1 if ($nest == 0); # if this is the outer most interes
+ting <div>
        next;
    }
    $nest--, next if (/^<\/div/);
    next if (/^[<]/); # skip other mark-up
    $out .= $_ if ($nest > 0);
}
$out =~ s/^, //;
say "$out\n";
[download]

Update: Changed title to indicate (regex)

In reply to Re^3: Parsing HTML/XML with Regular Expressions (regex) by RonW
in thread Parsing HTML/XML with Regular Expressions by haukex

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.