Re^3: Parsing HTML/XML with Regular Expressions (regex)

It was possible to produce a regex that parses all of Perl, why not one for HTML?

There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing

That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of <div> nesting to find the end of the contained text.

# Not tested and assumes proper nesting of <div> elements (and valid X
+ML syntax)
# (Warning: Messy hack. Read at your own risk.)
my $nest = 0;
my $out = '';
my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero
+n/REX.html#AppA
for (@elements)
{
    if (/^<div/)
    {
        $nest++ if ($nest > 0); # only increment if inside an interest
+ing <div>
        next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal w
+hite space
        next unless (/id\h*=\h*['"](\w+)['"]/);
        $out .= ", $1=";
        $nest = 1 if ($nest == 0); # if this is the outer most interes
+ting <div>
        next;
    }
    $nest--, next if (/^<\/div/);
    next if (/^[<]/); # skip other mark-up
    $out .= $_ if ($nest > 0);
}
$out =~ s/^, //;
say "$out\n";
[download]

Update: Changed title to indicate (regex)

Comment on Re^3: Parsing HTML/XML with Regular Expressions (regex) Select or Download Code

Replies are listed 'Best First'.
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig) by haukex (Archbishop) on Oct 19, 2017 at 16:27 UTC
Interesting post, thank you! I tested it and except that I had to strip non-word characters out of the values, it mostly works - it doesn't pick up the `id` of the ~~`Sunday`~~ `Saturday` entry, and it also picks up the values "`bbbdddeeeggg`", but overall it's a very interesting start. Regexes are a fine tool for lexing, and by adding some logic around them keeping track of the nested tags etc., it's basically like building a simple parser.	[reply] [d/l] [select]
Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 23:50 UTC
I tried it and got no output. Did you fix something in my code? I did add a statement to output the list of elements from the shallow parse regex. As far as I can tell, it split out the elements correctly, but it left the embedded newlines in the mark-up elements. For example, the following: `</div >` [download] became `</div\n>` In the case of the Sunday div: `<div title=" class='data' id='Foo'>Bar" id="Seven" class="data"> Sunday</div>` [download] became: `<div title=" class='data' id='Foo'>Bar"\nid="Seven" class="data">  Sunday </div>` [download] So, I added `tr/\n/ / for (@elements);` to get rid of the embedded newlines. Still no output (other than the dump of the elements list). I did encounter an unexpected error: `Variable "$XML_SPE" is not imported at extractor.pl line 46.` So, I changed: `my @elements = $xml =~ /$XML_SPE/g;` [download] to: `my @elements = $xml =~ /$::XML_SPE/g;` [download] I don't have time to try to debug my code, now. Will try, later. Current code: Read more... (6 kB) And the output: Read more... (3 kB)	[reply] [d/l] [select]
Re^6: Parsing HTML/XML with Regular Expressions (regex) by haukex (Archbishop) on Oct 20, 2017 at 09:03 UTC
Here's the code I ran, other than adding the necessary stuff to get it to compile and read the external file, the only difference to your code is the addition of `s/\W+//g;`. The output I get is the following. `<update>` You were right, it does pick up the wrong `id` for `Sunday`, it was the `id` of `Saturday` that was missing, my mistake. `</update>` `Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +ridaySaturday, Foo=xA0Sunda121bbbdddeeeggg` [download] Read more... (4 kB)	[reply] [d/l] [select]
Re^7: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 23, 2017 at 22:33 UTC
Re^7: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 20, 2017 at 21:29 UTC
Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 22:13 UTC
Thanks. Also, you've got me curious. I still haven't tested it, but I'm guessing the interesting title attribute for the Sunday division is part of the problem. You said it didn't pick up the id. I would have thought my code would have picked up `id='Foo'`. About the `bbbdddeeeggg` I'm thinking my code had trouble finding the correct `</div>`. I will try it and look at the list of elements generated by the shallow parsing regex.	[reply] [d/l] [select]


Do you know where your variables are?
	PerlMonks