Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^3: Parsing HTML/XML with Regular Expressions (regex)

by RonW (Parson)
on Oct 19, 2017 at 00:05 UTC ( [id://1201633]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig)
in thread Parsing HTML/XML with Regular Expressions

It was possible to produce a regex that parses all of Perl, why not one for HTML?

There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing

That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of <div> nesting to find the end of the contained text.

# Not tested and assumes proper nesting of <div> elements (and valid X +ML syntax) # (Warning: Messy hack. Read at your own risk.) my $nest = 0; my $out = ''; my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero +n/REX.html#AppA for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal w +hite space next unless (/id\h*=\h*['"](\w+)['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; say "$out\n";

Update: Changed title to indicate (regex)

Replies are listed 'Best First'.
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig)
by haukex (Archbishop) on Oct 19, 2017 at 16:27 UTC

    Interesting post, thank you! I tested it and except that I had to strip non-word characters out of the values, it mostly works - it doesn't pick up the id of the Sunday Saturday entry, and it also picks up the values "bbbdddeeeggg", but overall it's a very interesting start. Regexes are a fine tool for lexing, and by adding some logic around them keeping track of the nested tags etc., it's basically like building a simple parser.

      I tried it and got no output. Did you fix something in my code?

      I did add a statement to output the list of elements from the shallow parse regex. As far as I can tell, it split out the elements correctly, but it left the embedded newlines in the mark-up elements.

      For example, the following:

      </div >

      became </div\n>

      In the case of the Sunday div:

      <div title=" class='data' id='Foo'>Bar" id="Seven" class="data">&#xA0;Sunda&#121;</div>

      became:

      <div title=" class='data' id='Foo'>Bar"\nid="Seven" class="data"> &#xA0;Sunda&#121; </div>

      So, I added tr/\n/ / for (@elements); to get rid of the embedded newlines. Still no output (other than the dump of the elements list).

      I did encounter an unexpected error: Variable "$XML_SPE" is not imported at extractor.pl line 46. So, I changed:

      my @elements = $xml =~ /$XML_SPE/g;

      to:

      my @elements = $xml =~ /$::XML_SPE/g;

      I don't have time to try to debug my code, now. Will try, later.

      Current code:

      And the output:

        Here's the code I ran, other than adding the necessary stuff to get it to compile and read the external file, the only difference to your code is the addition of s/\W+//g;. The output I get is the following. <update> You were right, it does pick up the wrong id for Sunday, it was the id of Saturday that was missing, my mistake. </update>

        Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +ridaySaturday, Foo=xA0Sunda121bbbdddeeeggg

      Thanks. Also, you've got me curious. I still haven't tested it, but I'm guessing the interesting title attribute for the Sunday division is part of the problem. You said it didn't pick up the id. I would have thought my code would have picked up id='Foo'. About the bbbdddeeeggg I'm thinking my code had trouble finding the correct </div>.

      I will try it and look at the list of elements generated by the shallow parsing regex.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201633]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-23 18:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found