Re: parsing HTML

If all you want to do is find a specific tag in html that is well formed or not, you can try the following. I'm not saying to not use one of the standard parsers, its just if you are looking for a simple tag, this can do it without the overhead. It should match tags like

<img src="my\"stuff">
<img src="" width="" height="" >
<img bareword>
<img src="duh>">
[download]

Notice that it doesn't account for bar groups like <img> or bare xml tags such as <br/>. That could be worked in. We've been using it for well over a year on html from over 1,000,000 different people on live parsed documents. I've changed it slightly to make it more relevent to the situtation. You can do what ever you want to the text in the tag_handler routine. The variable $matches will have the total number of swaps that occured.

  my $txt = "Some long html document";
  my $tag = "img"
  my $matches = ( 
    $txt =~ s%
       (<\Q$tag\E\s+        # begin with tag and space
         (?:
          \w+                   # key/bareword
            (?:
             =(["']?)           # begin with quote or not 
             (?:|.*?[^\\])      # some value that doesn't end with \
             \2         # close quote maybe
            )?          # possible bareword
          (?:\s+|>)     # something trailing (force match)
         )+             # multiple groups
        >?)             # trailing > (handles <$tag word=val >)
       %&tag_handler($1)%gexis );
[download]

After this it is up to you to parse the tag itself.

my @a=qw(random brilliant braindead); print $a[rand(@a)];

Comment on Re: parsing HTML Select or Download Code