Re^2: Using Perl to snip the end off of HTML

I started looking at this, but I find the code quite hard to read. The heavy use of global variables contributed to that, as did the inconsistent indentation.

In general you should try to avoid building up large strings such as $biglist one small chunk at a time. (See what's faster than .= for some analysis of this.) Better would probably be to save the results in an array, then create the big string with a single join, or replace the whole thing with a join on a map, something like:

  my $message = join '', map cleanline($_), <DATA>;
  ...
  {
    my $prevline;
    sub cleanline {
      my $line = shift;
      # skip duplicate lines
      return if defined($prevline) && $line eq $prevline;
      $prevline = $line;
      # crudely HTMLify
      return "$line\n<br>\n";
    }
  }
[download]

With code like this:

   if ($biglist =~ m%$bq%ig )        # 
      {
       $biglist =~ s%$bq% %ig;         # replace with space, case inse
+nsitive, g
lobal
      }
[download]

.. the initial "test if the pattern matches" gains nothing except to double the work if the pattern does match. You'll get cleaner and faster code if you take out the test:

  $biglist =~ s%$bq% %ig;
[download]

This doesn't actually remove the quotes though, only the <blockquote> tags themselves. I think the quote removal in the example message is intended to leave only:

<pre>
blah blah blah blah blah

Andrew Darby
Web Services Librarian
Ithaca College Library
<a  rel="nofollow" href="http://www.ithaca.edu/library/">http://www.it
+haca.edu/l
ibrary/</a>



Vishwam Annam wrote:
</pre>
[download]

.. all the rest being a single trailing blockquote (and a bit of <pre>'d whitespace). If that is indeed the case, you could (if the HTML is sufficiently restricted to allow it) strip it with a single recursive regexp, something like (untested):

  our $re_bq;
  $re_bq = qr{
    # open tag
    < blockquote
      (?: \s+ style \s* = \s* " .*? " )? \s*
    \s* >
    (?:
      # some character that doesn't start a nested blockquote
      (?!<blockquote\b) .
    |
      # or a whole (recursively) nested blockquote
       (??{ $re_bq })
    )*
    < / blockquote \s* >
  }xsi;
  ...
  $message =~ s{$re_bq\s*\z}{};
[download]

This won't do the right thing in some cases - for example if something that looks like a blockquote tag is actually in an HTML comment, or embedded in an attribute string - which is why an HTML parsing module would be a much better bet.

Hope this gives you some useful ideas,

Hugo

Comment on Re^2: Using Perl to snip the end off of HTML Select or Download Code