comment on

I started looking at this, but I find the code quite hard to read. The heavy use of global variables contributed to that, as did the inconsistent indentation.

In general you should try to avoid building up large strings such as $biglist one small chunk at a time. (See what's faster than .= for some analysis of this.) Better would probably be to save the results in an array, then create the big string with a single join, or replace the whole thing with a join on a map, something like:

  my $message = join '', map cleanline($_), <DATA>;
  ...
  {
    my $prevline;
    sub cleanline {
      my $line = shift;
      # skip duplicate lines
      return if defined($prevline) && $line eq $prevline;
      $prevline = $line;
      # crudely HTMLify
      return "$line\n<br>\n";
    }
  }
[download]

With code like this:

   if ($biglist =~ m%$bq%ig )        # 
      {
       $biglist =~ s%$bq% %ig;         # replace with space, case inse
+nsitive, g
lobal
      }
[download]

.. the initial "test if the pattern matches" gains nothing except to double the work if the pattern does match. You'll get cleaner and faster code if you take out the test:

  $biglist =~ s%$bq% %ig;
[download]

This doesn't actually remove the quotes though, only the <blockquote> tags themselves. I think the quote removal in the example message is intended to leave only:

<pre>
blah blah blah blah blah

Andrew Darby
Web Services Librarian
Ithaca College Library
<a  rel="nofollow" href="http://www.ithaca.edu/library/">http://www.it
+haca.edu/l
ibrary/</a>



Vishwam Annam wrote:
</pre>
[download]

.. all the rest being a single trailing blockquote (and a bit of <pre>'d whitespace). If that is indeed the case, you could (if the HTML is sufficiently restricted to allow it) strip it with a single recursive regexp, something like (untested):

  our $re_bq;
  $re_bq = qr{
    # open tag
    < blockquote
      (?: \s+ style \s* = \s* " .*? " )? \s*
    \s* >
    (?:
      # some character that doesn't start a nested blockquote
      (?!<blockquote\b) .
    |
      # or a whole (recursively) nested blockquote
       (??{ $re_bq })
    )*
    < / blockquote \s* >
  }xsi;
  ...
  $message =~ s{$re_bq\s*\z}{};
[download]

This won't do the right thing in some cases - for example if something that looks like a blockquote tag is actually in an HTML comment, or embedded in an attribute string - which is why an HTML parsing module would be a much better bet.

Hope this gives you some useful ideas,

Hugo

In reply to Re^2: Using Perl to snip the end off of HTML by hv
in thread Using Perl to snip the end off of HTML by eastcoastcoder

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.