in reply to Re: Using Perl to snip the end off of HTML
in thread Using Perl to snip the end off of HTML
I started looking at this, but I find the code quite hard to read. The heavy use of global variables contributed to that, as did the inconsistent indentation.
In general you should try to avoid building up large strings such as $biglist one small chunk at a time. (See what's faster than .= for some analysis of this.) Better would probably be to save the results in an array, then create the big string with a single join, or replace the whole thing with a join on a map, something like:
my $message = join '', map cleanline($_), <DATA>; ... { my $prevline; sub cleanline { my $line = shift; # skip duplicate lines return if defined($prevline) && $line eq $prevline; $prevline = $line; # crudely HTMLify return "$line\n<br>\n"; } }
With code like this:
.. the initial "test if the pattern matches" gains nothing except to double the work if the pattern does match. You'll get cleaner and faster code if you take out the test:if ($biglist =~ m%$bq%ig ) # { $biglist =~ s%$bq% %ig; # replace with space, case inse +nsitive, g lobal }
$biglist =~ s%$bq% %ig;
This doesn't actually remove the quotes though, only the <blockquote> tags themselves. I think the quote removal in the example message is intended to leave only:
.. all the rest being a single trailing blockquote (and a bit of <pre>'d whitespace). If that is indeed the case, you could (if the HTML is sufficiently restricted to allow it) strip it with a single recursive regexp, something like (untested):<pre> blah blah blah blah blah Andrew Darby Web Services Librarian Ithaca College Library <a rel="nofollow" href="http://www.ithaca.edu/library/">http://www.it +haca.edu/l ibrary/</a> Vishwam Annam wrote: </pre>
our $re_bq; $re_bq = qr{ # open tag < blockquote (?: \s+ style \s* = \s* " .*? " )? \s* \s* > (?: # some character that doesn't start a nested blockquote (?!<blockquote\b) . | # or a whole (recursively) nested blockquote (??{ $re_bq }) )* < / blockquote \s* > }xsi; ... $message =~ s{$re_bq\s*\z}{};
This won't do the right thing in some cases - for example if something that looks like a blockquote tag is actually in an HTML comment, or embedded in an attribute string - which is why an HTML parsing module would be a much better bet.
Hope this gives you some useful ideas,
Hugo
|
|---|