Re: Using Perl to snip the end off of HTML

After addtl offline correspondence re eastcoastcoder's specs (none of which turned out to be crucial to the issue), the following is more-or-less responsive (ie, it does most of what I think should be what ecc wants (:<})).

REPEATED CAVEAT: There's gotta be a better way; pursue the suggestions for modules to help handle this; perhaps find a better way to acquire source data, and recognize the code below as written for ease of understanding, not elegance!
Other caveats, provided to ecc earlier, are below. Comments re those and "teaching comments" re the undoubtedly "ugly" coding are both welcome.

#! c:/perl/bin -w
use diagnostics;
use strict;

use vars qw ( $prevmail $biglist $mail @mail $mailcleaned $interim $re
+sult);
$prevmail = "";
$biglist = "";
$mailcleaned = '';
$result = "cleaned.htm";
$interim = "biglist.txt";
chomp (@mail = <DATA>);

foreach $mail(@mail) {
   &onelist($mail);
}
open (INTERIM, ">$interim");
print INTERIM "$biglist";
close INTERIM;
print STDERR "\n\t  See $interim\n\n";

&clean($biglist);

open (RESULT, ">$result");
print RESULT "<html><head><title>Cleaned</title></head><body>\n$mailcl
+eaned\n</body></html>\n";
close RESULT;
print STDERR "\n\tSee $result\n\n";
exit;

### END MAIN
### SUBS ##

# sub onelist - delete blank and dupe lines; remainder into $biglist #
+#
sub onelist {
   if ( $mail eq $prevmail ) { 
      return;                                            # return  NOT
+HING; get next line
      } else {
         if ( $mail eq "" &  $prevmail eq "\n<br>" ) {
         return;                                         # again retur
+n NOTHING
      } else {
         $prevmail = $mail;                          # save current to
+ check next for dupe . "\n<br>"
         $biglist = $biglist . $mail . "\n<br>\n";        # biglist ne
+eds spaces to stay readable; 
                                                          # effectivel
+y, replace \n with " "
      }
   return($biglist);
   }
}

# sub clean -  get rid of blockquote, pre tags, redundant credits, etc
+.;  save editorial content ##
sub clean {

use vars qw ( $bq $pq $rel $credit $garbage $garbage2 );

$bq = qr!<blockquote style="border-left: #5555EE solid 0.2em; margin: 
+0em; padding-left: 0.85em">|</blockquote>!i;
$pq = qr!<pre style="margin: 0em;">|<pre>|</pre>!i;
$rel = qr!<a  rel="nofollow"!;
$credit = qr!_______________________________________________\n<br>\nWe
+b4lib mailing list!;
$garbage = qr!<br><br>--------------<br>!;
$garbage2 = qr!-------------------------------------------------------
+-----------------!;

# remove <blockquote style="...0.85em">
   if ($biglist =~ m%$bq%ig )        # 
      {
       $biglist =~ s%$bq% %ig;         # replace with space, case inse
+nsitive, global
      }

# replace $pq (styled <pre...>) with a \n<br>. 
   if ( $biglist =~ m%$pq%ig )
      {
       $biglist =~ s%$pq%\n<br>%ig;
      }

# remove redundant credits 
   if ( $biglist =~ m%$credit.*(?=$credit)$credit%igs )    
      {
       # print STDERR "\t Testing credits\n";  (UNcomment this, insert
+ similar elsewhere for
       # simple execution tracing; add similar to see values...
       $biglist =~ s%($credit).*(?=$credit)$credit%$1%igs;
      }

# replace '<a  rel="nofollow"' with '<a'  
   if ( $biglist =~ m%$rel% )    
      {
       $biglist =~ s%$rel%<a%ig;
      }

# replace 47x underlines - separators for redundant credits - with sho
+rter line of hyphens
   if ( $biglist =~ m%_______________________________________________%
+ )
      {
       $biglist =~ s%_______________________________________________%\
+n<br>--------------%g
      }

# replace multiple newlines separated only by spaces with '<br>'
   if ( $biglist =~ m%\n\s*\n% )
      {
       $biglist =~ s%\n\s*\n%<br>%ig;
      # print STDERR "\t After mult newlines, \$biglist is $biglist\n"
+;
      }


# fix breaks around original message hyphens
   if ( $biglist =~ m%(-----.*)(?=-----)% )
      {
       $biglist =~ s%(-----.*)(?=-----)(-----)%<br>\n<br>\n$1$2<br>\n%
+ig;
      }

# delete garbage
   if ($biglist =~ m%$garbage% )
      {
       $biglist =~ s%$garbage%%;
      }

# delete garbage2, which is - X47
   if ($biglist =~ m%$garbage2% )
      {
       $biglist =~ s%$garbage2%%;
      }

# multiple <br> split by newlines    THIS ONE STILL HAS PROBLEMS
   if ( $biglist =~ m%(?:<br>)\n{0,4}%isg )
      {
      print STDERR "\n\t Testing multiple <br> sep by newlines\n";
      $biglist =~ s%(?:<br>)\n{0,4}(?:<br>)%<br>%isg;
      }

# multiple <br> on same line
   if ( $biglist =~ m%(?:<br>)+%ig )
      {
      # print STDERR "\n\t Testing multiple <br> ON SAME LINE\n";
      $biglist =~ s%(?:<br>)+%<br>%ig;
      }

   $mailcleaned = $biglist;
   return($mailcleaned);
}
# ENDSUB
__DATA__
<pre>
blah blah blah blah blah

Andrew Darby
Web Services Librarian
Ithaca College Library
<a  rel="nofollow" href="http://www.ithaca.edu/library/">http://www.it
+haca.edu/library/</a>



Vishwam Annam wrote:
</pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em
+; padding-left: 0.85em"><pre style="margin: 0em;">I use &quot;Web Acc
+essibility Toolbar&quot; as link checker, accessibility 
checker, HTML and CSS validator. This works for IE, and there is a 
similar tool available for firefox, web developer extn has some 
features of this. 
<a  rel="nofollow" href="http://www.nils.org.au/ais/web/resources/tool
+bar/">http://www.nils.org.au/ais/web/resources/toolbar/</a>

Color Contrast Analyzer, 
<a  rel="nofollow" href="http://www.nils.org.au/ais/web/resources/cont
+rast_analyser/index.html">http://www.nils.org.au/ais/web/resources/co
+ntrast_analyser/index.html</a>

You can create javascript menus with Navstudio, which is also a free 
software.

I am interested in knowing, what you findout from all. Goodluck with 
your meeting!

Vishwam

----- Original Message -----
From: Isabel Danforth &lt;isabel@shelltown.com&gt;
Date: Friday, May 20, 2005 11:03 am
Subject: Re: [Web4lib] Favorite Free Web Tools


</pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em
+; padding-left: 0.85em"><pre style="margin: 0em;">I am in a very smal
+l academic library.  I use MarcEdit to grab MARC
records to use in my system.  It is free.  I would really love to 
see what
other tools you find.

Isabel

On Fri, 20 May 2005, Susan Boland wrote:


</pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em
+; padding-left: 0.85em"><pre style="margin: 0em;">I am working on a p
+rogram for the American Association of Law 
</pre></blockquote><pre style="margin: 0em;">

Libraries'&gt; annual meeting that focuses on free or very cheap Web 
development tools.

</pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em
+; padding-left: 0.85em"><pre style="margin: 0em;">While I have my fav
+orites, I'm hoping the list can help me out by
suggesting free or cheap tools that I might not be aware of.  If you
want to reply to me privately, I can summarize for the list.

Thanks!



Susan M. Boland
Chair, Computing Services Special Interest Section
Research &amp; Instructional Services Librarian
Northern Illinois University College of Law Library
DeKalb, Il 60115
815-753-9492
Fax:  815-753-9499
e-mail:  sboland@niu.edu
<a  rel="nofollow" href="http://law.niu.edu">http://law.niu.edu</a>
_______________________________________________
Web4lib mailing list
Web4lib@webjunction.org
<a  rel="nofollow" href="http://lists.webjunction.org/web4lib/">http:/
+/lists.webjunction.org/web4lib/</a>

</pre></blockquote><pre style="margin: 0em;">

_______________________________________________
Web4lib mailing list
Web4lib@webjunction.org
<a  rel="nofollow" href="http://lists.webjunction.org/web4lib/">http:/
+/lists.webjunction.org/web4lib/</a>



----------------------------------------------------------------------
+--

_______________________________________________
Web4lib mailing list
Web4lib@webjunction.org
<a  rel="nofollow" href="http://lists.webjunction.org/web4lib/">http:/
+/lists.webjunction.org/web4lib/</a>
</pre></blockquote></blockquote>
<pre>

</pre>
[download]

Some observations:

eastcoastcoder wrote (some formatting added): "if we have <bq1>fred said this</bq1>I agree<bq2>he said<bq3>she said so</bq3>and so do I</bq2> - we can delete from <bq2> on and have proper html"

(1) We DON'T have EXACTLY that (Note especially, the variant quoting of the Boland message) and (more below) valid .html is FAR from the whole of your issue or 'problem.'

Simplfying the source your provided (and omitting the links to each writer's institution):

001: <pre>
002: blah blah blah blah blah ...
003: 
004: Andrew Darby
005: title...
006: 
007: Vishwam Annam wrote:
008: </pre> <!-- ENDS THE initial, UN_styled <pre> -->
009: <bq1> <!--bqs are all identical -->
010: <pq1> <!-- all '<pre style="...">' are identical -->
011:      <!-- and the style is effectively a no_op -->
012: I use &quot;Web Accessibility Toolbar&quot;...
013: 
014: <a...resources/toolbar/">...toolbar/</a>
015: 
016: Color Contrast Analyzer, 
017: <a ... nalyser/index.html">...index.html</a>
018: 
019: blah, blah, blah
020: 
021: Vishwam  <!-- end of msg, but NOT of bq or pq -->
022: 
023: ----- Original Message -----
024: From: Isabel Danforth &lt;isabel@shelltown.com&gt;
025: Date: Friday, May 20, 2005 11:03 am
026: Subject: Re: [Web4lib] Favorite Free Web Tools
027: 
028: 
029: </pre> <!-- ends the styled_pq immediately preceeding the text of
+ Vishwam's msg -->
030: <bq2> <!-- identical to bq1 and NESTED INSIDE bq1 -->
031: <pq2>I am in a ... find.
032: 
033: Isabel
034: 
035: On Fri, 20 May 2005, Susan Boland wrote:
036: </pre> <!-- ends pq2 -->
037: 
038: <bq3> <!-- AGAIN, nested, bq1 and bq2 are still open! -->
039: <pq3>
040: I am working on a program for the American Association of Law  <!
+-- NB: this writer o
+r the writer of the child, above,   used a different MODE for quoting
+-->
041: </pre>  <!-- ends pq3 -->
042: </blockquote>  <!-- ends bq3 -->
043: <pq4">
044: Libraries ... tools.
045: 
046: </pre> <!-- ends pq3 -->
047: 
048: <bq4> <!-- This is NESTED in bq2, since 2 is still open but 3 has
+ closed -->
049: <pq4>
050: While I ... (THRU SIGNATURE BLOCK)
051: (LONG UNDERLINE, SETS OFF LIST CREDIT)
052: (LIST CREDIT)
053: </pre>  <!-- ends pq4 -->
054: </blockquote>  <--ends bq4
055: <pre style="margin: 0em;">
056: (REPEAT REPEATEDLY (quantity depends on num of replies))
057: </pre>  <!-- ends yet another... not shown, I think -->
058: </blockquote> <!-- end bq2 -->
059: </blockquote>  <!-- end bq1 -->
060: <pre>  <!-- empty pre ... -->
061:        <!-- ...no sweat removing above and below -->
062: </pre> <! -- end empty, source_terminating, pre_pair -->
063:
[download]

2. As strictly followed as possible, given the above, your spec would remove the base message, or a fragment thereof, thereby removing the 'reason' for the replies. This does NOT sound like a 'good idea' if what you're trying to do is build a database (...info resource, whatever).

3. Removing the redundant list credits (added at the bottom of the thread, more-or-less once-per-reply) and the numerous, no_op pre's (<pre style="margin: 0em;">) will do a lot to compact the data, while retaining the appearance. But better yet, removing the bqs, inserting (yes, programaticly) <br>s or <p>s while retaining the editorial content seems likely to serve you better. But while that's relatively trivial for your sample, doing it for (multiple listservs and/or non-uniform quoting techniques) could get VERY hard...

4. The quoting style in your source -- the blockquotes, etc -- looks to me like it may be coming from a webmail reader or via (...gag!) MS Outlook. I believe there are better ways to get the source... perhaps right off the listserv, and perhaps even free of some of the problematic (correct, but VERY damn problematic/bad style....) markup. But for that, you'll need to try a new question in SOPW.

Comment on Re: Using Perl to snip the end off of HTML Select or Download Code

Replies are listed 'Best First'.
Re^2: Using Perl to snip the end off of HTML by hv (Prior) on Jun 13, 2005 at 23:39 UTC
I started looking at this, but I find the code quite hard to read. The heavy use of global variables contributed to that, as did the inconsistent indentation. In general you should try to avoid building up large strings such as `$biglist` one small chunk at a time. (See what's faster than .= for some analysis of this.) Better would probably be to save the results in an array, then create the big string with a single join, or replace the whole thing with a join on a map, something like: `my $message = join '', map cleanline($_), <DATA>; ... { my $prevline; sub cleanline { my $line = shift; # skip duplicate lines return if defined($prevline) && $line eq $prevline; $prevline = $line; # crudely HTMLify return "$line\n<br>\n"; } }` [download] With code like this: `if ($biglist =~ m%$bq%ig ) # { $biglist =~ s%$bq% %ig; # replace with space, case inse +nsitive, g lobal }` [download] .. the initial "test if the pattern matches" gains nothing except to double the work if the pattern does match. You'll get cleaner and faster code if you take out the test: `$biglist =~ s%$bq% %ig;` [download] This doesn't actually remove the quotes though, only the `<blockquote>` tags themselves. I think the quote removal in the example message is intended to leave only: `<pre> blah blah blah blah blah Andrew Darby Web Services Librarian Ithaca College Library <a rel="nofollow" href="http://www.ithaca.edu/library/">http://www.it +haca.edu/l ibrary/</a> Vishwam Annam wrote: </pre>` [download] .. all the rest being a single trailing blockquote (and a bit of <pre>'d whitespace). If that is indeed the case, you could (if the HTML is sufficiently restricted to allow it) strip it with a single recursive regexp, something like (untested): `our $re_bq; $re_bq = qr{ # open tag < blockquote (?: \s+ style \s* = \s* " .? " )? \s \s* > (?: # some character that doesn't start a nested blockquote (?!<blockquote\b) . \| # or a whole (recursively) nested blockquote (??{ $re_bq }) )* < / blockquote \s* > }xsi; ... $message =~ s{$re_bq\s*\z}{};` [download] This won't do the right thing in some cases - for example if something that looks like a blockquote tag is actually in an HTML comment, or embedded in an attribute string - which is why an HTML parsing module would be a much better bet. Hope this gives you some useful ideas, Hugo	[reply] [d/l] [select]
Re^2: Using Perl to snip the end off of HTML by eastcoastcoder (Sexton) on Jun 15, 2005 at 05:10 UTC
ww (and hv) - thanks. I'm going to work on digesting your code and respond. BTW, while daydreaming, I thought of a "lateral solution". I know there are two classic means of parsing XML - stack based, reactive (SAX) and tree based, proactive (DOM). Couldn't a tree based module handle this easily: pseudocode: `my $node = tree.getbodynode.getlasttoplevelnode delete tree.$node if ($node.type == blockquote) (recurse)` [download] Voila! Any nesting would be irrelevant, since it wouldn't show up on the top level tree. Would this work? More importantly, is there a tree style HTML parser for Perl (the one I am familiar with is event based, tag based, and reactive)	[reply] [d/l]
Re^3: Using Perl to snip the end off of HTML by ww (Archbishop) on Jun 15, 2005 at 13:38 UTC
Pay more attention to hv's critique (++!) than to my ugly code... and apologies for not writing the regexen in extended format with explanations in comments (If you need same I may be able to produce, but not quickly. workload heated up bigtime yesterday). But, back to hv's wisdom vs. mine: I'm (maybe) halfway decent at translating structure such as you showed into a minimally working regex, but I'm already busy trying to internalize his advice, which appears to be very good. That said, I am not entirely sure all his alternatives are applicable to your project: note that we appear to differ in our understandings of your intent. For example, if you wish to lose the blockquote tags, but not the editorial content they surround, as I read it, then the regexen MUST (TTBOMK) work on a string because not only the tags, but also the emails' 'editorial contents' span multiple '\n'. Even if so, though, I think his method of building the string is a large improvement over mine. re parser: believe prior comments mentioned several, which may or may not facilitiate your work.	[reply]
Re^3: Using Perl to snip the end off of HTML by eastcoastcoder (Sexton) on Jun 17, 2005 at 04:49 UTC
Here's what I came up with, based on the discussion: use HTML::TokeParser::Simple; sub remove_final_blockquotes { my $html_string = shift; $html_string =~ s/\s+$//smg; # remove trailing whitespace my $qty_blockquote = 0; my $scratch_pad = ''; my $parser = HTML::TokeParser::Simple->new(string => $html_string) +; $parser->unbroken_text(1); my $ret = ''; while (my $token = $parser->get_token) { if ($token->is_start_tag('blockquote')) { $qty_blockquote++; #$ret .= "\nqty_blockquote-> $qty_blockquote"; } if (! $qty_blockquote) { $ret .= $scratch_pad; # We've left the blockquote, +so add it $scratch_pad = ''; $ret .= $token->as_is(); } else { $scratch_pad .= $token->as_is(); } if ($token->is_end_tag('blockquote')) { $qty_blockquote--; #$ret .= "\nqty_blockquote-> $qty_blockquote"; } } return $ret; } [download] Also, please: Comments and criticism greatly appreciated!!	[reply] [d/l]