in reply to Using Perl to snip the end off of HTML
REPEATED CAVEAT: There's gotta be a better way; pursue the suggestions for modules to help handle this; perhaps find a better way to acquire source data, and recognize the code below as written for ease of understanding, not elegance!
Other caveats, provided to ecc earlier, are below. Comments re those and "teaching comments" re the undoubtedly "ugly" coding are both welcome.
#! c:/perl/bin -w use diagnostics; use strict; use vars qw ( $prevmail $biglist $mail @mail $mailcleaned $interim $re +sult); $prevmail = ""; $biglist = ""; $mailcleaned = ''; $result = "cleaned.htm"; $interim = "biglist.txt"; chomp (@mail = <DATA>); foreach $mail(@mail) { &onelist($mail); } open (INTERIM, ">$interim"); print INTERIM "$biglist"; close INTERIM; print STDERR "\n\t See $interim\n\n"; &clean($biglist); open (RESULT, ">$result"); print RESULT "<html><head><title>Cleaned</title></head><body>\n$mailcl +eaned\n</body></html>\n"; close RESULT; print STDERR "\n\tSee $result\n\n"; exit; ### END MAIN ### SUBS ## # sub onelist - delete blank and dupe lines; remainder into $biglist # +# sub onelist { if ( $mail eq $prevmail ) { return; # return NOT +HING; get next line } else { if ( $mail eq "" & $prevmail eq "\n<br>" ) { return; # again retur +n NOTHING } else { $prevmail = $mail; # save current to + check next for dupe . "\n<br>" $biglist = $biglist . $mail . "\n<br>\n"; # biglist ne +eds spaces to stay readable; # effectivel +y, replace \n with " " } return($biglist); } } # sub clean - get rid of blockquote, pre tags, redundant credits, etc +.; save editorial content ## sub clean { use vars qw ( $bq $pq $rel $credit $garbage $garbage2 ); $bq = qr!<blockquote style="border-left: #5555EE solid 0.2em; margin: +0em; padding-left: 0.85em">|</blockquote>!i; $pq = qr!<pre style="margin: 0em;">|<pre>|</pre>!i; $rel = qr!<a rel="nofollow"!; $credit = qr!_______________________________________________\n<br>\nWe +b4lib mailing list!; $garbage = qr!<br><br>--------------<br>!; $garbage2 = qr!------------------------------------------------------- +-----------------!; # remove <blockquote style="...0.85em"> if ($biglist =~ m%$bq%ig ) # { $biglist =~ s%$bq% %ig; # replace with space, case inse +nsitive, global } # replace $pq (styled <pre...>) with a \n<br>. if ( $biglist =~ m%$pq%ig ) { $biglist =~ s%$pq%\n<br>%ig; } # remove redundant credits if ( $biglist =~ m%$credit.*(?=$credit)$credit%igs ) { # print STDERR "\t Testing credits\n"; (UNcomment this, insert + similar elsewhere for # simple execution tracing; add similar to see values... $biglist =~ s%($credit).*(?=$credit)$credit%$1%igs; } # replace '<a rel="nofollow"' with '<a' if ( $biglist =~ m%$rel% ) { $biglist =~ s%$rel%<a%ig; } # replace 47x underlines - separators for redundant credits - with sho +rter line of hyphens if ( $biglist =~ m%_______________________________________________% + ) { $biglist =~ s%_______________________________________________%\ +n<br>--------------%g } # replace multiple newlines separated only by spaces with '<br>' if ( $biglist =~ m%\n\s*\n% ) { $biglist =~ s%\n\s*\n%<br>%ig; # print STDERR "\t After mult newlines, \$biglist is $biglist\n" +; } # fix breaks around original message hyphens if ( $biglist =~ m%(-----.*)(?=-----)% ) { $biglist =~ s%(-----.*)(?=-----)(-----)%<br>\n<br>\n$1$2<br>\n% +ig; } # delete garbage if ($biglist =~ m%$garbage% ) { $biglist =~ s%$garbage%%; } # delete garbage2, which is - X47 if ($biglist =~ m%$garbage2% ) { $biglist =~ s%$garbage2%%; } # multiple <br> split by newlines THIS ONE STILL HAS PROBLEMS if ( $biglist =~ m%(?:<br>)\n{0,4}%isg ) { print STDERR "\n\t Testing multiple <br> sep by newlines\n"; $biglist =~ s%(?:<br>)\n{0,4}(?:<br>)%<br>%isg; } # multiple <br> on same line if ( $biglist =~ m%(?:<br>)+%ig ) { # print STDERR "\n\t Testing multiple <br> ON SAME LINE\n"; $biglist =~ s%(?:<br>)+%<br>%ig; } $mailcleaned = $biglist; return($mailcleaned); } # ENDSUB __DATA__ <pre> blah blah blah blah blah Andrew Darby Web Services Librarian Ithaca College Library <a rel="nofollow" href="http://www.ithaca.edu/library/">http://www.it +haca.edu/library/</a> Vishwam Annam wrote: </pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em +; padding-left: 0.85em"><pre style="margin: 0em;">I use "Web Acc +essibility Toolbar" as link checker, accessibility checker, HTML and CSS validator. This works for IE, and there is a similar tool available for firefox, web developer extn has some features of this. <a rel="nofollow" href="http://www.nils.org.au/ais/web/resources/tool +bar/">http://www.nils.org.au/ais/web/resources/toolbar/</a> Color Contrast Analyzer, <a rel="nofollow" href="http://www.nils.org.au/ais/web/resources/cont +rast_analyser/index.html">http://www.nils.org.au/ais/web/resources/co +ntrast_analyser/index.html</a> You can create javascript menus with Navstudio, which is also a free software. I am interested in knowing, what you findout from all. Goodluck with your meeting! Vishwam ----- Original Message ----- From: Isabel Danforth <isabel@shelltown.com> Date: Friday, May 20, 2005 11:03 am Subject: Re: [Web4lib] Favorite Free Web Tools </pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em +; padding-left: 0.85em"><pre style="margin: 0em;">I am in a very smal +l academic library. I use MarcEdit to grab MARC records to use in my system. It is free. I would really love to see what other tools you find. Isabel On Fri, 20 May 2005, Susan Boland wrote: </pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em +; padding-left: 0.85em"><pre style="margin: 0em;">I am working on a p +rogram for the American Association of Law </pre></blockquote><pre style="margin: 0em;"> Libraries'> annual meeting that focuses on free or very cheap Web development tools. </pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em +; padding-left: 0.85em"><pre style="margin: 0em;">While I have my fav +orites, I'm hoping the list can help me out by suggesting free or cheap tools that I might not be aware of. If you want to reply to me privately, I can summarize for the list. Thanks! Susan M. Boland Chair, Computing Services Special Interest Section Research & Instructional Services Librarian Northern Illinois University College of Law Library DeKalb, Il 60115 815-753-9492 Fax: 815-753-9499 e-mail: sboland@niu.edu <a rel="nofollow" href="http://law.niu.edu">http://law.niu.edu</a> _______________________________________________ Web4lib mailing list Web4lib@webjunction.org <a rel="nofollow" href="http://lists.webjunction.org/web4lib/">http:/ +/lists.webjunction.org/web4lib/</a> </pre></blockquote><pre style="margin: 0em;"> _______________________________________________ Web4lib mailing list Web4lib@webjunction.org <a rel="nofollow" href="http://lists.webjunction.org/web4lib/">http:/ +/lists.webjunction.org/web4lib/</a> ---------------------------------------------------------------------- +-- _______________________________________________ Web4lib mailing list Web4lib@webjunction.org <a rel="nofollow" href="http://lists.webjunction.org/web4lib/">http:/ +/lists.webjunction.org/web4lib/</a> </pre></blockquote></blockquote> <pre> </pre>
Some observations:
eastcoastcoder wrote (some formatting added): "if we have <bq1>fred said this</bq1>I agree<bq2>he said<bq3>she said so</bq3>and so do I</bq2> - we can delete from <bq2> on and have proper html"
(1) We DON'T have EXACTLY that (Note especially, the variant quoting of the Boland message) and (more below) valid .html is FAR from the whole of your issue or 'problem.'
Simplfying the source your provided (and omitting the links to each writer's institution):
001: <pre> 002: blah blah blah blah blah ... 003: 004: Andrew Darby 005: title... 006: 007: Vishwam Annam wrote: 008: </pre> <!-- ENDS THE initial, UN_styled <pre> --> 009: <bq1> <!--bqs are all identical --> 010: <pq1> <!-- all '<pre style="...">' are identical --> 011: <!-- and the style is effectively a no_op --> 012: I use "Web Accessibility Toolbar"... 013: 014: <a...resources/toolbar/">...toolbar/</a> 015: 016: Color Contrast Analyzer, 017: <a ... nalyser/index.html">...index.html</a> 018: 019: blah, blah, blah 020: 021: Vishwam <!-- end of msg, but NOT of bq or pq --> 022: 023: ----- Original Message ----- 024: From: Isabel Danforth <isabel@shelltown.com> 025: Date: Friday, May 20, 2005 11:03 am 026: Subject: Re: [Web4lib] Favorite Free Web Tools 027: 028: 029: </pre> <!-- ends the styled_pq immediately preceeding the text of + Vishwam's msg --> 030: <bq2> <!-- identical to bq1 and NESTED INSIDE bq1 --> 031: <pq2>I am in a ... find. 032: 033: Isabel 034: 035: On Fri, 20 May 2005, Susan Boland wrote: 036: </pre> <!-- ends pq2 --> 037: 038: <bq3> <!-- AGAIN, nested, bq1 and bq2 are still open! --> 039: <pq3> 040: I am working on a program for the American Association of Law <! +-- NB: this writer o +r the writer of the child, above, used a different MODE for quoting +--> 041: </pre> <!-- ends pq3 --> 042: </blockquote> <!-- ends bq3 --> 043: <pq4"> 044: Libraries ... tools. 045: 046: </pre> <!-- ends pq3 --> 047: 048: <bq4> <!-- This is NESTED in bq2, since 2 is still open but 3 has + closed --> 049: <pq4> 050: While I ... (THRU SIGNATURE BLOCK) 051: (LONG UNDERLINE, SETS OFF LIST CREDIT) 052: (LIST CREDIT) 053: </pre> <!-- ends pq4 --> 054: </blockquote> <--ends bq4 055: <pre style="margin: 0em;"> 056: (REPEAT REPEATEDLY (quantity depends on num of replies)) 057: </pre> <!-- ends yet another... not shown, I think --> 058: </blockquote> <!-- end bq2 --> 059: </blockquote> <!-- end bq1 --> 060: <pre> <!-- empty pre ... --> 061: <!-- ...no sweat removing above and below --> 062: </pre> <! -- end empty, source_terminating, pre_pair --> 063:
2. As strictly followed as possible, given the above, your spec would remove the base message, or a fragment thereof, thereby removing the 'reason' for the replies. This does NOT sound like a 'good idea' if what you're trying to do is build a database (...info resource, whatever).
3. Removing the redundant list credits (added at the bottom of the thread, more-or-less once-per-reply) and the numerous, no_op pre's (<pre style="margin: 0em;">) will do a lot to compact the data, while retaining the appearance. But better yet, removing the bqs, inserting (yes, programaticly) <br>s or <p>s while retaining the editorial content seems likely to serve you better. But while that's relatively trivial for your sample, doing it for (multiple listservs and/or non-uniform quoting techniques) could get VERY hard...
4. The quoting style in your source -- the blockquotes, etc -- looks to me like it may be coming from a webmail reader or via (...gag!) MS Outlook. I believe there are better ways to get the source... perhaps right off the listserv, and perhaps even free of some of the problematic (correct, but VERY damn problematic/bad style....) markup. But for that, you'll need to try a new question in SOPW.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Using Perl to snip the end off of HTML
by hv (Prior) on Jun 13, 2005 at 23:39 UTC | |
|
Re^2: Using Perl to snip the end off of HTML
by eastcoastcoder (Sexton) on Jun 15, 2005 at 05:10 UTC | |
by ww (Archbishop) on Jun 15, 2005 at 13:38 UTC | |
by eastcoastcoder (Sexton) on Jun 17, 2005 at 04:49 UTC |