Using Perl to snip the end off of HTML

eastcoastcoder has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Using Perl to snip the end off of HTML by TheStudent (Scribe) on Jun 08, 2005 at 15:39 UTC
How are you defining the end of the document? TheStudent	[reply]
Re: Using Perl to snip the end off of HTML by tlm (Prior) on Jun 08, 2005 at 15:35 UTC
I don't know if it will help you, but check out `HTML::PrettyPrinter`. the lowliest monk	[reply]
Re: Using Perl to snip the end off of HTML by ww (Archbishop) on Jun 08, 2005 at 16:56 UTC
Additional info would help: Do you know what the .html looks like? Or are you working against the full range of (semi-reasonable) possibilities? If the former, provide a sample, so that we may better understand your purpose/intent (as noted above, there are bits and pieces of your question that are less_than_clear re removal of <pre.../pre> and charentity spaces: webmaster probably put'em there for a reason. If you're merely extracting content to .txt or a db or some such, webmaster's intent implies no requirement for you; but if you have some web-ish or "rendered" use in mind, beware re `s/<blockquote.?\/blockquote>$//s` Fixing that regex is comparatively easy (if you mean what I suspect), but -- if you come to frequent the monastery, you will read often re parsing html with handrolled code: "DON'T!". Instead search on html parse for relevant nodes pointing to the modules* that will serve you well... ...AND for the best ways to get good help here, read How do I post a question effectively? Welcome!	[reply] [d/l]
Re^2: Using Perl to snip the end off of HTML by eastcoastcoder (Sexton) on Jun 08, 2005 at 17:30 UTC
Okay! Thank you for the welcome, here is the additional info: 1) I am working with emails in text/html format (no groans, please). My goal is to remove any final spacing off, as well as when they just quote the previous message at the end (quotes in the middle are okay, as they are often used as reference points). I am certain that this is what I want to do, so please hold all "Why would you want to do that?" responses. 2) All the emails will be similar to this in format: <pre> blah blah blah blah blah Andrew Darby Web Services Librarian Ithaca College Library <a rel="nofollow" href="http://www.ithaca.edu/library/">http://www.it +haca.edu/library/</a> Vishwam Annam wrote: </pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em +; padding-left: 0.85em"><pre style="margin: 0em;">I use "Web Acc +essibility Toolbar" as link checker, accessibility checker, HTML and CSS validator. This works for IE, and there is a similar tool available for firefox, web developer extn has some features of this. <a rel="nofollow" href="http://www.nils.org.au/ais/web/resources/tool +bar/">http://www.nils.org.au/ais/web/resources/toolbar/</a> Color Contrast Analyzer, <a rel="nofollow" href="http://www.nils.org.au/ais/web/resources/cont +rast_analyser/index.html">http://www.nils.org.au/ais/web/resources/co +ntrast_analyser/index.html</a> You can create javascript menus with Navstudio, which is also a free software. I am interested in knowing, what you findout from all. Goodluck with your meeting! Vishwam ----- Original Message ----- From: Isabel Danforth <isabel@shelltown.com> Date: Friday, May 20, 2005 11:03 am Subject: Re: [Web4lib] Favorite Free Web Tools </pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em +; padding-left: 0.85em"><pre style="margin: 0em;">I am in a very smal +l academic library. I use MarcEdit to grab MARC records to use in my system. It is free. I would really love to see what other tools you find. Isabel On Fri, 20 May 2005, Susan Boland wrote: </pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em +; padding-left: 0.85em"><pre style="margin: 0em;">I am working on a p +rogram for the American Association of Law </pre></blockquote><pre style="margin: 0em;"> Libraries'> annual meeting that focuses on free or very cheap Web development tools. </pre><blockquote style="border-left: #5555EE solid 0.2em; margin: 0em +; padding-left: 0.85em"><pre style="margin: 0em;">While I have my fav +orites, I'm hoping the list can help me out by suggesting free or cheap tools that I might not be aware of. If you want to reply to me privately, I can summarize for the list. Thanks! Susan M. Boland Chair, Computing Services Special Interest Section Research & Instructional Services Librarian Northern Illinois University College of Law Library DeKalb, Il 60115 815-753-9492 Fax: 815-753-9499 e-mail: sboland@niu.edu <a rel="nofollow" href="http://law.niu.edu">http://law.niu.edu</a> _______________________________________________ Web4lib mailing list Web4lib@webjunction.org <a rel="nofollow" href="http://lists.webjunction.org/web4lib/">http:/ +/lists.webjunction.org/web4lib/</a> </pre></blockquote><pre style="margin: 0em;"> _______________________________________________ Web4lib mailing list Web4lib@webjunction.org <a rel="nofollow" href="http://lists.webjunction.org/web4lib/">http:/ +/lists.webjunction.org/web4lib/</a> ---------------------------------------------------------------------- +-- _______________________________________________ Web4lib mailing list Web4lib@webjunction.org <a rel="nofollow" href="http://lists.webjunction.org/web4lib/">http:/ +/lists.webjunction.org/web4lib/</a> </pre></blockquote></blockquote> <pre> </pre> [download] Note the nested blockquotes. (Those pluses are not in the orginal post, just perlmonks puts them in.)	[reply] [d/l]
Re: Using Perl to snip the end off of HTML by ww (Archbishop) on Jun 08, 2005 at 18:28 UTC
OK... referencing your Reply, coupla more questions: Are you trying to scrape this or are you working with a saved_local_file? (if so, suspect you can swap some extra processing for much better results) Either it's waaay to late to comment (cuz brain went to sleep before body) or there is some very odd usage in the Boland blockquote (near "Libraries> which might be the close of a quoted link, except that I can't find a start). So question is: Can you count on well_formed html? Mere fragments (like your sample data)? added> This is NOT a complete .html page. If, in fact, you have such a page, you could use the </body> tag as a marker to determine which (possibly nested) blockquotes are at the end of the page. added> If all the .html fragments you want to clean up come from the same origin (the web4lib listserv), you can clean up the end of the file by starting near the end. As it turns out (depending on your example) there's a very easy place to truncate your file... just before the very first instance of _______________________________________________ Web4lib mailing list where there are 47 underlines, a newline, and the web4lib... credit. added> In fact, the more I puzzle over this, the more I suspect that despite your certainty about your intent (as indicated by your caution to 'hold all "Why would you want to do that?" responses.' may reflect inadequate analysis of what you need to do to achieve your (still vague) objective. added> Others, more skilled, may approach this differently, but IMO, writing complex regexen is only about 10% knowing syntax; the rest is analysing your dataset so that the logic of the regex is plain. So, consider writing (or at least, 'thinking through') a detailed example of what output you want from the processing and compare that -- in detail -- to the source, to get a clear view of what you need to do. And again: This clearly is a case where you'll ascend unto the heavens more quickly and surely on the backs of those who've written "gold standard" modules as HTML::Parser, etc. than by re-inventing the wheel. By way of confession; writing the regex you're seeking is not quite as simple as I suggested above for your source example, ... but ... update Last 2 bullets and the 2 following paragraphs added (2005 Jun 8 2100 GMT), after 'puzzling' for a bit.	[reply]
Re^2: Using Perl to snip the end off of HTML by eastcoastcoder (Sexton) on Jun 08, 2005 at 23:09 UTC
It seems like everyone would like some background. I thought it was orthagonal to my objective and so ommitted it, but, apparently it's not, so here it is. I'm working on adding code to an (already working) mailing list converter. The converter converts emails to html, suitable for use in archives. Some features that I would like to add are: to get rid of whitespace at the end (easy), HTML whitepsace (medium), and trailing blockquotes (quite hard). The email is already stored locally in a file, and is converted to a variable in memory. By end of the file, I simply mean the last byte (character for those of us who speak Unicode) of the variable. The `</body>` etc tags will be added on later. The emails will be coming from a variety of sources and lists. The one I posted above is simply an example, but I can't count on its specifics. The html shoul be well formed, with the understanding that the `<body>`and`</body>` tags will be left out. There should be no tables or anything - it should really be more or less straight text, with simple markup - as you would find in an email sent as text/html. Again, by "end" I only mean the end of the variable. I hope I've provided everything that you've asked - if not, please let me know. And I hope someone here has some ideas, since I'm stumped, specifically on the last one (removing (possibly nested blockquotes)). PS Yes I realize that regex should never be used for real html, they can get confused by html in comments, and whitespace or attributes in the middle of a tag, etc - I was only trying to give a simple example of why they were totally inadequate here. The code, as is, can already convert	[reply] [d/l] [select]
Re: Using Perl to snip the end off of HTML by ww (Archbishop) on Jun 13, 2005 at 19:49 UTC
After addtl offline correspondence re eastcoastcoder's specs (none of which turned out to be crucial to the issue), the following is more-or-less responsive (ie, it does most of what I think should be what ecc wants (:<})). REPEATED CAVEAT: There's gotta be a better way; pursue the suggestions for modules to help handle this; perhaps find a better way to acquire source data, and recognize the code below as written for ease of understanding, not elegance! Other caveats, provided to ecc earlier, are below. Comments re those and "teaching comments" re the undoubtedly "ugly" coding are both welcome. Read more... (11 kB) Some observations: eastcoastcoder wrote (some formatting added): "if we have <bq1>fred said this</bq1>I agree<bq2>he said<bq3>she said so</bq3>and so do I</bq2> - we can delete from <bq2> on and have proper html" (1) We DON'T have EXACTLY that (Note especially, the variant quoting of the Boland message) and (more below) valid .html is FAR from the whole of your issue or 'problem.' Simplfying the source your provided (and omitting the links to each writer's institution): 001: <pre> 002: blah blah blah blah blah ... 003: 004: Andrew Darby 005: title... 006: 007: Vishwam Annam wrote: 008: </pre> <!-- ENDS THE initial, UN_styled <pre> --> 009: <bq1> <!--bqs are all identical --> 010: <pq1> <!-- all '<pre style="...">' are identical --> 011: <!-- and the style is effectively a no_op --> 012: I use "Web Accessibility Toolbar"... 013: 014: <a...resources/toolbar/">...toolbar/</a> 015: 016: Color Contrast Analyzer, 017: <a ... nalyser/index.html">...index.html</a> 018: 019: blah, blah, blah 020: 021: Vishwam <!-- end of msg, but NOT of bq or pq --> 022: 023: ----- Original Message ----- 024: From: Isabel Danforth <isabel@shelltown.com> 025: Date: Friday, May 20, 2005 11:03 am 026: Subject: Re: [Web4lib] Favorite Free Web Tools 027: 028: 029: </pre> <!-- ends the styled_pq immediately preceeding the text of + Vishwam's msg --> 030: <bq2> <!-- identical to bq1 and NESTED INSIDE bq1 --> 031: <pq2>I am in a ... find. 032: 033: Isabel 034: 035: On Fri, 20 May 2005, Susan Boland wrote: 036: </pre> <!-- ends pq2 --> 037: 038: <bq3> <!-- AGAIN, nested, bq1 and bq2 are still open! --> 039: <pq3> 040: I am working on a program for the American Association of Law <! +-- NB: this writer o +r the writer of the child, above, used a different MODE for quoting +--> 041: </pre> <!-- ends pq3 --> 042: </blockquote> <!-- ends bq3 --> 043: <pq4"> 044: Libraries ... tools. 045: 046: </pre> <!-- ends pq3 --> 047: 048: <bq4> <!-- This is NESTED in bq2, since 2 is still open but 3 has + closed --> 049: <pq4> 050: While I ... (THRU SIGNATURE BLOCK) 051: (LONG UNDERLINE, SETS OFF LIST CREDIT) 052: (LIST CREDIT) 053: </pre> <!-- ends pq4 --> 054: </blockquote> <--ends bq4 055: <pre style="margin: 0em;"> 056: (REPEAT REPEATEDLY (quantity depends on num of replies)) 057: </pre> <!-- ends yet another... not shown, I think --> 058: </blockquote> <!-- end bq2 --> 059: </blockquote> <!-- end bq1 --> 060: <pre> <!-- empty pre ... --> 061: <!-- ...no sweat removing above and below --> 062: </pre> <! -- end empty, source_terminating, pre_pair --> 063: [download] 2. As strictly followed as possible, given the above, your spec would remove the base message, or a fragment thereof, thereby removing the 'reason' for the replies. This does NOT sound like a 'good idea' if what you're trying to do is build a database (...info resource, whatever). 3. Removing the redundant list credits (added at the bottom of the thread, more-or-less once-per-reply) and the numerous, no_op pre's (<pre style="margin: 0em;">) will do a lot to compact the data, while retaining the appearance. But better yet, removing the bqs, inserting (yes, programaticly) <br>s or <p>s while retaining the editorial content seems likely to serve you better. But while that's relatively trivial for your sample, doing it for (multiple listservs and/or non-uniform quoting techniques) could get VERY hard... 4. The quoting style in your source -- the blockquotes, etc -- looks to me like it may be coming from a webmail reader or via (...gag!) MS Outlook. I believe there are better ways to get the source... perhaps right off the listserv, and perhaps even free of some of the problematic (correct, but VERY damn problematic/bad style....) markup. But for that, you'll need to try a new question in SOPW.	[reply] [d/l] [select]
Re^2: Using Perl to snip the end off of HTML by hv (Prior) on Jun 13, 2005 at 23:39 UTC
I started looking at this, but I find the code quite hard to read. The heavy use of global variables contributed to that, as did the inconsistent indentation. In general you should try to avoid building up large strings such as `$biglist` one small chunk at a time. (See what's faster than .= for some analysis of this.) Better would probably be to save the results in an array, then create the big string with a single join, or replace the whole thing with a join on a map, something like: `my $message = join '', map cleanline($_), <DATA>; ... { my $prevline; sub cleanline { my $line = shift; # skip duplicate lines return if defined($prevline) && $line eq $prevline; $prevline = $line; # crudely HTMLify return "$line\n<br>\n"; } }` [download] With code like this: `if ($biglist =~ m%$bq%ig ) # { $biglist =~ s%$bq% %ig; # replace with space, case inse +nsitive, g lobal }` [download] .. the initial "test if the pattern matches" gains nothing except to double the work if the pattern does match. You'll get cleaner and faster code if you take out the test: `$biglist =~ s%$bq% %ig;` [download] This doesn't actually remove the quotes though, only the `<blockquote>` tags themselves. I think the quote removal in the example message is intended to leave only: `<pre> blah blah blah blah blah Andrew Darby Web Services Librarian Ithaca College Library <a rel="nofollow" href="http://www.ithaca.edu/library/">http://www.it +haca.edu/l ibrary/</a> Vishwam Annam wrote: </pre>` [download] .. all the rest being a single trailing blockquote (and a bit of <pre>'d whitespace). If that is indeed the case, you could (if the HTML is sufficiently restricted to allow it) strip it with a single recursive regexp, something like (untested): `our $re_bq; $re_bq = qr{ # open tag < blockquote (?: \s+ style \s* = \s* " .? " )? \s \s* > (?: # some character that doesn't start a nested blockquote (?!<blockquote\b) . \| # or a whole (recursively) nested blockquote (??{ $re_bq }) )* < / blockquote \s* > }xsi; ... $message =~ s{$re_bq\s*\z}{};` [download] This won't do the right thing in some cases - for example if something that looks like a blockquote tag is actually in an HTML comment, or embedded in an attribute string - which is why an HTML parsing module would be a much better bet. Hope this gives you some useful ideas, Hugo	[reply] [d/l] [select]
Re^2: Using Perl to snip the end off of HTML by eastcoastcoder (Sexton) on Jun 15, 2005 at 05:10 UTC
ww (and hv) - thanks. I'm going to work on digesting your code and respond. BTW, while daydreaming, I thought of a "lateral solution". I know there are two classic means of parsing XML - stack based, reactive (SAX) and tree based, proactive (DOM). Couldn't a tree based module handle this easily: pseudocode: `my $node = tree.getbodynode.getlasttoplevelnode delete tree.$node if ($node.type == blockquote) (recurse)` [download] Voila! Any nesting would be irrelevant, since it wouldn't show up on the top level tree. Would this work? More importantly, is there a tree style HTML parser for Perl (the one I am familiar with is event based, tag based, and reactive)	[reply] [d/l]
Re^3: Using Perl to snip the end off of HTML by ww (Archbishop) on Jun 15, 2005 at 13:38 UTC
Pay more attention to hv's critique (++!) than to my ugly code... and apologies for not writing the regexen in extended format with explanations in comments (If you need same I may be able to produce, but not quickly. workload heated up bigtime yesterday). But, back to hv's wisdom vs. mine: I'm (maybe) halfway decent at translating structure such as you showed into a minimally working regex, but I'm already busy trying to internalize his advice, which appears to be very good. That said, I am not entirely sure all his alternatives are applicable to your project: note that we appear to differ in our understandings of your intent. For example, if you wish to lose the blockquote tags, but not the editorial content they surround, as I read it, then the regexen MUST (TTBOMK) work on a string because not only the tags, but also the emails' 'editorial contents' span multiple '\n'. Even if so, though, I think his method of building the string is a large improvement over mine. re parser: believe prior comments mentioned several, which may or may not facilitiate your work.	[reply]
Re^3: Using Perl to snip the end off of HTML by eastcoastcoder (Sexton) on Jun 17, 2005 at 04:49 UTC
Here's what I came up with, based on the discussion: use HTML::TokeParser::Simple; sub remove_final_blockquotes { my $html_string = shift; $html_string =~ s/\s+$//smg; # remove trailing whitespace my $qty_blockquote = 0; my $scratch_pad = ''; my $parser = HTML::TokeParser::Simple->new(string => $html_string) +; $parser->unbroken_text(1); my $ret = ''; while (my $token = $parser->get_token) { if ($token->is_start_tag('blockquote')) { $qty_blockquote++; #$ret .= "\nqty_blockquote-> $qty_blockquote"; } if (! $qty_blockquote) { $ret .= $scratch_pad; # We've left the blockquote, +so add it $scratch_pad = ''; $ret .= $token->as_is(); } else { $scratch_pad .= $token->as_is(); } if ($token->is_end_tag('blockquote')) { $qty_blockquote--; #$ret .= "\nqty_blockquote-> $qty_blockquote"; } } return $ret; } [download] Also, please: Comments and criticism greatly appreciated!!	[reply] [d/l]