eastcoastcoder has asked for the wisdom of the Perl Monks concerning the following question:
Hi. I'd like to use Perl to remove off the end of HTML any whitespace. A simple s/\s*$//s won't suffice, since, in HTML, whitespace can look like <pre> </pre> or etc.
Also, I'd like to remove any <blockquote>blah blah blah</blockquote> at the END of the document ONLY (ie, but not from the middle of the page). Again, a simple s/<blockquote.*?\/blockquote>$//s won't suffice, since blockquotes can be nested.
Any ideas?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Using Perl to snip the end off of HTML
by TheStudent (Scribe) on Jun 08, 2005 at 15:39 UTC | |
How are you defining the end of the document? TheStudent | [reply] |
|
Re: Using Perl to snip the end off of HTML
by tlm (Prior) on Jun 08, 2005 at 15:35 UTC | |
I don't know if it will help you, but check out HTML::PrettyPrinter. the lowliest monk | [reply] |
|
Re: Using Perl to snip the end off of HTML
by ww (Archbishop) on Jun 08, 2005 at 16:56 UTC | |
| [reply] [d/l] |
by eastcoastcoder (Sexton) on Jun 08, 2005 at 17:30 UTC | |
1) I am working with emails in text/html format (no groans, please). My goal is to remove any final spacing off, as well as when they just quote the previous message at the end (quotes in the middle are okay, as they are often used as reference points). I am certain that this is what I want to do, so please hold all "Why would you want to do that?" responses. 2) All the emails will be similar to this in format:
Note the nested blockquotes. (Those pluses are not in the orginal post, just perlmonks puts them in.) | [reply] [d/l] |
|
Re: Using Perl to snip the end off of HTML
by ww (Archbishop) on Jun 08, 2005 at 18:28 UTC | |
added> In fact, the more I puzzle over this, the more I suspect that despite your certainty about your intent (as indicated by your caution to 'hold all "Why would you want to do that?" responses.' may reflect inadequate analysis of what you need to do to achieve your (still vague) objective. added> Others, more skilled, may approach this differently, but IMO, writing complex regexen is only about 10% knowing syntax; the rest is analysing your dataset so that the logic of the regex is plain. So, consider writing (or at least, 'thinking through') a detailed example of what output you want from the processing and compare that -- in detail -- to the source, to get a clear view of what you need to do. And again: This clearly is a case where you'll ascend unto the heavens more quickly and surely on the backs of those who've written "gold standard" modules as HTML::Parser, etc. than by re-inventing the wheel. By way of confession; writing the regex you're seeking is not quite as simple as I suggested above for your source example, update Last 2 bullets and the 2 following paragraphs added (2005 Jun 8 2100 GMT), after 'puzzling' for a bit. | [reply] |
by eastcoastcoder (Sexton) on Jun 08, 2005 at 23:09 UTC | |
I'm working on adding code to an (already working) mailing list converter. The converter converts emails to html, suitable for use in archives. Some features that I would like to add are: to get rid of whitespace at the end (easy), HTML whitepsace (medium), and trailing blockquotes (quite hard). The email is already stored locally in a file, and is converted to a variable in memory. By end of the file, I simply mean the last byte (character for those of us who speak Unicode) of the variable. The </body> etc tags will be added on later. The emails will be coming from a variety of sources and lists. The one I posted above is simply an example, but I can't count on its specifics. The html shoul be well formed, with the understanding that the <body>and</body> tags will be left out. There should be no tables or anything - it should really be more or less straight text, with simple markup - as you would find in an email sent as text/html. Again, by "end" I only mean the end of the variable.
I hope I've provided everything that you've asked - if not, please let me know. PS Yes I realize that regex should never be used for real html, they can get confused by html in comments, and whitespace or attributes in the middle of a tag, etc - I was only trying to give a simple example of why they were totally inadequate here. The code, as is, can already convert | [reply] [d/l] [select] |
|
Re: Using Perl to snip the end off of HTML
by ww (Archbishop) on Jun 13, 2005 at 19:49 UTC | |
REPEATED CAVEAT: There's gotta be a better way; pursue the suggestions for modules to help handle this; perhaps find a better way to acquire source data, and recognize the code below as written for ease of understanding, not elegance! Read more... (11 kB)
Some observations: eastcoastcoder wrote (some formatting added): "if we have <bq1>fred said this</bq1>I agree<bq2>he said<bq3>she said so</bq3>and so do I</bq2> - we can delete from <bq2> on and have proper html" (1) We DON'T have EXACTLY that (Note especially, the variant quoting of the Boland message) and (more below) valid .html is FAR from the whole of your issue or 'problem.' Simplfying the source your provided (and omitting the links to each writer's institution):
2. As strictly followed as possible, given the above, your spec would remove the base message, or a fragment thereof, thereby removing the 'reason' for the replies. This does NOT sound like a 'good idea' if what you're trying to do is build a database (...info resource, whatever). 3. Removing the redundant list credits (added at the bottom of the thread, more-or-less once-per-reply) and the numerous, no_op pre's (<pre style="margin: 0em;">) will do a lot to compact the data, while retaining the appearance. But better yet, removing the bqs, inserting (yes, programaticly) <br>s or <p>s while retaining the editorial content seems likely to serve you better. But while that's relatively trivial for your sample, doing it for (multiple listservs and/or non-uniform quoting techniques) could get VERY hard... 4. The quoting style in your source -- the blockquotes, etc -- looks to me like it may be coming from a webmail reader or via (...gag!) MS Outlook. I believe there are better ways to get the source... perhaps right off the listserv, and perhaps even free of some of the problematic (correct, but VERY damn problematic/bad style....) markup. But for that, you'll need to try a new question in SOPW. | [reply] [d/l] [select] |
by hv (Prior) on Jun 13, 2005 at 23:39 UTC | |
I started looking at this, but I find the code quite hard to read. The heavy use of global variables contributed to that, as did the inconsistent indentation. In general you should try to avoid building up large strings such as $biglist one small chunk at a time. (See what's faster than .= for some analysis of this.) Better would probably be to save the results in an array, then create the big string with a single join, or replace the whole thing with a join on a map, something like:
With code like this: .. the initial "test if the pattern matches" gains nothing except to double the work if the pattern does match. You'll get cleaner and faster code if you take out the test:
This doesn't actually remove the quotes though, only the <blockquote> tags themselves. I think the quote removal in the example message is intended to leave only: .. all the rest being a single trailing blockquote (and a bit of <pre>'d whitespace). If that is indeed the case, you could (if the HTML is sufficiently restricted to allow it) strip it with a single recursive regexp, something like (untested):
This won't do the right thing in some cases - for example if something that looks like a blockquote tag is actually in an HTML comment, or embedded in an attribute string - which is why an HTML parsing module would be a much better bet. Hope this gives you some useful ideas, Hugo | [reply] [d/l] [select] |
by eastcoastcoder (Sexton) on Jun 15, 2005 at 05:10 UTC | |
BTW, while daydreaming, I thought of a "lateral solution". I know there are two classic means of parsing XML - stack based, reactive (SAX) and tree based, proactive (DOM). Couldn't a tree based module handle this easily: pseudocode: Voila! Any nesting would be irrelevant, since it wouldn't show up on the top level tree. Would this work? More importantly, is there a tree style HTML parser for Perl (the one I am familiar with is event based, tag based, and reactive) | [reply] [d/l] |
by ww (Archbishop) on Jun 15, 2005 at 13:38 UTC | |
Pay more attention to hv's critique (++!) than to my ugly code... and apologies for not writing the regexen in extended format with explanations in comments (If you need same I may be able to produce, but not quickly. workload heated up bigtime yesterday). But, back to hv's wisdom vs. mine: I'm (maybe) halfway decent at translating structure such as you showed into a minimally working regex, but I'm already busy trying to internalize his advice, which appears to be very good. That said, I am not entirely sure all his alternatives are applicable to your project: note that we appear to differ in our understandings of your intent. For example, if you wish to lose the blockquote tags, but not the editorial content they surround, as I read it, then the regexen MUST (TTBOMK) work on a string because not only the tags, but also the emails' 'editorial contents' span multiple '\n'. Even if so, though, I think his method of building the string is a large improvement over mine. re parser: believe prior comments mentioned several, which may or may not facilitiate your work. | [reply] |
by eastcoastcoder (Sexton) on Jun 17, 2005 at 04:49 UTC | |
Also, please: Comments and criticism greatly appreciated!! | [reply] [d/l] |