Re^2: Using Perl to snip the end off of HTML

ww (and hv) - thanks. I'm going to work on digesting your code and respond.

BTW, while daydreaming, I thought of a "lateral solution". I know there are two classic means of parsing XML - stack based, reactive (SAX) and tree based, proactive (DOM). Couldn't a tree based module handle this easily:

pseudocode:

my $node = tree.getbodynode.getlasttoplevelnode
delete tree.$node if ($node.type == blockquote)
(recurse)
[download]

Voila! Any nesting would be irrelevant, since it wouldn't show up on the top level tree. Would this work? More importantly, is there a tree style HTML parser for Perl (the one I am familiar with is event based, tag based, and reactive)

Comment on Re^2: Using Perl to snip the end off of HTML Download Code

Replies are listed 'Best First'.
Re^3: Using Perl to snip the end off of HTML by ww (Archbishop) on Jun 15, 2005 at 13:38 UTC
Pay more attention to hv's critique (++!) than to my ugly code... and apologies for not writing the regexen in extended format with explanations in comments (If you need same I may be able to produce, but not quickly. workload heated up bigtime yesterday). But, back to hv's wisdom vs. mine: I'm (maybe) halfway decent at translating structure such as you showed into a minimally working regex, but I'm already busy trying to internalize his advice, which appears to be very good. That said, I am not entirely sure all his alternatives are applicable to your project: note that we appear to differ in our understandings of your intent. For example, if you wish to lose the blockquote tags, but not the editorial content they surround, as I read it, then the regexen MUST (TTBOMK) work on a string because not only the tags, but also the emails' 'editorial contents' span multiple '\n'. Even if so, though, I think his method of building the string is a large improvement over mine. re parser: believe prior comments mentioned several, which may or may not facilitiate your work.	[reply]
Re^3: Using Perl to snip the end off of HTML by eastcoastcoder (Sexton) on Jun 17, 2005 at 04:49 UTC
Here's what I came up with, based on the discussion: use HTML::TokeParser::Simple; sub remove_final_blockquotes { my $html_string = shift; $html_string =~ s/\s+$//smg; # remove trailing whitespace my $qty_blockquote = 0; my $scratch_pad = ''; my $parser = HTML::TokeParser::Simple->new(string => $html_string) +; $parser->unbroken_text(1); my $ret = ''; while (my $token = $parser->get_token) { if ($token->is_start_tag('blockquote')) { $qty_blockquote++; #$ret .= "\nqty_blockquote-> $qty_blockquote"; } if (! $qty_blockquote) { $ret .= $scratch_pad; # We've left the blockquote, +so add it $scratch_pad = ''; $ret .= $token->as_is(); } else { $scratch_pad .= $token->as_is(); } if ($token->is_end_tag('blockquote')) { $qty_blockquote--; #$ret .= "\nqty_blockquote-> $qty_blockquote"; } } return $ret; } [download] Also, please: Comments and criticism greatly appreciated!!	[reply] [d/l]