HTML module help please?

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML module help please? by tachyon (Chancellor) on Sep 01, 2004 at 02:55 UTC
Here is an HTML::Parser API2 example that show you the basics. You basically need to deal with 2 cases - one where you have a </p> and one where you don't. You need to maintain state between the callbacks so you know what to do. my $data=<<DATA; <p>foo <h4>h4.1</h4> <p>bar <p>baz <h4>h4.2</h4> <p>bar</p> <p>baz</p> <h4>h4.3</h4> <hr> <p>bar <p>baz DATA { package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ( $self->{blockquote} ) { # deal with no closing </p> print "</blockquote>\n$origtext"; $self->{blockquote} = 0; } elsif ( $tagname eq 'p' and $self->{h4last} ) { print '<blockquote>'; $self->{blockquote} = 1; } else { print $origtext; } $self->{h4last} = $tagname eq 'h4' ? 1 : 0; } sub end { my($self, $tagname, $origtext) = @_; if ( $self->{blockquote} and $tagname eq 'p' ) { print '</blockquote>'; $self->{blockquote} = 0; } else { print $origtext; } } sub text { my($self, $origtext, $is_cdata) = @_; print $origtext; } sub comment { my($self, $origtext ) = @_; print $origtext; } } my $p = MyParser->new; $p->parse($data); [download] cheers tachyon	[reply] [d/l]
Re^2: HTML module help please? by Cody Pendant (Prior) on Sep 01, 2004 at 03:01 UTC
Thank you very much. ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re: HTML module help please? by eric256 (Parson) on Sep 01, 2004 at 01:49 UTC
HTML::TreeBuilder should allow you to do what you want. Its pretty easy to get it to give you a list of the h4 elements, then just check to see if their next sibling (or child depending on layout) and replace it. Be warned though that if you HTML is not pretty standard HTML to start with that HTML::TreeBuilder can do some pretty strange things. ___________ Eric Hodges	[reply]
Re: HTML module help please? by Your Mother (Archbishop) on Sep 01, 2004 at 02:48 UTC
I think the code below will do what you want (semi-tested). But I'd have to booster pretty strongly against the perl solution. In this case you probably should do it with CSS. If the blockquote is truly a "quote," okay, but if you're only changing it for the formatting, you should format from without. Both solutions are below. use strict; use HTML::TokeParser; use CGI qw( blockquote ); my $string = join '', <DATA>; my $string_ref = \$string; my $p = HTML::TokeParser->new( $string_ref ); my $html; my $last_end_tag; while ( my $token = $p->get_token ) { if ( $token->[0] =~ /[TCD]/ ) { $html .= $token->[1]; } elsif ( $token->[0] eq 'S' ) { if ( $token->[1] eq 'p' and $last_end_tag eq 'h4' ) { $html .= blockquote( $token->[2], $p->get_text('/p') ); $p->get_token(); # toss the </p> $last_end_tag = 'blockquote'; } else { $html .= $token->[-1]; } } elsif ( $token->[0] eq 'E' ) { # if it's a new blockquote, it's closed already $html .= $token->[-1] unless $token->[1] eq '/p' and $last_end_tag eq 'h4'; $last_end_tag = $token->[1]; } } print $html; __END__ <p>Stand alone p, or, aw, skip it.</p> <h4 class="title">This is an h4</h4> <p class="salad" id="taco">A first paragraph.</p> <p>A follower.</p> <p>Big finish. Or Finnish?</p> [download] CSS solution. `blockquote { /* format definition, whatever you want... / margin:1em 1em 1em 2em; font-style:italic; font-size:105%; } h4 + p { / format definition identical to your blockquote def */ }` [download] update: fixed grammar flatulence + one more, sigh.	[reply] [d/l] [select]
Re^2: HTML module help please? by tachyon (Chancellor) on Sep 01, 2004 at 03:03 UTC
The closing </p> tags are optional. Run the test data at Re: HTML module help please? to see why this is broken. cheers tachyon	[reply]
Re^3: HTML module help please? by Your Mother (Archbishop) on Sep 01, 2004 at 18:17 UTC
Oh, of course, you're right. I've become so accustomed to well formed xhtml that I forget what the web really looks like :) Mine will fail on any unclosed paras. I tried fixing it but your approach is much more sound for it. One thing mine does that yours omits is retains the attribute tags for the original para. But that's easy to update. `print '<blockquote>'; # becomes print CGI::start_blockquote( $attr );` [download]	[reply] [d/l]
Re^4: HTML module help please? by tachyon (Chancellor) on Sep 01, 2004 at 22:58 UTC
Re^4: HTML module help please? by Cody Pendant (Prior) on Sep 03, 2004 at 23:11 UTC
Re^2: HTML module help please? by Cody Pendant (Prior) on Sep 01, 2004 at 02:59 UTC
I'd have to booster pretty strongly against the perl solution. In this case you probably should do it with CSS. Thank you so much for the code. And, to reassure you, it's not about the formatting at all. I'm parsing HTML into XHTML and the original is a screenplay and I have a kind of rough schema where character names are H4, stage directions are P and speeches are BLOCKQUOTE, and ironically, both will probably look exactly the same in the browser. It might change later, but I've got to first achieve valid XHTML where the speeches are marked up as a distinct data type to the stage directions. ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re: HTML module help please? by dragonchild (Archbishop) on Sep 01, 2004 at 01:54 UTC
HTML::Parser is the standard way to go. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re^2: HTML module help please? by Cody Pendant (Prior) on Sep 01, 2004 at 02:21 UTC
Thanks both of you. I appreciate those links. I now remember I've installed and used HTML::TokeParser in the past as well. I'm feeling particularly dumb today obviously, but I'm not seeing any "write whole file back out" in those modules, and I'm not sure about the logic. What I obvously need is something like `get each tag if (tag is </h4>){ $just_found_h4_flag = 1; if tag is p && $just_found_h4_flag == 1 change p to blockquote; } else { $just_found_h4_flag = 0. }` [download] ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print	[reply] [d/l]
Re^3: HTML module help please? by dragonchild (Archbishop) on Sep 01, 2004 at 02:25 UTC
I've never actually used any HTML parsing module (or any parsing module, for that matter). However, if you SuperSearch, you might be able to find some more info. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]