Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I know I'm being really lazy, but I'd appreciate monks' help/recommendations anyway.

I have a bunch of HTML files.

Every time an <h4> is directly followed by a <p>, I want to convert that <p> to a <blockquote>, then at the end of course I want to save the file with the changes.

Now, I can do this by slurping, regexing and writing back, but, you don't want me to do that, do you?

What's a good way to do this with a module?



($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: HTML module help please?
by tachyon (Chancellor) on Sep 01, 2004 at 02:55 UTC

    Here is an HTML::Parser API2 example that show you the basics. You basically need to deal with 2 cases - one where you have a </p> and one where you don't. You need to maintain state between the callbacks so you know what to do.

    my $data=<<DATA; <p>foo <h4>h4.1</h4> <p>bar <p>baz <h4>h4.2</h4> <p>bar</p> <p>baz</p> <h4>h4.3</h4> <hr> <p>bar <p>baz DATA { package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ( $self->{blockquote} ) { # deal with no closing </p> print "</blockquote>\n$origtext"; $self->{blockquote} = 0; } elsif ( $tagname eq 'p' and $self->{h4last} ) { print '<blockquote>'; $self->{blockquote} = 1; } else { print $origtext; } $self->{h4last} = $tagname eq 'h4' ? 1 : 0; } sub end { my($self, $tagname, $origtext) = @_; if ( $self->{blockquote} and $tagname eq 'p' ) { print '</blockquote>'; $self->{blockquote} = 0; } else { print $origtext; } } sub text { my($self, $origtext, $is_cdata) = @_; print $origtext; } sub comment { my($self, $origtext ) = @_; print $origtext; } } my $p = MyParser->new; $p->parse($data);

    cheers

    tachyon

      Thank you very much.


      ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
      =~y~b-v~a-z~s; print
Re: HTML module help please?
by eric256 (Parson) on Sep 01, 2004 at 01:49 UTC

    HTML::TreeBuilder should allow you to do what you want. Its pretty easy to get it to give you a list of the h4 elements, then just check to see if their next sibling (or child depending on layout) and replace it. Be warned though that if you HTML is not pretty standard HTML to start with that HTML::TreeBuilder can do some pretty strange things.


    ___________
    Eric Hodges
Re: HTML module help please?
by Your Mother (Archbishop) on Sep 01, 2004 at 02:48 UTC

    I think the code below will do what you want (semi-tested). But I'd have to booster pretty strongly against the perl solution. In this case you probably should do it with CSS. If the blockquote is truly a "quote," okay, but if you're only changing it for the formatting, you should format from without. Both solutions are below.

    use strict; use HTML::TokeParser; use CGI qw( blockquote ); my $string = join '', <DATA>; my $string_ref = \$string; my $p = HTML::TokeParser->new( $string_ref ); my $html; my $last_end_tag; while ( my $token = $p->get_token ) { if ( $token->[0] =~ /[TCD]/ ) { $html .= $token->[1]; } elsif ( $token->[0] eq 'S' ) { if ( $token->[1] eq 'p' and $last_end_tag eq 'h4' ) { $html .= blockquote( $token->[2], $p->get_text('/p') ); $p->get_token(); # toss the </p> $last_end_tag = 'blockquote'; } else { $html .= $token->[-1]; } } elsif ( $token->[0] eq 'E' ) { # if it's a new blockquote, it's closed already $html .= $token->[-1] unless $token->[1] eq '/p' and $last_end_tag eq 'h4'; $last_end_tag = $token->[1]; } } print $html; __END__ <p>Stand alone p, or, aw, skip it.</p> <h4 class="title">This is an h4</h4> <p class="salad" id="taco">A first paragraph.</p> <p>A follower.</p> <p>Big finish. Or Finnish?</p>

    CSS solution.

    blockquote { /* format definition, whatever you want... */ margin:1em 1em 1em 2em; font-style:italic; font-size:105%; } h4 + p { /* format definition identical to your blockquote def */ }

    update: fixed grammar flatulence + one more, sigh.

        Oh, of course, you're right. I've become so accustomed to well formed xhtml that I forget what the web really looks like :) Mine will fail on any unclosed paras. I tried fixing it but your approach is much more sound for it. One thing mine does that yours omits is retains the attribute tags for the original para. But that's easy to update.

        print '<blockquote>'; # becomes print CGI::start_blockquote( $attr );
      I'd have to booster pretty strongly against the perl solution. In this case you probably should do it with CSS.

      Thank you so much for the code.

      And, to reassure you, it's not about the formatting at all. I'm parsing HTML into XHTML and the original is a screenplay and I have a kind of rough schema where character names are H4, stage directions are P and speeches are BLOCKQUOTE, and ironically, both will probably look exactly the same in the browser.

      It might change later, but I've got to first achieve valid XHTML where the speeches are marked up as a distinct data type to the stage directions.



      ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
      =~y~b-v~a-z~s; print
Re: HTML module help please?
by dragonchild (Archbishop) on Sep 01, 2004 at 01:54 UTC
    HTML::Parser is the standard way to go.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      Thanks both of you. I appreciate those links. I now remember I've installed and used HTML::TokeParser in the past as well.

      I'm feeling particularly dumb today obviously, but I'm not seeing any "write whole file back out" in those modules, and I'm not sure about the logic.

      What I obvously need is something like

      get each tag if (tag is </h4>){ $just_found_h4_flag = 1; if tag is p && $just_found_h4_flag == 1 change p to blockquote; } else { $just_found_h4_flag = 0. }


      ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
      =~y~b-v~a-z~s; print
        I've never actually used any HTML parsing module (or any parsing module, for that matter). However, if you SuperSearch, you might be able to find some more info.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested