Re: HTML module help please?
by tachyon (Chancellor) on Sep 01, 2004 at 02:55 UTC
|
Here is an HTML::Parser API2 example that show you the basics. You basically need to deal with 2 cases - one where you have a </p> and one where you don't. You need to maintain state between the callbacks so you know what to do.
my $data=<<DATA;
<p>foo
<h4>h4.1</h4>
<p>bar
<p>baz
<h4>h4.2</h4>
<p>bar</p>
<p>baz</p>
<h4>h4.3</h4>
<hr>
<p>bar
<p>baz
DATA
{
package MyParser;
use base 'HTML::Parser';
sub start {
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
if ( $self->{blockquote} ) {
# deal with no closing </p>
print "</blockquote>\n$origtext";
$self->{blockquote} = 0;
}
elsif ( $tagname eq 'p' and $self->{h4last} ) {
print '<blockquote>';
$self->{blockquote} = 1;
}
else {
print $origtext;
}
$self->{h4last} = $tagname eq 'h4' ? 1 : 0;
}
sub end {
my($self, $tagname, $origtext) = @_;
if ( $self->{blockquote} and $tagname eq 'p' ) {
print '</blockquote>';
$self->{blockquote} = 0;
}
else {
print $origtext;
}
}
sub text {
my($self, $origtext, $is_cdata) = @_;
print $origtext;
}
sub comment {
my($self, $origtext ) = @_;
print $origtext;
}
}
my $p = MyParser->new;
$p->parse($data);
| [reply] [d/l] |
|
|
Thank you very much.
($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print
| [reply] |
Re: HTML module help please?
by eric256 (Parson) on Sep 01, 2004 at 01:49 UTC
|
HTML::TreeBuilder should allow you to do what you want. Its pretty easy to get it to give you a list of the h4 elements, then just check to see if their next sibling (or child depending on layout) and replace it. Be warned though that if you HTML is not pretty standard HTML to start with that HTML::TreeBuilder can do some pretty strange things.
| [reply] |
Re: HTML module help please?
by Your Mother (Archbishop) on Sep 01, 2004 at 02:48 UTC
|
I think the code below will do what you want (semi-tested). But I'd have to booster pretty strongly against the perl solution. In this case you probably should do it with CSS. If the blockquote is truly a "quote," okay, but if you're only changing it for the formatting, you should format from without. Both solutions are below.
use strict;
use HTML::TokeParser;
use CGI qw( blockquote );
my $string = join '', <DATA>;
my $string_ref = \$string;
my $p = HTML::TokeParser->new( $string_ref );
my $html;
my $last_end_tag;
while ( my $token = $p->get_token )
{
if ( $token->[0] =~ /[TCD]/ )
{
$html .= $token->[1];
}
elsif ( $token->[0] eq 'S' )
{
if ( $token->[1] eq 'p' and $last_end_tag eq 'h4' )
{
$html .= blockquote( $token->[2],
$p->get_text('/p')
);
$p->get_token(); # toss the </p>
$last_end_tag = 'blockquote';
}
else
{
$html .= $token->[-1];
}
}
elsif ( $token->[0] eq 'E' )
{
# if it's a new blockquote, it's closed already
$html .= $token->[-1]
unless $token->[1] eq '/p' and $last_end_tag eq 'h4';
$last_end_tag = $token->[1];
}
}
print $html;
__END__
<p>Stand alone p, or, aw, skip it.</p>
<h4 class="title">This is an h4</h4>
<p class="salad" id="taco">A first paragraph.</p>
<p>A follower.</p>
<p>Big finish. Or Finnish?</p>
CSS solution.
blockquote {
/* format definition, whatever you want... */
margin:1em 1em 1em 2em;
font-style:italic;
font-size:105%;
}
h4 + p {
/* format definition identical to your blockquote def */
}
update: fixed grammar flatulence + one more, sigh. | [reply] [d/l] [select] |
|
|
| [reply] |
|
|
Oh, of course, you're right. I've become so accustomed to well formed xhtml that I forget what the web really looks like :) Mine will fail on any unclosed paras. I tried fixing it but your approach is much more sound for it. One thing mine does that yours omits is retains the attribute tags for the original para. But that's easy to update.
print '<blockquote>';
# becomes
print CGI::start_blockquote( $attr );
| [reply] [d/l] |
|
|
|
|
|
|
I'd have to booster pretty strongly against the perl solution. In this case you probably should do it with CSS.
Thank you so much for the code.
And, to reassure you, it's not about the formatting at all. I'm parsing HTML into XHTML and the original is a screenplay and I have a kind of rough schema where character names are H4, stage directions are P and speeches are BLOCKQUOTE, and ironically, both will probably look exactly the same in the browser.
It might change later, but I've got to first achieve valid XHTML where the speeches are marked up as a distinct data type to the stage directions.
($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print
| [reply] |
Re: HTML module help please?
by dragonchild (Archbishop) on Sep 01, 2004 at 01:54 UTC
|
HTML::Parser is the standard way to go.
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
I shouldn't have to say this, but any code, unless otherwise stated, is untested
| [reply] |
|
|
Thanks both of you. I appreciate those links. I now remember I've installed and used HTML::TokeParser in the past as well.
I'm feeling particularly dumb today obviously, but I'm not seeing any "write whole file back out" in those modules, and I'm not sure about the logic.
What I obvously need is something like
get each tag
if (tag is </h4>){
$just_found_h4_flag = 1;
if tag is p && $just_found_h4_flag == 1
change p to blockquote;
} else {
$just_found_h4_flag = 0.
}
($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print
| [reply] [d/l] |
|
|
I've never actually used any HTML parsing module (or any parsing module, for that matter). However, if you SuperSearch, you might be able to find some more info.
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
I shouldn't have to say this, but any code, unless otherwise stated, is untested
| [reply] |