Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

The short and skinny: I need to strip the face="foo" value from <font face="foo,bar,blort">My Foo</font> tags found in a document I have slurped into a scalar.

I tried messing around with some HTML::Parser code, as well as hstrip, but they didn't seem to get me where I need to be. I also tried HTML::TagFilter and HTML::TreeBuilder with the same level of success.. none. merlyn also has an article on something similar, but removes the tags themselves, leaving the text values. Close to what I need, but not quite there.

The glitch here is that I need the color="#RRGGBB" value in the tag, but I need to drop anything else that appears in there, leaving just the font tag and color attribute and value. The other sticky point is that many people use single-quotes around the attributes, some use none, and a simple regex would have to be quite smart to figure this out (and likely rife with errors).

Doing this with exclusively regexes is going to be prone to failure, especially since tags can be improperly nested, so I can't just yank from <font .*?> to </font> and work on the remainder.

Here's an example of what my input could look like, and what I need for final output:

<font color="#000000" face="Arial,Helvetica" size="1"> Some text </font> <font color="#000000"> Some text </font>

Can any monk lend a hand?

Replies are listed 'Best First'.
Re: Stripping font "face" values from font tags
by Fletch (Bishop) on May 28, 2003 at 16:36 UTC

    Fish.

    use HTML::TokeParser (); my %verb = ( S => 4, E => 2, T => 1, C => 1, D => 1, PI => 2 ); my $p = HTML::TokeParser->new( "foo.html" ) or die "can't open foo.html: $!\n"; while( my $t = $p->get_token ) { if( $t->[0] eq 'S' and $t->[1] eq 'font' ) { my $attr = $t->[2]; delete $attr->{face}; print "<font ", join( " ", map { qq{$_="$attr->{$_}"} } keys %{$attr} ), ">"; } else { ## print verbatim . . . print $t->[ $verb{ $t->[0] } ] } } exit 0; __END__

      Nice code. It's even easier with HTML::TokeParser::Simple, though.

      use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( "foo.html" ) or die "can't open foo.html: $!\n"; while( my $t = $p->get_token ) { if( $t->is_start_tag('font') ) { my $attr = $t->return_attr; delete $attr->{face}; my $attributes = join( " ", map { qq{$_="$attr->{$_}"} } keys +%$attr ); print "<font $attributes>"; } else { print $t->as_is; } }

      Cheers,
      Ovid

      New address of my CGI Course.
      Silence is Evil (feel free to copy and distribute widely - note copyright text)

Re: Stripping font "face" values from font tags
by CukiMnstr (Deacon) on May 28, 2003 at 16:31 UTC
    from HTML::TagFilter's POD:
    It can act in a more or less fine-grained way - you can specify permitted tags, permitted attributes of each tag, and permitted values for each attribute in as much detail as you like.
    So it is possible to do what you want. What problem are you having with HTML::TagFilter? Can you post some code so we can look at it and help you spot the problem?

    hope this helps,

Re: Stripping font "face" values from font tags
by BrowserUk (Patriarch) on May 28, 2003 at 18:08 UTC

    This probably doesn't work ... somewhere? But on the all the most pathological pages I could find (sample of 5 in the DATA section), it seems to do fine.

    #! perl -slw use strict; use LWP::Simple; while( <DATA> ) { chomp; my $html = get $_; $html =~ s[( <FONT \s+ (?: (?:"[^"]+") | (?:'[^']+') | [^>]* ) > ) +] #" { my $tag = $1; $tag =~ s[ face \s* = \s* (?: (?:"[^"]+") | (?:'[^']+') | [^\s +]+ ) ][]ixm; #" $tag }eximg; my $out = $ENV{TMP} . '\\' .time() . '.htm'; open OUT, '>', $out or warn $!; print OUT $html; close OUT; system( $out ); } __DATA__ http://www.webdiner.com/annexe/font/font.htm http://www.electricearl.com/fonttest.html http://www.ilovethisplace.com/webdesign/fonts.html http://www.york.ac.uk/depts/maths/symbchrc.htm http://www.tedmontgomery.com/tutorial/style.html#face

    Note: system probably won't load the modifed sample directly into the browser on systems who's commands line don't know what do with .htm files.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller