Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I made these regexes to match and replace html font tags, but rather than doing what they're supposed to, they end up cutting out the entire html document, save for the last line or so of text.

these are the parameters i need to meet:
-the only important attributes are COLOR, SIZE, and FACE (thus the ones i need to match and send to the correct subroutine
-the above attributes will ALWAYS be in the order of COLOR, SIZE, FACE
-each font tag could have all or any combination of above attributes
-some font tags also have BACKGROUND-COLOR and PTSIZE attributes, which could match falsely as COLOR and SIZE if they're not prefixed by whitespace

$instream =~ s/<FONT(.*?)\sCOLOR\s?=\s?('|")?(\#......)\2(.*?)\sSIZE\s +?=\s?(\d+)(.+?)FACE\s?=\s?('|")?([^\7]+)\7[^>]*>/genStyleCSF($3,$5,$8 +)/iegs; $instream =~ s/<FONT(.*?)\sCOLOR\s?=\s?('|")?(\#......)\2(.*?)\sSIZE\s +?=\s?(\d+)[^>]*>\s*/genStyleCS($3,$6)/iegs; $instream =~ s/<FONT(.*?)\sCOLOR\s?=\s?('|")?(\#......)\2(.+?)FACE\s?= +\s?('|")?([^\5]+)\5[^>]*>\s*/genStyleCF($3,$6)/iegs; $instream =~ s/<FONT(.*?)\sCOLOR\s?=\s?('|")?(\#......)\2[^>]*>\s*/gen +StyleC($3)/iegs; $instream =~ s/<FONT(.*?)\sSIZE\s?=\s?(\d+)(.+?)FACE\s?=\s?('|")?([^\4 +]+)\4[^>]*>\s*/genStyleSF($2,$4)/iegs; $instream =~ s/<FONT(.*?)\sSIZE\s?=\s?(\d+)[^>]*>\s*/genStyleS($2)/ieg +s; $instream =~ s/<FONT(.+?)FACE\s?=\s?('|")?([^\2]+)\2[^>]*>\s*/genStyle +F($3)/iegs;

Edit kudra, 2001-10-09 Changed title per ntc request

Replies are listed 'Best First'.
Re: regex problem...
by hopes (Friar) on Oct 08, 2001 at 22:21 UTC
    You can use HTML::Parser
    See this node for more information. It would be easier.
    Hopes
Re: regex problem...
by wog (Curate) on Oct 08, 2001 at 23:33 UTC
    The reason it's cuttting out the entire file is probably because [^\2] does not match anything except the character in the second group of capturing parens, but instead matches anything but character number 2 (ASCII STX). (The \<number> escape does not appear to be special in character classes.)

    As stated before, you are better off using HTML::Parser (or similar) for this task. Parsing HTML correctly with a regex is virtually impossible, and the HTML::Parser or HTML::TokeParser modules make it relatively easy.

      alright, everyone keeps telling me to use HTML::Parser...but i need to match each opening font tag with the correct closing font tag...see, for each font tag, i substitute code that sets the font color, font size, and font face seperately, so i also need to end the color, size, and face stylings seperately. to do that, i need to match each opening font tag with the closing font tag that directly follows it. how would i do that with HTML::Parser? I've read through the pod documentation, and i don't see how i could do that...
        well, i think i can do it, but in this code:
        $p = HTML::Parser->new(api_version => 3, start_h => [sub { ... }, "self,tagname,attr,text"], );
        how do i replace the starting font tag *and* the text contained in the font tags with something generated in sub { ... }?