Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Regex not working

by Beaker (Beadle)
on Jul 17, 2015 at 15:13 UTC ( #1135180=perlquestion: print w/replies, xml ) Need Help??

Beaker has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to turn HTML into BBCode where there is a matching end tag. The following regex is not doing anything:

$text =~ s~<font color=("|')?((\w|\s)+)("|')?>(.*?)<\/font>~\[color=$2\]$5\[/color\]~ig;

Sample input:

<font color="blue"><i><b> <br>Some text</font>,  <br>an a

P.S. I know the HTML is deprecated, it's legacy code and I haven't the time to work out how to re-parse all the content (of which there is a lot!).

Thanks for any help.

Replies are listed 'Best First'.
Re: Regex not working
by hippo (Bishop) on Jul 17, 2015 at 15:26 UTC

    I do not know why you think it does nothing:

    $ cat foo.pl #!/usr/bin/perl -w use strict; use warnings; my $text = '<font color="blue"><i><b> <br>Some text</font>, <br>an a' +; $text =~ s~<font color=("|')?((\w|\s)+)("|')?>(.*?)<\/font>~\[color=$2 +\]$5\[/color\]~ig; print $text . "\n"; $ ./foo.pl [color=blue]<i><b> <br>Some text[/color], <br>an a $

    That's not to say the regex couldn't be improved/simplified, but it certainly seems to have an effect as it stands.

      Maybe $text actually has newlines in it and you need to add the /s modifier? Also, bracketing regex delimiters are more readable.
      $text =~ s{...}{...}igs;
Re: Regex not working
by Monk::Thomas (Friar) on Jul 17, 2015 at 19:15 UTC

    The script you posted is working perfectly fine. There must be a problem with the actual input you are using. That being said there's quite a lot you could do to improve the RegExp:

    ((\w|\s)+) is better written as a character class, e.g.: ([\w\s]+)

    Are you sure this is what you want? According to W3C there are no HTML-colors with whitespace in their name.
    Use \s*(\w+)\s* instead? Additionally you fail to match valid colors like #ff0000

    ("|')? .... ("|')? -> would match "abc'; better use a back-reference: (["']?) .... \1

    <\/font> can be replaced with </font> since you are using s~~~ instead of s///

    \[color=$2\]$5\[/color\] can be converted into [color=$2]$5\[/color] (must keep the escape after $5 or perl complains)

    Additionally: I don't know where the input is coming from and how you much you can trust it. Your code 'fails open' - nothing happens in case it fails to match. I would suggest to convert it into 'fail closed' - complain loudly or even terminate in case a font tag can not be matched propely.

    while (parse input) {
      if (a font tag is found)) {
        if (font tag matches expected pattern) {
          (replace font tag)
        }
        else {
          (complain loudly)
        }
      }
    }
    
Re: Regex not working
by codiac (Beadle) on Jul 18, 2015 at 05:09 UTC

    HTML::BBReverse - Perl module to convert HTML to BBCode and back.

    If you post an actual line that is failing it might be helpful.

Re: Regex not working
by tangent (Parson) on Jul 18, 2015 at 00:32 UTC
    When I saw this question my first reaction was the usual "just use a parser", so I thought I'd knock up something using HTML::Parser. That brought up some problems that I don't think can properly be solved using regular expressions, especially, what do you do if there are nested font tags as can happen (this is what I have the $font++ and $font-- for below).

    So this bit of code is a bit longer than I expected but now that I've done it I might as well post. Note it only replaces font tags that have a color attribute, everything else is left as is.

    use HTML::Parser (); my $parser = HTML::Parser->new( api_version => 3, start_h => [\&start_tag, "tagname, attr, text"], text_h => [\&text_content, "text"], end_h => [\&end_tag, "tagname, text"], ); my ( $font, %colored ); sub start_tag { my ($tagname, $attr, $text) = @_; if ($tagname eq 'font') { $font++; if (my $color = $attr->{'color'}) { print "[color=$color]"; $colored{$font}++; return; } } print $text; } sub text_content { my $text = shift; print $text; } sub end_tag { my ($tagname, $text) = @_; if ($tagname ne 'font') { print $text; return; } if ($colored{$font}) { print "[/color]"; } else { print $text; } $font--; } my $html1 = q|<font color="blue"><i><b> <br>Some text</font>, <br>an +a|; $parser->parse($html1); $parser->eof; print "\n\n"; my $html2 = q|<font color="blue"><i><b> <font>breaker</font><br><font +color="#ff0000">Some</font> text</font>, <br>an a|; $parser->parse($html2); $parser->eof;
    Output:
    [color=blue]<i><b> <br>Some text[/color], <br>an a [color=blue]<i><b> <font>breaker</font><br>[color=#ff0000]Some[/color] + text[/color], <br>an a
Re: Regex not working
by GotToBTru (Prior) on Jul 17, 2015 at 15:47 UTC

    "Not working" is not helpful. What output do you desire? What output are you getting? We can be much more helpful when there's less guessing involved!

    Dum Spiro Spero

      Not working in that nothing changes with the input

      Desired output from

      <color=blue><i><b> <br>Some text</color>,  <br>an a

      would be..

      [color=blue]<i><b> <br>Some text[/color],  <br>an a
        Sorry struggling to use this website, can't seem to edit my post. The input is given in my first post, ignore the input in my second post as it's incorrect. The expected output is correct.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1135180]
Approved by toolic
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2023-12-07 16:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?











    Results (32 votes). Check out past polls.

    Notices?