markhoy has asked for the wisdom of the Perl Monks concerning the following question:

Using Regular expressions I am trying to replace a pattern within a string but the replacement should have no effect on text between "<" and ">". The result should write a html uneffected by changes but text between the tags altered. Thanks for any help!! Mark

Replies are listed 'Best First'.
Re: Regexp to ignore HTML tags
by davorg (Chancellor) on Mar 31, 2003 at 14:59 UTC

    If you're dealing with HTML then you should look at HTML::Parser or one of its subclasses.

    Update: Here's one example:

    #!/usr/bin/perl use warnings; use strict; use HTML::Parser; my $p = HTML::Parser->new(text_h => [\&text, 'text'], default_h => [\&passthru, 'text']); $p->parse_file(*DATA); sub text { $_[0] =~ s/foo/bar/; print $_[0]; } sub passthru { print $_[0]; } __DATA__ <html> <head><title>foo</title><head> <body> <h1 class="foo">foo<h1> </body> </html>
    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

(jeffa) Re: Regexp to ignore HTML tags
by jeffa (Bishop) on Mar 31, 2003 at 15:06 UTC
    Here is a very brute force method that replaces all text items 'foo' with 'bar' using HTML::Template HTML::Parser . The idea is to set callbacks everytime:
    1. a start tag is encountered
    2. the inside text element is encountered
    3. the end tag is encountered
    For the first and third cases, we simply regurgitate what we found to STDOUT. For the middle case, we substitute. Hope this helps. :)
    use strict; use warnings; use HTML::Parser; my $parser = HTML::Parser->new(api_version => 3); $parser->handler(start => \&start, 'self,tagname,attr'); $parser->handler(text => \&text, 'self,dtext'); $parser->handler(end => \&end, 'self,tagname'); $parser->parse(q|<foo bar="qux" baz="foo">foo</foo>|); sub start { my ($parser,$tag,$attr) = @_; print "<$tag"; # we lose the original order of attribs, but we'll live ;) print qq| $_="$attr->{$_}"| for keys %$attr; print ">"; } sub text { my ($parser,$text,$attr) = @_; $text =~ s/foo/bar/g; print "$text"; } sub end { my ($parser,$tag) = @_; print "</$tag>"; }
    UPDATE:
    Thanks hiseldl ... maybe it really is time for me to switch from HTML::Template to TT! :D

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: Regexp to ignore HTML tags
by hardburn (Abbot) on Mar 31, 2003 at 15:00 UTC

    Regexen are generally considered unsuitible for doing work with HTML. They'll work in simple cases, but work incorrectly with the many, many complex cases. You're better off using one of the many HTML parsing modules on CPAN.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

Re: Regexp to ignore HTML tags
by Jenda (Abbot) on Mar 31, 2003 at 15:39 UTC

    Quite likely others are right that you should use an existing HTML parser. If for some reason this is an overkill to you or if the string is not really a valid HTML, you may try something like this:

    sub PolishHTML { my $str = shift; if ($AllowXHTML) { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\ +s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d]*>|$) +} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } else { $str =~ s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\ +s*(?:[^" '><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d]*>|$)} {HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-~ +').$2}gem; } return $str; }
    (This function escapes all characters special to HTML that are not part of valid HTML tags or entities.)

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

    Edit by castaway: Closed small tag in signature

      Thanks All!! Will give all suggestions a try ASAP.