ChevLucas has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I need to delete all html tags from html file. First I read my file into one variable and then I delete tags by code:
while ( $file =~ /(<[^>]*>)/g ) { # print "'$1'\n"; $file =~ s/$1//g; }
It works fine, but can not cope with some expression, eg.: '<?xml version="1.0" encoding="UTF-8"?>' Any idea?

Replies are listed 'Best First'.
Re: s/// don't delete matching phrase
by Corion (Patriarch) on Oct 28, 2014 at 12:25 UTC

    s/$1//g will not automatically escape the regex meta characters like * or (in your case) ?. See \Q..\E and/or quotemeta.

    Most likely, you want:

    $file =~ s/\Q$1\E//g;
Re: s/// don't delete matching phrase
by Loops (Curate) on Oct 28, 2014 at 12:38 UTC

    If you hit a point where regex isn't working for you, consider one of the parsers on CPAN. For instance HTML::HTML5::Parser can make quick work of removing all the tags:

    use HTML::HTML5::Parser; my $parser = HTML::HTML5::Parser->new; my $doc = $parser->parse_string(<<'EOT'); <?xml version="1.0" encoding="UTF-8"?> <title>Foo</title> <p><b><i>Foo</b> bar</i>. <p>Baz</br>Quux. EOT print $doc->textContent;
    Output:
    Foo Foo bar. BazQuux.
Re: s/// don't delete matching phrase
by hippo (Archbishop) on Oct 28, 2014 at 12:28 UTC

    Your example expression includes metacharacters (such as "?") which need to be escaped when treated as a regular expression eg. with quotemeta.

    Alternatively, why not just do:

    $file =~ s/<[^>]*>//g;
Re: s/// don't delete matching phrase
by Eily (Monsignor) on Oct 28, 2014 at 12:51 UTC

    The /g modifier already means "while it matches", so you can actually just do $file =~ s/<.*?>//g; (without the while loop). This means that the matched chars won't be interpreted as metacharacters by the second regex.

Re: s/// don't delete matching phrase
by Hameed (Acolyte) on Oct 29, 2014 at 06:24 UTC
    The suggestion I got when I first asked a similar question was to not use Regex and use a HTML/XML parser.
      This is good advice. For example, in xsh, you can do just
      open :F html file.html ; echo //text() ;
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: s/// don't delete matching phrase
by ChevLucas (Initiate) on Oct 28, 2014 at 12:34 UTC
    Ok, thanks for helpful comments ;)
Re: s/// don't delete matching phrase (/ge)
by tye (Sage) on Oct 29, 2014 at 14:33 UTC
    $file =~ s{(<[^>]*>)}{ # print "'$1'\n"; '' }ge;

    - tye        

Re: s/// don't delete matching phrase
by Pizentios (Scribe) on Oct 31, 2014 at 16:10 UTC
    Don't use regex to read or modify html. I prefer modules like Mojo::DOM
    Makes working with HTML/XML a breeze!
    -Pizentios