Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings and salutations, Can someone tell me why this doesn't work and what would be correct. Thank you kindly.
$value =~ s/<(([^ >]|\n)*)>//g;
The above code should be removing any html being parsed through my web form.

Replies are listed 'Best First'.
Re: Code does Remove Html
by graff (Chancellor) on Dec 17, 2006 at 06:20 UTC
    If you are trying to use a regex to remove tags from html data, the data is not "being parsed" in the proper sense of the term -- parsing is what HTML::TokeParser does.

    But apart from that, you haven't really given enough information about how your script is handling the html data. Are you going line-by-line? (Are there newline characters in the data, and do you use your regex within a  while (<>) {} kind of loop?) If so, then the regex will fail if there is a newline between a < and the next >.

    Assuming that the html data has all been slurped into a single scalar variable, then your regex fails on tags like:

    <font size="-1">
    because it's not allowing for tags that contain a space character. Maybe what you were looking for was something simpler:
    s/<[^>]+>/ /g;
    (Note that some tags, like <p>, stand in place of whitespace, because they function as white-space. Just deleting them outright might cause loss of some word boundaries. So replace them with spaces instead.)

    But again, the simpler regex still won't work if you're treating the data one line at a time and you happen to run into tags like:

    <img src="/super/long/path/string/that/the/author/puts/on/a/separate/line. +jpg" alt="missing image" >
Re: Code does Remove Html
by stonecolddevin (Parson) on Dec 17, 2006 at 08:36 UTC

    Try out HTML::Scrubber. You can define a set of rules as to what kind of HTML you want stripped and such. I really like this module, as it's pretty fast, reliable, and definitely grokkable.

    meh.
Re: Code does Remove Html
by Anonymous Monk on Dec 17, 2006 at 05:47 UTC
Re: Code does Remove Html
by bart (Canon) on Dec 17, 2006 at 12:57 UTC
    Why can't your tags contain any spaces?
    /[^ >]/
    In plain English: any character except ">" or a space.

    Besides, quoted attributes, either with single or with double quotes, may contain ">" characters.

      bart, from my findings if one was to remove the space in
      /<(([^ >]|\n)*)>/ /
      to
      /<(([^>]|\n)*)>/ /

      The code will remove more none html like.
      1) < sfsdf > 2) < sddsds "dfdsfds" > 3) < sdsds sdsd"sdsdasd" >
      So in those three cases they are text and not html.
        But what about tags like these?
        • <a href="http://perlmonks.org" class="link">
        • <br />
Re: Code does Remove Html
by shmem (Chancellor) on Dec 17, 2006 at 19:03 UTC
    and what would be correct.

    There's Tom Christiansen's striphtml since 1996 - still works.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Code does Remove Html
by graff (Chancellor) on Dec 18, 2006 at 07:47 UTC
    I just noticed that this line of code in the OP:
    $value =~ s/<(([^ >]|\n)*)>//g;
    also appears in code that was posted in another recent SoPW thread: How to use tr///. It looks like ApolloOne may have been replying as Anonymonk when posting this sub-reply, which contains the infamous "Matt's scripts" code, which in turn contains this strange regex.

    ApolloOne has already been told to stop using Matt's scripts, and has been given a pointer to the NMS CGI scripts to do the job right.

    If the person posting the current thread is also trying to patch Matt's scripts (or if this is ApolloOne posting anonymously again), then the correct answer to the question is the same as before: don't bother patching Matt's scripts, because you shouldn't be using that code in the first place.

Re: Code does Remove Html
by SFLEX (Chaplain) on Dec 17, 2006 at 06:43 UTC
    This code will remove all html tags.
    Test code
    my $value = qq(<html> <head> <title>HTML Document?</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> </head> <body bgcolor="#FFFFFF" text="#000000"> Some Text. <img src="/super/long/path/string/that/the/author/puts/on/a/separate/line. +jpg" alt="missing image" > </body> </html>); $value =~ s/<(([^ >]|\n|\s\w)*)>/ /gso; print $value;

    What I like to do is just escape some parts to make the form text harmless and characters viewable when printed in the browser.
    Example code
    $value =~ s{&}{&amp;}gso; $value =~ s{"}{&quot;}gso; $value =~ s{ }{ \&nbsp;}gso; $value =~ s{<}{&lt;}gso; $value =~ s{>}{&gt;}gso; $value =~ s{\)}{&#41;}gso; $value =~ s{\(}{&#40;}gso; $value =~ s{\t}{ \&nbsp; \&nbsp; \&nbsp;}gso; $value =~ s{\n}{<br>}gso; print $value;

      If you "remove HTML" with a regex like that, then I can still get whatever HTML I want in like so:

      <a <b>href="www.example.com">Cheap Viagra!</</b>a> <script<b>> alert("CHEAP VIAGRA!") </script</b>>

      Escaping is a much better idea.

      - tye        

        Escaping is a much better idea.

        Yes it is!
        If done right the text enterd is safe and you dont loose any of the data.
        doing the other way lots of data can be lost and the code could be bybassed.


        Updated: I just had to build a code to remove the text in tye's post.
        This is what I came up with
        $value =~ s/<(([^ >]|\n|\s\/|\s\S\S)*)>/ /gso;

        Ya that code kinda works better but it still removes parts that it should not and like tye stated "Escaping" is the better way to go.

Re: Code does Remove Html
by jonsmith1982 (Beadle) on Dec 17, 2006 at 21:49 UTC
    i use these to get rid of the html selectively.
    $content =~ s/<[paiutfdsb](.*?)>//gi; $content =~ s/<\/[bfiudspat](.*?)>//gi;