Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

There is a node on here that tells you how to remove any non-word character from a variable. I need to allow certain characters (spaces and single ticks') and ditch anything else that's not a-zA-Z0-9. Can anyone help figure out the best way to do this?

Can someone also check to see if this regex could be made better? It works but it seems to collect more than expected.

if($line =~ /code="(.*)"/i) { my $collected = $1; };
This matches more than what inside the quotes. It seems to match the " itself and the > that follows it in the code. Is there a better way to collect anything BETWEEN these two " "'s?

Replies are listed 'Best First'.
Re: Removing certain non-word characters
by davido (Cardinal) on Apr 26, 2004 at 07:01 UTC
    The approaches that enumerate a-z and A-Z don't play nice with locales. For example, in the Portuguese character set, you will have vowels with ~, ', `, and ^ over them as part of the alphabet, and they don't fall within the range of a-z.

    \w is locale-smart, but has the unfortunate disadvantage of also containing '_' (underscore). So if you were to use \w, you would have to figure out some way of using s/// to eliminate all \W characters except hyphen, space, and tick, plus eliminate underscore. That can get a little convoluted.

    The easiest solution might be to use a couple of regexes instead of just one. Another solution might be to match what you want and leave out the rest. A solution that I considered (and Zaxo also mentioned in the CB) is to use the oft-neglected POSIX character classes:

    $string =~ s/[^[:alnum:]\s'-]//g;

    Which says, "Substitute anything that is not alphanumeric, space, tick, or hyphen, with nothing (ie, just get rid of it)."

    Posix gets along with locales, so if your code ever ended up getting run in an environment where use locale; is in effect, it shouldn't break.


    Dave

Re: Removing certain non-word characters
by tkil (Monk) on Apr 26, 2004 at 06:38 UTC

    Ask for what you want matched. In this case, you most likely want everything up to the next double-quote, yes? If so, something like this should work:

    my ( $code ) = ( $line =~ /code="([^"]+)"/ );

    The use (and abuse) of regexes to match HTML content has been beaten to death. If you want stronger results, consider using a module designed to parse HTML. This is also covered in the fantastic book Mastering Regular Expressions.

    Some cases to watch out for:

    <!-- watch out for greedy matching --> <tag code="blah" attr="nothing"> <!-- and for less-than characters in attribute values (which is likely illegal, but HTML in the wild is notoriously nasty this way) --> <tag code="<bang!>"> <!-- finally, make sure you can handle multiple-line tags --> <tag foo="bar" code="nothing">
Re: Removing certain non-word characters
by Zed_Lopez (Chaplain) on Apr 26, 2004 at 05:53 UTC
    for the first:
    s/[^a-zA-Z0-9'\x20]//g;
    With the second, please give an example. I suspect you're running into problems with the match being greedy.

      I'm collecting meta tag information. I want to get the keywords section out, but just the keywords..not the meta tags itself.
      <meta name="keywords" content="one,two,three,four">
      And I want to match anything inside of content="" but nothing else. I tried your s/// but that doesn't work, it made the regex a little worse actually. If I can get it to get the information I want, then the rest of the problems should go away. Thanks.
Re: Removing certain non-word characters
by Anomynous Monk (Scribe) on Apr 26, 2004 at 06:03 UTC
    $var =~ tr/ 'A-Za-z0-9//cd;