Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks ,

i want to know how can i remove all non english characters from submited field

i want to accept characters 'A'..'Z','a'..'z'

and numbers from 0..9

plus spaces and (. , - , _ )

Thanks

Replies are listed 'Best First'.
Re: regex question
by davido (Cardinal) on Dec 05, 2006 at 08:04 UTC

    Use tr///

    $string =~ tr/A-Za-z0-9.-_//cd;

    The /c modifier "complements" the terms listed. That means that instead of replacing A-Z with nothing, it will replace the complement of A-Z (which is every character that is NOT A-Z). We've specified A-Z, as well as your other ranges and special characters. The /d modifier means delete any listed character (or in this case complement to the listed characters) that is not mirrored on the right hand side of the operator. Since we leave the right hand side empty, everything not matching our criteria will be deleted. This will delete any character that is not A-Z a-z, 0-9, ., -, and _. It's also possible (and easy) with the s/// operator like this:

    $string =~ s/[^A-Za-z0-9._-]//g;

    This works a little differently: It substitutes any character not found in the character class with nothing (which means to delete that character). The /g modifier causes s/// to iterate through every match.

    You could have a look at perlop to better understand tr///, and perlre for help with the regular expression. It's important to note that even though it looks kind of like a regular expression, the transliteration operator (tr///) is not a regexp.


    Dave

      $string =~ tr/A-Za-z0-9.-_//cd;
      You forgot to escape the hyphen, the one you want taken literally:
      $string =~ tr/A-Za-z0-9.\-_//cd;

      Yours will keep everything in the range

      ./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
      too, which is more than you wanted.

        What I should have said (and intended to but didn't) is:

        tr/A-Za-z0-9._-//cd;

        The hyphen doesn't have special meaning in that case. I even did that when I composed the character class demonstrated in my s/// example, but somehow missed it in the tr/// example.

        Good catch bart. :) You've got to love Perl's density huh? ;)


        Dave

      Thanks for your replay , but ...

      #!/perl/bin/perl -w use strict; my $string = "Hello $ World \n"; $string =~ tr/A-Za-z0-9.-_//cd; print $string; ### ### prints out Global symbol "$World" requires explicit package name

      same for the other example

      and i have one more question how can i add a space

      to s/// in the example you gave to me ?

      leaving a space at the end or adding \s giving me errors

        Yes, but if you used the correct type of quotes that wouldn't be a problem:

        my $string = 'Hello $World' . "\n";

        Or escape your $ character.

        By the way, your script is failing on this line:

        my $string = "Hello $World \n";

        That's happening in the compilation phase, not runtime.


        Dave

Re: regex question
by madbombX (Hermit) on Dec 05, 2006 at 16:29 UTC
    Have you considered something aside from a regex? For instance, HTML::Entities can do what you want. The following code should do what you want it to do:
    use HTML::Entities; $foo = encode_entities($foo, "\x80-\xff");

    With the information you provided using \x80-\xff should do the replacements you need. If there are other characters that require replacements/removal/modification, you may want to look at a conversion table or a lookup table for the appropriate codes.

    Update thanks to ideas from ww.