in reply to Re^2: Strip utf-8 dangerous url chars
in thread Strip utf-8 dangerous url chars

If you properly decoded your string, s/\W//g will do that.

See also: Unicode and Character Encodings in Perl.

Replies are listed 'Best First'.
Re^4: Strip utf-8 dangerous url chars
by AlfaProject (Beadle) on Apr 03, 2011 at 16:32 UTC
    I have tried it in few languages , but \W removes all characters that are not english.
    Don't know what it's problem. Here is my code:
    Url russian(with russian characters in url):domain.com/ru/search/игра
    perl code :$params{$_} =~ s/[\W]//g for keys %params;
    another try with encode-decode:
    use Encode qw(encode decode); $params{search_in_page}=decode('utf-8',$params{search_in_page}); $params{$_} =~ s/[\W]//g for keys %params; $params{search_in_page}=encode('utf-8',$params{search_in_page});
    what can be the problem ?
      what can be the problem ?

      The problem is most likely that your data isn't what you think it is.

      Try the following script:

      use strict; use warnings; binmode STDOUT, ':encoding(UTF-8)'; use Encode qw/decode_utf8/; while (<>) { $_ = decode_utf8 $_; s/\W//g; print; }

      Some in- and output:

      möp spaß     # input
      möpspaß      # output
      АБВГ-ДЕЖ/ЗDD # input
      АБВГДЕЖЗDD   # output
      

      So it preserves both German umlauts and Cyrillic characters.

      Please actually read the article I gave you a link to, it contains advice on how to debug such problems.