in reply to Re^3: Strip utf-8 dangerous url chars
in thread Strip utf-8 dangerous url chars

I have tried it in few languages , but \W removes all characters that are not english.
Don't know what it's problem. Here is my code:
Url russian(with russian characters in url):domain.com/ru/search/игра
perl code :$params{$_} =~ s/[\W]//g for keys %params;
another try with encode-decode:
use Encode qw(encode decode); $params{search_in_page}=decode('utf-8',$params{search_in_page}); $params{$_} =~ s/[\W]//g for keys %params; $params{search_in_page}=encode('utf-8',$params{search_in_page});
what can be the problem ?

Replies are listed 'Best First'.
Re^5: Strip utf-8 dangerous url chars
by moritz (Cardinal) on Apr 03, 2011 at 17:02 UTC
    what can be the problem ?

    The problem is most likely that your data isn't what you think it is.

    Try the following script:

    use strict; use warnings; binmode STDOUT, ':encoding(UTF-8)'; use Encode qw/decode_utf8/; while (<>) { $_ = decode_utf8 $_; s/\W//g; print; }

    Some in- and output:

    möp spaß     # input
    möpspaß      # output
    АБВГ-ДЕЖ/ЗDD # input
    АБВГДЕЖЗDD   # output
    

    So it preserves both German umlauts and Cyrillic characters.

    Please actually read the article I gave you a link to, it contains advice on how to debug such problems.