in reply to Re: Strip utf-8 dangerous url chars
in thread Strip utf-8 dangerous url chars

It should support all the languages , not only english.

Replies are listed 'Best First'.
Re^3: Strip utf-8 dangerous url chars
by moritz (Cardinal) on Apr 03, 2011 at 15:17 UTC
      I have tried it in few languages , but \W removes all characters that are not english.
      Don't know what it's problem. Here is my code:
      Url russian(with russian characters in url):domain.com/ru/search/игра
      perl code :$params{$_} =~ s/[\W]//g for keys %params;
      another try with encode-decode:
      use Encode qw(encode decode); $params{search_in_page}=decode('utf-8',$params{search_in_page}); $params{$_} =~ s/[\W]//g for keys %params; $params{search_in_page}=encode('utf-8',$params{search_in_page});
      what can be the problem ?
        what can be the problem ?

        The problem is most likely that your data isn't what you think it is.

        Try the following script:

        use strict; use warnings; binmode STDOUT, ':encoding(UTF-8)'; use Encode qw/decode_utf8/; while (<>) { $_ = decode_utf8 $_; s/\W//g; print; }

        Some in- and output:

        möp spaß     # input
        möpspaß      # output
        АБВГ-ДЕЖ/ЗDD # input
        АБВГДЕЖЗDD   # output
        

        So it preserves both German umlauts and Cyrillic characters.

        Please actually read the article I gave you a link to, it contains advice on how to debug such problems.

Re^3: Strip utf-8 dangerous url chars
by Corion (Patriarch) on Apr 03, 2011 at 14:08 UTC

    Then you need to get far more specific in what you want to remove, and also why.

Re^3: Strip utf-8 dangerous url chars
by Anonymous Monk on Apr 03, 2011 at 14:09 UTC
    It should support all the languages , not only english.

    Great, then what should it remove, what are the dangerous chars?

      Let's say I need regex to get rid of all NOT letters of all languages that utf-8 support.