in reply to Strip utf-8 dangerous url chars

I assume that by "letters", you mean only a-zA-Z, and not äöü for example. Then it's easy - just use a negation in your character class:

$param =~ s/[^a-zA-Z0-9_]//g;

But I'm not sure what kind of safety that buys you.

Replies are listed 'Best First'.
Re^2: Strip utf-8 dangerous url chars
by cavac (Prior) on Apr 03, 2011 at 16:44 UTC

    Probably none.

    In my webapps i usually do away with URL parameters completly. URL parameters often enough open the door for some simple XSS attacks (email manipulated URL...).

    Of course, depending on your webserver, this is not an option. *But* - and here is the important part - when you use dynamic URL parameters, they should not contain form input. These parameters should only contain parameters generated by your backend - which you then lookup against ID's stored in whatever you use for data storage.

    Form input (which you POST, not GET, according to RFC2616) is harder to validate. This depends strongly on the data you expect and how you gonna store and display it later. The key here is to whitelist (allow) characters, not blacklist (deny) them. It's easier to expand the list after checking a character is save, then it is to fix a never-ending list of new "open holes".

    If you use a database (which i strongly recommend), use the quote() function of DBI or prepared() statements with placeholders. This at least should you save from exploits like SQL injection.

    I'm also taking a wild guess here on my own experience and say that probably about 10-50% of your input can be validated by the database by using foreign keys against static tables or by using ENUM's. Your mileage may vary, but at some point or the other, you have to trust at least your own staff to enter correct data anyway.

    One last personal note: While it's certainly nice to have support for Unicode and therefore many typed languages: This kind of support may bind quite a large number of resources on your part and may lead to all kinds of weird behaviour on badly implemented clients. Unicode in itself may pose a security risk even when correctly implemented: An example are unicode domain names; depending on the font, the user might not be able to distinguish between the correct link to his online banking website and a fake one a scammer set up.

    Don't use '#ff0000':
    use Acme::AutoColor; my $redcolor = RED();
    All colors subject to change without notice.
      I use flat files in that app. Of course I'm whitelisting all the letter of all languages and numbers. All the app works only with a-zA-Z0-9_ chars, except the search needed to generate some random links for seo for all the languages.
Re^2: Strip utf-8 dangerous url chars
by AlfaProject (Beadle) on Apr 03, 2011 at 14:04 UTC
    It should support all the languages , not only english.
        I have tried it in few languages , but \W removes all characters that are not english.
        Don't know what it's problem. Here is my code:
        Url russian(with russian characters in url):domain.com/ru/search/игра
        perl code :$params{$_} =~ s/[\W]//g for keys %params;
        another try with encode-decode:
        use Encode qw(encode decode); $params{search_in_page}=decode('utf-8',$params{search_in_page}); $params{$_} =~ s/[\W]//g for keys %params; $params{search_in_page}=encode('utf-8',$params{search_in_page});
        what can be the problem ?

      Then you need to get far more specific in what you want to remove, and also why.

      It should support all the languages , not only english.

      Great, then what should it remove, what are the dangerous chars?

        Let's say I need regex to get rid of all NOT letters of all languages that utf-8 support.