nkropols has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody.
On using the javascript encodeURI() method to encode a string that is posted to a perl script I can not find a good way to decode it in perl. Regular english characters translate fine using URI::uri_unescape. However, all other character are created with Unicode escapes, using more than one escape sequence per character. Which creates problems for things like Å or Å as I would say. ("våre norske tegn bør æres") While there is a method in the latest URI::Escape module called uri_escape_utf8,(that corresponds to javascript encodeURI()) there is no uri_unescape_utf8.
That would be the function I am looking for. Any one seen it? Or something similar?
  • Comment on javascript encodeURI() -> perl uri_unescape_utf8() ?

Replies are listed 'Best First'.
Re: javascript encodeURI() -> perl uri_unescape_utf8() ?
by nkropols (Sexton) on Dec 15, 2004 at 15:12 UTC
    Solution
    use Encode; use URI::Escape; my $a ='ab%C3%A5cd%3C/H4%3E%0D%0A'; $a = uri_unescape($a); print decode('utf8', $a);
    Which in effect is exactly the same as the first suggestion posted by graff.
      Helped ! Thanks for sharing!
Re: javascript encodeURI() -> perl uri_unescape_utf8() ?
by graff (Chancellor) on Dec 14, 2004 at 19:25 UTC
    Maybe someone else who is more familiar with this will chime in, but in the meantime, I'd need to ask for a little more detail.

    Is your cgi script getting things that look like  Å for Å (or  Å for Å) ? If you're getting something else, what exactly is it that is coming back to the cgi script? (if it seems to be binary, say so and show us the hex digits)

      You can see the characters by opening this simple script in your browser. (encodeURIComponent encodes a few more characters. So you better can see the difference)
      <script> alert(encodeURIComponent('a å /')); </script>
      outputs: a%20%C3%A5%20%2F
      The problem is that å is encoded into two "sequences": %C3%A5
        The thing you are referring to as "two sequences" is actually the two-byte sequence for the utf-8 encoded character U00E5. (updated the grammar slightly to make more sense)

        Naturally, we'd love to have an elegant and concise way to interpret this correctly as utf8 text, but I don't know enough about the URI modules to provide much guidance in that direction.

        So instead, I'll offer an ad-hoc (but still somewhat concise) work-around -- it's a kluge, but it should work until you or some other monk can find the needed gems in the appropriate module(s):

        use Encode; # ... get the uri string into $_ by whatever means ... $_ = "a%20%C3%A5%20%2F"; # first, let's turn the uri encoded string (with "%HH" for some bytes) + into binary: s/\%([0-9a-f]{2})/chr(hex($1))/egi; # then, since this produces a utf-8 byte sequence, let's "decode" that + into utf-8 $_ = decode( 'utf8', $_ ); # $_ now has utf8 flag set, and contains the string with expected unic +ode characters binmode STDOUT, ":utf8"; print;
        The "binmode STDOUT" thing could be taken out if you add a "-CO" flag on the shebang line, I believe -- that "perlrun" option does the same thing as 'binmode STDOUT, ":utf8";'.