in reply to javascript encodeURI() -> perl uri_unescape_utf8() ?

Maybe someone else who is more familiar with this will chime in, but in the meantime, I'd need to ask for a little more detail.

Is your cgi script getting things that look like  Å for Å (or  Å for Å) ? If you're getting something else, what exactly is it that is coming back to the cgi script? (if it seems to be binary, say so and show us the hex digits)

Replies are listed 'Best First'.
Re^2: javascript encodeURI() -> perl uri_unescape_utf8() ?
by nkropols (Sexton) on Dec 14, 2004 at 20:05 UTC
    You can see the characters by opening this simple script in your browser. (encodeURIComponent encodes a few more characters. So you better can see the difference)
    <script> alert(encodeURIComponent('a å /')); </script>
    outputs: a%20%C3%A5%20%2F
    The problem is that å is encoded into two "sequences": %C3%A5
      The thing you are referring to as "two sequences" is actually the two-byte sequence for the utf-8 encoded character U00E5. (updated the grammar slightly to make more sense)

      Naturally, we'd love to have an elegant and concise way to interpret this correctly as utf8 text, but I don't know enough about the URI modules to provide much guidance in that direction.

      So instead, I'll offer an ad-hoc (but still somewhat concise) work-around -- it's a kluge, but it should work until you or some other monk can find the needed gems in the appropriate module(s):

      use Encode; # ... get the uri string into $_ by whatever means ... $_ = "a%20%C3%A5%20%2F"; # first, let's turn the uri encoded string (with "%HH" for some bytes) + into binary: s/\%([0-9a-f]{2})/chr(hex($1))/egi; # then, since this produces a utf-8 byte sequence, let's "decode" that + into utf-8 $_ = decode( 'utf8', $_ ); # $_ now has utf8 flag set, and contains the string with expected unic +ode characters binmode STDOUT, ":utf8"; print;
      The "binmode STDOUT" thing could be taken out if you add a "-CO" flag on the shebang line, I believe -- that "perlrun" option does the same thing as 'binmode STDOUT, ":utf8";'.
        I think that CGI's 'unescape' method does the same thing.
        use CGI qw(unescape); $_ = "a%20%C3%A5%20%2F"; $value = CGI::unescape($_);

        (it's in CGI/Util.pm)

        Thanks for your suggestion.
        This does not solve my problem, unfortunately. I have found that the decode function in the Encode module should be able to do the job.
        use Encode; print decode('utf8', "\xC3\xA5");
        This prints the character I need.
        The problem now is:
        How to go from %C3%A5 etc. to \xC3\xA5?
        I tried
        use Encode; $_='%C3%A5'; s/%/\\x/g; eval { $_=$_}; print decode('utf8', $_);
        but it does not work as expected. :-)

        This was actually solved in the previous post by graff.