http://qs1969.pair.com?node_id=607763

rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

hi monks,

i have a script that retrieves several urls from a webpage. the urls are already uri_escaped and i would like to unescape them. the problem is they contain special characters from other languages which are not being unescaped correctly.
#!/usr/bin/perl use strict; use warnings; use URI::Escape; my $terms = "%22Celades%22+%22Aspectos+cl%C3%ADnicos+*+*+menopausia%22 +"; # uri_unescape($terms); # should return: # "Celades"+"Aspectos+clínicos+*+*+menopausia" # actually returns: # "Celades"+"Aspectos+clínicos+*+*+menopausia" print uri_unescape($terms); exit;
anyone got any suggestions?

Replies are listed 'Best First'.
Re: uri_unescape not correct
by shmem (Chancellor) on Apr 02, 2007 at 08:18 UTC
    The URI you present is UTF-8 encoded. Quick fix -
    #!/usr/bin/perl use strict; use warnings; use URI::Escape; use Encode qw(from_to); my $terms = "%22Celades%22+%22Aspectos+cl%C3%ADnicos+*+*+menopausia%22 +"; # uri_unescape($terms); # should return: # "Celades"+"Aspectos+clínicos+*+*+menopausia" # actually returns: # "Celades"+"Aspectos+clínicos+*+*+menopausia" $_ = uri_unescape ($terms); from_to ($_,"utf-8","iso-8859-1"); print $_,"\n";

    That should do. It seems there's no uri_unescape_utf8 in URI::Escape.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: uri_unescape not correct
by Joost (Canon) on Apr 02, 2007 at 11:43 UTC
    There's no general way to know if a URI is UTF-8 encoded or not. See rfc RFC 2396:

    In the simplest case, the original character sequence contains only characters that are defined in US-ASCII, and the two levels of mapping are simple and easily invertible: each 'original character' is represented as the octet for the US-ASCII code for it, which is, in turn, represented as either the US-ASCII character, or else the "%" escape sequence for that octet.

    For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one rfc RFC 2277. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used.

    You could use decode('utf-8',$string) to get the right characters after uri_unescaping, provided the uris are always utf-8 encoded.

Re: uri_unescape not correct
by valdez (Monsignor) on Apr 02, 2007 at 07:18 UTC

    That code works properly on my computer, it is probably just a visualization effect; my console is UTF-8 and shows the proper accented letters, ie ì

    Ciao, Valerio

      i'm not so sure. i modified my code to:
      #!/usr/bin/perl use strict; use warnings; use URI::Escape; my $terms = "%22Celades%22+%22Aspectos+cl%C3%ADnicos+*+*+menopausia%22 +"; print "Expecting: \"Celades\"+\"Aspectos+clínicos+*+*+menopausia\"\n"; print "Returning: ",uri_unescape($terms),"\n"; exit;
      and it returned:
      Expecting: "Celades"+"Aspectos+clínicos+*+*+menopausia" Returning: "Celades"+"Aspectos+clínicos+*+*+menopausia"
      if it were a visual thing wouldnt they both be wrong?

      perhaps it is to do with the version of URI::Escape? mine is 3.28, yours?