uri_unescape not correct

rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

hi monks,

i have a script that retrieves several urls from a webpage. the urls are already uri_escaped and i would like to unescape them. the problem is they contain special characters from other languages which are not being unescaped correctly.

#!/usr/bin/perl

use strict;
use warnings;

use URI::Escape;

my $terms = "%22Celades%22+%22Aspectos+cl%C3%ADnicos+*+*+menopausia%22
+";

# uri_unescape($terms);
# should return:
#       "Celades"+"Aspectos+clínicos+*+*+menopausia"
# actually returns:
#       "Celades"+"Aspectos+clÃnicos+*+*+menopausia"

print uri_unescape($terms);

exit;
[download]

anyone got any suggestions?

Comment on uri_unescape not correct Download Code

Replies are listed 'Best First'.
Re: uri_unescape not correct by shmem (Chancellor) on Apr 02, 2007 at 08:18 UTC
The URI you present is UTF-8 encoded. Quick fix - `#!/usr/bin/perl use strict; use warnings; use URI::Escape; use Encode qw(from_to); my $terms = "%22Celades%22+%22Aspectos+cl%C3%ADnicos+++menopausia%22 +"; # uri_unescape($terms); # should return: # "Celades"+"Aspectos+clínicos+++menopausia" # actually returns: # "Celades"+"Aspectos+clÃnicos+++menopausia" $_ = uri_unescape ($terms); from_to ($_,"utf-8","iso-8859-1"); print $_,"\n";` [download] That should do. It seems there's no `uri_unescape_utf8` in URI::Escape. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re: uri_unescape not correct by Joost (Canon) on Apr 02, 2007 at 11:43 UTC
There's no general way to know if a URI is UTF-8 encoded or not. See rfc RFC 2396: In the simplest case, the original character sequence contains only characters that are defined in US-ASCII, and the two levels of mapping are simple and easily invertible: each 'original character' is represented as the octet for the US-ASCII code for it, which is, in turn, represented as either the US-ASCII character, or else the "%" escape sequence for that octet. For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one rfc RFC 2277. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. You could use decode('utf-8',$string) to get the right characters after uri_unescaping, provided the uris are always utf-8 encoded. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: uri_unescape not correct by valdez (Monsignor) on Apr 02, 2007 at 07:18 UTC
That code works properly on my computer, it is probably just a visualization effect; my console is UTF-8 and shows the proper accented letters, ie ì Ciao, Valerio	[reply]
Re^2: uri_unescape not correct by rsiedl (Friar) on Apr 02, 2007 at 07:40 UTC
i'm not so sure. i modified my code to: `#!/usr/bin/perl use strict; use warnings; use URI::Escape; my $terms = "%22Celades%22+%22Aspectos+cl%C3%ADnicos+++menopausia%22 +"; print "Expecting: \"Celades\"+\"Aspectos+clínicos+++menopausia\"\n"; print "Returning: ",uri_unescape($terms),"\n"; exit;` [download] and it returned: `Expecting: "Celades"+"Aspectos+clínicos+++menopausia" Returning: "Celades"+"Aspectos+clÃnicos+++menopausia"` [download] if it were a visual thing wouldnt they both be wrong? perhaps it is to do with the version of URI::Escape? mine is 3.28, yours?	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom