You're feeding an invalid URL to LWP, so unexpected results are to be expected. I bet it works fine when you provide a valid URL.
use Encode qw( encode decode ); use URI::Escape qw( uri_escape ); # From DB my $title = decode('UTF-8', "OverlordQ/R\x{C4}\x{AB}ga-Herson-Astrahan +"); # Escape each URL component. my @uri_components = map { uri_escape(encode('UTF-8', $_)) } split qr{/}, $title; # Prints OverlordQ/R%C4%ABga-Herson-Astrahan print(join('/', @uri_components), "\n");
uri_escape(encode('UTF-8', $_)) can be written as uri_escape_utf8($_)
| Original content of the parent |
|---|
|
Alright, in my Perl codings, I've done some work with respect to Wikipedia. One thing you'll find on Wikipedia is plenty of Unicode. Now unfortunately, I've come across some snags when trying to do some work. Since I'm not conversant with all the Black Magic(tm) with Character Encodings when I mention Unicode, I likely mean the UTF8 encoding of it. Lets establish some facts:
Stepping through the code I have provided below, you eventually to URI at line 77: The first run through the regex, it eats a character: DB<20> p $1
▒
DB<21> x unpack("U*",$1);
0 196
Odd, oh well, let us let the regex finish until we get to
line 78.
Now lets see what the url contains:
Hurm, not fun, that's not what we should have got. Bug? Or should I not be telling perl that these strings may contain utf8 characters. Example below. (It abuses the pre tag since the code tag eats the characters) Output: Title: OverlordQ/Rīga-Herson-Astrahan
is not UTF8
URI: http://...?...&titles=User:OverlordQ/R%C4%ABga-Herson-Astrahan&...
OverlordQ/Rīga-Herson-Astrahan
is now UTF8
URI: http://...?...&titles=User:OverlordQ/R%C3%84%C2%ABga-Herson-Astrahan&...
Title: OverlordQ/Rīga-Herson-Astrahan
|
Update: Shortened URLs in PRE tags as per reply.
In reply to Re: URIs and UTF8
by ikegami
in thread URIs and UTF8
by OverlordQ
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |