comment on

You're feeding an invalid URL to LWP, so unexpected results are to be expected. I bet it works fine when you provide a valid URL.

use Encode      qw( encode decode );
use URI::Escape qw( uri_escape );

# From DB
my $title = decode('UTF-8', "OverlordQ/R\x{C4}\x{AB}ga-Herson-Astrahan
+");

# Escape each URL component.
my @uri_components = map { uri_escape(encode('UTF-8', $_)) }
                     split qr{/},
                     $title;

# Prints OverlordQ/R%C4%ABga-Herson-Astrahan
print(join('/', @uri_components), "\n");
[download]

uri_escape(encode('UTF-8', $_)) can be written as uri_escape_utf8($_)

Original content of the parent

Original content of the parent
Alright, in my Perl codings, I've done some work with respect to Wikipedia. One thing you'll find on Wikipedia is plenty of Unicode. Now unfortunately, I've come across some snags when trying to do some work. Since I'm not conversant with all the Black Magic(tm) with Character Encodings when I mention Unicode, I likely mean the UTF8 encoding of it. Lets establish some facts: Titles can be Unicode strings Example title is: Rīga-Herson-Astrahan This (should) escaped to: R%C4%ABga-Herson-Astrahan When (not) marked as UTF8, it decodes correctly When marked as UTF8, it decodes incorrectly Stepping through the code I have provided below, you eventually to URI at line 77: `DB<18> x $str 0 'http://en.wikipedia.org/w/api.php?prop=revisions&format=xml&titles +=User:OverlordQ/Rīga-Herson-Astrahan&action=query&rvlimit=20'` [download] The first run through the regex, it eats a character: DB<20> p $1 ▒ DB<21> x unpack("U*",$1); 0 196 Odd, oh well, let us let the regex finish until we get to line 78. Now lets see what the url contains: `DB<24> x $str 0 'http://en.wikipedia.org/w/api.php?prop=revisions&format=xml&titles +=User:OverlordQ/R%C3%84%C2%ABga-Herson-Astrahan&action=query&rvlimit= +20'` [download] Hurm, not fun, that's not what we should have got. Bug? Or should I not be telling perl that these strings may contain utf8 characters. Example below. (It abuses the pre tag since the code tag eats the characters) #!/usr/bin/perl use strict; use warnings; use lib '/home/overlordq/lib'; use LWP::UserAgent; use Data::Dumper; use DBI; use wikidb; $\|++; my $ua = LWP::UserAgent->new(); my $dbh = DBI->connect("DBI:mysql:database=enwiki_p;host=sql-s1",$user +,$password); my $query = "SELECT page_title FROM page WHERE page_title LIKE 'Overlo +rdQ%' AND page_id = '22325873'"; my $sth = $dbh->prepare($query); $sth->execute(); my $title; while(my $ref = $sth->fetchrow_hashref() ) { $title = $ref->{'page_title'}; } print "Title: $title\n"; if( isUTF($title) ) { print "\tis UTF8\n"; } else { print "\tis not UTF8\n"; } my $res = $ua->post('http://en.wikipedia.org/w/api.php?prop=revisions& +format=xml&titles=User:' . $title . '&action=query&rvlimit=20'); my $uriUsed = $res->request->uri->as_string; print "URI: $uriUsed\n"; if( isUTF($title) ) { print "\tis already UTF8\n"; } else { utf8::upgrade($title); if( isUTF($title) ) { print "$title\n\tis now UTF8\n"; } } $res = $ua->post('http://en.wikipedia.org/w/api.php?prop=revisions&for +mat=xml&titles=User:' . $title . '&action=query&rvlimit=20'); $uriUsed = $res->request->uri->as_string; print "URI: $uriUsed\n"; print "Title: $title\n"; sub isUTF { my $string = shift; return utf8::is_utf8($string); } [download] Output: Title: OverlordQ/Rīga-Herson-Astrahan is not UTF8 URI: http://...?...&titles=User:OverlordQ/R%C4%ABga-Herson-Astrahan&... OverlordQ/Rīga-Herson-Astrahan is now UTF8 URI: http://...?...&titles=User:OverlordQ/R%C3%84%C2%ABga-Herson-Astrahan&... Title: OverlordQ/Rīga-Herson-Astrahan

Alright, in my Perl codings, I've done some work with respect to Wikipedia. One thing you'll find on Wikipedia is plenty of Unicode. Now unfortunately, I've come across some snags when trying to do some work. Since I'm not conversant with all the Black Magic(tm) with Character Encodings when I mention Unicode, I likely mean the UTF8 encoding of it.

Lets establish some facts:

Titles can be Unicode strings
Example title is: Rīga-Herson-Astrahan
This (should) escaped to: R%C4%ABga-Herson-Astrahan
When (not) marked as UTF8, it decodes correctly
When marked as UTF8, it decodes incorrectly

Stepping through the code I have provided below, you eventually to URI at line 77:

  DB<18> x $str
0  'http://en.wikipedia.org/w/api.php?prop=revisions&format=xml&titles
+=User:OverlordQ/R&#299;ga-Herson-Astrahan&action=query&rvlimit=20'
[download]

The first run through the regex, it eats a character:

  DB<20> p $1
▒
  DB<21> x unpack("U*",$1);
0  196

Odd, oh well, let us let the regex finish until we get to line 78. Now lets see what the url contains:

  DB<24> x $str
0  'http://en.wikipedia.org/w/api.php?prop=revisions&format=xml&titles
+=User:OverlordQ/R%C3%84%C2%ABga-Herson-Astrahan&action=query&rvlimit=
+20'
[download]

Hurm, not fun, that's not what we should have got. Bug? Or should I not be telling perl that these strings may contain utf8 characters. Example below. (It abuses the pre tag since the code tag eats the characters)

#!/usr/bin/perl
use strict;
use warnings;
use lib '/home/overlordq/lib';
use LWP::UserAgent;
use Data::Dumper;
use DBI;
use wikidb;

$|++;

my $ua = LWP::UserAgent->new();

my $dbh = DBI->connect("DBI:mysql:database=enwiki_p;host=sql-s1",$user
+,$password);
my $query = "SELECT page_title FROM page WHERE page_title LIKE 'Overlo
+rdQ%' AND page_id = '22325873'";
my $sth = $dbh->prepare($query);
$sth->execute();
my $title;
while(my $ref = $sth->fetchrow_hashref() ) {
        $title = $ref->{'page_title'};
}

print "Title: $title\n";
if( isUTF($title) ) {
        print "\tis UTF8\n";
} else {
        print "\tis not UTF8\n";
}

my $res = $ua->post('http://en.wikipedia.org/w/api.php?prop=revisions&
+format=xml&titles=User:' . $title . '&action=query&rvlimit=20');

my $uriUsed = $res->request->uri->as_string;

print "URI: $uriUsed\n";

if( isUTF($title) ) {
        print "\tis already UTF8\n";
} else {
        utf8::upgrade($title);
        if( isUTF($title) ) {

                print "$title\n\tis now UTF8\n";
        }
}

$res = $ua->post('http://en.wikipedia.org/w/api.php?prop=revisions&for
+mat=xml&titles=User:' . $title . '&action=query&rvlimit=20');

$uriUsed = $res->request->uri->as_string;

print "URI: $uriUsed\n";


print "Title: $title\n";

sub isUTF {
        my $string = shift;
        return utf8::is_utf8($string);
}
[download]

Output:

Title: OverlordQ/Rīga-Herson-Astrahan
        is not UTF8
URI: http://...?...&titles=User:OverlordQ/R%C4%ABga-Herson-Astrahan&...
OverlordQ/Rīga-Herson-Astrahan
        is now UTF8
URI: http://...?...&titles=User:OverlordQ/R%C3%84%C2%ABga-Herson-Astrahan&...
Title: OverlordQ/Rīga-Herson-Astrahan

Update: Shortened URLs in PRE tags as per reply.

In reply to Re: URIs and UTF8 by ikegami
in thread URIs and UTF8 by OverlordQ

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.