http://qs1969.pair.com?node_id=830948

amir_e_a has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem with encoding of file names on Ubuntu.

I am using glob to get a list of file names that include a certain string, slurp each file's contents to a variable, remove the file's extension using s///, and then i am trying to use MediaWiki::API->edit to upload the contents to a Wikipedia page whose title is the file's name without the extension. The file name and its contents include Hebrew characters; the content is utf8, but i am not sure about the file name.

The content comes out correctly at the target page, but the the page title is gibberish. What can i do to make the file name proper utf8, as the file's content?

Here's the relevant code:

#!/usr/bin/perl use 5.010; use strict; use warnings; use open ':encoding(utf8)'; use utf8; use English qw(-no_match_vars); use Carp qw(croak cluck); use MediaWiki::API; my $INPUT_EXTENSION = 'wiki.txt'; my $mw = MediaWiki::API->new(); $mw->{config}->{api_url} = "http://he.wikipedia.org/w/api.php"; $mw->login( { lgname => 'Amire80', lgpassword => 'secret80', # not really } ) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details}; my $page_prefix = 'User:Amire80'; my $dirname = './out.he/'; # in the next line the word 'category' is actually supposed to be # written in Hebrew characters, but this website doesn't seem # to like it my @filenames = glob "${dirname}category*.$INPUT_EXTENSION"; foreach my $filename (@filenames) { my $pagename = $filename; $pagename =~ s/\A $dirname//xms; $pagename =~ s/\.$INPUT_EXTENSION \z//xms; $pagename = "$page_prefix/$pagename"; say $pagename; my $ref = $mw->get_page({ title => $pagename }); if ($ref->{missing}) { say "page $pagename is missing, trying to create"; } my $timestamp = $ref->{timestamp}; local $INPUT_RECORD_SEPARATOR; open my $file, '<', $filename or croak "Can't open $filename: $OS_ERROR"; my $text = <$file>; close $file; $mw->edit( { action => 'edit', title => $pagename, summary => 'cat 001', basetimestamp => $timestamp, # to avoid edit conflicts text => $text, }, { skip_encoding => 1, } ) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details}; }

If i just give a literal Hebrew string as the title parameter to $mw->edit, then everything works correctly. What can i do with $pagename so it will be encoded the same way as $text?

Thanks in advance.

Version: Perl 5.10 on Ubuntu 9.10.