perlquestion
amir_e_a
<p>I have a problem with encoding of file names on Ubuntu.</p>
<p>I am using <code>glob</code> to get a list of file names that include a certain string, slurp each file's contents to a variable, remove the file's extension using <code>s///</code>, and then i am trying to use <code>MediaWiki::API->edit</code> to upload the contents to a Wikipedia page whose title is the file's name without the extension. The file name and its contents include Hebrew characters; the content is utf8, but i am not sure about the file name.</p>
<p>The content comes out correctly at the target page, but the the page title is gibberish. What can i do to make the file name proper utf8, as the file's content?</p>
<p>Here's the relevant code:</p>
<readmore>
<code>#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use open ':encoding(utf8)';
use utf8;
use English qw(-no_match_vars);
use Carp qw(croak cluck);
use MediaWiki::API;
my $INPUT_EXTENSION = 'wiki.txt';
my $mw = MediaWiki::API->new();
$mw->{config}->{api_url} = "http://he.wikipedia.org/w/api.php";
$mw->login(
{
lgname => 'Amire80',
lgpassword => 'secret80', # not really
}
) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details};
my $page_prefix = 'User:Amire80';
my $dirname = './out.he/';
# in the next line the word 'category' is actually supposed to be
# written in Hebrew characters, but this website doesn't seem
# to like it
my @filenames = glob "${dirname}category*.$INPUT_EXTENSION";
foreach my $filename (@filenames) {
my $pagename = $filename;
$pagename =~ s/\A $dirname//xms;
$pagename =~ s/\.$INPUT_EXTENSION \z//xms;
$pagename = "$page_prefix/$pagename";
say $pagename;
my $ref = $mw->get_page({ title => $pagename });
if ($ref->{missing}) {
say "page $pagename is missing, trying to create";
}
my $timestamp = $ref->{timestamp};
local $INPUT_RECORD_SEPARATOR;
open my $file, '<', $filename
or croak "Can't open $filename: $OS_ERROR";
my $text = <$file>;
close $file;
$mw->edit(
{
action => 'edit',
title => $pagename,
summary => 'cat 001',
basetimestamp => $timestamp, # to avoid edit conflicts
text => $text,
},
{ skip_encoding => 1, }
) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details};
}
</code>
</readmore>
<p>If i just give a literal Hebrew string as the title parameter to <code>$mw->edit</code>, then everything works correctly. What can i do with <code>$pagename</code> so it will be encoded the same way as <code>$text</code>?</p>
<p>Thanks in advance.</p>
<p>Version: Perl 5.10 on Ubuntu 9.10.</p>