I have a problem with encoding of file names on Ubuntu.
I am using glob to get a list of file names that include a certain string, slurp each file's contents to a variable, remove the file's extension using s///, and then i am trying to use MediaWiki::API->edit to upload the contents to a Wikipedia page whose title is the file's name without the extension. The file name and its contents include Hebrew characters; the content is utf8, but i am not sure about the file name.
The content comes out correctly at the target page, but the the page title is gibberish. What can i do to make the file name proper utf8, as the file's content?
Here's the relevant code:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use open ':encoding(utf8)';
use utf8;
use English qw(-no_match_vars);
use Carp qw(croak cluck);
use MediaWiki::API;
my $INPUT_EXTENSION = 'wiki.txt';
my $mw = MediaWiki::API->new();
$mw->{config}->{api_url} = "http://he.wikipedia.org/w/api.php";
$mw->login(
{
lgname => 'Amire80',
lgpassword => 'secret80', # not really
}
) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details};
my $page_prefix = 'User:Amire80';
my $dirname = './out.he/';
# in the next line the word 'category' is actually supposed to be
# written in Hebrew characters, but this website doesn't seem
# to like it
my @filenames = glob "${dirname}category*.$INPUT_EXTENSION";
foreach my $filename (@filenames) {
my $pagename = $filename;
$pagename =~ s/\A $dirname//xms;
$pagename =~ s/\.$INPUT_EXTENSION \z//xms;
$pagename = "$page_prefix/$pagename";
say $pagename;
my $ref = $mw->get_page({ title => $pagename });
if ($ref->{missing}) {
say "page $pagename is missing, trying to create";
}
my $timestamp = $ref->{timestamp};
local $INPUT_RECORD_SEPARATOR;
open my $file, '<', $filename
or croak "Can't open $filename: $OS_ERROR";
my $text = <$file>;
close $file;
$mw->edit(
{
action => 'edit',
title => $pagename,
summary => 'cat 001',
basetimestamp => $timestamp, # to avoid edit conflicts
text => $text,
},
{ skip_encoding => 1, }
) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details};
}
If i just give a literal Hebrew string as the title parameter to $mw->edit, then everything works correctly. What can i do with $pagename so it will be encoded the same way as $text?
Thanks in advance.
Version: Perl 5.10 on Ubuntu 9.10.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.