encoding of file names

amir_e_a has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem with encoding of file names on Ubuntu.

I am using glob to get a list of file names that include a certain string, slurp each file's contents to a variable, remove the file's extension using s///, and then i am trying to use MediaWiki::API->edit to upload the contents to a Wikipedia page whose title is the file's name without the extension. The file name and its contents include Hebrew characters; the content is utf8, but i am not sure about the file name.

The content comes out correctly at the target page, but the the page title is gibberish. What can i do to make the file name proper utf8, as the file's content?

Here's the relevant code:

#!/usr/bin/perl

use 5.010;

use strict;
use warnings;
use open ':encoding(utf8)';
use utf8;

use English qw(-no_match_vars);
use Carp qw(croak cluck);

use MediaWiki::API;

my $INPUT_EXTENSION = 'wiki.txt';

my $mw = MediaWiki::API->new();
$mw->{config}->{api_url} = "http://he.wikipedia.org/w/api.php";

$mw->login(
    {
        lgname     => 'Amire80',
        lgpassword => 'secret80', # not really
    }
) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details};

my $page_prefix = 'User:Amire80';

my $dirname   = './out.he/';
# in the next line the word 'category' is actually supposed to be
# written in Hebrew characters, but this website doesn't seem
# to like it
my @filenames = glob "${dirname}category*.$INPUT_EXTENSION";

foreach my $filename (@filenames) {
    my $pagename = $filename;
    $pagename =~ s/\A $dirname//xms;
    $pagename =~ s/\.$INPUT_EXTENSION \z//xms;
    $pagename = "$page_prefix/$pagename";
    say $pagename;

    my $ref = $mw->get_page({ title => $pagename });
    if ($ref->{missing}) {
        say "page $pagename is missing, trying to create";
    }
    my $timestamp = $ref->{timestamp};

    local $INPUT_RECORD_SEPARATOR;
    open my $file, '<', $filename
        or croak "Can't open $filename: $OS_ERROR";
    my $text      = <$file>;
    close $file;

    $mw->edit(
        {
            action        => 'edit',
            title         => $pagename,
            summary       => 'cat 001',
            basetimestamp => $timestamp,    # to avoid edit conflicts
            text          => $text,
        },
        { skip_encoding => 1, }
    ) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details};
}
[download]

If i just give a literal Hebrew string as the title parameter to $mw->edit, then everything works correctly. What can i do with $pagename so it will be encoded the same way as $text?

Thanks in advance.

Version: Perl 5.10 on Ubuntu 9.10.

Comment on encoding of file names Select or Download Code

Replies are listed 'Best First'.
Re: encoding of file names by almut (Canon) on Mar 25, 2010 at 20:14 UTC
You could try to Encode::decode() `$filename`, e.g. from UTF-8, if you suspect that's how the names are stored in the filesystem.	[reply] [d/l]
Re^2: encoding of file names by amir_e_a (Hermit) on Mar 25, 2010 at 20:53 UTC
That's the thing - i would probably try using Encode, but i don't know the from encoding. The to encoding is supposed to be utf8. Or do you mean to say that UTF-8 and utf8 are different things?	[reply]
Re^3: encoding of file names by almut (Canon) on Mar 25, 2010 at 21:10 UTC
but i don't know the from encoding I think you can't do much harm by just trying 'UTF-8' as the from encoding :) `$filename = decode("UTF-8", $filename);` [download] When your file names are in fact in UTF-8, things will likely work out fine. Otherwise, you'll know they aren't UTF-8 encoded, and you can try some other encoding...	[reply] [d/l]
Re^4: encoding of file names by amir_e_a (Hermit) on Mar 25, 2010 at 22:39 UTC
Re: encoding of file names by ikegami (Patriarch) on Mar 25, 2010 at 20:17 UTC
Perl treats file names as opaque strings of bytes. In unix, they are usually characters encoded using the current locale, which in turn, is usually UTF-8. You need to decode the file names, and you'll be all set. — This presents a problem on Windows which stores them as characters, but that's not relevant here.	[reply]
Re^2: encoding of file names by amir_e_a (Hermit) on Mar 25, 2010 at 20:53 UTC
Can you please tell me how to decode them? I am not quite experienced with Encode. I still don't understand what to specify as the from encoding. And since you mention it, i actually am curious about Windows, because i plan to make this program portable and i already had similar problems on Windows in the past.	[reply]
Re^3: encoding of file names by ikegami (Patriarch) on Mar 25, 2010 at 21:34 UTC
I still don't understand what to specify as the from encoding. Usually, the file names are text encoded as per the local's encoding. In fact, I dare say that's the expectation. Most users have a UTF-8 locale. You could assume UTF-8, and worry about it when someone complains. If you want to actually get the right encoding, your best bet is probably the following undocumented function: `require encoding; # Or "use encoding ();" with the parens. my $locale_encoding = encoding::_get_locale_encoding();` [download] This is what core module open uses. And since you mention it, i actually am curious about Windows It's a real mess. Bad support by builtins and by modules for accessing Windows's wide character interface. Bad support at finding the code page (last time I checked) of the single-byte interface (even though it's easier than locales in unix). Maybe some other time.	[reply] [d/l]

Back to Seekers of Perl Wisdom