http://qs1969.pair.com?node_id=830948

amir_e_a has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem with encoding of file names on Ubuntu.

I am using glob to get a list of file names that include a certain string, slurp each file's contents to a variable, remove the file's extension using s///, and then i am trying to use MediaWiki::API->edit to upload the contents to a Wikipedia page whose title is the file's name without the extension. The file name and its contents include Hebrew characters; the content is utf8, but i am not sure about the file name.

The content comes out correctly at the target page, but the the page title is gibberish. What can i do to make the file name proper utf8, as the file's content?

Here's the relevant code:

#!/usr/bin/perl use 5.010; use strict; use warnings; use open ':encoding(utf8)'; use utf8; use English qw(-no_match_vars); use Carp qw(croak cluck); use MediaWiki::API; my $INPUT_EXTENSION = 'wiki.txt'; my $mw = MediaWiki::API->new(); $mw->{config}->{api_url} = "http://he.wikipedia.org/w/api.php"; $mw->login( { lgname => 'Amire80', lgpassword => 'secret80', # not really } ) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details}; my $page_prefix = 'User:Amire80'; my $dirname = './out.he/'; # in the next line the word 'category' is actually supposed to be # written in Hebrew characters, but this website doesn't seem # to like it my @filenames = glob "${dirname}category*.$INPUT_EXTENSION"; foreach my $filename (@filenames) { my $pagename = $filename; $pagename =~ s/\A $dirname//xms; $pagename =~ s/\.$INPUT_EXTENSION \z//xms; $pagename = "$page_prefix/$pagename"; say $pagename; my $ref = $mw->get_page({ title => $pagename }); if ($ref->{missing}) { say "page $pagename is missing, trying to create"; } my $timestamp = $ref->{timestamp}; local $INPUT_RECORD_SEPARATOR; open my $file, '<', $filename or croak "Can't open $filename: $OS_ERROR"; my $text = <$file>; close $file; $mw->edit( { action => 'edit', title => $pagename, summary => 'cat 001', basetimestamp => $timestamp, # to avoid edit conflicts text => $text, }, { skip_encoding => 1, } ) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details}; }

If i just give a literal Hebrew string as the title parameter to $mw->edit, then everything works correctly. What can i do with $pagename so it will be encoded the same way as $text?

Thanks in advance.

Version: Perl 5.10 on Ubuntu 9.10.

Replies are listed 'Best First'.
Re: encoding of file names
by almut (Canon) on Mar 25, 2010 at 20:14 UTC

    You could try to Encode::decode() $filename, e.g. from UTF-8, if you suspect that's how the names are stored in the filesystem.

      That's the thing - i would probably try using Encode, but i don't know the from encoding. The to encoding is supposed to be utf8.

      Or do you mean to say that UTF-8 and utf8 are different things?

        but i don't know the from encoding

        I think you can't do much harm by just trying 'UTF-8' as the from encoding :)

        $filename = decode("UTF-8", $filename);

        When your file names are in fact in UTF-8, things will likely work out fine. Otherwise, you'll know they aren't UTF-8 encoded, and you can try some other encoding...

Re: encoding of file names
by ikegami (Patriarch) on Mar 25, 2010 at 20:17 UTC

    Perl treats file names as opaque strings of bytes*. In unix, they are usually characters encoded using the current locale, which in turn, is usually UTF-8.

    You need to decode the file names, and you'll be all set.

    * — This presents a problem on Windows which stores them as characters, but that's not relevant here.

      Can you please tell me how to decode them? I am not quite experienced with Encode. I still don't understand what to specify as the from encoding.

      And since you mention it, i actually am curious about Windows, because i plan to make this program portable and i already had similar problems on Windows in the past.

        I still don't understand what to specify as the from encoding.

        Usually, the file names are text encoded as per the local's encoding. In fact, I dare say that's the expectation.

        Most users have a UTF-8 locale. You could assume UTF-8, and worry about it when someone complains.

        If you want to actually get the right encoding, your best bet is probably the following undocumented function:

        require encoding; # Or "use encoding ();" with the parens. my $locale_encoding = encoding::_get_locale_encoding();

        This is what core module open uses.

        And since you mention it, i actually am curious about Windows

        It's a real mess. Bad support by builtins and by modules for accessing Windows's wide character interface. Bad support at finding the code page (last time I checked) of the single-byte interface (even though it's easier than locales in unix). Maybe some other time.