Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I have a problem with encoding of file names on Ubuntu.

I am using glob to get a list of file names that include a certain string, slurp each file's contents to a variable, remove the file's extension using s///, and then i am trying to use MediaWiki::API->edit to upload the contents to a Wikipedia page whose title is the file's name without the extension. The file name and its contents include Hebrew characters; the content is utf8, but i am not sure about the file name.

The content comes out correctly at the target page, but the the page title is gibberish. What can i do to make the file name proper utf8, as the file's content?

Here's the relevant code:

#!/usr/bin/perl use 5.010; use strict; use warnings; use open ':encoding(utf8)'; use utf8; use English qw(-no_match_vars); use Carp qw(croak cluck); use MediaWiki::API; my $INPUT_EXTENSION = 'wiki.txt'; my $mw = MediaWiki::API->new(); $mw->{config}->{api_url} = "http://he.wikipedia.org/w/api.php"; $mw->login( { lgname => 'Amire80', lgpassword => 'secret80', # not really } ) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details}; my $page_prefix = 'User:Amire80'; my $dirname = './out.he/'; # in the next line the word 'category' is actually supposed to be # written in Hebrew characters, but this website doesn't seem # to like it my @filenames = glob "${dirname}category*.$INPUT_EXTENSION"; foreach my $filename (@filenames) { my $pagename = $filename; $pagename =~ s/\A $dirname//xms; $pagename =~ s/\.$INPUT_EXTENSION \z//xms; $pagename = "$page_prefix/$pagename"; say $pagename; my $ref = $mw->get_page({ title => $pagename }); if ($ref->{missing}) { say "page $pagename is missing, trying to create"; } my $timestamp = $ref->{timestamp}; local $INPUT_RECORD_SEPARATOR; open my $file, '<', $filename or croak "Can't open $filename: $OS_ERROR"; my $text = <$file>; close $file; $mw->edit( { action => 'edit', title => $pagename, summary => 'cat 001', basetimestamp => $timestamp, # to avoid edit conflicts text => $text, }, { skip_encoding => 1, } ) or croak $mw->{error}->{code} . ': ' . $mw->{error}->{details}; }

If i just give a literal Hebrew string as the title parameter to $mw->edit, then everything works correctly. What can i do with $pagename so it will be encoded the same way as $text?

Thanks in advance.

Version: Perl 5.10 on Ubuntu 9.10.


In reply to encoding of file names by amir_e_a

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2023-06-06 16:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you go to conferences?






    Results (29 votes). Check out past polls.

    Notices?