mike.scharnow has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

I'm stuck here and could not find any good search keywords to answer my question. So, sorry if this is already answered elsewhere.

I'm working on a SLES Linux with UTF-8 support switched on. Given is a file which contains german umlauts in its name (e.g. 'Fehler für Projekt x.xls') The following code worked perfectly on an older SLES:

unless (opendir(DIR,$scandir)) { confess "can't open the directory $scandir: $@"; } @files=grep {-f "$scandir/$_"} readdir(DIR);
On this machine, it is possible to use
chdir ($scandir); opendir(DIR,'.'); @files=grep {-f $_} readdir(DIR);
but as soon as I concatenate the retrieved filename with some other string, e.g. "$scandir/$_" or $scandir."/".$_ or $scandir."/".$files[0], "-f" does not work any more plus I cannot copy or move $scandir."/".$files[0]. Is there any general setting that I have to do in my perl code to treat utf files correctly? I already tried "use utf8;" or 'use encoding "utf8";' but these did not make a difference.

Thanks for your help

Mike

Replies are listed 'Best First'.
Re: treat files with umlauts (utf)
by Anonymous Monk on Apr 01, 2014 at 07:01 UTC

    see perlunitut: Unicode in Perl#I/O flow (the actual 5 minute tutorial) and perlport

    what readdir returns are bytes, so you need to decode them as utf before any appending/concatenation

    before using -f/stat... you need to use encode_utf8

    I imagine if you stick with Path::Tiny -> children then this behaviour of perl unicode strings feature you've stumbled upon won't play a role, and -f will work the way you want it to work

    use Unicode::UTF8 qw[decode_utf8 encode_utf8];
Re: treat files with umlauts (utf)
by zentara (Cardinal) on Apr 01, 2014 at 08:59 UTC
    Here is a quick fix which might work in your case.
    use Encode; chdir ($scandir); opendir(DIR,'.'); @files=grep {-f $_} readdir(DIR); # this line tells Perl to interpret the filenames as utf8, # and once done, your umlauts should appear @files = map{ decode('utf8',$_) } @files;

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

      Where is decode defined?

Re: treat files with umlauts (utf)
by kcott (Archbishop) on Apr 01, 2014 at 10:50 UTC

    G'day Mike,

    Welcome to the monastery.

    In what way does '-f' "not work any more"?

    Please provide more information on "I cannot copy or move $scandir."/".$files[0]". How are attempting to copy and move? What's happening? What errors, warnings or other feedback are you getting?

    Are you asking Perl to point out problems to you (e.g. strict, warnings, autodie, etc.)? The documentation shows opendir reporting problems in "$!": you're using "$@".

    You're using package global variables exclusively (in the code you've shown): this could be causing any number of problems. Without seeing all of your code, it's impossible to tell. Use lexical variables and avoid the issue altogether.

    I created this directory (some characters may not render in your browser but the filenames indicate the code points):

    $ ls -al pm_1080490_utf8_readdir
    total 0
    drwxr-xr-x    5 ken  staff    170  1 Apr 21:01 .
    drwxr-xr-x  599 ken  staff  20366  1 Apr 21:02 ..
    -rw-r--r--    1 ken  staff      0  1 Apr 20:47 Fehler für Projekt x.txt
    -rw-r--r--    1 ken  staff      0  1 Apr 20:58 ᚠᚡᚢᚣᚤᚥᚦ (U+16a0 to U+16a6)
    -rw-r--r--    1 ken  staff      0  1 Apr 21:01 🜁 🜂 🜃 🜄 (U+01f701 to U+01f704)
    

    I ran this script:

    #!/usr/bin/env perl -l use strict; use warnings; use autodie; my $scandir = './pm_1080490_utf8_readdir'; opendir(my $dh, $scandir); my @files = grep { -f "$scandir/$_" } readdir $dh; print for @files;

    And got this output:

    Fehler für Projekt x.txt
    ᚠᚡᚢᚣᚤᚥᚦ (U+16a0 to U+16a6)
    🜁 🜂 🜃 🜄 (U+01f701 to U+01f704)
    

    So, I'm unable to reproduce your problem.

    Try running exactly the same code as I did (except with a different value for $scandir) and report the result. Show the full output of any errors, warnings, or other messages: vague references to "does not work" and the like are of no use at all. The guidelines in "How do I post a question effectively?" explain what information is useful and how to present it.

    -- Ken

      So, I'm unable to reproduce your problem.
      That's because your code is missing the two key ingredients:
      use utf8; ... my $scandir = 'something with umlauts it it';
        "That's because your code is missing the two key ingredients:"

        No, it's not.

        "use utf8;"

        Neither my code nor the code the OP posted requires the utf8 pragma. See the following, from that documention, which it shows in bold-faced type:

        "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8."
        "my $scandir = 'something with umlauts it it';"

        The OP does not say that $scandir contains umlauts. The only mention of umlauts by the OP is:

        "Given is a file which contains german umlauts in its name (e.g. 'Fehler für Projekt x.xls')"

        He says "file" and provides what would be resonable to assume is an MS Excel spreadsheet file. There's nothing about any directory whose name contains umlauts.

        -- Ken