Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Perl on Windows: file names with accented characters, UTF-8 and -e

by bart (Canon)
on May 11, 2016 at 20:47 UTC ( [id://1162804]=perlquestion: print w/replies, xml ) Need Help??

bart has asked for the wisdom of the Perl Monks concerning the following question:

Using Strawberry Perl x64 on Windows, I'm having a bit of a problem with -e that fails recognize file names if the name contains an accented character, meaning: -e $path returns false even though the file exists.

I'm reading the file names out of an XML file, and thus, they are strings with the UTF-8 flag set.

I'm quite sure the cause of the problem is the mix of character encodings in Perl on Windows, as the standard encoding of the CMD shell (under which Perl runs) is CP-850, Explorer and the Windows API appear to be using either CP-1252 or 16-bit wide Unicode characters; and the Unicode-enabled strings in Perl.

I have no idea which encoding -e expects to work as it should. It does work if the path contains only plain ASCII characters.

So... what do I have to do to make it work?

Replies are listed 'Best First'.
Re: Perl on Windows: file names with accented characters, UTF-8 and -e
by dasgar (Priest) on May 11, 2016 at 21:20 UTC

    I had a different Perl issue on Windows a while back and what I found that helped me might end up helping you out.

    In my situation, I was trying to deal with very long paths. In Windows, there are two file system APIs. One of the APIs (used by file explorer and the command prompt) is limited to a max of 250-256 characters (I've seen different reference material with differing values) and that limitation is in place for backwards compatibility. The other API allows for use of the full capabilities of NTFS, which are significantly longer path support as well as Unicode support. Although you might not be hitting the path length issue, the Unicode support may be an issue for your code.

    The module that I found that gave access to the second API is Win32::LongPath, which provides replacement functions that uses "Windows wide-character functions which support Unicode and extended-length paths". In your case, you would import Win32::LongPath into your code change your -e $path to be testL ('e', $path) instead. (See the documentation on the testL function for the list of other -x file tests that it can replace.)

    Also, if you are dealing with files that have Unicode in the filenames, you may have issues with other file related tasks, such as opening the file for reading/writing or file stat tasks. The Win32::LongPath module probably should have other functions to help provide you with a work around solution.

Re: Perl on Windows: file names with accented characters, UTF-8 and -e
by beech (Parson) on May 11, 2016 at 23:07 UTC
Re: Perl on Windows: file names with accented characters, UTF-8 and -e
by andal (Hermit) on May 12, 2016 at 07:31 UTC

    Your understanding is correct. Everything depends on which Encoding is used to store file names in your Windows (well on Linux also, but normally Linux uses only UTF-8, but there are some exotic cases). You can try to use some standard modules mentioned by others. Or you can try to play with encodings of strings. There's very powerful and simple module Encode that comes with perl. Use that module to convert your strings to desired encoding.

    There's one tricky point though. Perl can be configured to automatically apply conversion to data received from OS. That automatic conversion may be wrong and/or may confuse. In general, OS passes to perl "octets", perl may convert them to "characters" (automatically or per request). Normally when working with file names also perl has to pass to OS "octets".

    So, if you know that you get data from file in UTF-8 encoding and Windows encodes file names using CP1252, then you'd have to do the following

    Encode::from_to($fname, "UTF-8", "CP1252");
    But the above code assumes that $fname contains "octets" (Encode::is_utf8($fname) returns false). If it already contains "characters", then the code shall be
    $fname = Encode::encode("CP1252", $fname);

    Read through perldoc Encode to get the details.

Re: Perl on Windows: file names with accented characters, UTF-8 and -e
by Anonymous Monk on May 12, 2016 at 15:32 UTC
    maybe this will help? windows unicode issues in Perl

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1162804]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-23 02:27 GMT
Find Nodes?
    Voting Booth?

    No recent polls found