cjk32 has asked for the wisdom of the Perl Monks concerning the following question:

This was posted to comp.lang.perl.misc with no replies. Hello, I'm trying to understand the correct way to handle the following. Say I have a file containing non ascii characters, e.g. as 8 bit values 66 69 6C E9, or 'f' 'i' 'l' 'e_acute', and I want to establish whether that file exists. If I execute:
ex("c:\\fil\x{e9}.txt"); ex("c:\\fil" . pack("U", 0xe9 ) . ".txt"); sub ex { my $f = shift; print $f; print ((-e $f) ? " exists" : " doesn't exist"); print "\n"; }
Then I get the following:
c:\filé.txt exists c:\filé.txt doesn't exist
The filename is displayed in exactly the same format to the screen in both cases, but the file isn't found when the filename is passed as a utf8 string. Is there any way to tell perl that the operand being passed to -e is utf8? One workaround would be to simply convert everything from utf8 to an single byte representation, but I've no idea whether this will work in environments other than mine (Win32, EN_GB, perl v5.8.8). It's also not going to handle unicode characters that won't fit into a single byte either. What is the correct way to handle this to ensure portability? Chris Key

Replies are listed 'Best First'.
Re: utf8 filenames
by wazoox (Prior) on Apr 10, 2006 at 13:43 UTC
    did you try to
    use utf8 ; # Convert a Perl scalar to/from UTF-8. $num_octets = utf8::upgrade($string); $success = utf8::downgrade($string[, FAIL_OK]);
    You'll findmany more gory details in "perldoc utf8".
      I've just tried:
      use utf8; ex("c:\\fil\x{e9}.txt"); ex("c:\\fil" . pack("U", 0xe9 ) . ".txt"); sub ex { my $f2 = shift; my $f = $f2; print $f; print ((-e $f) ? " exists" : " doesn't exist"); print "\n"; my $f = $f2; utf8::upgrade($f); print $f; print ((-e $f) ? " exists" : " doesn't exist"); print "\n"; my $f = $f2; utf8::downgrade($f); print $f; print ((-e $f) ? " exists" : " doesn't exist"); print "\n\n"; }
      which produces:
      c:\filé.txt exists c:\filé.txt doesn't exist c:\filé.txt exists c:\filé.txt doesn't exist c:\filé.txt doesn't exist c:\filé.txt exists
      Is utf8::downgrade always guaranteed to produce a string with the required encoding for any environment though?
        Is utf8::downgrade always guaranteed to produce a string with the required encoding for any environment though?

        No, it just convert UTF-8 back to Latin1 (or EBCDIC in case you're using an A/400 or S/390). It simply looks like your file names are encoded in Latin1 on your drives. IIRC windows 2000 used Latin1, while Windows XP uses utf8. I may be wrong though because I'm a strict Un*x/Linux guy :)

Re: utf8 filenames
by Anonymous Monk on Apr 10, 2006 at 13:50 UTC
      Does this mean that support must be explicitly written for every environment, or is it just Win32 that requires exceptions?
        Does this mean that support must be explicitly written for every environment

        That depends on what platforms you want to support. There are platforms (and filesystems) that have no provision for multibyte characters in filenames. (For that matter there are platforms that have no provision for filenames, but I don't know that perl runs on any of them.)

        Besides Win32, what else do you need to support? Modern POSIX-type systems? OS/2? VMS? Older, pre-POSIX Unix-type systems? Archimedes? Mac System 7? Amiga? PC-DOS 3.3? Whenever you're asking about portability, you've got to give us some hint exactly _HOW_ portable you need to be, i.e., what types of systems you need to support. If you just want to support current major desktop systems (Windows, Mac, the major free Unices, and maybe Solaris) it's one thing; if your portability needs are more extensive, it's another thing.

        And with Unicode (unlike most other things), it also makes a big difference exactly what versions of perl you need to support. If you only have to support perl 5.8 and higher, that makes a big difference for Unicode.


        Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. Why, I've got so much sanity it's driving me crazy.
Re: utf8 filenames
by graff (Chancellor) on Apr 11, 2006 at 03:23 UTC
    Since your file name includes a single-byte (latin1) version of e_acute, the question about compatibility and portability applies to your choice of file system, not your choice of perl coding idiom.

    It looks (surprisingly) like the text display in your terminal window (or whatever is presenting the stated output) is able to render Latin1 accented characters regardless of whether they are in single-byte or utf8-two-byte form. If your file names are likely to be limited to Latin-1 characters, you should check out this other current thread: Re: how to check the encoding of a file, to see how to tell whether the names are in utf8 or single-byte latin-1.

    To the extent that your file names involve single-byte latin-1 characters (and your terminal can show these correctly without further ado), I don't see why you'd want to do anything with utf8 -- at least, not until your code ends up on another system where the file names are utf8 instead.

Re: utf8 filenames
by zentara (Cardinal) on Apr 10, 2006 at 15:32 UTC
      use Encode; ex("c:\\fil\x{e9}.txt"); ex("c:\\fil" . chr(0xc3) . chr(0xa9) . ".txt"); ex("c:\\fil" . pack("U", 0xe9 ) . ".txt"); sub ex { my $f2 = shift; my $f = $f2; print $f; print ((-e $f) ? " exists" : " doesn't exist"); print "\n"; my $f = Encode::decode( 'utf8', $f2); print $f; print ((-e $f) ? " exists" : " doesn't exist"); print "\n\n"; }
      Procudes:
      c:\filé.txt exists Wide character in print at C:\ex.pl line 16. c:\fil�.txt doesn't exist c:\filé.txt doesn't exist c:\filé.txt doesn't exist c:\filé.txt doesn't exist Wide character in print at C:\ex.pl line 16. c:\fil�.txt doesn't exist