in reply to MD5 non ascii file name

1. If the naming system under Windows can be assumed as UTF-8?

Windows uses UTF-16LE for most stuff, but I don't know if that applies to file names as well.

2. Am I correct assuming if encoding/decoding is done correctly, the usually Perl package should work for wide-character names

Do you mean using non-ASCII-characters in package names in perl? I don't think that's supported.

Or do you want to access arbitrary file names with perl? That's only partly supported, because most operating systems don't know in which character encodings the file names are stored. If you treat the file names as binary data, it should mostly work, though.

 $ctx->addfile(I);

You don't initialize (or even declare) the variable $ctx. That's why it's uninitialized.

Replies are listed 'Best First'.
Re^2: MD5 non ascii file name
by Corion (Patriarch) on Aug 19, 2008 at 20:39 UTC

    Perl assumes Latin1 (for Win32) or "native" (for other) for all filenames. Under Win32, Perl mostly calls the *A APIs, which deal with "ASCII" data. In theory, Perl should move to using the *W APIs so it use UTF-16LE for filenames and all strings passed to the OS, but it doesn't. There is no abstraction layer for handling the encoding(s) returned by readdir and for the encoding(s) passed to open. They are not necessarily compatible with each other and not necessarily compatible with other strings in Perl.

      Perl assumes Latin1 (for Win32) or "native" (for other) for all filenames.

      Do you have an example of where Perl treats file names as anything but opaque binary strings? Is that what you mean by "native"?

      If anything, Perl (such as File::Spec) treats file names as any other (undecoded) text string: as iso-latin-1, regardless of platform.

        I think the problem occurrs when you do stuff like:

        use utf8; my $filename = "Söme Weird File"; open my $fh, "<", $filename or die;

        Except that "ö" is likely still a valid character. The same happens with filenames read from an external file I guess.

Re^2: MD5 non ascii file name
by benjwlee (Initiate) on Aug 19, 2008 at 21:31 UTC
    ok, so if I adopt UTF-16LE would MD5 be able to handle non-ASCII file object?
    My intention is to calculate MD5 on a file named '零一.txt'.
    $ctx isn't really the problem as this is a code fragment I got from some other places, tested and works, and since there's no 'warning' and 'strict' perl supposed to be lenient about it.
    Alas, I fixed 'my $ctx' and '||die on none open'. None are of any issues.
    On top of that, File::Find isn't working on a directory named '零一', either, same problem?!?
      ok, so if I adopt UTF-16LE would MD5 be able to handle non-ASCII file object?

      Digest and Digest::MD5 can calculate the hash sum of any binary data that you can read in perl.

      And on linux this simply works:

      $ echo foo > 零一.txt
      $ perl -we 'print while <>' 零一.txt
      foo

      (Intentionally no code tags here because they break non-latin1-chars)

      So yes, it's possible, by treating the file names as simple binary data. On Linux, at least ;-).

      On top of that, File::Find isn't working on a directory named '零一', either, same problem?!?

      "not working" is not an error description. What code fails? with what message?

        I will reduce my code for File::Find to a workable fragment before I post, but here are the error msg:
        Use of uninitialized value $file in print at unicodeTest6.pl line 28.
        Use of uninitialized value $file in -f at unicodeTest6.pl line 29.
        invalid top directory at C:/Perl/lib/File/Find.pm line 593.
      Never mind, you guys are right, I didn't define $ctx.
      Thanks for everybody. Banging head.
Re^2: MD5 non ascii file name
by ikegami (Patriarch) on Aug 22, 2008 at 00:46 UTC

    Windows uses UTF-16LE for most stuff, but I don't know if that applies to file names as well.

    It does, for the *W system calls. *A system calls use the local code page, if I understand correctly.