benjwlee has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to calculate MD5 from non-Ascii file name and directory name.
For example, if the file name as '零一.txt', the execution will come back with:
Can't call method "addfile" on an undefined value at unicodeTest2.pl line 16.

I am using Activestate perl v5.10.0 build for MSWin32-x86-multi-thread binary build 1003, May 13, 2008
I have the following questions:
1. If the naming system under Windows can be assumed as UTF-8?
2. Am I correct assuming if encoding/decoding is done correctly, the usually Perl package should work for wide-character names
require Encode; use Digest::MD5 qw(md5 md5_hex md5_base64); my $ctx = Digest::MD5->new;; my $where = "C:/0work/test"; open O, ">:utf8", "C:\\0work\\test\\out.txt" or die "Couldn't open STD +OUT: $!" || die "can't open\n"; opendir (D, $where) or print " \ncould not open the directory: $!"; binmode D, ":utf8"; binmode O, ":utf8"; print O "reading the list from the directory $where\n"; my @list = readdir (D); for my $file (sort @list){ next if ! -f $file; print O "$file\n"; open(I,"<$file") || die "couldbn't open $file"; binmode(I, ":utf8"); $ctx->addfile(I); my $digest = $ctx->b64digest; print "digest:", $digest; }
<Updated> added open || die and my $ctx And here is the File::Find portion of my test
use strict; use warnings; use Cwd; use File::Find; use Digest::MD5 qw(md5 md5_hex md5_base64); use File::Basename; require Encode; $|++; my $ctx = Digest::MD5->new; my $totcnt=0; find (&d, "C:/0work/pics/Done"); binmode STDOUT, ":utf8"; sub d { my $file = $File::Find::name; print $file, "\n"; return unless -f $file; open(I,$file); binmode(I); $ctx->addfile(*I); my $digest = $ctx->b64digest; print $file, $digest, "\n"; close(I); }

Replies are listed 'Best First'.
Re: MD5 non ascii file name
by Joost (Canon) on Aug 19, 2008 at 20:19 UTC
      (though IIRC Digest::MD5 does not support wide characters).
      That's actually not a Digest::MD5 limitation, but rather a design choice of the MD5 hash algorithm (and most or all other hash algorithm share that limitation).

      Hash algorithms generally work with binary data, so you'd have to Encode::encode the data first.

Re: MD5 non ascii file name
by moritz (Cardinal) on Aug 19, 2008 at 20:23 UTC
    1. If the naming system under Windows can be assumed as UTF-8?

    Windows uses UTF-16LE for most stuff, but I don't know if that applies to file names as well.

    2. Am I correct assuming if encoding/decoding is done correctly, the usually Perl package should work for wide-character names

    Do you mean using non-ASCII-characters in package names in perl? I don't think that's supported.

    Or do you want to access arbitrary file names with perl? That's only partly supported, because most operating systems don't know in which character encodings the file names are stored. If you treat the file names as binary data, it should mostly work, though.

     $ctx->addfile(I);

    You don't initialize (or even declare) the variable $ctx. That's why it's uninitialized.

      Perl assumes Latin1 (for Win32) or "native" (for other) for all filenames. Under Win32, Perl mostly calls the *A APIs, which deal with "ASCII" data. In theory, Perl should move to using the *W APIs so it use UTF-16LE for filenames and all strings passed to the OS, but it doesn't. There is no abstraction layer for handling the encoding(s) returned by readdir and for the encoding(s) passed to open. They are not necessarily compatible with each other and not necessarily compatible with other strings in Perl.

        Perl assumes Latin1 (for Win32) or "native" (for other) for all filenames.

        Do you have an example of where Perl treats file names as anything but opaque binary strings? Is that what you mean by "native"?

        If anything, Perl (such as File::Spec) treats file names as any other (undecoded) text string: as iso-latin-1, regardless of platform.

      ok, so if I adopt UTF-16LE would MD5 be able to handle non-ASCII file object?
      My intention is to calculate MD5 on a file named '零一.txt'.
      $ctx isn't really the problem as this is a code fragment I got from some other places, tested and works, and since there's no 'warning' and 'strict' perl supposed to be lenient about it.
      Alas, I fixed 'my $ctx' and '||die on none open'. None are of any issues.
      On top of that, File::Find isn't working on a directory named '零一', either, same problem?!?
        ok, so if I adopt UTF-16LE would MD5 be able to handle non-ASCII file object?

        Digest and Digest::MD5 can calculate the hash sum of any binary data that you can read in perl.

        And on linux this simply works:

        $ echo foo > 零一.txt
        $ perl -we 'print while <>' 零一.txt
        foo

        (Intentionally no code tags here because they break non-latin1-chars)

        So yes, it's possible, by treating the file names as simple binary data. On Linux, at least ;-).

        On top of that, File::Find isn't working on a directory named '零一', either, same problem?!?

        "not working" is not an error description. What code fails? with what message?

        Never mind, you guys are right, I didn't define $ctx.
        Thanks for everybody. Banging head.

      Windows uses UTF-16LE for most stuff, but I don't know if that applies to file names as well.

      It does, for the *W system calls. *A system calls use the local code page, if I understand correctly.

Re: MD5 non ascii file name
by JavaFan (Canon) on Aug 19, 2008 at 20:29 UTC
    As others have pointed out, $ctx isn't set. You also don't check the return value of the open(I, "<$file"); call, so you don't know whether the open succeeded.

    And isn't it time to slowly kill the bare-word-as-a-file-handle idiom for new code? Bare word file handles are so 20th century.

      And isn't it time to slowly kill the bare-word-as-a-file-handle idiom for new code?

      Unforunately, open still uses barewords for most of its examples. But I think someone submitted a patch to change that in the next version?