in reply to Unicode (ä, ö, ü in German) Problem with File::Find under Windows2000

Perl 5.8 is capable of using UTF8 internally, but it can't always tell if a string of octets is UTF8, latin1, big5, ..., or just some random binary data.

It looks like readdir is returning a latin1 encoding of the name. This leaves the interesting question of what it would do with some name that isn't representable in latin1.

As BrowserUK indicated, the error indicates that the string in your source file is not encoded as a utf8 string. If your editor is capable of using utf8, you can still use it in your program. For example, in vim ":set encoding=utf8". That will work, but may not convert pre-existing non-ascii characters.

I tried a test with File::Find, and finddepth seemed to work okay.

If you want to display non-ASCII data in a DOS box, you need to convert it to the correct code page. Here's an example program:

#!perl -w use Encode; use utf8; my $test = "This is a test. Gödel"; my $cp = `chcp`; # get code page from DOS CHCP command if ($cp =~ /(\d+)/) { $cp = "cp$1"; } else { $cp = "cp437"; } binmode STDOUT, ":encoding($cp)" or die "Error on binmode: $!"; print STDOUT "$test\n";
  • Comment on Re: Unicode (ä, ö, ü in German) Problem with File::Find under Windows2000
  • Download Code

Replies are listed 'Best First'.
readdir UTF8
by Thelonius (Priest) on Sep 02, 2003 at 17:55 UTC
    Addendum: I found out how to get UTF8 results from readdir, if you need them (for German, you don't). You use the "perl -C" flag or set ${^WIDE_SYSTEM_CALLS}.

    Right now it's slightly broken because it returns utf8 strings, but doesn't set the utf8 flag on the strings. There is a workaround:

    #!perl -w # use File::Find; use strict; use Encode qw(decode_utf8 is_utf8); my $start = "/home/Hirschk/pmonks/utftest"; { local ${^WIDE_SYSTEM_CALLS} = 1; finddepth( \&showme, $start ); } sub fixutf8 { for (@_) { if (${^WIDE_SYSTEM_CALLS} && !is_utf8($_)) { $_ = decode_utf8($_); } } } sub showme { fixutf8($File::Find::dir,$File::Find::name,$_); print "\$_ = $_\n"; }
    The fixutf8 function should, well, fix it.

      Thank you BrowserUk! Thank you Thelonius!
      I think I got more than I hoped.

      because of some other trouble, i can try a little more since this morning.
      Thelonius,I 've tried your code and I think something will happen in File::Find::finddepth before you fix it.
      I used XML to get UTF8 string (similar to my old program).
      config.xml

      <?xml version="1.0" encoding="UTF-8" ?> <config> <srcdir>d:\temp\source\test2</srcdir> <dstdir>d:\temp\source\test5</dstdir> </config>

      newcopy6.pl
      #!d:\perl\bin\perl.exe -w use File::Find; use strict; use Encode qw(encode_utf8 decode_utf8 is_utf8); use XML::Simple; my $configfile=".\\config.xml"; my $config=XMLin($configfile); my $srcdir="d:\\temp\\source\\test2"; print "\$srcdir: $srcdir\n"; if(is_utf8($srcdir)){ print "is utf8\n"; }else{ print "is NOT utf8\n"; $srcdir=decode_utf8($srcdir); # ??? } # line "!!!" get srcdir from xml # or you can comment it to test # wether line "???" take any effect or not $srcdir=$$config{'srcdir'}; # !!! if(is_utf8($srcdir)){ print "is utf8\n"; }else{ print "is NOT utf8\n"; } { local ${^WIDE_SYSTEM_CALLS} = 1; finddepth( \&showme, $srcdir ); } sub fixutf8 { for (@_) { if (${^WIDE_SYSTEM_CALLS} && !is_utf8($_)) { $_ = decode_utf8($_); } } } sub showme { print "\$_ = $_\n"; fixutf8($File::Find::dir,$File::Find::name,$_); print "\$_ = $_\n"; }
      And I got results in Dos but It's NOT depth first!
      D:\temp\source>newcopy6.pl $srcdir: d:\temp\source\test2 is NOT utf8 is utf8 Can't cd to (d:\temp\source\test2/) &#9500;â&#9516;&#9570;a: No such f +ile or directory at D:\temp\source\newcopy6.pl line 28 $_ = &#9500;&#9570;a $_ = &#9500;&#9570;a $_ = . $_ = .
      and in Komodo
      $srcdir: d:\temp\source\test2 is NOT utf8 is utf8 $_ = öa $_ = öa $_ = . $_ = .
      and if I comment "!!!" , i got in Komodo
      Line "???" takes NO effect, but it's depth first
      $srcdir: d:\temp\source\test2 is NOT utf8 is NOT utf8 $_ = ü.txt $_ = &#52212;xt $_ = öa $_ = &#30797; $_ = . $_ = .
      (I've set UTF8 as editor encoding in Komodo's Preference, some character can't be posted here correctly, see Note from BrowserUK)
      Then I've tested the fixutf8.
      sub fixutf8 { for (@_) { print "\$_=$_"; if (${^WIDE_SYSTEM_CALLS} && !is_utf8($_)) { $_ = decode_utf8($_); } if(is_utf8($_)){ print "#\$_=$_ is utf8\n"; }else{ print "#\$_=$_ is NOT utf8\n"; } } }
      then get
      $srcdir: d:\temp\source\test2 is NOT utf8 is NOT utf8 $_ = ü.txt $_=d:\temp\source\test2/öa#$_=d:\temp\source\test2/&#30816;is utf8 $_=d:\temp\source\test2/öa/ü.txt#$_=d:\temp\source\test2/&#30831;&#522 +12;xt is utf8 $_=ü.txt#$_=&#52212;xt is utf8 $_ = &#52212;xt $_ = öa $_=d:\temp\source\test2#$_=d:\temp\source\test2 is NOT utf8 $_=d:\temp\source\test2/öa#$_=d:\temp\source\test2/&#30816;is utf8 $_=öa#$_=&#30816;is utf8 $_ = &#30797; $_ = . $_=d:\temp\source\test2#$_=d:\temp\source\test2 is NOT utf8 $_=d:\temp\source\test2#$_=d:\temp\source\test2 is NOT utf8 $_=.#$_=. is NOT utf8 $_ = .

      So, I guess,
      If I give the "finddepth" a UTF8 dirname,then it get a Ascii name of child node but can't handle them correctly like the first 2 results in Dos /komodo

      If I give the "finddepth" a normal string with the program format, it has no problem to handle them just like last result.

      finally I use the plain text als config file...
      somehow disapointed.
      But I still can't understand,
      --Why the line "???" takes no effect?
      --According to the Thelonius' Post , there is no function like getEncoding but what is the encoding in the Program?

      btw. if you visit www.perl-community.de(where i also posted), you can see some other German-in-Win32 problems, for German in Dos there is a solution from Crian