terris has asked for the wisdom of the Perl Monks concerning the following question:

Something my script is doing has caused File::Find to return UTF-8 instead of ANSI strings. At least I think it's UTF-8.

The ANSI filename is:
çá¬áº ¡á ¬á¡µ. G«óáad.doc

File::Find::name returns:
çá¬áº ¡á ¬á¡µ. G«óáad.doc

I tried using File::Find in a simple script. File::Find::name returns the ANSI string. Therefore, my program is doing something strange to cause File::Find to alter its behavior.

I copied all the "use" statements to the simple test program and I can't reproduce the problem.

My program reads an XML file using XML::Parser. I did the same in my simple test program and wasn't able to replicate the File::Find behavior.

Perl Version:
This is perl, v5.8.3 built for MSWin32-x86-multi-thread (with 8 registered patches, see perl -V for more detail)

Binary build 809 provided by ActiveState Corp.

Any ideas?

Thanks,
Terris

Replies are listed 'Best First'.
Re: File::Find returning utf-8 characters
by traveler (Parson) on Mar 26, 2004 at 20:21 UTC
    I had a similar problem when I used Switch. See this node for more information. It only happened under some circumstances.

    HTH, --traveler

Re: File::Find returning utf-8 characters
by kvale (Monsignor) on Mar 26, 2004 at 20:19 UTC
    If your script and test script both use the same modules, then the difference is in either the non-module code or a method call that you used in one program, but not the other.

    A general method of solving this sort of problem is to take out code, bit by bit, unitl the problem stops happening. That last bit you took out then has something to do with the problem.

    We could possibly help more if you showed us the code.

    -Mark

      Thank you for your message. This script: http://cvs.sourceforge.net/viewcvs.py/dgpctk/pctk2/pl/dgbuild.pl invokes this module via the method Build(): http://cvs.sourceforge.net/viewcvs.py/dgpctk/pctk2/pm/cfgmgmt/Build.pm
Re: File::Find returning utf-8 characters
by flyingmoose (Priest) on Mar 27, 2004 at 02:43 UTC
    At least I think it's UTF-8.

    It appears to be. UTF-8 is the same as ASCII for the lowest 127 bytes of the ASCII character set, and your ASCII characters in the lower 127 still line up.

    aside -- The niceness of UTF-8 is why Linux C apps don't have to worry about Unicode much, UTF-8 is just another bytestream to them -- and they don't know what is Unicode and what isn't. Dealing with other encodings is more painful, and Microsoft C/C++ likes to work in Double byte (WCHAR). I experienced this fun recently when dealing with Unicode enabled shared library we were writing -- UTF-8 is much better than the alternatives.