How to fix wrongly encoded filenames?

mcc001 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Problem: Reading the directory contents of a directory on a Microsoft Windowsserver 2008 SP2. The contents of the directory was created by the commandline version of 7zip while unpacking a TAR-archive, which was created on a UNIX-system. The filenames contain characters outside ASCII.

Symptoms and reason for them: The filenames were encoded in latin1 and were put as bytestrings on the filesystem of the Windowsserver and then tagged being of the encoding scheme of the Windowsserver without being altered on byte level. The command 'dir' of that directory shows the correct count of characters, but the NON-ASCII-characters are shown as 'grey blcoks'.

The same directory shown in the Windows Explorer shows them in the 8.3 scheme of DOS and an attached ~<number>. I need to read the directory with perl in the way command.com sees the directory and to fix the problem WITHOUT changing the bytes of the filename as such.

I tried many versions of encode/decode/use bytes/pack-unpack. Too many to post them here.

How can I fix that with Perl only?
Thank you very much in advance!
Best regards,
mcc

Comment on How to fix wrongly encoded filenames?

Replies are listed 'Best First'.
Re: How to fix wrongly encoded filenames? by Anonymous Monk on Mar 17, 2014 at 10:49 UTC
`#!/usr/bin/perl -- use Encode qw/ encode decode /; my $string = decode("iso-8859-1", $octets); { ## probably works but if it doesn't use Win32::Unicode::File(); Win32::Unicode::File::moveW( $rawness, $string ); } { ## should work open my $in, '<:raw', $rawness or die $!; { use Win32::Unicode::Native; open my $out, '>:raw', $string or die $!; print $out readline( $in ); close $out; } close $in; }` [download]	[reply] [d/l]
Re^2: How to fix wrongly encoded filenames? by mcc001 (Initiate) on Mar 17, 2014 at 17:53 UTC
Hi, thank for the code! :) One problem is to read the corrupted filenames off the filesystem without getting the shortened 8.3 form. What happens is: 7zip, which is used to unpack the the TAR-archive, doesnt know the encoding scheme of the filenames and tagged them as cdp437 (while they are latin1). Windows sees the cdp437-flag and encodes the latin1-filename from cdp437 to the underlaying UTF16 (I think this is used internally). The result is a latin1-bytestring converted from cdp437 to UTF16 which results in encoding-nonsense. The logik which I want to implement (and currently dont know how) in Perl is: 7zip is used to unpack the TAR archives. From the output of 7zip I get a list of latin1-encoded filenames while 7zip is extracting those. Take filename by filename off the list, and if not found, it is a filename which encoding is garbled. For those filenames, do: Read the encoding nonsense (and NOT the 8.3 form of the files, windows is not able to display correctly) off the filesystem. Decode them from cdp437 and encode them to latin1. Check whether they could be found now. If so, rename the garbled filename to the corresponding filename of the list (output from 7zip). First goal is to read the full (and garbled) filename from the filesystem. Second goal is to change the "encoding scheme flag" of the bytestring of the filename without changing the bytes themselves. I cannot identify the part of the code above, which reads the filenames off the filesystem, which definetly is a result of my being a novice and no monk...;) How can I implement the algorithm described above? Thank you very mauch in advance! Best regards, mcc	[reply]
Re^3: How to fix wrongly encoded filenames? by Anonymous Monk on Mar 18, 2014 at 07:16 UTC
Well, thats the tricky part You can get not 8.3 using Win32::GetANSIPathName()/Win32::GetLongPathName() or using Win32::Unicode::Dir Now what you'll get will not be raw bytes ... so changing them from whatever they are into whatever you want them to be will be tricky Getting the real names from the source tarball is the easiest option Tutorials: perlunitut: Unicode in Perl, perluniintro/perlunitut... Re: Can Perl convert ISO-? \| WIN-? \| MAC-? to UTF-8? Good luck	[reply]
Re: How to fix wrongly encoded filenames? by andal (Hermit) on Mar 17, 2014 at 11:27 UTC
Well, I don't know about Windowsserver, but on Linux file names don't have any encoding information attached to them. There's global setting for all file names. So, to fix the problem, one should change "bytestring" for the file name to match system-wide encoding of the file names. I have strong suspicion, that Windowsserver is not different from Linux in this respect.	[reply]
Re: How to fix wrongly encoded filenames? by graff (Chancellor) on Mar 18, 2014 at 03:37 UTC
Do you still have the original tar file that came from the unix system? If so, you should be able to open that with Archive::Tar, and get the raw byte strings of the file names. If they really are encoded as iso-8859-1, then it's trivial to decode those strings to utf8 (and if necessary, re-encode them to whatever works on your windows server). If that's possible, then maybe you want to just delete the first attempt from the windows server and try again using Archive::Zip (instead of 7zip, whatever that is); you can iterate through the tar file, decode the non-ASCII names into perl-internal utf8 (and re-encode for windows if necessary); then create directories and files on the server filesystem as needed to unpack the tar contents. Who knows, maybe you'll want to decode/recode the file contents while you're at it.	[reply]
Re^2: How to fix wrongly encoded filenames? by Anonymous Monk on Mar 18, 2014 at 06:17 UTC
That is what was tried before. It fails for two reasons: First: Performance, cause you have to handle the files separately. The count of file may be upto 5000 files. Second: Size, the TAR-balls which are to handle, are of the size of some Gbytes.	[reply]
Re: How to fix wrongly encoded filenames? by wollmers (Scribe) on Mar 18, 2014 at 10:10 UTC
Most Linux/Unix filesystem store the names as bytes, and can use any encoding (whatever the locale of the user/shell is). Modern Windows (NTFS, vFAT with long file names) and MAC OSX HFS+ use UCS-2/UTF-16. In some situations non-ASCII in filenames on OSX is escaped. OSX also is "case conserving", which means that e.g. Foo.txt and foo.txt in the same directory point to the same file. Maybe Windows is also case conserving (a Win-guy told me so yesterday. Thus filenames are not portable between (the most popular) operating systems unless you restrict filenames to ASCII a-zA-Z0-9_+.- and avoid case-duplicates. In your case you maybe should unpack on a Linux-system, URL-encode the filenames, and then transfer them to Win. But keep in mind that on Linux filenames can be 255 bytes long, whereas Win allows 255 UCS-2 characters.	[reply]


XP is just a number
	PerlMonks