Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

How to fix wrongly encoded filenames?

by mcc001 (Initiate)
on Mar 17, 2014 at 10:32 UTC ( #1078587=perlquestion: print w/replies, xml ) Need Help??

mcc001 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Problem: Reading the directory contents of a directory on a Microsoft Windowsserver 2008 SP2. The contents of the directory was created by the commandline version of 7zip while unpacking a TAR-archive, which was created on a UNIX-system. The filenames contain characters outside ASCII.

Symptoms and reason for them: The filenames were encoded in latin1 and were put as bytestrings on the filesystem of the Windowsserver and then tagged being of the encoding scheme of the Windowsserver without being altered on byte level. The command 'dir' of that directory shows the correct count of characters, but the NON-ASCII-characters are shown as 'grey blcoks'.

The same directory shown in the Windows Explorer shows them in the 8.3 scheme of DOS and an attached ~<number>. I need to read the directory with perl in the way command.com sees the directory and to fix the problem WITHOUT changing the bytes of the filename as such.

I tried many versions of encode/decode/use bytes/pack-unpack. Too many to post them here.

How can I fix that with Perl only?
Thank you very much in advance!
Best regards,
mcc

Replies are listed 'Best First'.
Re: How to fix wrongly encoded filenames?
by Anonymous Monk on Mar 17, 2014 at 10:49 UTC
    #!/usr/bin/perl -- use Encode qw/ encode decode /; my $string = decode("iso-8859-1", $octets); { ## probably works but if it doesn't use Win32::Unicode::File(); Win32::Unicode::File::moveW( $rawness, $string ); } { ## should work open my $in, '<:raw', $rawness or die $!; { use Win32::Unicode::Native; open my $out, '>:raw', $string or die $!; print $out readline( $in ); close $out; } close $in; }
      Hi, thank for the code! :) One problem is to read the corrupted filenames off the filesystem without getting the shortened 8.3 form.

      What happens is: 7zip, which is used to unpack the the TAR-archive, doesnt know the encoding scheme of the filenames and tagged them as cdp437 (while they are latin1). Windows sees the cdp437-flag and encodes the latin1-filename from cdp437 to the underlaying UTF16 (I think this is used internally). The result is a latin1-bytestring converted from cdp437 to UTF16 which results in encoding-nonsense.

      The logik which I want to implement (and currently dont know how) in Perl is:
      7zip is used to unpack the TAR archives. From the output of 7zip I get a list of latin1-encoded filenames while 7zip is extracting those.
      Take filename by filename off the list, and if not found, it is a filename which encoding is garbled.
      For those filenames, do:
      Read the encoding nonsense (and *NOT* the 8.3 form of the files, windows is not able to display correctly) off the filesystem. Decode them from cdp437 and encode them to latin1. Check whether they could be found now. If so, rename the garbled filename to the corresponding filename of the list (output from 7zip).

      First goal is to read the full (and garbled) filename from the filesystem.
      Second goal is to change the "encoding scheme flag" of the bytestring of the filename *without* changing the bytes themselves.

      I cannot identify the part of the code above, which reads the filenames off the filesystem, which definetly is a result of my being a novice and no monk...;)

      How can I implement the algorithm described above?
      Thank you very mauch in advance!
      Best regards, mcc

Re: How to fix wrongly encoded filenames?
by andal (Hermit) on Mar 17, 2014 at 11:27 UTC

    Well, I don't know about Windowsserver, but on Linux file names don't have any encoding information attached to them. There's global setting for all file names. So, to fix the problem, one should change "bytestring" for the file name to match system-wide encoding of the file names. I have strong suspicion, that Windowsserver is not different from Linux in this respect.

Re: How to fix wrongly encoded filenames?
by graff (Chancellor) on Mar 18, 2014 at 03:37 UTC
    Do you still have the original tar file that came from the unix system? If so, you should be able to open that with Archive::Tar, and get the raw byte strings of the file names. If they really are encoded as iso-8859-1, then it's trivial to decode those strings to utf8 (and if necessary, re-encode them to whatever works on your windows server).

    If that's possible, then maybe you want to just delete the first attempt from the windows server and try again using Archive::Zip (instead of 7zip, whatever that is); you can iterate through the tar file, decode the non-ASCII names into perl-internal utf8 (and re-encode for windows if necessary); then create directories and files on the server filesystem as needed to unpack the tar contents.

    Who knows, maybe you'll want to decode/recode the file contents while you're at it.

      That is what was tried before. It fails for two reasons: First: Performance, cause you have to handle the files separately. The count of file may be upto 5000 files.

      Second: Size, the TAR-balls which are to handle, are of the size of some Gbytes.

Re: How to fix wrongly encoded filenames?
by wollmers (Scribe) on Mar 18, 2014 at 10:10 UTC

    Most Linux/Unix filesystem store the names as bytes, and can use any encoding (whatever the locale of the user/shell is).

    Modern Windows (NTFS, vFAT with long file names) and MAC OSX HFS+ use UCS-2/UTF-16. In some situations non-ASCII in filenames on OSX is escaped. OSX also is "case conserving", which means that e.g. Foo.txt and foo.txt in the same directory point to the same file. Maybe Windows is also case conserving (a Win-guy told me so yesterday.

    Thus filenames are not portable between (the most popular) operating systems unless you restrict filenames to ASCII a-zA-Z0-9_+.- and avoid case-duplicates.

    In your case you maybe should unpack on a Linux-system, URL-encode the filenames, and then transfer them to Win. But keep in mind that on Linux filenames can be 255 bytes long, whereas Win allows 255 UCS-2 characters.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1078587]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2022-05-17 22:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (68 votes). Check out past polls.

    Notices?