Re: Seeking help with Extracting files from zip
by jonadab (Parson) on Jan 14, 2015 at 14:16 UTC
|
Everything works perfectly as long as my zip file has files named with Latin characters, but things get worse when the names are Chinese or Japanese.
If you can answer a couple of questions, it may give us the information that would allow us to actually help you...
- When you say "things get worse", what does that mean, exactly? Do you get an error message? Does extractMemberWithoutPaths return an error code? Are the files written at all? Do their filenames get mangled? What do the resulting mangled filenames look like? The phrase "get worse" is kind of vague, so I'm not really sure what's going wrong, and without knowing what's going wrong, it's hard to know how to fix it.
- Can you, via some other means (say, using a file manager, or on the command line) create files in the location where you are trying to extract these, with CJK characters in their filenames? Not all filesystems support such things, and so without knowing what kind of filesystem your storage device is formatted with, we can't know for sure that it's even theoretically possible for such filenames to be created. Do you know what kind of filesystem it is?
ext3? NTFS? HFS+? FAT32? Something else? (If you don't know the answer to this, just telling us what operating system you're using and whether you're saving on your computer's main hard drive or to a USB flash drive or some other location could provide clues.) Update: I just noticed the "c:/somedir" in your code, which I suspect narrows things down a little. NTFS *ought* to be able to handle CJK filenames, I think, although depending on what version of Windows you have it might require that the relevant language options be installed, in the Language thingydoo in the control panel. If you're using a really old Windows (95/98/Me) or for some other reason are using FAT32, then I'm less sure.
Oh, one other thing: the following code works for me (Perl 5.10.1, debian oldstable amd64):
nathan@warthog:~/test2/extract$ ls
somefile.zip
nathan@warthog:~/test2/extract$ perl -e '
$filename = "somefile.zip";
$dest_dir = "/home/nathan/test2/extract";
use Archive::Zip;
my $zip = Archive::Zip->new();
local $Archive::Zip::UNICODE = 1;
unless ( $zip->read($filename) == AZ_OK ) {
die "Error Reading Zip File !";
}
foreach my $m ($zip->members()) { print "Member $m:\n ";
my $err = $zip->extractMemberWithoutPaths( $m, "$dest_dir/" . $m->fi
+leName);
print "Error: $err" if $err; print $/;
}'
Member Archive::Zip::ZipFileMember=HASH(0xdfdd30):
Member Archive::Zip::ZipFileMember=HASH(0xdfe2b8):
Member Archive::Zip::ZipFileMember=HASH(0xdfe5a0):
Member Archive::Zip::ZipFileMember=HASH(0xdfe888):
Member Archive::Zip::ZipFileMember=HASH(0xdfeb98):
Member Archive::Zip::ZipFileMember=HASH(0xdfee80):
nathan@warthog:~/test2/extract$ ls
한국어 somefile.zip ગુજર&#
+2750;તી ಕನ್ನಡ ব
+94;ংলা 中文 日本語
nathan@warthog:~/test2/extract$
(Perlmonks seems unable or perhaps unwilling to handle most of those characters -- and if unwilling I can't blame them; this is by design an English-language venue -- but they display just fine on my terminal when I do the ls. Of course, I created my somefile.zip using the zip program that comes with Debian; yours may have been created using different software...)
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Thanks for your reply.
By worse i mean the filename characters gets mangled.
I tried extracting the same zip file using windows tools like winrar and it extracts the files with proper names likewise it should be.
I am using windows 7 and have Japanese and Chinese language packs installed on the machine. Below is the link to an image which shows the difference in name of the folder.
http://s4.postimg.org/bnphbww59/Japanese.png
| [reply] [Watch: Dir/Any] |
|
Ok, so I assume the katakana filename there is what it's supposed to look like, and the gibberish filename with nearly more than twice as many characters, most of which look like they came from the miscellanous-symbols-and-accented-characters section of an eight-bit character set, is the result of running your code?
This definitely looks like a charset translation issue. The Archive::Zip documentation indicates that setting UNICODE causes the filenames in the archive to be treated as UTF8. Perhaps they're not? Maybe they're UTF16 or UTF32 or some other Unicode encoding (or, heaven help you, some pre-Unicode Asian encoding like Shift-JIS or whatnot)? If you can figure out what fiddling needs to be done to preserve the encoding, you can pass the correct filename to extractMemberWithoutPaths and that should probably work, I think...
Unfortunately, I don't know that much about the details of the character sets involved, but maybe someone else will come along now and be able to recognize what's going on. (Even just being able to recognize which encoding is being erroneously treated as though it were some other encoding would go a long way toward figuring out the problem.) That image you provided should help.
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
Re: Seeking help with Extracting files from zip
by pmqs (Friar) on Jan 14, 2015 at 16:02 UTC
|
Were the zip files created using Windows 7 Compressed Folders?
If they were, the filenames will not be stored in Unicode in the zip files (see this link for the gory details).
I *think* they will use whatever code page is active on your Windows setup.
To tell for sure can you run the zipdetails script that comes with perl against the zip file you used for the screenshot & post the output?
| [reply] [Watch: Dir/Any] |
|
Hahaha, I just noticed this ... 7zip handles/makes the names utf8, whereas windows does codepage nonsense
| [reply] [Watch: Dir/Any] |
Re: Seeking help with Extracting files from zip ( Win32::Unicode )
by Anonymous Monk on Jan 14, 2015 at 23:36 UTC
|
If you want any kind of unicode filenames on windows, you need Win32::Unicode {
use Win32::Unicode qw/ -native /;
open my($fh), '>:raw', $unicodename;
...
}
| [reply] [Watch: Dir/Any] [d/l] |
|
Seems to work for me , naturally it doesn't chmod/umask, not throughly tested, assumes utf8 (no easy flag I could see that signals utf8), and its a monkeypatch , unzipwin32unicode.pl
| [reply] [Watch: Dir/Any] [d/l] |
|
How I created the test directory, later zipped with 7zip, makekebabs.pl, its meat
kebabing the ћевап.txt
kebabing the ražnjić.txt
kebabing the ćevap.txt
kebabing the кебапче.txt
kebabing the kebab.txt
| [reply] [Watch: Dir/Any] [d/l] |
Re: Seeking help with Extracting files from zip
by Anonymous Monk on Jan 14, 2015 at 18:09 UTC
|
What do members look like? (when you print them out to a file) | [reply] [Watch: Dir/Any] |
|
Here is an example of the output I get when with a zip file that uses Unicode properly. The important thing to look for is the presence of the "Language Encoding".
0000 LOCAL HEADER #1 04034B50
0004 Extract Zip Spec 0A '1.0'
0005 Extract OS 00 'MS-DOS'
0006 General Purpose Flag 0800
[Bit 11] 1 'Language Encoding'
0008 Compression Method 0000 'Stored'
000A Last Mod Time 3EB3B54C 'Thu May 19 22:42:24 2011'
000E CRC 00000000
0012 Compressed Length 00000000
0016 Uncompressed Length 00000000
001A Filename Length 0009
001C Extra Length 001C
001E Filename 'tmp/PĀé'
0027 Extra ID #0001 5455 'UT: Extended Timestamp'
0029 Length 0009
002B Flags '03 mod access'
002C Mod Time 4DD58EBF 'Thu May 19 22:42:23 2011'
0030 Access Time 4DD59079 'Thu May 19 22:49:45 2011'
0034 Extra ID #0002 7875 'ux: Unix Extra Type 3'
0036 Length 000B
0038 Version 01
0039 UID Size 04
003A UID 000003E8
003E GID Size 04
003F GID 000003E8
0043 CENTRAL HEADER #1 02014B50
0047 Created Zip Spec 1E '3.0'
0048 Created OS 03 'Unix'
0049 Extract Zip Spec 0A '1.0'
004A Extract OS 00 'MS-DOS'
004B General Purpose Flag 0800
[Bit 11] 1 'Language Encoding'
004D Compression Method 0000 'Stored'
004F Last Mod Time 3EB3B54C 'Thu May 19 22:42:24 2011'
0053 CRC 00000000
0057 Compressed Length 00000000
005B Uncompressed Length 00000000
005F Filename Length 0009
0061 Extra Length 0018
0063 Comment Length 0000
0065 Disk Start 0000
0067 Int File Attributes 0000
[Bit 0] 0 'Binary Data'
0069 Ext File Attributes 81A40000
006D Local Header Offset 00000000
0071 Filename 'tmp/PĀé'
007A Extra ID #0001 5455 'UT: Extended Timestamp'
007C Length 0005
007E Flags '03 mod access'
007F Mod Time 4DD58EBF 'Thu May 19 22:42:23 2011'
0083 Extra ID #0002 7875 'ux: Unix Extra Type 3'
0085 Length 000B
0087 Version 01
0088 UID Size 04
0089 UID 000003E8
008D GID Size 04
008E GID 000003E8
0092 END CENTRAL HEADER 06054B50
0096 Number of this disk 0000
0098 Central Dir Disk no 0000
009A Entries in this disk 0001
009C Total Entries 0001
009E Size of Central Dir 0000004F
00A2 Offset to Central Dir 00000043
00A6 Comment Length 0000
Done
| [reply] [Watch: Dir/Any] [d/l] |
|
Actually, if you run zipdetails in verbose mode we can get a hex dump of what is actually stored in the zip file. The "-v" option enables verbose mode below
$ zipdetails -v abc.zip
0000 0004 50 4B 03 04 LOCAL HEADER #1 04034B50
0004 0001 0A Extract Zip Spec 0A '1.0'
0005 0001 00 Extract OS 00 'MS-DOS'
0006 0002 00 08 General Purpose Flag 0800
[Bit 11] 1 'Language Encoding'
0008 0002 00 00 Compression Method 0000 'Stored'
000A 0004 4C B5 B3 3E Last Mod Time 3EB3B54C 'Thu May 19 22:42
+:24 2011'
000E 0004 00 00 00 00 CRC 00000000
0012 0004 00 00 00 00 Compressed Length 00000000
0016 0004 00 00 00 00 Uncompressed Length 00000000
001A 0002 09 00 Filename Length 0009
001C 0002 1C 00 Extra Length 001C
001E 0009 74 6D 70 2F Filename 'tmp/PĀé'
50 C4 80 C3
A9
0027 0002 55 54 Extra ID #0001 5455 'UT: Extended Timesta
+mp'
0029 0002 09 00 Length 0009
002B 0001 03 Flags '03 mod access'
002C 0004 BF 8E D5 4D Mod Time 4DD58EBF 'Thu May 19 22:42
+:23 2011'
0030 0004 79 90 D5 4D Access Time 4DD59079 'Thu May 19 22:49
+:45 2011'
0034 0002 75 78 Extra ID #0002 7875 'ux: Unix Extra Type
+3'
0036 0002 0B 00 Length 000B
0038 0001 01 Version 01
0039 0001 04 UID Size 04
003A 0004 E8 03 00 00 UID 000003E8
003E 0001 04 GID Size 04
003F 0004 E8 03 00 00 GID 000003E8
0043 0004 50 4B 01 02 CENTRAL HEADER #1 02014B50
0047 0001 1E Created Zip Spec 1E '3.0'
0048 0001 03 Created OS 03 'Unix'
0049 0001 0A Extract Zip Spec 0A '1.0'
004A 0001 00 Extract OS 00 'MS-DOS'
004B 0002 00 08 General Purpose Flag 0800
[Bit 11] 1 'Language Encoding'
004D 0002 00 00 Compression Method 0000 'Stored'
004F 0004 4C B5 B3 3E Last Mod Time 3EB3B54C 'Thu May 19 22:42
+:24 2011'
0053 0004 00 00 00 00 CRC 00000000
0057 0004 00 00 00 00 Compressed Length 00000000
005B 0004 00 00 00 00 Uncompressed Length 00000000
005F 0002 09 00 Filename Length 0009
0061 0002 18 00 Extra Length 0018
0063 0002 00 00 Comment Length 0000
0065 0002 00 00 Disk Start 0000
0067 0002 00 00 Int File Attributes 0000
[Bit 0] 0 'Binary Data'
0069 0004 00 00 A4 81 Ext File Attributes 81A40000
006D 0004 00 00 00 00 Local Header Offset 00000000
0071 0009 74 6D 70 2F Filename 'tmp/PĀé'
50 C4 80 C3
A9
007A 0002 55 54 Extra ID #0001 5455 'UT: Extended Timesta
+mp'
007C 0002 05 00 Length 0005
007E 0001 03 Flags '03 mod access'
007F 0004 BF 8E D5 4D Mod Time 4DD58EBF 'Thu May 19 22:42
+:23 2011'
0083 0002 75 78 Extra ID #0002 7875 'ux: Unix Extra Type
+3'
0085 0002 0B 00 Length 000B
0087 0001 01 Version 01
0088 0001 04 UID Size 04
0089 0004 E8 03 00 00 UID 000003E8
008D 0001 04 GID Size 04
008E 0004 E8 03 00 00 GID 000003E8
0092 0004 50 4B 05 06 END CENTRAL HEADER 06054B50
0096 0002 00 00 Number of this disk 0000
0098 0002 00 00 Central Dir Disk no 0000
009A 0002 01 00 Entries in this disk 0001
009C 0002 01 00 Total Entries 0001
| [reply] [Watch: Dir/Any] [d/l] |