Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Win32 encoding conversion mystery

by mithaldu (Monk)
on Aug 13, 2018 at 10:52 UTC ( #1220286=perlquestion: print w/replies, xml ) Need Help??

mithaldu has asked for the wisdom of the Perl Monks concerning the following question:

Edit: vr below has provided the correct answer: https://www.perlmonks.org/?node_id=1220300


I'm on windows and trying to rename some files from Shift_JIS to UTF8, and Perl is giving me the hardest time trying to do this because my system's codepage is Latin1.

I've tried googling various ways around this, but i'm in a situation where i've completely lost overview. So i'm hoping that someone here can give me a more concrete hint.

As follows, i have examples of the string as i get it from the Win32::LongPath api, along with its codepoint values; on the left. On the right is the ascii string of the filename as windows' own `dir` provides it, also with codepoints.

It is notable that the visuals of the string on the left are roughly what you get if you view the string on the right with a DOS ASCII font. So it looks like the Win32::LongPath api took the original string, and "upconverted" "invisible" characters to some utf8 equivalent?

What would be the name of this process?
Is there a reverse mapping of it?

Image form of the names: https://i.imgur.com/1tUzUrn.png

{N┴q╠Ŀ⌐2 ƒ{ƒN‚Žq&#13 +0;Ž‚‚2 226 131 123 123 226 131 78 78 233 130 9524 193 196 142 113 113 233 130 9568 204 196 142 191 168 233 130 8976 169 233 130 189 171 50 50

Replies are listed 'Best First'.
Re: Win32 encoding conversion mystery
by vr (Curate) on Aug 13, 2018 at 17:38 UTC
    What would be the name of this process?

    Eh-h... just, 'encoding'?

    Good news, decoding from CP932 (~ shiftjis) goes error-free, bad news, online translations (to English) seem to produce gibberish.

    use strict; use warnings; use feature 'say'; use Encode qw/ encode decode /; my $a = join '', map chr, qw/ 226 123 226 78 233 9524 196 113 233 9568 196 191 233 8976 233 189 50 /; my $b = join '', map chr, qw/ 131 123 131 78 130 193 142 113 130 204 142 168 130 169 130 171 50 /; ############################ use Test::More; is $a, decode( 'cp437', $b, Encode::FB_CROAK | Encode::LEAVE_SRC ), 'f +rom b to a'; is $b, encode( 'cp437', $a, Encode::FB_CROAK | Encode::LEAVE_SRC ), 'f +rom a to b'; done_testing; ############################ use Win32; say Win32::GetACP; say Win32::GetOEMCP; say Win32::GetConsoleCP; ############################ use Imager; my $text = decode 'cp932', $b, Encode::FB_CROAK | Encode::LEAVE_SRC; my $image = Imager-> new( xsize => 300, ysize => 80 ); Imager::Font-> new( file => 'mona.ttf' ) -> align( string => $text, size => 30, color => 'white', x => $image-> getwidth/2, y => $image-> getheight/2, halign => 'center', valign => 'center', image => $image, ); $image-> write( file => 'jp.png' ); ############################ use Win32::LongPath; mkdirL $text;
      Thank you!

      You provided both the correct answer, which is that i needed to encode the UTF8 string i got to code page 437, and provided it in massive detail. This was the exact chain i need to get useful modern UTF8. :D

      my $safe = Encode::FB_CROAK | Encode::LEAVE_SRC; my $text = decode 'cp932', (encode "cp".Win32::GetOEMCP(), $windows_wi +de_filename, $safe), $safe;

        The actual code page you want is probably the one returned by "cp".Win32::GetOEMCP() or "cp".Win32::GetConsoleOutputCP(), not cp437. It's cp850 on my machine, for example. I think it's because I'm using "English (Canada)" instead of "English (US)".

Re: Win32 encoding conversion mystery
by ikegami (Patriarch) on Aug 13, 2018 at 11:48 UTC

    The Windows API has two versions of each function that requires text strings: An "(A)NSI" version and a "(W)ide" version.

    The ANSI version expects strings encoded using the Active Code Page. The ACP can be obtained from Win32::ACP(), and am encoding name suitable for Encode::encode and Encode::decode can be obtained from "cp".Win32::ACP(). For Western machines, these are 1252 and cp1252 respectively. (Not latin1 as you said.) This is what Perl's native functions use.

    To manipulate any file, you need to use the Wide version. These are made accessible by Win32::Unicode and other modules from the same distro.

      Based on your recommendation i wrote this:
      use Win32::Unicode; my $wdir = Win32::Unicode::Dir->new; $wdir->open(@ARGV); my ( undef, undef, $file ) = $wdir->fetch; print join "\n", "length: " . length( $file ) . ", code points:", map +ord, split //, $file;
      Which gives the below result, which is identical with the weirdly "upconverted" unicode above, and without further processing useless, as it doesn't map to anything but mojibake on any codepage known to me. (And ACP gives me 1252, sorry for speaking inaccurately about that.)
      d:\>perl filename_check.pl RJ209072 length: 17, code points: 226 123 226 78 233 9524 196 113 233 9568 196 191 233 8976 233 189 50
        What is it supposed to give you?
Re: Win32 encoding conversion mystery
by marto (Cardinal) on Aug 13, 2018 at 10:58 UTC
      On some platforms imgur will redirect to a full page on the first load even if you link directly to the image.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1220286]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2022-05-20 20:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (76 votes). Check out past polls.

    Notices?