in reply to UCS2 Internationalization file parsing

Take a look at the Unicode::String module on CPAN, specifically at the byteswap method. While reading the file, if your BOM is not in network order, byteswap each line then decode it to utf-8. Reverse the process as you write. Incidentally, ucs2 is another name for utf-16.

Here's a small demo script.

use warnings; use strict; use Unicode::String qw/byteswap2/; use Encode qw/encode decode/; load('Language.msg'); sub load { my $file = shift; open my $FH, '<:bytes', $file or die "$file: $!\n"; my $char; sysread( $FH, $char, 2, 0 ); my $swapbyteorder; if ( $char eq "\x{FF}\x{FE}" ) { $swapbyteorder = 1; } elsif ( $char eq "\x{FE}\x{FF}" ) { $swapbyteorder = 0; } else { die "No BOM found.\n"; } close $FH; { local $/ = $swapbyteorder ? "\x{0600}" : "\x{0006}"; open $FH, '<:encoding(utf16)', $file or die "$!\n"; while ( my $line = <$FH> ) { chomp $line; byteswap2($line) if $swapbyteorder; $line = decode( 'utf16', $line ); # do whatever with line, now in utf-8 } } }

Replies are listed 'Best First'.
Re^2: UCS2 Internationalization file parsing
by sfinster (Acolyte) on Mar 27, 2006 at 17:05 UTC
    Uncle! I downloaded the Unicode::String tar/zip, but I can't get it working.

    I believed I had ActivePerl installed on Windows, but "ppm" gives me a 'not found' error.

    I tried the perl -MCPAN -e method, but I'm not sure what to feed it. Unicode::String? Some paths?

    Thanks.

      If you are using ActivePerl and ppm isn't working, you probably have a broken install. I would first suggest reinstalling ActivePerl. (You may possibly have some other distribution of Perl earlier in your path. Try running

      perl -v
      at the command line and see if it mentions ActiveState.)

      You could look at doing a CPAN install, but you are going to need a C compiler on your system to do so. You Unicode::String isn't a pure perl module, so you can't (easily) just do a manual install and expect it to work. Your best bet is to get ppm working.