Re: UCS2 Internationalization file parsing

Take a look at the Unicode::String module on CPAN, specifically at the byteswap method. While reading the file, if your BOM is not in network order, byteswap each line then decode it to utf-8. Reverse the process as you write. Incidentally, ucs2 is another name for utf-16.

Here's a small demo script.

use warnings;
use strict;
use Unicode::String qw/byteswap2/;
use Encode qw/encode decode/;

load('Language.msg');

sub load {
    my $file = shift;
    open my $FH, '<:bytes', $file or die "$file: $!\n";
    my $char;
    sysread( $FH, $char, 2, 0 );
    my $swapbyteorder;
    if ( $char eq "\x{FF}\x{FE}" ) {
        $swapbyteorder = 1;
    }
    elsif ( $char eq "\x{FE}\x{FF}" ) {
        $swapbyteorder = 0;
    }
    else {
        die "No BOM found.\n";
    }
    close $FH;

    {
        local $/ = $swapbyteorder ? "\x{0600}" : "\x{0006}";
        open $FH, '<:encoding(utf16)', $file or die "$!\n";

        while ( my $line = <$FH> ) {
            chomp $line;
            byteswap2($line) if $swapbyteorder;
            $line = decode( 'utf16', $line );
            # do whatever with line, now in utf-8
        }
    }
}
[download]

Comment on Re: UCS2 Internationalization file parsing Download Code

Replies are listed 'Best First'.
Re^2: UCS2 Internationalization file parsing by sfinster (Acolyte) on Mar 27, 2006 at 17:05 UTC
Uncle! I downloaded the Unicode::String tar/zip, but I can't get it working. I believed I had ActivePerl installed on Windows, but "ppm" gives me a 'not found' error. I tried the perl -MCPAN -e method, but I'm not sure what to feed it. Unicode::String? Some paths? Thanks.	[reply]
Re^3: UCS2 Internationalization file parsing by thundergnat (Deacon) on Mar 28, 2006 at 14:11 UTC
If you are using ActivePerl and ppm isn't working, you probably have a broken install. I would first suggest reinstalling ActivePerl. (You may possibly have some other distribution of Perl earlier in your path. Try running `perl -v` [download] at the command line and see if it mentions ActiveState.) You could look at doing a CPAN install, but you are going to need a C compiler on your system to do so. You Unicode::String isn't a pure perl module, so you can't (easily) just do a manual install and expect it to work. Your best bet is to get ppm working.	[reply] [d/l]