in reply to UCS2 Internationalization file parsing
Take a look at the Unicode::String module on CPAN, specifically at the byteswap method. While reading the file, if your BOM is not in network order, byteswap each line then decode it to utf-8. Reverse the process as you write. Incidentally, ucs2 is another name for utf-16.
Here's a small demo script.
use warnings; use strict; use Unicode::String qw/byteswap2/; use Encode qw/encode decode/; load('Language.msg'); sub load { my $file = shift; open my $FH, '<:bytes', $file or die "$file: $!\n"; my $char; sysread( $FH, $char, 2, 0 ); my $swapbyteorder; if ( $char eq "\x{FF}\x{FE}" ) { $swapbyteorder = 1; } elsif ( $char eq "\x{FE}\x{FF}" ) { $swapbyteorder = 0; } else { die "No BOM found.\n"; } close $FH; { local $/ = $swapbyteorder ? "\x{0600}" : "\x{0006}"; open $FH, '<:encoding(utf16)', $file or die "$!\n"; while ( my $line = <$FH> ) { chomp $line; byteswap2($line) if $swapbyteorder; $line = decode( 'utf16', $line ); # do whatever with line, now in utf-8 } } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: UCS2 Internationalization file parsing
by sfinster (Acolyte) on Mar 27, 2006 at 17:05 UTC | |
by thundergnat (Deacon) on Mar 28, 2006 at 14:11 UTC |