UCS2 Internationalization file parsing

sfinster has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

My company is working on internationalization. All our C++ programs have been converted. The Perl scripts remain. They run on multiple platforms (Solaris, Linux, Windows). Each platform has at least Perl 5.8.0. (Windows has 5.8.3.)

I need to be able to read a variable-length record file of messages in ucs2. The first line contains a BOM and some header information. Because the messages themselves can contain newlines, the file is delimited with an ASCII 6 ('ack').

My goal is to open the file and parse the messages into a hash. I know to use a local /$ to redefine my line delimiter.

I'm having a number of problems. First, the BOM in my file is 'FFFE', while Windows is expecting 'FEFF'. I know that Windows can handle this. I can open the message file in Wordpad and it's legible. However, this doesn't appear to be automatic in Perl. Am I going to need to swap the bytes on every character I read in?

I keep reading that "all you have to do" is use an encoding on a file when you open it. I can't seem to examine what I read in, much less parse it.

Has anyone done this? Is there a document or web site that lays out what I need to do? All I can find (perluniintro, perlunicode, web) are examples of how to jam a special character into a string using \x{xxxx} format. I really don't want to look up every string in the hash by building large "\x{xxxx}\x{xxxx}\x{xxxx}etc." strings that are illegible to the human eye.

A message line looks like:

The message may contain values that will be filled in at runtime. These appear in the text like this: <<< 1 >>>. If it does, then the formats for those arguments will look something like this: {{{ "%s", 2, %d, 1 }}}

That is, a set of format, placement pairs. I need to parse out the pairs, then replace the <<< x >>> in the string with its corresponding format.

I'll attach my pitiful attempts at getting this going.

#!/usr/local/bin/perl
#
# Module: LocMsgTable
#
# Purpose: Build a table of messages from a Collage internationalized 
+string file.
#

use strict;
use locale;
use POSIX 'locale_h';
use Encode qw/encode decode/; 

package LocMsgTable;

use constant ARG_PREFIX_BE => "\x{6C00}\x{6D00}\x{4100}\x{7200}\x{6700
+}"; # 'lmArg';
use constant DELIMITER_BE  => "\x{2200}";                 # double quo
+te

use constant ARG_PREFIX_LE => "\x{006C}\x{006D}\x{0041}\x{0072}\x{0067
+}"; # 'lmArg';
use constant DELIMITER_LE  => '\x{0022}';                 # double quo
+te

use constant WINDOWS_BOM   => '\x{FEFF}';


sub argPrefix
{
   my ($this) = @_;

   if ($this->{'bigendian'} == 1)
   {
      return ARG_PREFIX_BE;
   }
   else
   {
      return ARG_PREFIX_LE;
   }
}


sub delimiter
{
   my ($this) = @_;

   if ($this->{'bigendian'} == 1)
   {
      return DELIMITER_BE;
   }
   else
   {
      return DELIMITER_LE;
   }
}


sub new
{
   my ($class, $bigendian, $language, %table) = @_;

   $bigendian = 1;
   $language = undef;
   %table = ('Invalid', '--- message not found, or invalid ---');

   my $obj = bless {
      'bigendian' => $bigendian,
      'language' => $language,
      'table' => \%table,
   }, $class;
}

sub LoadTable
{
   my ($this, $language) = @_;
   my $line;
   my $FH;
   my $prelim, my $id, my $spacer, my $remainder;

   $this->{'language'} = $language;

   binmode (STDOUT, ":encoding(ucs2)");
   binmode (STDIN, ":encoding(ucs2)");
   binmode (STDERR, ":encoding(ucs2)");

   local $/ = "\x{0600}";
   open( $FH, '<:encoding(ucs2)', 'Language.msg');

   $line = <$FH>;
   chomp $line;
   if (substr($line, 0, 1) eq WINDOWS_BOM)
   {
      $this->{'bigendian'} = 0;
   }
   my $delim = $this->delimiter();

# Parse each line  
 while ($line = <$FH>)
   {
      chomp $line;
      ($prelim, $id, $spacer, $remainder) = split(/$delim/, $line, 4);
+ # not completely parsed for now
      $this->{'table'}{$id} = $remainder;
   }

}

#
#  Get a "Message lmArg1" style entry from the hash, substitute args, 
+and return the string
#
sub CreateString
{
   my ($this, $strId, @args) = @_;
   my $arg;
   my $argCount = 1;
   my $argTag;

   my $retVal = $this->{'table'}{$strId};

   foreach $arg (@args)
   {
      $argTag = "<<< $argCount >>>"
      $retVal =~ s/$argTag/$args[$argCount-1]/g;
      $argCount++;
   }

   return $retVal;
}

1;
[download]

Comment on UCS2 Internationalization file parsing Download Code

Replies are listed 'Best First'.
Re: UCS2 Internationalization file parsing by thundergnat (Deacon) on Mar 23, 2006 at 18:41 UTC
Take a look at the Unicode::String module on CPAN, specifically at the byteswap method. While reading the file, if your BOM is not in network order, byteswap each line then decode it to utf-8. Reverse the process as you write. Incidentally, ucs2 is another name for utf-16. Here's a small demo script. use warnings; use strict; use Unicode::String qw/byteswap2/; use Encode qw/encode decode/; load('Language.msg'); sub load { my $file = shift; open my $FH, '<:bytes', $file or die "$file: $!\n"; my $char; sysread( $FH, $char, 2, 0 ); my $swapbyteorder; if ( $char eq "\x{FF}\x{FE}" ) { $swapbyteorder = 1; } elsif ( $char eq "\x{FE}\x{FF}" ) { $swapbyteorder = 0; } else { die "No BOM found.\n"; } close $FH; { local $/ = $swapbyteorder ? "\x{0600}" : "\x{0006}"; open $FH, '<:encoding(utf16)', $file or die "$!\n"; while ( my $line = <$FH> ) { chomp $line; byteswap2($line) if $swapbyteorder; $line = decode( 'utf16', $line ); # do whatever with line, now in utf-8 } } } [download]	[reply] [d/l]
Re^2: UCS2 Internationalization file parsing by sfinster (Acolyte) on Mar 27, 2006 at 17:05 UTC
Uncle! I downloaded the Unicode::String tar/zip, but I can't get it working. I believed I had ActivePerl installed on Windows, but "ppm" gives me a 'not found' error. I tried the perl -MCPAN -e method, but I'm not sure what to feed it. Unicode::String? Some paths? Thanks.	[reply]
Re^3: UCS2 Internationalization file parsing by thundergnat (Deacon) on Mar 28, 2006 at 14:11 UTC
If you are using ActivePerl and ppm isn't working, you probably have a broken install. I would first suggest reinstalling ActivePerl. (You may possibly have some other distribution of Perl earlier in your path. Try running `perl -v` [download] at the command line and see if it mentions ActiveState.) You could look at doing a CPAN install, but you are going to need a C compiler on your system to do so. You Unicode::String isn't a pure perl module, so you can't (easily) just do a manual install and expect it to work. Your best bet is to get ppm working.	[reply] [d/l]
Re: UCS2 Internationalization file parsing by creamygoodness (Curate) on Mar 24, 2006 at 04:39 UTC
Take a look at the core Encode and Encode::Supported modules, specifically the `encode` and `decode` commands. The PerlIO::encoding module calls those commands internally. Am I going to need to swap the bytes on every character I read in? No way. Encode is set up to handle both Big-Endian and Little-Endian. You're having a problem with the Byte Order Mark because you specify an encoding of `ucs2`, which is an alias for the big-endian format `UCS-2BE`, but your BOM indicates that the data from the stream is in little-endian format. Perl doesn't like that, and it complains. Here's a script which illustrates that what you want to do is possible: `#!/usr/bin/perl use strict; use warnings; binmode( STDOUT, ':utf8' ); my $big_endian_BOM = pack( 'n', 0xFEFF ); my $smiley = pack( 'n', 0x263A ); my $ack = pack( 'n', 0x0006 ); my $newline = pack( 'n', 0x000A ); my $data = $big_endian_BOM; $data .= "$smiley$newline$ack$smiley$newline$ack"; open( my $fh, '<:encoding(ucs2)', \$data ) or die $!; local $/ = $ack; while (<$fh>) { chomp; print; }` [download] It prints two smileys, each on its own line. No errors or warnings. -- Marvin Humphrey Rectangular Research ― http://www.rectangular.com	[reply] [d/l] [select]
Re^2: UCS2 Internationalization file parsing by sfinster (Acolyte) on Mar 29, 2006 at 16:46 UTC
If I run your smiley script on my Windows 2000 machine and get garbage (looks like it might be extended ASCII), what do I need to change? I have "Central Europe" and "Western Europe and United States" enabled in Control Panel->Regional Options. Locale is English (United States). In my script if I try to encode what I've read in (back to original question) I get a "Wide character in input" error. Print doesn't show me the correct data even though I've used binmode to set STDOUT. Are there other environment settings I need to be making?	[reply]