comment on

Dear Monks,

My company is working on internationalization. All our C++ programs have been converted. The Perl scripts remain. They run on multiple platforms (Solaris, Linux, Windows). Each platform has at least Perl 5.8.0. (Windows has 5.8.3.)

I need to be able to read a variable-length record file of messages in ucs2. The first line contains a BOM and some header information. Because the messages themselves can contain newlines, the file is delimited with an ASCII 6 ('ack').

My goal is to open the file and parse the messages into a hash. I know to use a local /$ to redefine my line delimiter.

I'm having a number of problems. First, the BOM in my file is 'FFFE', while Windows is expecting 'FEFF'. I know that Windows can handle this. I can open the message file in Wordpad and it's legible. However, this doesn't appear to be automatic in Perl. Am I going to need to swap the bytes on every character I read in?

I keep reading that "all you have to do" is use an encoding on a file when you open it. I can't seem to examine what I read in, much less parse it.

Has anyone done this? Is there a document or web site that lays out what I need to do? All I can find (perluniintro, perlunicode, web) are examples of how to jam a special character into a string using \x{xxxx} format. I really don't want to look up every string in the hash by building large "\x{xxxx}\x{xxxx}\x{xxxx}etc." strings that are illegible to the human eye.

A message line looks like:

The message may contain values that will be filled in at runtime. These appear in the text like this: <<< 1 >>>. If it does, then the formats for those arguments will look something like this: {{{ "%s", 2, %d, 1 }}}

That is, a set of format, placement pairs. I need to parse out the pairs, then replace the <<< x >>> in the string with its corresponding format.

I'll attach my pitiful attempts at getting this going.

#!/usr/local/bin/perl
#
# Module: LocMsgTable
#
# Purpose: Build a table of messages from a Collage internationalized 
+string file.
#

use strict;
use locale;
use POSIX 'locale_h';
use Encode qw/encode decode/; 

package LocMsgTable;

use constant ARG_PREFIX_BE => "\x{6C00}\x{6D00}\x{4100}\x{7200}\x{6700
+}"; # 'lmArg';
use constant DELIMITER_BE  => "\x{2200}";                 # double quo
+te

use constant ARG_PREFIX_LE => "\x{006C}\x{006D}\x{0041}\x{0072}\x{0067
+}"; # 'lmArg';
use constant DELIMITER_LE  => '\x{0022}';                 # double quo
+te

use constant WINDOWS_BOM   => '\x{FEFF}';


sub argPrefix
{
   my ($this) = @_;

   if ($this->{'bigendian'} == 1)
   {
      return ARG_PREFIX_BE;
   }
   else
   {
      return ARG_PREFIX_LE;
   }
}


sub delimiter
{
   my ($this) = @_;

   if ($this->{'bigendian'} == 1)
   {
      return DELIMITER_BE;
   }
   else
   {
      return DELIMITER_LE;
   }
}


sub new
{
   my ($class, $bigendian, $language, %table) = @_;

   $bigendian = 1;
   $language = undef;
   %table = ('Invalid', '--- message not found, or invalid ---');

   my $obj = bless {
      'bigendian' => $bigendian,
      'language' => $language,
      'table' => \%table,
   }, $class;
}

sub LoadTable
{
   my ($this, $language) = @_;
   my $line;
   my $FH;
   my $prelim, my $id, my $spacer, my $remainder;

   $this->{'language'} = $language;

   binmode (STDOUT, ":encoding(ucs2)");
   binmode (STDIN, ":encoding(ucs2)");
   binmode (STDERR, ":encoding(ucs2)");

   local $/ = "\x{0600}";
   open( $FH, '<:encoding(ucs2)', 'Language.msg');

   $line = <$FH>;
   chomp $line;
   if (substr($line, 0, 1) eq WINDOWS_BOM)
   {
      $this->{'bigendian'} = 0;
   }
   my $delim = $this->delimiter();

# Parse each line  
 while ($line = <$FH>)
   {
      chomp $line;
      ($prelim, $id, $spacer, $remainder) = split(/$delim/, $line, 4);
+ # not completely parsed for now
      $this->{'table'}{$id} = $remainder;
   }

}

#
#  Get a "Message lmArg1" style entry from the hash, substitute args, 
+and return the string
#
sub CreateString
{
   my ($this, $strId, @args) = @_;
   my $arg;
   my $argCount = 1;
   my $argTag;

   my $retVal = $this->{'table'}{$strId};

   foreach $arg (@args)
   {
      $argTag = "<<< $argCount >>>"
      $retVal =~ s/$argTag/$args[$argCount-1]/g;
      $argCount++;
   }

   return $retVal;
}

1;
[download]

In reply to UCS2 Internationalization file parsing by sfinster

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.