Dear Monks,

My company is working on internationalization. All our C++ programs have been converted. The Perl scripts remain. They run on multiple platforms (Solaris, Linux, Windows). Each platform has at least Perl 5.8.0. (Windows has 5.8.3.)

I need to be able to read a variable-length record file of messages in ucs2. The first line contains a BOM and some header information. Because the messages themselves can contain newlines, the file is delimited with an ASCII 6 ('ack').

My goal is to open the file and parse the messages into a hash. I know to use a local /$ to redefine my line delimiter.

I'm having a number of problems. First, the BOM in my file is 'FFFE', while Windows is expecting 'FEFF'. I know that Windows can handle this. I can open the message file in Wordpad and it's legible. However, this doesn't appear to be automatic in Perl. Am I going to need to swap the bytes on every character I read in?

I keep reading that "all you have to do" is use an encoding on a file when you open it. I can't seem to examine what I read in, much less parse it.

Has anyone done this? Is there a document or web site that lays out what I need to do? All I can find (perluniintro, perlunicode, web) are examples of how to jam a special character into a string using \x{xxxx} format. I really don't want to look up every string in the hash by building large "\x{xxxx}\x{xxxx}\x{xxxx}etc." strings that are illegible to the human eye.

A message line looks like:

<Integer ID> <integer flag> <integer flag> <Text ID in double quotes> <message text in double quotes> <argument formats> <comment in double quotes>

The message may contain values that will be filled in at runtime. These appear in the text like this: <<< 1 >>>. If it does, then the formats for those arguments will look something like this: {{{ "%s", 2, %d, 1 }}}

That is, a set of format, placement pairs. I need to parse out the pairs, then replace the <<< x >>> in the string with its corresponding format.

I'll attach my pitiful attempts at getting this going.
#!/usr/local/bin/perl # # Module: LocMsgTable # # Purpose: Build a table of messages from a Collage internationalized +string file. # use strict; use locale; use POSIX 'locale_h'; use Encode qw/encode decode/; package LocMsgTable; use constant ARG_PREFIX_BE => "\x{6C00}\x{6D00}\x{4100}\x{7200}\x{6700 +}"; # 'lmArg'; use constant DELIMITER_BE => "\x{2200}"; # double quo +te use constant ARG_PREFIX_LE => "\x{006C}\x{006D}\x{0041}\x{0072}\x{0067 +}"; # 'lmArg'; use constant DELIMITER_LE => '\x{0022}'; # double quo +te use constant WINDOWS_BOM => '\x{FEFF}'; sub argPrefix { my ($this) = @_; if ($this->{'bigendian'} == 1) { return ARG_PREFIX_BE; } else { return ARG_PREFIX_LE; } } sub delimiter { my ($this) = @_; if ($this->{'bigendian'} == 1) { return DELIMITER_BE; } else { return DELIMITER_LE; } } sub new { my ($class, $bigendian, $language, %table) = @_; $bigendian = 1; $language = undef; %table = ('Invalid', '--- message not found, or invalid ---'); my $obj = bless { 'bigendian' => $bigendian, 'language' => $language, 'table' => \%table, }, $class; } sub LoadTable { my ($this, $language) = @_; my $line; my $FH; my $prelim, my $id, my $spacer, my $remainder; $this->{'language'} = $language; binmode (STDOUT, ":encoding(ucs2)"); binmode (STDIN, ":encoding(ucs2)"); binmode (STDERR, ":encoding(ucs2)"); local $/ = "\x{0600}"; open( $FH, '<:encoding(ucs2)', 'Language.msg'); $line = <$FH>; chomp $line; if (substr($line, 0, 1) eq WINDOWS_BOM) { $this->{'bigendian'} = 0; } my $delim = $this->delimiter(); # Parse each line while ($line = <$FH>) { chomp $line; ($prelim, $id, $spacer, $remainder) = split(/$delim/, $line, 4); + # not completely parsed for now $this->{'table'}{$id} = $remainder; } } # # Get a "Message lmArg1" style entry from the hash, substitute args, +and return the string # sub CreateString { my ($this, $strId, @args) = @_; my $arg; my $argCount = 1; my $argTag; my $retVal = $this->{'table'}{$strId}; foreach $arg (@args) { $argTag = "<<< $argCount >>>" $retVal =~ s/$argTag/$args[$argCount-1]/g; $argCount++; } return $retVal; } 1;

In reply to UCS2 Internationalization file parsing by sfinster

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.