My company is working on internationalization. All our C++ programs have been converted. The Perl scripts remain. They run on multiple platforms (Solaris, Linux, Windows). Each platform has at least Perl 5.8.0. (Windows has 5.8.3.)
I need to be able to read a variable-length record file of messages in ucs2. The first line contains a BOM and some header information. Because the messages themselves can contain newlines, the file is delimited with an ASCII 6 ('ack').
My goal is to open the file and parse the messages into a hash. I know to use a local /$ to redefine my line delimiter.
I'm having a number of problems. First, the BOM in my file is 'FFFE', while Windows is expecting 'FEFF'. I know that Windows can handle this. I can open the message file in Wordpad and it's legible. However, this doesn't appear to be automatic in Perl. Am I going to need to swap the bytes on every character I read in?
I keep reading that "all you have to do" is use an encoding on a file when you open it. I can't seem to examine what I read in, much less parse it.
Has anyone done this? Is there a document or web site that lays out what I need to do? All I can find (perluniintro, perlunicode, web) are examples of how to jam a special character into a string using \x{xxxx} format. I really don't want to look up every string in the hash by building large "\x{xxxx}\x{xxxx}\x{xxxx}etc." strings that are illegible to the human eye.
<Integer ID> <integer flag> <integer flag> <Text ID in double quotes> <message text in double quotes> <argument formats> <comment in double quotes>
The message may contain values that will be filled in at runtime. These appear in the text like this: <<< 1 >>>. If it does, then the formats for those arguments will look something like this: {{{ "%s", 2, %d, 1 }}}
That is, a set of format, placement pairs. I need to parse out the pairs, then replace the <<< x >>> in the string with its corresponding format.
#!/usr/local/bin/perl # # Module: LocMsgTable # # Purpose: Build a table of messages from a Collage internationalized +string file. # use strict; use locale; use POSIX 'locale_h'; use Encode qw/encode decode/; package LocMsgTable; use constant ARG_PREFIX_BE => "\x{6C00}\x{6D00}\x{4100}\x{7200}\x{6700 +}"; # 'lmArg'; use constant DELIMITER_BE => "\x{2200}"; # double quo +te use constant ARG_PREFIX_LE => "\x{006C}\x{006D}\x{0041}\x{0072}\x{0067 +}"; # 'lmArg'; use constant DELIMITER_LE => '\x{0022}'; # double quo +te use constant WINDOWS_BOM => '\x{FEFF}'; sub argPrefix { my ($this) = @_; if ($this->{'bigendian'} == 1) { return ARG_PREFIX_BE; } else { return ARG_PREFIX_LE; } } sub delimiter { my ($this) = @_; if ($this->{'bigendian'} == 1) { return DELIMITER_BE; } else { return DELIMITER_LE; } } sub new { my ($class, $bigendian, $language, %table) = @_; $bigendian = 1; $language = undef; %table = ('Invalid', '--- message not found, or invalid ---'); my $obj = bless { 'bigendian' => $bigendian, 'language' => $language, 'table' => \%table, }, $class; } sub LoadTable { my ($this, $language) = @_; my $line; my $FH; my $prelim, my $id, my $spacer, my $remainder; $this->{'language'} = $language; binmode (STDOUT, ":encoding(ucs2)"); binmode (STDIN, ":encoding(ucs2)"); binmode (STDERR, ":encoding(ucs2)"); local $/ = "\x{0600}"; open( $FH, '<:encoding(ucs2)', 'Language.msg'); $line = <$FH>; chomp $line; if (substr($line, 0, 1) eq WINDOWS_BOM) { $this->{'bigendian'} = 0; } my $delim = $this->delimiter(); # Parse each line while ($line = <$FH>) { chomp $line; ($prelim, $id, $spacer, $remainder) = split(/$delim/, $line, 4); + # not completely parsed for now $this->{'table'}{$id} = $remainder; } } # # Get a "Message lmArg1" style entry from the hash, substitute args, +and return the string # sub CreateString { my ($this, $strId, @args) = @_; my $arg; my $argCount = 1; my $argTag; my $retVal = $this->{'table'}{$strId}; foreach $arg (@args) { $argTag = "<<< $argCount >>>" $retVal =~ s/$argTag/$args[$argCount-1]/g; $argCount++; } return $retVal; } 1;
In reply to UCS2 Internationalization file parsing by sfinster
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |