sfinster has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

My company is working on internationalization. All our C++ programs have been converted. The Perl scripts remain. They run on multiple platforms (Solaris, Linux, Windows). Each platform has at least Perl 5.8.0. (Windows has 5.8.3.)

I need to be able to read a variable-length record file of messages in ucs2. The first line contains a BOM and some header information. Because the messages themselves can contain newlines, the file is delimited with an ASCII 6 ('ack').

My goal is to open the file and parse the messages into a hash. I know to use a local /$ to redefine my line delimiter.

I'm having a number of problems. First, the BOM in my file is 'FFFE', while Windows is expecting 'FEFF'. I know that Windows can handle this. I can open the message file in Wordpad and it's legible. However, this doesn't appear to be automatic in Perl. Am I going to need to swap the bytes on every character I read in?

I keep reading that "all you have to do" is use an encoding on a file when you open it. I can't seem to examine what I read in, much less parse it.

Has anyone done this? Is there a document or web site that lays out what I need to do? All I can find (perluniintro, perlunicode, web) are examples of how to jam a special character into a string using \x{xxxx} format. I really don't want to look up every string in the hash by building large "\x{xxxx}\x{xxxx}\x{xxxx}etc." strings that are illegible to the human eye.

A message line looks like:

<Integer ID> <integer flag> <integer flag> <Text ID in double quotes> <message text in double quotes> <argument formats> <comment in double quotes>

The message may contain values that will be filled in at runtime. These appear in the text like this: <<< 1 >>>. If it does, then the formats for those arguments will look something like this: {{{ "%s", 2, %d, 1 }}}

That is, a set of format, placement pairs. I need to parse out the pairs, then replace the <<< x >>> in the string with its corresponding format.

I'll attach my pitiful attempts at getting this going.
#!/usr/local/bin/perl # # Module: LocMsgTable # # Purpose: Build a table of messages from a Collage internationalized +string file. # use strict; use locale; use POSIX 'locale_h'; use Encode qw/encode decode/; package LocMsgTable; use constant ARG_PREFIX_BE => "\x{6C00}\x{6D00}\x{4100}\x{7200}\x{6700 +}"; # 'lmArg'; use constant DELIMITER_BE => "\x{2200}"; # double quo +te use constant ARG_PREFIX_LE => "\x{006C}\x{006D}\x{0041}\x{0072}\x{0067 +}"; # 'lmArg'; use constant DELIMITER_LE => '\x{0022}'; # double quo +te use constant WINDOWS_BOM => '\x{FEFF}'; sub argPrefix { my ($this) = @_; if ($this->{'bigendian'} == 1) { return ARG_PREFIX_BE; } else { return ARG_PREFIX_LE; } } sub delimiter { my ($this) = @_; if ($this->{'bigendian'} == 1) { return DELIMITER_BE; } else { return DELIMITER_LE; } } sub new { my ($class, $bigendian, $language, %table) = @_; $bigendian = 1; $language = undef; %table = ('Invalid', '--- message not found, or invalid ---'); my $obj = bless { 'bigendian' => $bigendian, 'language' => $language, 'table' => \%table, }, $class; } sub LoadTable { my ($this, $language) = @_; my $line; my $FH; my $prelim, my $id, my $spacer, my $remainder; $this->{'language'} = $language; binmode (STDOUT, ":encoding(ucs2)"); binmode (STDIN, ":encoding(ucs2)"); binmode (STDERR, ":encoding(ucs2)"); local $/ = "\x{0600}"; open( $FH, '<:encoding(ucs2)', 'Language.msg'); $line = <$FH>; chomp $line; if (substr($line, 0, 1) eq WINDOWS_BOM) { $this->{'bigendian'} = 0; } my $delim = $this->delimiter(); # Parse each line while ($line = <$FH>) { chomp $line; ($prelim, $id, $spacer, $remainder) = split(/$delim/, $line, 4); + # not completely parsed for now $this->{'table'}{$id} = $remainder; } } # # Get a "Message lmArg1" style entry from the hash, substitute args, +and return the string # sub CreateString { my ($this, $strId, @args) = @_; my $arg; my $argCount = 1; my $argTag; my $retVal = $this->{'table'}{$strId}; foreach $arg (@args) { $argTag = "<<< $argCount >>>" $retVal =~ s/$argTag/$args[$argCount-1]/g; $argCount++; } return $retVal; } 1;

Replies are listed 'Best First'.
Re: UCS2 Internationalization file parsing
by thundergnat (Deacon) on Mar 23, 2006 at 18:41 UTC

    Take a look at the Unicode::String module on CPAN, specifically at the byteswap method. While reading the file, if your BOM is not in network order, byteswap each line then decode it to utf-8. Reverse the process as you write. Incidentally, ucs2 is another name for utf-16.

    Here's a small demo script.

    use warnings; use strict; use Unicode::String qw/byteswap2/; use Encode qw/encode decode/; load('Language.msg'); sub load { my $file = shift; open my $FH, '<:bytes', $file or die "$file: $!\n"; my $char; sysread( $FH, $char, 2, 0 ); my $swapbyteorder; if ( $char eq "\x{FF}\x{FE}" ) { $swapbyteorder = 1; } elsif ( $char eq "\x{FE}\x{FF}" ) { $swapbyteorder = 0; } else { die "No BOM found.\n"; } close $FH; { local $/ = $swapbyteorder ? "\x{0600}" : "\x{0006}"; open $FH, '<:encoding(utf16)', $file or die "$!\n"; while ( my $line = <$FH> ) { chomp $line; byteswap2($line) if $swapbyteorder; $line = decode( 'utf16', $line ); # do whatever with line, now in utf-8 } } }
      Uncle! I downloaded the Unicode::String tar/zip, but I can't get it working.

      I believed I had ActivePerl installed on Windows, but "ppm" gives me a 'not found' error.

      I tried the perl -MCPAN -e method, but I'm not sure what to feed it. Unicode::String? Some paths?

      Thanks.

        If you are using ActivePerl and ppm isn't working, you probably have a broken install. I would first suggest reinstalling ActivePerl. (You may possibly have some other distribution of Perl earlier in your path. Try running

        perl -v
        at the command line and see if it mentions ActiveState.)

        You could look at doing a CPAN install, but you are going to need a C compiler on your system to do so. You Unicode::String isn't a pure perl module, so you can't (easily) just do a manual install and expect it to work. Your best bet is to get ppm working.

Re: UCS2 Internationalization file parsing
by creamygoodness (Curate) on Mar 24, 2006 at 04:39 UTC

    Take a look at the core Encode and Encode::Supported modules, specifically the encode and decode commands. The PerlIO::encoding module calls those commands internally.

    Am I going to need to swap the bytes on every character I read in?

    No way. Encode is set up to handle both Big-Endian and Little-Endian. You're having a problem with the Byte Order Mark because you specify an encoding of ucs2, which is an alias for the big-endian format UCS-2BE, but your BOM indicates that the data from the stream is in little-endian format. Perl doesn't like that, and it complains.

    Here's a script which illustrates that what you want to do is possible:

    #!/usr/bin/perl use strict; use warnings; binmode( STDOUT, ':utf8' ); my $big_endian_BOM = pack( 'n', 0xFEFF ); my $smiley = pack( 'n', 0x263A ); my $ack = pack( 'n', 0x0006 ); my $newline = pack( 'n', 0x000A ); my $data = $big_endian_BOM; $data .= "$smiley$newline$ack$smiley$newline$ack"; open( my $fh, '<:encoding(ucs2)', \$data ) or die $!; local $/ = $ack; while (<$fh>) { chomp; print; }

    It prints two smileys, each on its own line. No errors or warnings.

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com
      If I run your smiley script on my Windows 2000 machine and get garbage (looks like it might be extended ASCII), what do I need to change?

      I have "Central Europe" and "Western Europe and United States" enabled in Control Panel->Regional Options. Locale is English (United States).

      In my script if I try to encode what I've read in (back to original question) I get a "Wide character in input" error. Print doesn't show me the correct data even though I've used binmode to set STDOUT.

      Are there other environment settings I need to be making?