Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

unpacking mixed ascii & utf16 null termed strings

by patcat88 (Deacon)
on Oct 04, 2011 at 22:06 UTC ( [id://929681]=perlquestion: print w/replies, xml ) Need Help??

patcat88 has asked for the wisdom of the Perl Monks concerning the following question:

I have an interesting file format to decode. Its based off C structs, but with tons of null terminated strings, UTF16LE strings are mixed with ASCII strings, all null terminated. The string lengths are variable, but there is always the same number and position of strings and longs in the same pattern (struct-ish). A sample of what the record looks like is below and my attempts to far to decode it.
$s = encode('UTF-16LE', "first name string\0").encode('UTF-16LE', "mi +ddle name string\0"). encode('UTF-16LE', "last name string\0").pack('V', 3654877182).pack('V +', 1).encode('UTF-16LE', "address string\0")."zip code ascii\0".pack( +'V',1); $decodedStr = decode('UTF-16LE', $s, Encode::FB_CROAK); print Dumper([unpack('Z*Z*Z*VVZ*Z*V',$decodedStr)]);
UTF-16LE:Unicode character fffe is illegal at C:/perl512/lib/Encode.pm line 174. removing FB_CROAK gave me
$VAR1 = [ "first name string", "middle name string", "last name string", "1627454973", "1701995620", "ss string", "\x{697a}\x{2070}\x{6f63}\x{6564}\x{6120}\x{6373}\x{6969}\x{ +100}" ];
and warning, "Character(s) in 'V' format wrapped in unpack"
Is there any way to use pack to decode this, or any other template style way of decoding this without writing a byte level parser using substr and index and a current character position integer?
as far as I understand, pack doesn't understand what a null terminated UTF16 string is, right?

Replies are listed 'Best First'.
Re: unpacking mixed ascii & utf16 null termed strings
by ikegami (Patriarch) on Oct 05, 2011 at 00:23 UTC

    The reverse of encode→pack is unpack→decode. You have to remove "layers" in the opposite order that they've been applied, so you're doing it in the wrong order.

    I don't see any trivial way of extracting the string before it's decoded, unfortunately.

    sub extract_text { $_[0] =~ s/^((?:..)*)\0\0//s or die; return decode('UTF-16le', "$1"); } my $first_name = extract_text($bytes); my $middle_name = extract_text($bytes); my $last_name = extract_text($bytes); my ($x, $y) = unpack('VV', substr($bytes, 0, 8, '')); my $address = extract_text($bytes); my $zip_code = extract_text($bytes); my ($z) = unpack('V', substr($bytes, 0, 4, ''));
    or maybe
    my @fields = $bytes =~ / ^ ((?:..)*)\0\0 ((?:..)*)\0\0 ((?:..)*)\0\0 (.{4}) (.{4}) ((?:..)*)\0\0 ((?:..)*)\0\0 (.{4}) \z /sx or die; $_ = decode('UTF-16le', $_) for @fields[0,1,2,5,6]; $_ = unpack('V', $_) for @fields[3,4,7];
      Your regexp seems to work. It also deals correctly with the problem of matching 2nd byte (a \0) of hypothetical 1st char (a "p\0" or "\x70\0" lets say) and 1st byte of 2nd char (a \0) that is a utf 16 null (a \0\0), rather than byte 1 (\0) and byte 2 (\0) of utf16 null. I didnt think of doing your group and multiplier to do the alignment on the utf16 strings in a regexp. Thanks.
Re: unpacking mixed ascii & utf16 null termed strings
by Anonymous Monk on Oct 05, 2011 at 00:29 UTC

      C struct types have a fixed size (except possibly for their last field). The OP's record format is variable-width.

      At a glance, one might think this module useless, but the module goes beyond the ability to handle C structs. It provides the ability to do length-prefixed fields, and it provides the ability to use custom packing and unpacking routines on a per-field basis. The former doesn't help, but the latter (called "hooks") could maybe handle the text fields.

      Update: I tried, and I don't think it's possible using this module.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://929681]
Approved by BrowserUk
Front-paged by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-19 20:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found