Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Unpack/format

by Reverend Phil (Pilgrim)
on Feb 13, 2002 at 15:58 UTC ( [id://145185]=perlquestion: print w/replies, xml ) Need Help??

Reverend Phil has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, fellow citizens of planet earth.

While I do not work in a shipping department, I do spend a great deal of time packing and unpacking. I've got a project which deals with fixed-width data records in various sized based on the kind of record we're getting. On rare occasions (because we all know how rare it is that our customers provide us with erroneous data) I get something a bit ugly from the customer. This can cause me heartache, hassle, and various other forms of gump. The problem occurs if the records are shorter than I expect. Actually, it's more absurd than that. Let me simplify with an example:
example 1: $a = "test test test"; $fmt = "A4 x1 A4 x1 A4 x1 A4"; @a = unpack($fmt, $a); print join ("*",@a); ^D x outside of string at - line 3. example 2: $a = "test test test "; $fmt = "A4 x1 A4 x1 A4 x1 A4"; @a = unpack($fmt, $a); print join ("*",@a); ^D test*test*test*
For those unfamiliar, format strings ($fmt), allow me to cut and paste a drop of data from my infinitely handy Perl CD Bookshelf
A - An ASCII string, will be space padded
x - A null byte

Strangely, if the string being unpacked is too short, but I'm looking for an ASCII char beyond the bounds of the string, things work fine as frogs hair (even though there are other 'x' chunks in the format beyond). If I'm looking for a null byte (ie: skipping a character) at the dead position of the string, the code dies with an 'x outside of string'.

I'm wondering if other people have come across this as an issue, and how they've handled it. My thoughts are to check the length of each string vs. the expected length of the format string, but with a great deal of strings coming through, that would be tedious. Besides, rather than hard-coding the fmt string lengths, I'd rather have something more flexible to changes, which would leave me checking the length of each string vs. the calculated expected length of the format string. Just sounds ugly. Either way, I'm looking to avoid this 'x outside of string' hubbub.

Thoughts? Feelings? Rhyme? Reason?
-=rev=-

Replies are listed 'Best First'.
Re: Unpack/format
by jlongino (Parson) on Feb 13, 2002 at 16:56 UTC
    This may be an ugly hack, but it works:
    $pad = ' ' x 80; ## make the length the same as your max expected $c = "test test test"; $fmt = "A4 x1 A4 x1 A4 x1 A4"; @c = unpack($fmt, "$c$pad"); print join ("*",@c), "\n"; $d = "test test test "; $fmt = "A4 x1 A4 x1 A4 x1 A4"; @d = unpack($fmt, "$d$pad"); print join ("*",@d), "\n";

    --Jim

    Update: BTW, I would avoid using $a as a variable since it (and $b) has an evil association with sort. Check the link for details.

      Thanks Jim. That'd definitely kill the 'x' issue, and I'll probably put that there. However, if I'm going to handle this, I'd rather trap short entries and yell at the customer. Currently, I yell at them for a variety of reasons via email within the script. I'd like to junk any file that has short records like this, and notify the client. I guess I'm looking for the prettiest way that someone can think of, and wondering if there's something out there that quickly calculates the expected length given a pack/unpack style format string.

      As for the $a/$b - thanks for kicking me =) I know this, and generally do not use such nefarious variables. They show up only when I'm hitting the interpreter directly on my stinky Win32 box in order to test some snippet (like this one). =)
Re: Unpack/format
by trs80 (Priest) on Feb 13, 2002 at 17:36 UTC
    I took jlongino's suggestion and reworked it a little bit. I moved the padding into a subroutine and made it so it only creates enough extra whitespace to satisfy the format condition.
    $c = "test test test"; $c = correct_length($c); $fmt = "A4 x1 A4 x1 A4 x1 A4"; @c = unpack($fmt, "$c"); print join ("*",@c), "\n"; $d = "test test test "; $d = correct_length($d); $fmt = "A4 x1 A4 x1 A4 x1 A4"; @d = unpack($fmt, "$d"); print join ("*",@d), "\n"; sub correct_length { my ($string) = shift; $string .= ' ' unless length($string) >= 15; return($string); }
    This is assuming all your widths are the same, otherwise you would want to either pass the required width as an argument to the sub (not a good idea if you use the size in multiple places) or create a hash with values and there corresponding lengths and pass the item_name. I think this would give it a little more durability. Some thing like this:
    my %formats = ( "item_name" => 15, # where 15 is the length "item_name2" => 20, # etc. ); # code removed for demonstration $c = correct_length($c,'item_name'); sub correct_length { my ($string,$item) = @_; $string .= ' ' unless length($string) >= $formats{$item}; return($string); }
    Then the subroutine above could be passed the item_name and the string so the only place you would have to modify length would be in the formats hash, but I don't know if your data is that complex/varied.
      Thanks for the ideas =)
      Perhaps I should be a little more detailed though as to why I would rather not hard-code the format lengths.
      I'm taking a slew of files in from an FTP site, and their file names are the only indication of their formatting.
      Let's assume the file "2001-123456789-W2.txt"
      unless($file =~ /^(\d{4})-\d{9}-([^\.]+)/){ log_data("$file does not conform to file naming standards"); next; } $lower_format = lc("$1$2");
      This allows me to throw the file contents into a hash, keyed to "2001w2". Later on, I'm going to take the array of records matched to each hash key, and break them up based on their format. I'm disgustingly doing this via an eval, in a function which was passed the hash-key as $format:
      eval "\$current_format = \$fmt_$format"; @line = unpack($current_format, $data);
      Earlier in the script, I've defined $fmt_2001w2 as a format string. There are numerous format strings of various shapes and sizes here.

      I could take the expected string sizes from our specs, and hardcode the values in here, checking the format and then the string lengths as I go along. But when the customer decides that we need to add this and move that, I have to (a) adjust the $fmt_2001w2 variable, and (b) change this hard-coded length value.

      I am lazy, and wish to do this in one place, not two. I don't want to miss one ;)

      Yes, padding the end of the line will work. Yes, hard-coding the lengths of the formats will work. Thanks both of you for helping =) Now though, it's no longer about making this work (or more specifically, not letting this die), and it's about how I can accomplish the objective goal of making sure that such a thing doesn't happen in a general sense - and if it involves calculating the length from the format string, what might be the quickest or keenest way to do so =)

      -=rev=-
        If your format string doesn't have terribly fancy template characters (i.e. doesn't have things like "w"), and yours which are all "a" and "x" fit the bill, you can do
        $expected_length = length(pack($fmt))
        This returns 14 for the format string of "a4 x1 a4 x1 a4".
        Hope this helps...

        -JAS

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://145185]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-04-26 07:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found