maida has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I am parsing several large text files and I am using unpack to seperate some fixed length records. One of our fellow perl monks users suggested that I look for a more elegant use of unpack. So here I am and this is what I have.
__DATA__ AP040003EZ9891783 61125 N BX 108.0 0000 03196 + 00000 D Y B BP041303DD554 009J0 N BX 8.7 5000 03168 +62 00000 Y W ___PART OF THE CODE________ elsif (/^\s+?\S{13}\s+?\S+?\s+?\S/){ $_ =~ s/^\s*//; my @fields = unpack "a21 a9 a9 a2 a13 a8 a9 a9 a4 a5 a6", $_; print " PIIN \= $fields[0]\n"; print " FSCM \= $fields[1]\n"; print " N/A \= $fields[2]\n"; print " U/I \= $fields[3]\n"; print " UNIT PRICE \= $fields[4]\n"; print " AWD DT \= $fields[5]\n"; print " QTY \= $fields[6]\n"; print " OPT DT \= $fields[7]\n"; print " FOB \= $fields[8]\n"; print " REP \= $fields[9]\n"; print " TYPE \= $fields[10]\n"; print "\n"; }
Thanks in advance. -Shawn

Replies are listed 'Best First'.
Re: A more elegant use of unpack
by wfsp (Abbot) on Sep 03, 2004 at 12:20 UTC
    After looking at the docs again the 'a's should be 'A's.
    a A string with arbitrary binary data, will be null padded. A A text (ASCII) string, will be space padded.
    See also: perlpacktut in the docs.
    Update: Added reference to tutorial.
Re: A more elegant use of unpack
by clscott (Friar) on Sep 03, 2004 at 16:42 UTC
    Just a personal preference but I would do:
    my @field_names = qw|PIIN FSCM N/A U/I UNIT PRICE AWD DT QTY OPT DT FO +B REP TYPE|; my $pack_defn = 'A21 A9 A9 A2 A14 A8 A9 A9 A4 A5 A6'; my %fields; @fields{@field_names} = unpack($pack_defn,$_); foreach (@field_names){ print "\t$_\t\= " , $fields{$_},"\n"; }

    My changes keep the field names and the field unpack definitions closer together, puts the values into a hash with the appropriate named keys and removes repeated code for the printing (use formats if you want better alignment in the columns).

    It may be important to note that this is not as efficient as the way you are currently doing it.

    As wfsp noted your 'a's should be 'A's and you are one character off in the 5th field (counting from one). Additionally your regexp in the elsif line does not match any of your sample data lines.

    --
    Clayton
      deleted by mifflin
Re: A more elegant use of unpack
by Aristotle (Chancellor) on Sep 03, 2004 at 16:29 UTC

    First, since you check the string with an initial match, there's no need to s/// it separately to trim the whitespace: just capture the part you're interested in and use it directly. Also, all those lazy quantifiers should be greedy: that which follows your plus quantifiers can never be matched by that which is quantified (ie \s can never match \S and vice versa), so greedy vs lazy does not change the match semantics. And greedy is both more efficient and makes for less clutter. I'd add an /x for good measure.

    elsif ( /^ \s+ ( \S{13} \s+ \S+ \s+ \S.* )/x ) { my @fields = unpack "a21 a9 a9 a2 a13 a8 a9 a9 a4 a5 a6", $1; # ... }

    What follows in your case has a lot of repetition: the print, the formatting whitespace, and the reference to @fields is duplicated over and over. You can do better than that:

    elsif ( /^ \s+ ( \S{13} \s+ \S+ \s+ \S.* )/x ) { my @field = qw( PIIN FSCM N/A U/I UNIT PRICE AWD DT QTY OPT DT FOB REP TYPE ); my %value; @value{ @field } = unpack "a21 a9 a9 a2 a13 a8 a9 a9 a4 a5 a6", $1 +; printf " %-10s = %s\n", $_, $value{ $_ } for @field; print "\n"; }

    Makeshifts last the longest.