in reply to Re: Extracting specific data from fixed-width columns
in thread Extracting specific data from fixed-width columns

The problem with that comes if you do not know what the keys are, or what order they appear in, how do you pick out the nth field from the hash?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."
  • Comment on Re^2: Extracting specific data from fixed-width columns

Replies are listed 'Best First'.
Re^3: Extracting specific data from fixed-width columns
by psini (Deacon) on Jul 04, 2008 at 05:09 UTC

    From the word choice in the original node, I understood that the OP knows the keys ("variables") and not necessarily the positions.

    If the position are fixed and known, it could work this:

    my @vars=$line=~/.{25}(.{15})/g;

    or this, that keeps the keys interleaved with the values:

    my @vars=$line=~/(.{25})(.{15})/g;

    Not tested (I've not a perl available right now) but both of them should work.

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

      There are several other reasons for prefering unpack to a regex for fixed width data.

      1. Strings are typically left-justified and space padded. When used as keys in a hash, 'xxx' won't match 'xxx '.

        The A template will strip the right-padding on the fly.

        Numbers are typically right-justified. Perl will strip leading spaces the first time you use it in a numeric contexr.

      2. unpack is usually much faster than using a regex for this.

        For this application of picking out 6/12 fields from 125/250, it is roughly 10 times faster:

        #! perl -slw use strict; use Math::Random::MT qw[ rand ]; use Benchmark qw[ cmpthese ]; our $data = join '', map { sprintf '%-25s%15d', 'X' x int( rand 25 ) , int( rand 2**32 ) } 1 .. 125; ## extract 6 pairs at positions 3rd, 33rd, 50th, 75th, 100th 123rd my $pair = 'A25 A15'; our $tmpl = "x[($pair)2] $pair x[($pair)29] $pair x[($pair)16] $pair" . "x[($pair)24] $pair x[($pair)24] $pair x[($pair)22] $pair +"; cmpthese -3, { regex => q[ our $data; my @sixPair = ( $data =~ m[(.{25})(.{15})]g )[ 5, 6, 65, 66, 99, 100, 149, 150, 199, 200, 245,246 ]; ], unpack=> q[ our( $data, $tmpl ); my @sixPair = unpack $tmpl, $data ], }; cmpthese 1, { regex => q[ our $data; my @sixPair = ( $data =~ m[(.{25})(.{15})]g )[ 4,5, 64,65, 98,99, 148,149, 198,199, 244,245 ]; print 'regex ', join '|', @sixPair; ], unpack=> q[ our( $data, $tmpl ); my @sixPair = unpack $tmpl, $data; print 'unpack ', join '|', @sixPair; ], }; __END__ C:\test>junk4 Rate regex unpack regex 3311/s -- -91% unpack 38783/s 1071% -- regex XXXXXXXXXXXXXXXXXXXXXX | 189677339| XXXXXXX | 966124187| XXXXXXXXXXX | -1269554066| XXXXX | -1916129141| XXXXXXXXXXXXXXX | -479254076| XXXXXXXXXXXXXXXXX | 335028423 unpack XXXXXXXXXXXXXXXXXXXXXX| 189677339| XXXXXXX| 966124187| XXXXXXXXXXX| -1269554066| XXXXX| -1916129141| XXXXXXXXXXXXXXX| -479254076| XXXXXXXXXXXXXXXXX| 335028423

        Results wrapped manually for posting.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.