Re^3: Extracting specific data from fixed-width columns

From the word choice in the original node, I understood that the OP knows the keys ("variables") and not necessarily the positions.

If the position are fixed and known, it could work this:

my @vars=$line=~/.{25}(.{15})/g;

or this, that keeps the keys interleaved with the values:

my @vars=$line=~/(.{25})(.{15})/g;

Not tested (I've not a perl available right now) but both of them should work.

Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Comment on Re^3: Extracting specific data from fixed-width columns Select or Download Code

Replies are listed 'Best First'.
Re^4: Extracting specific data from fixed-width columns by BrowserUk (Patriarch) on Jul 04, 2008 at 08:15 UTC
There are several other reasons for prefering unpack to a regex for fixed width data. Strings are typically left-justified and space padded. When used as keys in a hash, 'xxx' won't match 'xxx '. The A template will strip the right-padding on the fly. Numbers are typically right-justified. Perl will strip leading spaces the first time you use it in a numeric contexr. unpack is usually much faster than using a regex for this. For this application of picking out 6/12 fields from 125/250, it is roughly 10 times faster: #! perl -slw use strict; use Math::Random::MT qw[ rand ]; use Benchmark qw[ cmpthese ]; our $data = join '', map { sprintf '%-25s%15d', 'X' x int( rand 25 ) , int( rand 2**32 ) } 1 .. 125; ## extract 6 pairs at positions 3rd, 33rd, 50th, 75th, 100th 123rd my $pair = 'A25 A15'; our $tmpl = "x[($pair)2] $pair x[($pair)29] $pair x[($pair)16] $pair" . "x[($pair)24] $pair x[($pair)24] $pair x[($pair)22] $pair +"; cmpthese -3, { regex => q[ our $data; my @sixPair = ( $data =~ m[(.{25})(.{15})]g )[ 5, 6, 65, 66, 99, 100, 149, 150, 199, 200, 245,246 ]; ], unpack=> q[ our( $data, $tmpl ); my @sixPair = unpack $tmpl, $data ], }; cmpthese 1, { regex => q[ our $data; my @sixPair = ( $data =~ m[(.{25})(.{15})]g )[ 4,5, 64,65, 98,99, 148,149, 198,199, 244,245 ]; print 'regex ', join '\|', @sixPair; ], unpack=> q[ our( $data, $tmpl ); my @sixPair = unpack $tmpl, $data; print 'unpack ', join '\|', @sixPair; ], }; __END__ C:\test>junk4 Rate regex unpack regex 3311/s -- -91% unpack 38783/s 1071% -- regex XXXXXXXXXXXXXXXXXXXXXX \| 189677339\| XXXXXXX \| 966124187\| XXXXXXXXXXX \| -1269554066\| XXXXX \| -1916129141\| XXXXXXXXXXXXXXX \| -479254076\| XXXXXXXXXXXXXXXXX \| 335028423 unpack XXXXXXXXXXXXXXXXXXXXXX\| 189677339\| XXXXXXX\| 966124187\| XXXXXXXXXXX\| -1269554066\| XXXXX\| -1916129141\| XXXXXXXXXXXXXXX\| -479254076\| XXXXXXXXXXXXXXXXX\| 335028423 [download] Results wrapped manually for posting. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]

Replies are listed 'Best First'.

Re^4: Extracting specific data from fixed-width columns
by BrowserUk (Patriarch) on Jul 04, 2008 at 08:15 UTC

There are several other reasons for prefering unpack to a regex for fixed width data.

Strings are typically left-justified and space padded. When used as keys in a hash, 'xxx' won't match 'xxx '.
The A template will strip the right-padding on the fly.
Numbers are typically right-justified. Perl will strip leading spaces the first time you use it in a numeric contexr.

unpack is usually much faster than using a regex for this.

For this application of picking out 6/12 fields from 125/250, it is roughly 10 times faster:

#! perl -slw
use strict;
use Math::Random::MT qw[ rand ];
use Benchmark qw[ cmpthese ];

our $data = join '', map {
    sprintf '%-25s%15d', 'X' x int( rand 25 ) , int( rand 2**32 )
} 1 .. 125;

## extract 6 pairs at positions 3rd, 33rd, 50th, 75th, 100th 123rd
my $pair = 'A25 A15';
our $tmpl = "x[($pair)2] $pair x[($pair)29] $pair x[($pair)16] $pair"
          . "x[($pair)24] $pair x[($pair)24] $pair x[($pair)22] $pair 
+";

cmpthese -3, {
    regex => q[
        our $data;
        my @sixPair = ( 
            $data =~ m[(.{25})(.{15})]g 
        )[ 5, 6, 65, 66, 99, 100, 149, 150, 199, 200, 245,246 ];
    ],
    unpack=> q[
        our( $data, $tmpl );
        my @sixPair = unpack $tmpl, $data
    ],
};

cmpthese 1, {
    regex => q[
        our $data;
        my @sixPair = ( 
           $data =~ m[(.{25})(.{15})]g 
        )[ 4,5, 64,65, 98,99, 148,149, 198,199, 244,245 ];
        print 'regex  ', join '|', @sixPair;
    ],
    unpack=> q[
        our( $data, $tmpl );
        my @sixPair = unpack $tmpl, $data;
        print 'unpack ', join '|', @sixPair;
    ],
};

__END__
C:\test>junk4
          Rate  regex unpack
regex   3311/s     --   -91%
unpack 38783/s  1071%     --

regex  
XXXXXXXXXXXXXXXXXXXXXX   |      189677339|
XXXXXXX                  |      966124187|
XXXXXXXXXXX              |    -1269554066|
XXXXX                    |    -1916129141|
XXXXXXXXXXXXXXX          |     -479254076|
XXXXXXXXXXXXXXXXX        |      335028423

unpack 
XXXXXXXXXXXXXXXXXXXXXX|      189677339|
XXXXXXX|      966124187|
XXXXXXXXXXX|    -1269554066|
XXXXX|    -1916129141|
XXXXXXXXXXXXXXX|     -479254076|
XXXXXXXXXXXXXXXXX|      335028423
[download]

Results wrapped manually for posting.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

[reply]
[d/l]