Seeking best approach to column parsing

vivapl has asked for the wisdom of the Perl Monks concerning the following question:

Oh wise ones,

I'm searching for some wisdom in regards to string parsing.

1     1234    gg123456789  000-12345-1234-111
12    1234    gg123456789  000-12345-1234-111
123   1234    gg123456789  000-12345-1234-111
[download]

the above are examples of a file I'm trying to parse, my question is in regards to number of spaces between the first and second column. I tried to use split but then the location of the second column changes when white space count differs. Any idea on how to approach this?

Thanks in advance

update (broquaint): tidied up formatting

20031125 Edit by BazB: Changed title from 'Seeking best approach'

Comment on Seeking best approach to column parsing Download Code

Replies are listed 'Best First'.
Re: Seeking best approach to column parsing by hardburn (Abbot) on Nov 25, 2003 at 17:42 UTC
Since your data appears to be fixed-width, you could use unpack instead of split, which should be much faster. ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer `: () { :\|:& };:` Note: All code is untested, unless otherwise stated	[reply] [d/l]
Re: Re: Seeking best approach by davido (Cardinal) on Nov 25, 2003 at 18:27 UTC
That's a fact! unpack may be intimidating at first, but if you're looking at fixed-width columns (as your examples seem to indicate), unpack is going to be faster than split. Consider split when you don't have fixed-width data, and unpack when you do. Another advantage to unpack for fixed-width column data is that if one of the fields should be filled completely (leaving no whitespace at all), splitting on whitespace will fail, while unpack will still work fine. Dave "If I had my life to live over again, I'd be a plumber." -- Albert Einstein	[reply]
Re: Seeking best approach to column parsing by TStanley (Canon) on Nov 25, 2003 at 17:32 UTC
split will take a regular expression as the delimiter: `#!/opt/perl5/bin/perl -w use strict; while(<DATA>){ my ($c,$d,$e,$f)=split /\s+/,$_; print"$c\t$d\t$e\t$f\n"; } __DATA__ 1 1234 gg123456789 000-12345-1234-111 12 1234 gg123456789 000-12345-1234-111 123 1234 gg123456789 000-12345-1234-111` [download] TStanley -------- The only thing necessary for the triumph of evil is for good men to do nothing -- Edmund Burke	[reply] [d/l]
Re: Seeking best approach to column parsing by b10m (Vicar) on Nov 25, 2003 at 17:31 UTC
Use something like: `my @entries = split(/\s+/, $_);` [download] -- B10m	[reply] [d/l]
Re: Seeking best approach to column parsing by Art_XIV (Hermit) on Nov 25, 2003 at 18:21 UTC
You don't even have to use the regex inside of a split: `use strict; while (<DATA>) { my @elements = split; print ">", join(':', @elements), "<\n"; } 1; __DATA__ 1 1234 gg123456789 000-12345-1234-111 12 1234 gg123456789 000-12345-1234-111 123 1234 gg123456789 000-12345-1234-111` [download] `split`'s behavior with no args will do what you want, with the bonus of ignoring leading/trailing whitespace. See `perlfunc`. Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"	[reply] [d/l] [select]
Re: Seeking best approach to column parsing by Roger (Parson) on Nov 26, 2003 at 01:14 UTC
My favorite method - build a two dimensional array in one go. `use strict; use Data::Dumper; my @elements = map { /^\s*$/ ? () : [split /\s+/] } <DATA>; print Dumper(\@elements); __DATA__ 1 1234 gg123456789 000-12345-1234-111 12 1234 gg123456789 000-12345-1234-111 123 1234 gg123456789 000-12345-1234-111` [download] And the output - `$VAR1 = [ [ '1', '1234', 'gg123456789', '000-12345-1234-111' ], [ '12', '1234', 'gg123456789', '000-12345-1234-111' ], [ '123', '1234', 'gg123456789', '000-12345-1234-111' ] ];` [download]	[reply] [d/l] [select]
Re: Seeking best approach to column parsing by vivapl (Acolyte) on Nov 25, 2003 at 17:36 UTC
Thanks guys, that did the trrick. I guess I tend to forget these things. Thanks again	[reply]