vivapl has asked for the wisdom of the Perl Monks concerning the following question:

Oh wise ones,

I'm searching for some wisdom in regards to string parsing.

1 1234 gg123456789 000-12345-1234-111 12 1234 gg123456789 000-12345-1234-111 123 1234 gg123456789 000-12345-1234-111
the above are examples of a file I'm trying to parse, my question is in regards to number of spaces between the first and second column. I tried to use split but then the location of the second column changes when white space count differs. Any idea on how to approach this?

Thanks in advance

update (broquaint): tidied up formatting

20031125 Edit by BazB: Changed title from 'Seeking best approach'

Replies are listed 'Best First'.
Re: Seeking best approach to column parsing
by hardburn (Abbot) on Nov 25, 2003 at 17:42 UTC

    Since your data appears to be fixed-width, you could use unpack instead of split, which should be much faster.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

      That's a fact! unpack may be intimidating at first, but if you're looking at fixed-width columns (as your examples seem to indicate), unpack is going to be faster than split. Consider split when you don't have fixed-width data, and unpack when you do.

      Another advantage to unpack for fixed-width column data is that if one of the fields should be filled completely (leaving no whitespace at all), splitting on whitespace will fail, while unpack will still work fine.


      Dave


      "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: Seeking best approach to column parsing
by TStanley (Canon) on Nov 25, 2003 at 17:32 UTC
    split will take a regular expression as the delimiter:
    #!/opt/perl5/bin/perl -w use strict; while(<DATA>){ my ($c,$d,$e,$f)=split /\s+/,$_; print"$c\t$d\t$e\t$f\n"; } __DATA__ 1 1234 gg123456789 000-12345-1234-111 12 1234 gg123456789 000-12345-1234-111 123 1234 gg123456789 000-12345-1234-111


    TStanley
    --------
    The only thing necessary for the triumph of evil is for good men to do nothing -- Edmund Burke
Re: Seeking best approach to column parsing
by b10m (Vicar) on Nov 25, 2003 at 17:31 UTC
    Use something like:
    my @entries = split(/\s+/, $_);
    --
    B10m
Re: Seeking best approach to column parsing
by Art_XIV (Hermit) on Nov 25, 2003 at 18:21 UTC

    You don't even have to use the regex inside of a split:

    use strict; while (<DATA>) { my @elements = split; print ">", join(':', @elements), "<\n"; } 1; __DATA__ 1 1234 gg123456789 000-12345-1234-111 12 1234 gg123456789 000-12345-1234-111 123 1234 gg123456789 000-12345-1234-111

    split's behavior with no args will do what you want, with the bonus of ignoring leading/trailing whitespace. See perlfunc.

    Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
Re: Seeking best approach to column parsing
by Roger (Parson) on Nov 26, 2003 at 01:14 UTC
    My favorite method - build a two dimensional array in one go.
    use strict; use Data::Dumper; my @elements = map { /^\s*$/ ? () : [split /\s+/] } <DATA>; print Dumper(\@elements); __DATA__ 1 1234 gg123456789 000-12345-1234-111 12 1234 gg123456789 000-12345-1234-111 123 1234 gg123456789 000-12345-1234-111
    And the output -
    $VAR1 = [ [ '1', '1234', 'gg123456789', '000-12345-1234-111' ], [ '12', '1234', 'gg123456789', '000-12345-1234-111' ], [ '123', '1234', 'gg123456789', '000-12345-1234-111' ] ];
Re: Seeking best approach to column parsing
by vivapl (Acolyte) on Nov 25, 2003 at 17:36 UTC
    Thanks guys, that did the trrick. I guess I tend to forget these things.
    Thanks again