Process Records Spread Across Multiple Lines

All,
This really isn't cute but is more of a pattern I use quite frequently. It is simple and obvious and yet I see some people stumble with it so I thought I would share.

The general problem is a file where each line is comprised of identifiable fields but a "record" spans multiple lines. There is some key field that has the same value repeated on each line of the record. The process typically starts with externally sorting the file so that all lines of the record are adjacent.

Then it is simply a matter of pushing all the matching into an array and then processing the array as soon as all the lines for that record have been read.

#!/usr/bin/perl
use strict;
use warnings;

my $file = $ARGV[0] or die "Usage: $0 <input_file>";
open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $
+!";

my ($curr_key, @rec) = ('', ());
while (<$fh>) {
    chomp;
    my $entry = parse_line($_);
    if ($entry->{key} ne $curr_key) {
        process_rec($curr_key, \@rec);
        ($curr_key, @rec) = ($entry->{key}, $entry);
    }
    else {
        push @rec, $entry;
    }
}
process_rec($curr_key, \@rec);

sub parse_line {
    my ($line) = @_;
    my %entry;
    # ...
    return \%entry;
}

sub process_rec {
    my ($key, $rec) = @_;
    return if ! @$rec;
    # ...
}
[download]

Of course, parse_line() is usually overkill because the line is delimited in such a way that split is sufficient. Here are some things that may not be so obvious:

The code will not do the right thing if you pass an input file called '0'
The return if ! @$rec; is used to handle when it is called for the first line in the file
The call to process_rec() at the end of the while loop is necessary for the last record

Cheers - L~R

Comment on Process Records Spread Across Multiple Lines Select or Download Code

Replies are listed 'Best First'.
Re: Process Records Spread Across Multiple Lines by ig (Vicar) on Feb 16, 2011 at 09:09 UTC
What you wrote is easy to read but, for me, the following is easier... `my @rec; while (<$fh>) { my $entry = parse_line($_); if (@rec and $entry->{key} ne $rec[0]->{key}) { process_rec(\@rec); @rec = (); } push @rec, $entry; } process_rec(\@rec); sub parse_line { my ($line) = @_; chomp($line); my %entry; # ... return \%entry; } sub process_rec { my ($rec) = @_; return if ! @$rec; # ... }` [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re: Process Records Spread Across Multiple Lines
by ig (Vicar) on Feb 16, 2011 at 09:09 UTC

What you wrote is easy to read but, for me, the following is easier...

my @rec;
while (<$fh>) {
    my $entry = parse_line($_);
    if (@rec and $entry->{key} ne $rec[0]->{key}) {
        process_rec(\@rec);
        @rec = ();
    }
    push @rec, $entry;
}
process_rec(\@rec);

sub parse_line {
    my ($line) = @_;
    chomp($line);
    my %entry;
    # ...
    return \%entry;
}

sub process_rec {
    my ($rec) = @_;
    return if ! @$rec;
    # ...
}
[download]

[reply]
[d/l]