in reply to amalgamate similar lines

Anonymous Monk,
You likely want to be using a CSV parsing module like Text::CSV_XS or Text::x_SV, but for this example I will be using split. I have chosen to use split because I have made several assumptions about your problem.

Assumptions:

#!/usr/bin/perl use strict; use warnings; my $input = $ARGV[0] || 'sample.txt'; open(my $fh, '<', $input) or die "Unable to open $input for reading: $ +!"; my %data; while ( <$fh> ) { chomp; my @field = split /\|/, $_, 3; my $key = join '|', @field[0,1]; $data{$key}{line} = $. if ! exists $data{$key}; push @{ $data{$key}{records} }, $field[2]; } for ( sort { $data{$a}{line} <=> $data{$b}{line} } keys %data ) { if ( @{ $data{$_}{records} } > 1 ) { my $field3 = join ',', @{ $data{$_}{records} }; print join '|', $_, $field3; } else { print join '|', $_, $data{$_}{records}[0]; } print "\n"; }
Please forgive me for the rather tedious solution. I wanted to point out the importance of clearly and concisely stating the problem and assumptions.

Cheers - L~R

Update: Simplified code and clarified assumptions

Replies are listed 'Best First'.
Re^2: amalgamate similar lines
by ysth (Canon) on Jan 09, 2006 at 14:05 UTC
    • The original 3rd field will not contain commas
    I don't see where you assume that; AFAICT your solution will work whether or not that's true. Perhaps you are just pointing out that the operation will not be reversable if there are existing commas?
    • The joined record will appear at the first occurence
    Implicitly, you are also assuming that records should be merged regardless of their position in the file; it's possible that only adjacent records should be candidates for merging.
      ysth,
      With regards to the first assumption you called into question, that should have read:

      Concattenated records in the output will be identified by commas in the 3rd field. This assumes no commas appear in the 3rd field prior to merging. updated

      With regards to second assumption you mentioned. You are correct that since the AM only stated where the first two fields were the same that I assumed that meant they could appear anywhere in the file. That is the point of my post - to clearly state what is desired.

      Cheers - L~R