Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, as a newbie I'm struggling to write a bit of code that will handle the following situation. Any help appreciated
I've inherited a system that produces files in the format field|field|field e.g.
aaa|bbb|ccc ddd|eee|fff ddd|eee|xxxxx hhh|iiii|jjjjjj
I need to be able to amalgamate the records where field1 and field2 are the same into one record like this :-
aaa|bbb|ccc ddd|eee|fff,xxxxx hhh|iiii|jjjjjj
So the 3rd field becomes a comma delimited amalgamation of the records where field1 and field2 are the same
Thanks in advance

Replies are listed 'Best First'.
Re: amalgamate similar lines
by Limbic~Region (Chancellor) on Jan 09, 2006 at 13:55 UTC
    Anonymous Monk,
    You likely want to be using a CSV parsing module like Text::CSV_XS or Text::x_SV, but for this example I will be using split. I have chosen to use split because I have made several assumptions about your problem.

    Assumptions:

    • Each record is contained on a single line
    • Each record is pipe delimited and no field contains any imbedded delimiters
    • Each record is comprised of 3 fields
    • Preservation of record ordering is important
    • 2 or more records with the first 2 fields in common are desired to be joined
    • These records may be anywhere in the file and are not necessarily adjacent
    • Joining records means concatenating the 3rd fields with commas in the order the records appeared in the file
    • Concattenated records in the output will be identified by commas in the 3rd field. This assumes no commas appear in the 3rd field prior to merging.
    • The joined record will appear at the first occurence in the output
    • The machine running the program will have sufficient memory to hold required information in memory

    #!/usr/bin/perl use strict; use warnings; my $input = $ARGV[0] || 'sample.txt'; open(my $fh, '<', $input) or die "Unable to open $input for reading: $ +!"; my %data; while ( <$fh> ) { chomp; my @field = split /\|/, $_, 3; my $key = join '|', @field[0,1]; $data{$key}{line} = $. if ! exists $data{$key}; push @{ $data{$key}{records} }, $field[2]; } for ( sort { $data{$a}{line} <=> $data{$b}{line} } keys %data ) { if ( @{ $data{$_}{records} } > 1 ) { my $field3 = join ',', @{ $data{$_}{records} }; print join '|', $_, $field3; } else { print join '|', $_, $data{$_}{records}[0]; } print "\n"; }
    Please forgive me for the rather tedious solution. I wanted to point out the importance of clearly and concisely stating the problem and assumptions.

    Cheers - L~R

    Update: Simplified code and clarified assumptions
      • The original 3rd field will not contain commas
      I don't see where you assume that; AFAICT your solution will work whether or not that's true. Perhaps you are just pointing out that the operation will not be reversable if there are existing commas?
      • The joined record will appear at the first occurence
      Implicitly, you are also assuming that records should be merged regardless of their position in the file; it's possible that only adjacent records should be candidates for merging.
        ysth,
        With regards to the first assumption you called into question, that should have read:

        Concattenated records in the output will be identified by commas in the 3rd field. This assumes no commas appear in the 3rd field prior to merging. updated

        With regards to second assumption you mentioned. You are correct that since the AM only stated where the first two fields were the same that I assumed that meant they could appear anywhere in the file. That is the point of my post - to clearly state what is desired.

        Cheers - L~R

Re: amalgamate similar lines
by g0n (Priest) on Jan 09, 2006 at 13:58 UTC
    Fore!

    while (<DATA>) { my ($onetwo,$three) = $_=~/(.*)\|(.*)/; push @{$hash{$onetwo}},$three; } for (sort keys %hash) { print $_."|",join ",",@{$hash{$_}}; print "\n"; } __DATA__ aaa|bbb|ccc ddd|eee|fff ddd|eee|xxxxx hhh|iiii|jjjjjj

    --------------------------------------------------------------

    "If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."

    John Brunner, "The Shockwave Rider".

Re: amalgamate similar lines
by smokemachine (Hermit) on Jan 09, 2006 at 13:36 UTC
    perl -ne 'chomp; $hash{$1}.=$2."," if /^([^\|]+\|[^\|]+)\|([^\|]+)$/; +END{chop$hash{$_} foreach keys %hash; print "$_|$hash{$_}\n" foreach +keys %hash}' file_name
Re: amalgamate similar lines
by wfsp (Abbot) on Jan 09, 2006 at 13:46 UTC
    Here's my go:
    #!/bin/perl5 use strict; use warnings; use Data::Dumper; my %data; while (my $record = <DATA>){ chomp $record; my ($fld1, $fld2, $fld3) = split /\|/, $record; push @{$data{$fld1}{$fld2}}, $fld3; } for my $fld1 (sort keys %data){ for my $fld2 (keys %{$data{$fld1}}){ print "$fld1|$fld2|", join(',', @{$data{$fld1}{$fld2}}), "\n"; } } __DATA__ aaa|bbb|ccc ddd|eee|fff ddd|eee|xxxxx hhh|iiii|jjjjjj
    output:
    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl aaa|bbb|ccc ddd|eee|fff,xxxxx hhh|iiii|jjjjjj > Terminated with exit code 0.
Re: amalgamate similar lines
by McDarren (Abbot) on Jan 09, 2006 at 13:52 UTC
    I'd use the first two fields as a key to a hash, something like the following:
    #!/usr/bin/perl -w use strict; use Data::Dumper::Simple; my %first_two; while (<DATA>) { chomp; my ($key, $remainder) = $_ =~ /([a-z]+\|[a-z]+)\|([a-z]+)/; if (exists $first_two{$key}) { $first_two{$key} .= ",$remainder"; } else { $first_two{$key} = $_; } } print Dumper(%first_two); __DATA__ aaa|bbb|ccc ddd|eee|fff ddd|eee|xxxxx hhh|iiii|jjjjjj

    Which gives:

    %first_two = ( 'aaa|bbb' => 'aaa|bbb|ccc', 'hhh|iiii' => 'hhh|iiii|jjjjjj', 'ddd|eee' => 'ddd|eee|fff,xxxxx' );

    There are probably many more elegant ways to do it, but that's just the first thing that occurred to me.

    Cheers,
    Darren :)

Re: amalgamate similar lines
by Aristotle (Chancellor) on Jan 09, 2006 at 19:09 UTC
    my @line; my %third_col; while( <> ) { chomp; my @col = split /\|/, $_, -1; my $key = join '|', @col[ 0, 1 ]; push @line, $key if not exists %third_col{ $key }; push @{ $third_col{ $key } }, $col[ 2 ]; } for ( @line ) { print $_ . '|' . join ',', @{ $third_col{ $_ } }; }

    Update: fixed a sigil, thanks to Roy Johnson.

    Makeshifts last the longest.

Re: amalgamate similar lines
by Perl Mouse (Chaplain) on Jan 09, 2006 at 13:39 UTC
    Untested:
    my %info; while (<>) { my ($key, $value) = /([^|]*\|[^|]*)\|(.*)/; push @{$info{$key}}, $value; } while (my ($key, $value) = each %info) { local $" = ","; print "$key|@$value\n"; } __END__
    Perl --((8:>*
        Yes - I don't think the OP stated any requirements about the order. A required ordering is easily added without changing the gist of the solution.

        And I'm not going to speculate what the OP wants or not.

        Perl --((8:>*
Re: amalgamate similar lines
by DungeonKeeper (Novice) on Jan 09, 2006 at 14:22 UTC
    my %h = (); while(<>) { chop; my @fld = split( /\|/ ); my $key = shift @fld . "|" . shift @fld; $h{ $key } &&= $h{ key } . ','; $h{ $key } .= shift @fld; shift @fld and die "Analysis error: too many fields $_\n"; } for my $k ( keys %h ) { print $k . '|' . $h { $k } . "\n"; }

    Everything but the troll

Re: amalgamate similar lines
by Roy Johnson (Monsignor) on Jan 09, 2006 at 19:50 UTC
    my @lines; my %seen; while (<DATA>) { chomp; my ($onetwo,$three) = $_=~/(.*)\|(.*)/; if ($seen{$onetwo}) { $lines[$seen{$onetwo} - 1] .= ",$three"; } else { $seen{$onetwo} = push @lines, $_; } } print "$_\n" for @lines; __DATA__ aaa|bbb|ccc ddd|eee|fff ddd|eee|xxxxx hhh|iiii|jjjjjj

    Caution: Contents may have been coded under pressure.