amalgamate similar lines

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: amalgamate similar lines by Limbic~Region (Chancellor) on Jan 09, 2006 at 13:55 UTC
Anonymous Monk, You likely want to be using a CSV parsing module like Text::CSV_XS or Text::x_SV, but for this example I will be using split. I have chosen to use split because I have made several assumptions about your problem. Assumptions: Each record is contained on a single line Each record is pipe delimited and no field contains any imbedded delimiters Each record is comprised of 3 fields Preservation of record ordering is important 2 or more records with the first 2 fields in common are desired to be joined These records may be anywhere in the file and are not necessarily adjacent Joining records means concatenating the 3rd fields with commas in the order the records appeared in the file Concattenated records in the output will be identified by commas in the 3rd field. This assumes no commas appear in the 3rd field prior to merging. The joined record will appear at the first occurence in the output The machine running the program will have sufficient memory to hold required information in memory #!/usr/bin/perl use strict; use warnings; my $input = $ARGV[0] \|\| 'sample.txt'; open(my $fh, '<', $input) or die "Unable to open $input for reading: $ +!"; my %data; while ( <$fh> ) { chomp; my @field = split /\\|/, $_, 3; my $key = join '\|', @field[0,1]; $data{$key}{line} = $. if ! exists $data{$key}; push @{ $data{$key}{records} }, $field[2]; } for ( sort { $data{$a}{line} <=> $data{$b}{line} } keys %data ) { if ( @{ $data{$_}{records} } > 1 ) { my $field3 = join ',', @{ $data{$_}{records} }; print join '\|', $_, $field3; } else { print join '\|', $_, $data{$_}{records}[0]; } print "\n"; } [download] Please forgive me for the rather tedious solution. I wanted to point out the importance of clearly and concisely stating the problem and assumptions. Cheers - L~R Update: Simplified code and clarified assumptions	[reply] [d/l]
Re^2: amalgamate similar lines by ysth (Canon) on Jan 09, 2006 at 14:05 UTC
The original 3rd field will not contain commas I don't see where you assume that; AFAICT your solution will work whether or not that's true. Perhaps you are just pointing out that the operation will not be reversable if there are existing commas? The joined record will appear at the first occurence Implicitly, you are also assuming that records should be merged regardless of their position in the file; it's possible that only adjacent records should be candidates for merging.	[reply]
Re^3: amalgamate similar lines by Limbic~Region (Chancellor) on Jan 09, 2006 at 14:11 UTC
ysth, With regards to the first assumption you called into question, that should have read: Concattenated records in the output will be identified by commas in the 3rd field. This assumes no commas appear in the 3rd field prior to merging. updated With regards to second assumption you mentioned. You are correct that since the AM only stated where the first two fields were the same that I assumed that meant they could appear anywhere in the file. That is the point of my post - to clearly state what is desired. Cheers - L~R	[reply] [d/l]
Re: amalgamate similar lines by g0n (Priest) on Jan 09, 2006 at 13:58 UTC
Fore! `while (<DATA>) { my ($onetwo,$three) = $_=~/(.)\\|(.)/; push @{$hash{$onetwo}},$three; } for (sort keys %hash) { print $_."\|",join ",",@{$hash{$_}}; print "\n"; } __DATA__ aaa\|bbb\|ccc ddd\|eee\|fff ddd\|eee\|xxxxx hhh\|iiii\|jjjjjj` [download] -------------------------------------------------------------- "If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing." John Brunner, "The Shockwave Rider".	[reply] [d/l]
Re: amalgamate similar lines by smokemachine (Hermit) on Jan 09, 2006 at 13:36 UTC
`perl -ne 'chomp; $hash{$1}.=$2."," if /^([^\\|]+\\|[^\\|]+)\\|([^\\|]+)$/; +END{chop$hash{$_} foreach keys %hash; print "$_\|$hash{$_}\n" foreach +keys %hash}' file_name` [download]	[reply] [d/l]
Re: amalgamate similar lines by wfsp (Abbot) on Jan 09, 2006 at 13:46 UTC
Here's my go: `#!/bin/perl5 use strict; use warnings; use Data::Dumper; my %data; while (my $record = <DATA>){ chomp $record; my ($fld1, $fld2, $fld3) = split /\\|/, $record; push @{$data{$fld1}{$fld2}}, $fld3; } for my $fld1 (sort keys %data){ for my $fld2 (keys %{$data{$fld1}}){ print "$fld1\|$fld2\|", join(',', @{$data{$fld1}{$fld2}}), "\n"; } } __DATA__ aaa\|bbb\|ccc ddd\|eee\|fff ddd\|eee\|xxxxx hhh\|iiii\|jjjjjj` [download] output: `---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl aaa\|bbb\|ccc ddd\|eee\|fff,xxxxx hhh\|iiii\|jjjjjj > Terminated with exit code 0.` [download]	[reply] [d/l] [select]
Re: amalgamate similar lines by McDarren (Abbot) on Jan 09, 2006 at 13:52 UTC
I'd use the first two fields as a key to a hash, something like the following: `#!/usr/bin/perl -w use strict; use Data::Dumper::Simple; my %first_two; while (<DATA>) { chomp; my ($key, $remainder) = $_ =~ /([a-z]+\\|[a-z]+)\\|([a-z]+)/; if (exists $first_two{$key}) { $first_two{$key} .= ",$remainder"; } else { $first_two{$key} = $_; } } print Dumper(%first_two); __DATA__ aaa\|bbb\|ccc ddd\|eee\|fff ddd\|eee\|xxxxx hhh\|iiii\|jjjjjj` [download] Which gives: `%first_two = ( 'aaa\|bbb' => 'aaa\|bbb\|ccc', 'hhh\|iiii' => 'hhh\|iiii\|jjjjjj', 'ddd\|eee' => 'ddd\|eee\|fff,xxxxx' );` [download] There are probably many more elegant ways to do it, but that's just the first thing that occurred to me. Cheers, Darren :)	[reply] [d/l] [select]
Re: amalgamate similar lines by Aristotle (Chancellor) on Jan 09, 2006 at 19:09 UTC
`my @line; my %third_col; while( <> ) { chomp; my @col = split /\\|/, $_, -1; my $key = join '\|', @col[ 0, 1 ]; push @line, $key if not exists %third_col{ $key }; push @{ $third_col{ $key } }, $col[ 2 ]; } for ( @line ) { print $_ . '\|' . join ',', @{ $third_col{ $_ } }; }` [download] Update: fixed a sigil, thanks to Roy Johnson. Makeshifts last the longest.	[reply] [d/l]
Re: amalgamate similar lines by Perl Mouse (Chaplain) on Jan 09, 2006 at 13:39 UTC
Untested: `my %info; while (<>) { my ($key, $value) = /([^\|]\\|[^\|])\\|(.)/; push @{$info{$key}}, $value; } while (my ($key, $value) = each %info) { local $" = ","; print "$key\|@$value\n"; } __END__` [download] `Perl --((8:>`	[reply] [d/l]
Re^2: amalgamate similar lines by ysth (Canon) on Jan 09, 2006 at 13:57 UTC
This solution (and smokemachine's and DungeonKeeper's) outputs the lines in semi-random order.	[reply]
Re^3: amalgamate similar lines by Perl Mouse (Chaplain) on Jan 09, 2006 at 14:47 UTC
Yes - I don't think the OP stated any requirements about the order. A required ordering is easily added without changing the gist of the solution. And I'm not going to speculate what the OP wants or not. `Perl --((8:>*`	[reply]
Re: amalgamate similar lines by DungeonKeeper (Novice) on Jan 09, 2006 at 14:22 UTC
`my %h = (); while(<>) { chop; my @fld = split( /\\|/ ); my $key = shift @fld . "\|" . shift @fld; $h{ $key } &&= $h{ key } . ','; $h{ $key } .= shift @fld; shift @fld and die "Analysis error: too many fields $_\n"; } for my $k ( keys %h ) { print $k . '\|' . $h { $k } . "\n"; }` [download] Everything but the troll	[reply] [d/l]
Re: amalgamate similar lines by Roy Johnson (Monsignor) on Jan 09, 2006 at 19:50 UTC
`my @lines; my %seen; while (<DATA>) { chomp; my ($onetwo,$three) = $_=~/(.)\\|(.)/; if ($seen{$onetwo}) { $lines[$seen{$onetwo} - 1] .= ",$three"; } else { $seen{$onetwo} = push @lines, $_; } } print "$_\n" for @lines; __DATA__ aaa\|bbb\|ccc ddd\|eee\|fff ddd\|eee\|xxxxx hhh\|iiii\|jjjjjj` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l]