Find duplicate fields and merging data in a text file

donkost has asked for the wisdom of the Perl Monks concerning the following question:

I have been struggling with this for the past couple of days. I'm hoping someone can provide some code to do the following: I receive comma delimited files that have 3 fields; part number , quantity and designator. The part number field may or may not be duplicated. If it is duplicated I need to merge the remaining 2 fields together adding the quantities together and sorting and adding the designator field so it is in increasing order. I then need to write this back out to a comma delimited file.
Here is a sample on the input file:

032-00751-0000,1,R383
032-00794-0000,6,"RP1,RP2,RP3,RP22,RP24,RP26"
032-00795-0000,8,"RP10,RP11,RP12,RP13,RP14,RP15,RP16,RP17"
032-00804-0000,7,"R7,R14,R21,R23,R41,R42,R49"
032-00807-0000,2,"RP18,RP19"
032-00807-0000,4,"RP8,RP9,RP200,RP201"
032-00808-0000,3,"RP21,RP23,RP25"
032-00820-0000,5,"R966,R970,R971,R1041,R1076"
032-00820-0000,1,R3000
032-00893-0000,1,R1164
[download]

Note that there are 2 duplicates 032-00807-0000 and 032-00820-0000. Also note that if the designator field contains only one item there are no "". Here is what the output should look like:

032-00751-0000,1,R383
032-00794-0000,6,"RP1,RP2,RP3,RP22,RP24,RP26"
032-00795-0000,8,"RP10,RP11,RP12,RP13,RP14,RP15,RP16,RP17"
032-00804-0000,7,"R7,R14,R21,R23,R41,R42,R49"
032-00807-0000,6,"RP8,RP9,RP18,RP19,RP200,RP201"
032-00808-0000,3,"RP21,RP23,RP25"
032-00820-0000,6,"R966,R970,R971,R1041,R1076,R3000"
0032-00893-0000,1,R1164
[download]

Also the designator field can get pretty big, quantities around 2000+ are not unheard of.
Any insight in how to tackle this would be appreciated.
Thanks!
Don...

Comment on Find duplicate fields and merging data in a text file Select or Download Code

Replies are listed 'Best First'.
Re: Find duplicate fields and merging data in a text file by toolic (Bishop) on Apr 29, 2010 at 20:02 UTC
Any insight in how to tackle this would be appreciated. I recommend using Text::CSV_XS to parse the input file. If duplicates are always on consecutive lines, then you can just remember the previous line, then combine the lines once they are parsed. If you are having trouble with any specific portion of your code, post it here.	[reply]
Re: Find duplicate fields and merging data in a text file by moritz (Cardinal) on Apr 29, 2010 at 19:53 UTC
What have you tried? Lots of questions here talk about finding and removing duplicates, I'd guess Super Search should find at least one helpful candidate.	[reply]
Re: Find duplicate fields and merging data in a text file by jwkrahn (Abbot) on Apr 29, 2010 at 20:45 UTC
`while ( <> ) { my ( $part_num, undef, $designator ) = split /,/, $_, 3 or next; push @{ $data{ $part_num } }, $designator =~ /[^",\s]+/g; } for my $part_num ( sort keys %data ) { my $quantity = @{ $data{ $part_num } }; print "$part_num,$quantity,", $quantity == 1 ? '' : '"', join( ',', sort { my ( $aL, $aR ) = $a =~ /(\D+)(\d+)/; my ( $bL, $bR ) = $b =~ /(\D+)(\d+)/; $aL cmp $bL \|\| $aR <=> $bR } @{ $data{ $part_num } } ), $quantity == 1 ? '' : '"', "\n"; }` [download]	[reply] [d/l]
Re: Find duplicate fields and merging data in a text file by Marshall (Canon) on Apr 30, 2010 at 01:18 UTC
I liked the hash based solutions including jwkrahn's. I don't now how huge this file could get, but if it is horrifically huge, then you might have to process it line by line due to memory constraints. Below, I show one way of doing that assuming that lines are sorted which is often not a bad assumption as the command line sort utilities can sort humongous files very efficiently. Code would have to be a bit more complex if more than 2 duplicate lines were there and needed to be combined on the first pass although running the program again would pick up the "3rd" one on the second pass. Note I did not "save" the number of descriptions as this is easily produced by Perl by evaluation of @var in a scalar context. #!/usr/bin/perl -w use strict; my $prev_line=(); while (<DATA>) { if (!$prev_line){$prev_line = $_; next} my ($prev_num, $prev_desc_txt) = (split(/,/,$prev_line,3))[0,2]; my ($num, $desc_text) = (split(/,/,$_,3))[0,2]; if ($prev_num eq $num) #combine prev and current descriptions { my $new_desc = "$prev_desc_txt $desc_text"; my @new_desc = ($new_desc =~m/(\w+)/g); @new_desc = sort { #thanks to jwkrahn for sort my ( $aL, $aR ) = $a =~ /(\D+)(\d+)/; my ( $bL, $bR ) = $b =~ /(\D+)(\d+)/; $aL cmp $bL or $aR <=> $bR } @new_desc; print "$num,".@new_desc.",\"", join(',',@new_desc),"\"\n"; #note .@new_desc forces scalar context (num elements) $prev_line =(); } else #prev_line is a "singleton" { print $prev_line; $prev_line = $_; } } print $prev_line if ($prev_line); #maybe a "hanger on" =prints: 032-00751-0000,1,R383 032-00794-0000,6,"RP1,RP2,RP3,RP22,RP24,RP26" 032-00795-0000,8,"RP10,RP11,RP12,RP13,RP14,RP15,RP16,RP17" 032-00804-0000,7,"R7,R14,R21,R23,R41,R42,R49" 032-00807-0000,6,"RP8,RP9,RP18,RP19,RP200,RP201" 032-00808-0000,3,"RP21,RP23,RP25" 032-00820-0000,6,"R966,R970,R971,R1041,R1076,R3000" 032-00893-0000,1,R1164 =cut __DATA__ 032-00751-0000,1,R383 032-00794-0000,6,"RP1,RP2,RP3,RP22,RP24,RP26" 032-00795-0000,8,"RP10,RP11,RP12,RP13,RP14,RP15,RP16,RP17" 032-00804-0000,7,"R7,R14,R21,R23,R41,R42,R49" 032-00807-0000,2,"RP18,RP19" 032-00807-0000,4,"RP8,RP9,RP200,RP201" 032-00808-0000,3,"RP21,RP23,RP25" 032-00820-0000,5,"R966,R970,R971,R1041,R1076" 032-00820-0000,1,R3000 032-00893-0000,1,R1164 [download]	[reply] [d/l]
Re^2: Find duplicate fields and merging data in a text file by donkost (Initiate) on Apr 30, 2010 at 14:27 UTC
Thanks to everyone for your ideas. I will working on this today once I get the validation code running right. You guys are amazing! don...	[reply]
Re: Find duplicate fields and merging data in a text file by choroba (Cardinal) on Apr 29, 2010 at 20:20 UTC
This code should work. It creates a hash, the part number being the key. The value is another hash, containing two keys, num and des, whose values are the quantity and the list of designators. #!/usr/bin/perl use strict; use warnings; my %parts; while(my $line = <DATA>){ chomp $line; $line =~ s/"//g; my ($part,$num,@des) = split /,/,$line; $parts{$part}{num} += $num; $parts{$part}{des} = {} unless exists $parts{$part}{des}; my %des = %{ $parts{$part}{des} }; @des{@des} = (); %{ $parts{$part}{des} } = %des; } foreach my $part_id (sort keys %parts){ my $part = $parts{$part_id}; my $des_string = join ',', sort keys %{ $part->{des} }; $des_string = '"'.$des_string.'"' if $des_string =~ /,/; print join ',',$part_id,$part->{num},$des_string; print "\n"; } __DATA__ 032-00751-0000,1,R383 032-00794-0000,6,"RP1,RP2,RP3,RP22,RP24,RP26" 032-00795-0000,8,"RP10,RP11,RP12,RP13,RP14,RP15,RP16,RP17" 032-00804-0000,7,"R7,R14,R21,R23,R41,R42,R49" 032-00807-0000,2,"RP18,RP19" 032-00807-0000,4,"RP8,RP9,RP200,RP201" 032-00808-0000,3,"RP21,RP23,RP25" 032-00820-0000,5,"R966,R970,R971,R1041,R1076" 032-00820-0000,1,R3000 032-00893-0000,1,R1164 [download] Update: designators sorted.	[reply] [d/l]
Re^2: Find duplicate fields and merging data in a text file by jwkrahn (Abbot) on Apr 30, 2010 at 02:31 UTC
`$parts{$part}{des} = {} unless exists $parts{$part}{des}; my %des = %{ $parts{$part}{des} }; @des{@des} = (); %{ $parts{$part}{des} } = %des;` [download] Or simply: `@{ $parts{ $part }{ des } }{ @des } = ();` [download]	[reply] [d/l] [select]