comment on

I liked the hash based solutions including jwkrahn's. I don't now how huge this file could get, but if it is horrifically huge, then you might have to process it line by line due to memory constraints.

Below, I show one way of doing that assuming that lines are sorted which is often not a bad assumption as the command line sort utilities can sort humongous files very efficiently. Code would have to be a bit more complex if more than 2 duplicate lines were there and needed to be combined on the first pass although running the program again would pick up the "3rd" one on the second pass. Note I did not "save" the number of descriptions as this is easily produced by Perl by evaluation of @var in a scalar context.

#!/usr/bin/perl -w
use strict;

my $prev_line=();

while (<DATA>)
{
   if (!$prev_line){$prev_line = $_; next}
   
   my ($prev_num, $prev_desc_txt) = (split(/,/,$prev_line,3))[0,2];
   my ($num, $desc_text)          = (split(/,/,$_,3))[0,2];
   
   if ($prev_num eq $num)  #combine prev and current descriptions
   {
       my $new_desc = "$prev_desc_txt $desc_text";
       my @new_desc = ($new_desc =~m/(\w+)/g);
       
       @new_desc = sort {  #thanks to jwkrahn for sort
                           my ( $aL, $aR ) = $a =~ /(\D+)(\d+)/;
                           my ( $bL, $bR ) = $b =~ /(\D+)(\d+)/;
                                     $aL cmp $bL 
                                          or
                                     $aR <=> $bR
                         } @new_desc;
                  
       print "$num,".@new_desc.",\"", join(',',@new_desc),"\"\n";
              #note .@new_desc forces scalar context (num elements)
       $prev_line =();
   }
   else                    #prev_line is a "singleton"
   {
       print $prev_line;
       $prev_line = $_;
   }
}
print $prev_line if ($prev_line); #maybe a "hanger on"

=prints:
032-00751-0000,1,R383
032-00794-0000,6,"RP1,RP2,RP3,RP22,RP24,RP26"
032-00795-0000,8,"RP10,RP11,RP12,RP13,RP14,RP15,RP16,RP17"
032-00804-0000,7,"R7,R14,R21,R23,R41,R42,R49"
032-00807-0000,6,"RP8,RP9,RP18,RP19,RP200,RP201"
032-00808-0000,3,"RP21,RP23,RP25"
032-00820-0000,6,"R966,R970,R971,R1041,R1076,R3000"
032-00893-0000,1,R1164
=cut

__DATA__
032-00751-0000,1,R383
032-00794-0000,6,"RP1,RP2,RP3,RP22,RP24,RP26"
032-00795-0000,8,"RP10,RP11,RP12,RP13,RP14,RP15,RP16,RP17"
032-00804-0000,7,"R7,R14,R21,R23,R41,R42,R49"
032-00807-0000,2,"RP18,RP19"
032-00807-0000,4,"RP8,RP9,RP200,RP201"
032-00808-0000,3,"RP21,RP23,RP25"
032-00820-0000,5,"R966,R970,R971,R1041,R1076"
032-00820-0000,1,R3000
032-00893-0000,1,R1164
[download]

In reply to Re: Find duplicate fields and merging data in a text file by Marshall
in thread Find duplicate fields and merging data in a text file by donkost

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.