in reply to csv parsing

As the other responders noted, I also am not entirely sure what you're looking for. But following the general trend regarding what you're after, here is another strategy for handling what I think you're looking for.

#!/user/bin/perl use strict; use warnings; ############# Set up Input File ##################### # # Output will be to the terminal, in OP's code the # results would be output to the spreadsheet or to # a .csv file that a spreadsheet could read # my $dir = 'c:/Documents and Settings/xxxxxx/'; my $inFileName = 'MeshInputData.txt'; my $inFile = $dir . $inFileName; open(IN,"<",$inFile)||die "Can't open input file $inFileName: $!\n"; # open(OUT,">",$outFile)||die "Can't open output file $outFileName: $! +\n"; ##################################################### # # Read the file and process each line from input file # one at a time per OP's preference. # my %codes = (); my @diseases = (); my @meshes = (); foreach my $line (<IN>){ $i++; chomp($line); my($code,$codeIndex,$disease,$mesh) = $line =~ /^(.+),([0-9]+),(.+),(MESH:(?:[a-zA-Z0-9]+))$/; if(exists $codes{$code}){ @diseases = @{$codes{$code}[0]}; @meshes = @{$codes{$code}[1]}; push(@diseases,$disease); push(@meshes,$mesh); $codes{$code} = [[@diseases],[@meshes]]; } else { @diseases = ($disease); @meshes = ($mesh); $codes{$code} = [[@diseases],[@meshes]]; } } ############################################## # # Display the results to the terminal # $i = 0; foreach my $code (keys %codes){ $i++; my $diseases_ref = $codes{$code}[0]; my $meshes_ref = $codes{$code}[1]; my @diseases = @{$diseases_ref}; my @meshes = @{$meshes_ref}; my $diseases = join(",",@diseases); my $meshes = join(",",@meshes); print "$i: ($code),($diseases),($meshes)\n"; } exit(0);

The input file that I used to test the above code looks like this:

ARL6IP2,298757,Hyperalgesia,MESH:D006930 ARL6IP2,298757,Liver Diseases,MESH:D008107 ARL6IP2,298757,"Liver Failure, Acute",MESH:D017114 ARL6IP2,298757,Liver Neoplasms,MESH:D008113 CCL22,6367,Esophageal Neoplasms,MESH:D004938 CCL22,6367,Fatty Liver,MESH:D005234 CCL22,6367,Fetal Growth Retardation,MESH:D005317 CCL22,6367,Fever,MESH:D005334

With that input and the above code, the output to the terminal looks like the following:

1: (CCL22),(Esophageal Neoplasms,Fatty Liver,Fetal Growth Retardation, +Fever),(MESH:D004938,MESH:D005234,MESH:D005317,MESH:D005334) 2: (ARL6IP2),(Hyperalgesia,Liver Diseases,"Liver Failure, Acute",Liver + Neoplasms),(MESH:D006930,MESH:D008107,MESH:D017114,MESH:D008113)

As suggested by the other responders, this code uses a hash as the mechanism for storing the various codes (e.g., "ARL6IP2" or "CCL22"). Each entry in the has stores a reference to an array which, itself contains references to two other arrays. One of those two arrays is an array of what I designate as the @diseases (e.g., "Hyperalgesia" or "Liver Failure, Acute" associated with each $code) and the other is an array of what I designate as @meshes (i.e., the various "MESH:D006930" type of stuff associated with each $code).

The one perhaps odd looking construct in the script is:

$codes{$code} = [[@diseases],[@meshes]];

The use of the square brackets in the interior are to ensure that all of the entries in the hash don't point to the exact same array. The use of the square brackets in this way is the usual recommended way to ensure that one doesn't continually point to the same data structure.

I hope this helps show another way to do it.

ack Albuquerque, NM