Extracting records with unique ID

mmittiga17 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
I have a file that has records for the same unique ID on multiple line
+s.  

Sample:

A02545142   CA0 0120000612A02545142   C  00000000  G                  
+          
A02545142   CA3 0120080804 121303 USEDT                               
+          
A02545142   CD0 01GENERALI HOLDING VIENNA AG     ORD                  
+          
A02545142   CD1 01OUTFC   000000 29GRHVF        000000000000     YNN  
+ AT   NATN
A02545142   CD2 010000000000000000000000 00000000000000   F           
+          
A02545142   CD3 01  00000000000000   00000000                         
+          
A02545142   CE0 01               0000000000000000000   00000000000000 
+          
A02545142   CE1 0100000000          00000000      00000000          00
+000000    
A02545142   CI0 0100000000000000000000000000000000      00000000000000
+0000000000
A02545142   CR1 01    00000000            00000000            00000000
+          
A02545142   CT2 01 9920000607AGRHVF                                   
+          
A02545142   CX0 01A02545142                                           
+          
A02545142   CX3 01  00000000                                          
+          
A02545142   CX4 01GRHVF                                               
+          
A03987103   CA0 0120030305A03987103   C  00000000  G                  
+          
A03987103   CA3 0120080710 180603 USEDT                               
+          
A03987103   CD0 01AAP IMPLANTATE AG BERLIN       AKT                  
+          
A03987103   CD1 01OUTFC   384100 29APIPF        000000000000     YNN  
+YAT   NATN
A03987103   CD2 010000000000000000000000 00000000000000   F           
+          
A03987103   CD3 01  00000000000000   00000000                         
+          
A03987103   CD8 01339112                                    3841      
+          
A03987103   CE0 01               0000000000000000000   00000000000000 
+          
A03987103   CE1 0100000000          00000000      00000000          00
+000000    
A03987103   CI0 0100000000000000000000000000000000      00000000000000
+0000000000
A03987103   CR1 01    00000000            00000000            00000000
+          
A03987103   CT2 01 9920030304AAPIPF                                   
+          
A03987103   CX0 01A03987103         009763775              B3BGFP7DE00
+05066609  
A03987103   CX3 010220080710F5678220AB28DW53                          
+          
A03987103   CX4 01APIPF                                               
+          

The first field of each line denotes the unique ID.
I need to parse each line for data set at a fix width and print it to 
+a single line. 

Here is what I have so far and it is not working.
while (@ARGV) {
   $file=shift (@ARGV);
   open (DATA,"$file") || die "unable to open tmp file";
      while (defined($Rec = <DATA>)) {
         push(@Lines, $Rec);
        foreach $line (@Lines) {
        if  ($line =~ /CA0/) {
            $USERNUM = substr($line, 0, 12); 
        }
         if (($line =~ /CA0/) && ($line =~ /^$USERNUM/)) {
            $RECTYPE = substr($line, 13, 2);
            $ASSET = substr($line, 51, 1) ;  
         }
          if  (($line =~ /CX4/) && ($line =~ /^$USERNUM/)) {
             $EXTK = substr($line, 18, 33);
          }
              push(@Recs, "$USERNUM|$RECTYPE|$ASSET|$EXTK");
         }
}
}
   foreach $X (@Recs) {
      ($USERNUM, $RECTYPE, $ASSET, $EXTK) = split /\|/, $X;
       print "$USERNUM $RECTYPE $ASSET $EXTK\n";
}

Any suggestions or help is deeply appreciated.
[download]

Comment on Extracting records with unique ID Download Code

Replies are listed 'Best First'.
Re: Extracting records with unique ID by moritz (Cardinal) on Sep 23, 2008 at 17:40 UTC
Since you don't tell us what the output should be, I can only give you some general advice. First, always do this: `use strict; use warnings;` [download] And declare your variables with my. It helps t catch common errors. Secondly, you can replace your two outer loops with something as simple as `while (<>) { # the current line is in $_, # the current filename in $ARGV }` [download] Thirdly if you want to group data by the ID, store that data in a a hash keyed by the ID. And finally there's the unpack function for extracting fixed width data, and perlpacktut contains a nice tutorial-style introduction on how to use it. And really-finally: If your data format has a name, go to CPAN and search for it - maybe there's already a module that does most of your work.	[reply] [d/l] [select]
Re: Extracting records with unique ID by Andrew Coolman (Hermit) on Sep 23, 2008 at 17:34 UTC
How about using hash with $USERNUM as key? Instead this: `push(@Recs, "$USERNUM\|$RECTYPE\|$ASSET\|$EXTK");` [download] Try this: `push(@{$Recs{$USERNUM}}, "$RECTYPE\|$ASSET\|$EXTK");` [download] Where Recs are defined somewhere above as my %Recs. Then just go through hash and take all the records. `for my $USERNUM (keys %Recs) { for my $rec (@{$Recs{$USERNUM}}) { ($RECTYPE, $ASSET, $EXTK) = split /\\|/, $rec; print "$USERNUM $RECTYPE $ASSET $EXTK\n"; } }` [download] Regards, s++·ą°µ» ¸Â ł¶˝¬ —¬ął. Ş¨µ ş°» ¨µ« ş»¨ą¬ ¶µ °» Ż¶ľ °» ľ¶ą˛ş ¶ą Ż¶Ľąş.}++y~†-Â~?-{~/s*$_ee	[reply] [d/l] [select]
Re: Extracting records with unique ID by psini (Deacon) on Sep 23, 2008 at 17:36 UTC
I really can't follow the logic of your program: it seems you are reading the file and, for each line read, parse all the lines read so far to check for ID equality. It is certainly not the fastest way to do it. If I understand the problem, you want to group data from lines with the same ID and then print it in some format. What I'd do is to: Define a empty hash Read the file line by line Parse each line when read and extract the ID and the single data fields Add the data read to the hash using as key the ID field. Data should be further structured in a hash using keys like RECTYPE, ASSET, and EXTK? When all the file has been read, print the contents of the hash, formatting as you need Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."	[reply]
Re^2: Extracting records with unique ID by mmittiga17 (Scribe) on Sep 23, 2008 at 18:32 UTC
Thanks all for your responses. My biggest how to structure a hash when he key is on multiple lines. Can some one point in the correct direction on that.	[reply]
Re^3: Extracting records with unique ID by massa (Hermit) on Sep 25, 2008 at 01:17 UTC
instead of using a hash where keys and values are strings, use a hash where the keys are strings and the values are references to arrays (so you can put more than one value for a key): `my %h; while( <> ) { my ($k, $v) = process $_; push @{$h{$k}}, $v } for my $k ( keys %h ) { #traverse all keys for my $v ( @{$h{$K}} ) { #traverse all values for that key do_your_stuff $k, $v; } }` [download] and read perllol, perlreftut, perlref, perldsc... I would prefer you read them in that order, but do as you please!!! []s, HTH, Massa (κς,πμ,πλ)	[reply] [d/l]
Re: Extracting records with unique ID by apl (Monsignor) on Sep 23, 2008 at 17:56 UTC
I'd suggest you unpack all fields in each record. Then depending on the type of record (CX4, for example) you populate (for example) `$hash{$USERNUM}{EXTK} = $EXTK;` When you've finished processing the file, you can loop through all keys of the hash, and display the hash values however you wish. I wouldn't rely on a simple match of the record-type because you could conceivably have CX4 as a value in a record of type CT2	[reply] [d/l]