Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Extracting records with unique ID

by mmittiga17 (Scribe)
on Sep 23, 2008 at 17:21 UTC ( [id://713256]=perlquestion: print w/replies, xml ) Need Help??

mmittiga17 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I have a file that has records for the same unique ID on multiple line +s. Sample: A02545142 CA0 0120000612A02545142 C 00000000 G + A02545142 CA3 0120080804 121303 USEDT + A02545142 CD0 01GENERALI HOLDING VIENNA AG ORD + A02545142 CD1 01OUTFC 000000 29GRHVF 000000000000 YNN + AT NATN A02545142 CD2 010000000000000000000000 00000000000000 F + A02545142 CD3 01 00000000000000 00000000 + A02545142 CE0 01 0000000000000000000 00000000000000 + A02545142 CE1 0100000000 00000000 00000000 00 +000000 A02545142 CI0 0100000000000000000000000000000000 00000000000000 +0000000000 A02545142 CR1 01 00000000 00000000 00000000 + A02545142 CT2 01 9920000607AGRHVF + A02545142 CX0 01A02545142 + A02545142 CX3 01 00000000 + A02545142 CX4 01GRHVF + A03987103 CA0 0120030305A03987103 C 00000000 G + A03987103 CA3 0120080710 180603 USEDT + A03987103 CD0 01AAP IMPLANTATE AG BERLIN AKT + A03987103 CD1 01OUTFC 384100 29APIPF 000000000000 YNN +YAT NATN A03987103 CD2 010000000000000000000000 00000000000000 F + A03987103 CD3 01 00000000000000 00000000 + A03987103 CD8 01339112 3841 + A03987103 CE0 01 0000000000000000000 00000000000000 + A03987103 CE1 0100000000 00000000 00000000 00 +000000 A03987103 CI0 0100000000000000000000000000000000 00000000000000 +0000000000 A03987103 CR1 01 00000000 00000000 00000000 + A03987103 CT2 01 9920030304AAPIPF + A03987103 CX0 01A03987103 009763775 B3BGFP7DE00 +05066609 A03987103 CX3 010220080710F5678220AB28DW53 + A03987103 CX4 01APIPF + The first field of each line denotes the unique ID. I need to parse each line for data set at a fix width and print it to +a single line. Here is what I have so far and it is not working. while (@ARGV) { $file=shift (@ARGV); open (DATA,"$file") || die "unable to open tmp file"; while (defined($Rec = <DATA>)) { push(@Lines, $Rec); foreach $line (@Lines) { if ($line =~ /CA0/) { $USERNUM = substr($line, 0, 12); } if (($line =~ /CA0/) && ($line =~ /^$USERNUM/)) { $RECTYPE = substr($line, 13, 2); $ASSET = substr($line, 51, 1) ; } if (($line =~ /CX4/) && ($line =~ /^$USERNUM/)) { $EXTK = substr($line, 18, 33); } push(@Recs, "$USERNUM|$RECTYPE|$ASSET|$EXTK"); } } } foreach $X (@Recs) { ($USERNUM, $RECTYPE, $ASSET, $EXTK) = split /\|/, $X; print "$USERNUM $RECTYPE $ASSET $EXTK\n"; } Any suggestions or help is deeply appreciated.

Replies are listed 'Best First'.
Re: Extracting records with unique ID
by Andrew Coolman (Hermit) on Sep 23, 2008 at 17:34 UTC
    How about using hash with $USERNUM as key?
    Instead this:
    push(@Recs, "$USERNUM|$RECTYPE|$ASSET|$EXTK");
    Try this:
    push(@{$Recs{$USERNUM}}, "$RECTYPE|$ASSET|$EXTK");
    Where Recs are defined somewhere above as my %Recs.
    Then just go through hash and take all the records.
    for my $USERNUM (keys %Recs) { for my $rec (@{$Recs{$USERNUM}}) { ($RECTYPE, $ASSET, $EXTK) = split /\|/, $rec; print "$USERNUM $RECTYPE $ASSET $EXTK\n"; } }

    Regards,
    s++ą  ł˝ ął. Ş ş şą Żľ ľą˛ş ą ŻĽąş.}++y~-~?-{~/s**$_*ee
Re: Extracting records with unique ID
by psini (Deacon) on Sep 23, 2008 at 17:36 UTC

    I really can't follow the logic of your program: it seems you are reading the file and, for each line read, parse all the lines read so far to check for ID equality. It is certainly not the fastest way to do it.

    If I understand the problem, you want to group data from lines with the same ID and then print it in some format. What I'd do is to:

    • Define a empty hash
    • Read the file line by line
    • Parse each line when read and extract the ID and the single data fields
    • Add the data read to the hash using as key the ID field. Data should be further structured in a hash using keys like RECTYPE, ASSET, and EXTK?
    • When all the file has been read, print the contents of the hash, formatting as you need

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

      Thanks all for your responses. My biggest how to structure a hash when he key is on multiple lines. Can some one point in the correct direction on that.
        instead of using a hash where keys and values are strings, use a hash where the keys are strings and the values are references to arrays (so you can put more than one value for a key):
        my %h; while( <> ) { my ($k, $v) = process $_; push @{$h{$k}}, $v } for my $k ( keys %h ) { #traverse all keys for my $v ( @{$h{$K}} ) { #traverse all values for that key do_your_stuff $k, $v; } }
        and read perllol, perlreftut, perlref, perldsc... I would prefer you read them in that order, but do as you please!!!
        []s, HTH, Massa (κς,πμ,πλ)
Re: Extracting records with unique ID
by moritz (Cardinal) on Sep 23, 2008 at 17:40 UTC
    Since you don't tell us what the output should be, I can only give you some general advice.

    First, always do this:

    use strict; use warnings;

    And declare your variables with my. It helps t catch common errors.

    Secondly, you can replace your two outer loops with something as simple as

    while (<>) { # the current line is in $_, # the current filename in $ARGV }

    Thirdly if you want to group data by the ID, store that data in a a hash keyed by the ID.

    And finally there's the unpack function for extracting fixed width data, and perlpacktut contains a nice tutorial-style introduction on how to use it.

    And really-finally: If your data format has a name, go to CPAN and search for it - maybe there's already a module that does most of your work.

Re: Extracting records with unique ID
by apl (Monsignor) on Sep 23, 2008 at 17:56 UTC
    I'd suggest you unpack all fields in each record. Then depending on the type of record (CX4, for example) you populate (for example) $hash{$USERNUM}{EXTK} = $EXTK;

    When you've finished processing the file, you can loop through all keys of the hash, and display the hash values however you wish.

    I wouldn't rely on a simple match of the record-type because you could conceivably have CX4 as a value in a record of type CT2

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://713256]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2024-04-20 05:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found