mupud has asked for the wisdom of the Perl Monks concerning the following question:

Okay, I'm a totally newb and feel really frustrated. I need to write a simple script to do the following.

Open a file that contains information in the following format:

385#19126!NM_167210@[1103;1104] 2 386#19127!NM_167211@[1103;1104] 2 387#19128!NM_167212@[1103;1104] 2 438#1781!NM_135492@[1337] 1 442#1794!NM_001042886@[1349] 1

What I need is each entry between ! and @ (e.g., NM_167210) to be associated with its corresponding number value at the very end....so in a hash, the key would be NM_167210 and the value would be 2, another key would be NM_167211 and the value would be 2.

I then have another file with each of the NM entries having a corresponding entry called CG*****:

CG32694-RD NM_167211 CG32694-RC NM_167210 CG32694-RA NM_167209 CG32694-RB NM_167212 CG33557-RA NM_001014730

So what I want is to process the first file as stated above, and then keep the association but convert the NM field into the CG fields and yet keep the corresponding value for the NM field. For example:

key 167211 with value 2 needs to be converted to corresponding CG32694-RD.

How do I go about doing this? I have no idea how to even start. A hash of arrays? An array of hashes? Just a regular hash? An array instead?

My problem is that there are so many ways to approach this. Do I use m// or split or grep or all three of them? The advantage of a hash is that it would associate each NM with its value, whereas in an array, I would lose the association, no?

Heeeeeeeeeeeeeeelp! I h*** PERL!

  • Comment on ISOLATE 2 ASSOCIATED FIELDS IN A TEXT FILE, then CONVERT the first into another based on a table of definitions
  • Select or Download Code

Replies are listed 'Best First'.
Re: ISOLATE 2 ASSOCIATED FIELDS IN A TEXT FILE, then CONVERT the first into another based on a table of definitions
by GrandFather (Saint) on Oct 19, 2009 at 04:24 UTC

    It doesn't matter how you start. If making a decision really bothers you, you could write each option into a suitably sized box drawn on a piece of paper, then throw a dart at it and choose a technique that way. However, once you start coding you'll generally pretty quickly find what doesn't work and why. Then you can make a somewhat better informed decision (perhaps by eliminating some options from the decision tool?).

    When you have some code to show (any code, but be sensible) then it is time to come back and help refine it.

    As a general rule though, use a hash when you need to look stuff up and an array when you don't need to look it up, but do need to keep it together and possibly need to retain its order. For more complicated stuff just apply the rule iteratively: 'I need to look something up to get a list of values = hash of array'.


    True laziness is hard work
Re: ISOLATE 2 ASSOCIATED FIELDS IN A TEXT FILE, then CONVERT the first into another based on a table of definitions
by bichonfrise74 (Vicar) on Oct 19, 2009 at 05:08 UTC
    From how I understand your question, this should help you get started. Note that there are many ways of doing this.
    #!/usr/bin/perl use strict; use Data::Dumper; my %temp_record; my $file_1 =<<FILE_1; 385#19126!NM_167210@[1103;1104] 2 386#19127!NM_167211@[1103;1104] 2 387#19128!NM_167212@[1103;1104] 2 438#1781!NM_135492@[1337] 1 442#1794!NM_001042886@[1349] 1 FILE_1 open( my $fh, '<', \$file_1 ) or die( "Cannot open $file_1" ); while (<$fh>) { my ($key, $value) = /\!(\w+)\@\S.*\s+(\d)/; $temp_record{$key} = $value; } close( $fh ); my %record; while (<DATA>) { my ($key, $value) = /(\S+)\s+(\S+)/; for my $i (keys %temp_record) { $record{$key} = $temp_record{$i} if ( $i eq $value ); } } print Dumper \%record; __DATA__ CG32694-RD NM_167211 CG32694-RC NM_167210 CG32694-RA NM_167209 CG32694-RB NM_167212 CG33557-RA NM_001014730
    The output is:
    $VAR1 = { 'CG32694-RD' => '2', 'CG32694-RB' => '2', 'CG32694-RC' => '2' };
Re: ISOLATE 2 ASSOCIATED FIELDS IN A TEXT FILE, then CONVERT the first into another based on a table of definitions
by Marshall (Canon) on Oct 19, 2009 at 07:33 UTC
    I think you will find it of enormous help if you consider what kind of report output(s) you want or what kind of queries will need to be run on this merged version of your two files. Meaning write the user spec of what this will do from a "black box" perspective. A clear understanding of where you want to end up will drive the techniques and structures leading to this result. In other words, the problem is easier if we know where we want to wind up.

    Update: Already we have a couple of proposed solutions from bichonfrise74and Bloodnok. The one from bichonfrise74 is organized to make accessing the CGxxx stuff easy. The one from Bloodnok is organized to make accessing the NM_xxx stuff easy. The real question is what best suits your need? It could well be that "none of the above" is the right answer.

    another Update: One important thing in an user spec is what to do when "things don't work". You probably have noticed that the critical user data that links the your files doesn't match. So what do you want to do about that? bichonfrise74 ignores those records while Bloodnok generates a data struct with an erroneous value for the last thing when the $NM value is bogus, although it reports that $NM value is undefined. These are the types of things that need to be planned in advance.

Re: ISOLATE 2 ASSOCIATED FIELDS IN A TEXT FILE, then CONVERT the first into another based on a table of definitions
by Bloodnok (Vicar) on Oct 19, 2009 at 09:24 UTC
    Hmmm ,

    Is this something like you wanted...

    use warnings; use strict; use autodie; use Data::Dumper; my %result; my $lookup =<<'_LOOKUP'; CG32694-RD NM_167211 CG32694-RC NM_167210 CG32694-RA NM_167209 CG32694-RB NM_167212 CG33557-RA NM_001014730 _LOOKUP open LOOKUP, '<', \$lookup; my %lookup = map { local @_ = split; $_[1] => $_[0]; } <LOOKUP>; close LOOKUP; while (<DATA>) { local @_ = split /[][#!@\s]+/; @{ $result{$_[2]}}{ qw/cg val/} = ($lookup{$_[2]}, $_[$#_]); } warn Dumper \%result; __DATA__ 385#19126!NM_167210@[1103;1104] 2 386#19127!NM_167211@[1103;1104] 2 387#19128!NM_167212@[1103;1104] 2 438#1781!NM_135492@[1337] 1 442#1794!NM_001042886@[1349] 1
    Giving:
    $ perl tst.pl $VAR1 = { 'NM_167212' => { 'cg' => 'CG32694-RB', 'val' => '2' }, 'NM_167210' => { 'cg' => 'CG32694-RC', 'val' => '2' }, 'NM_167211' => { 'cg' => 'CG32694-RD', 'val' => '2' }, 'NM_001042886' => { 'cg' => undef, 'val' => '1' }, 'NM_135492' => { 'cg' => undef, 'val' => '1' } };
    A user level that continues to overstate my experience :-))