More efficient way to lookup with 2 AoA's.

BioGeek has asked for the wisdom of the Perl Monks concerning the following question:

Hey All, I have 2 ArrayOfArrays, one with a gene name and its score, like this:

@gene_score = (
         [ "gene_name_0", "score_0" ],
         [ "gene_name_1", "score_1" ],
            ...
         [ "gene_name_400", "score_400" ]
);
[download]

and, one with a gene name and its start and stop positions on a chromosone:

@gene_start_stop_chr = (
         [ "gene_name_0", "start_0", "stop_0", "chr_0" ],
         [ "gene_name_1", "start_1", "stop_1", "chr_1" ],
            ...
          [ "gene_name_30000", "start_30000", "stop_30000", "chr_30000
+" ]
);
[download]

And of course I want to match the scores with the positions, using the gene names, so that I end with an array:

@results = (
         [ "gene_name_0", "score_0", "start_0", "stop_0", "chr_0" ],
         [ "gene_name_1", "score_1", "start_1", "stop_1", "chr_1" ],
            ...
         [ "gene_name_400", "score_400", "start_400", "stop_400", "chr
+_400" ],
);
[download]

The code I've written so far is (@gene_start_stop_chr abbreviated till @gssc):

for (my $a = 0; $a < scalar @gene_score; $a++) {
        for (my $b = 0; $b < scalar @gssc; $b++) {
                if ("$gene_score[$a][0]" eq "$gssc[$b][0]") {
                print "$gene-score[$a][0]\t$gene_score[$a][1]\t$gssc[$
+b][1]\t$gssc[$b][2]\t$gssc[$b][3]\n";

                }
        }
}
[download]

Which works, but is very slow, as I am comparing each of the 400 gene names of my first array with every of the 30000 gene names in the second array. So I was wondering of there are changes I could make to speed things up.
Thanks in advance.

Comment on More efficient way to lookup with 2 AoA's. Select or Download Code

Replies are listed 'Best First'.
Re: More efficient way to lookup with 2 AoA's. by Zaxo (Archbishop) on Jul 27, 2004 at 20:50 UTC
Use a hash with the gene names as keys. You can then put all the data in one structure, `my %gene = ( gene_name_1 => { start=>'start_1', stop=>'stop_1', chr=>'chr_1'}, # ... );` [download] You could add the score data there, too, unless it's more dynamic than that and is generated elsewhere. After Compline, Zaxo	[reply] [d/l]
Re: More efficient way to lookup with 2 AoA's. by BrowserUk (Patriarch) on Jul 27, 2004 at 21:09 UTC
Like everyone says--whenever you need to do a lookup in Perl: Think hashes, #! perl -slw use strict; use Data::Dumper; my @gene_score = ( [ "gene_name_0", "score_0" ], [ "gene_name_1", "score_1" ], # ... [ "gene_name_400", "score_400" ] ); my @gene_start_stop_chr = ( [ "gene_name_0", "start_0", "stop_0", "chr_0" ], [ "gene_name_1", "start_1", "stop_1", "chr_1" ], # ... [ "gene_name_400", "start_400", "stop_400", "chr_400" ], [ "gene_name_30000", "start_30000", "stop_30000", "chr_30000 +" ] ); ## Build a hash from the lookup array my %gene_start_stop_chr = map{ $_->[ 0 ] => [ @{ $_ }[ 1 .. 3 ] ] } @gene_start_stop_chr; ## Use it to map the inputs to results my @results = map{ [ $_->[ 0 ], $_->[ 1 ], @{ $gene_start_stop_chr{ $_->[ 0 ] } } ] } @gene_score; print Dumper \@results; __END__ P:\test>377857 $VAR1 = [ [ 'gene_name_0', 'score_0', 'start_0', 'stop_0', 'chr_0' ], [ 'gene_name_1', 'score_1', 'start_1', 'stop_1', 'chr_1' ], [ 'gene_name_400', 'score_400', 'start_400', 'stop_400', 'chr_400' ] ]; [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply] [d/l]
Re: More efficient way to lookup with 2 AoA's. by bgreenlee (Friar) on Jul 27, 2004 at 21:12 UTC
I wrote something up, but Zaxo (and now rir) beat me to the punch, so instead here's some code to convert your arrays into a single hash: `my %gene = (); foreach (@gene_score) { $gene{$_->[0]}->{score} = $_->[1]; } foreach (@gssc) { $gene{$_->[0]}->{start} = $_->[1]; $gene{$_->[0]}->{stop} = $_->[2]; $gene{$_->[0]}->{chr} = $_->[3]; }` [download] Now `%gene` looks like: `%gene = ( gene_name_0 => { score => 'score_0', start => 'start_0', stop => 'stop_0', chr => 'chr_0' }, gene_name_1 => { score => 'score_1', ... );` [download] Brad	[reply] [d/l] [select]
Re: More efficient way to lookup with 2 AoA's. by rir (Vicar) on Jul 27, 2004 at 21:05 UTC
Use a hash for your smaller array. Something like: $, = " "; # just playing with the =>'s my @gn_score = ( [ name_0 => score_0 => ], [ name_1 => score_1 => ], [ name_2 => score_2 => ], ); my @gn_start_stop_chr = ( [ name_0 => b_0 => e_0 => ], [ name_1 => b_1 => e_1 => ], [ name_2 => b_2 => e_2 => ], [ name_0 => b_30 => e_3 => ], [ name_2 => b_42=> e_4 => ], [ name_1 => b_51=> e_5 => ], ); my %score; $score{$_->[0]} = $_->[1] for (@gn_score ); for ( @gn_start_stop_chr) { my ( $name => $begin => $end => ) = @$_; die unless exists $score{$name}; print $name, # or stash your data somewhere $score{$name}, $begin, $end, $/; }; [download]	[reply] [d/l]
Re: More efficient way to lookup with 2 AoA's. by CountZero (Bishop) on Jul 27, 2004 at 21:08 UTC
Dump both AoA's into a database (each in its own table) and do a SELECT on both tables joined by the keyfield of gene_name. Somehow you will have to persist the AoA's or are you going to input them each time by hand again (or perhaps read them from a flat file)? What better way then than to put them in database from the start? CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]

Back to Seekers of Perl Wisdom