http://qs1969.pair.com?node_id=377857

BioGeek has asked for the wisdom of the Perl Monks concerning the following question:

Hey All, I have 2 ArrayOfArrays, one with a gene name and its score, like this:
@gene_score = ( [ "gene_name_0", "score_0" ], [ "gene_name_1", "score_1" ], ... [ "gene_name_400", "score_400" ] );
and, one with a gene name and its start and stop positions on a chromosone:
@gene_start_stop_chr = ( [ "gene_name_0", "start_0", "stop_0", "chr_0" ], [ "gene_name_1", "start_1", "stop_1", "chr_1" ], ... [ "gene_name_30000", "start_30000", "stop_30000", "chr_30000 +" ] );
And of course I want to match the scores with the positions, using the gene names, so that I end with an array:
@results = ( [ "gene_name_0", "score_0", "start_0", "stop_0", "chr_0" ], [ "gene_name_1", "score_1", "start_1", "stop_1", "chr_1" ], ... [ "gene_name_400", "score_400", "start_400", "stop_400", "chr +_400" ], );
The code I've written so far is (@gene_start_stop_chr abbreviated till @gssc):
for (my $a = 0; $a < scalar @gene_score; $a++) { for (my $b = 0; $b < scalar @gssc; $b++) { if ("$gene_score[$a][0]" eq "$gssc[$b][0]") { print "$gene-score[$a][0]\t$gene_score[$a][1]\t$gssc[$ +b][1]\t$gssc[$b][2]\t$gssc[$b][3]\n"; } } }
Which works, but is very slow, as I am comparing each of the 400 gene names of my first array with every of the 30000 gene names in the second array. So I was wondering of there are changes I could make to speed things up.
Thanks in advance.

Replies are listed 'Best First'.
Re: More efficient way to lookup with 2 AoA's.
by Zaxo (Archbishop) on Jul 27, 2004 at 20:50 UTC

    Use a hash with the gene names as keys. You can then put all the data in one structure,

    my %gene = ( gene_name_1 => { start=>'start_1', stop=>'stop_1', chr=>'chr_1'}, # ... );
    You could add the score data there, too, unless it's more dynamic than that and is generated elsewhere.

    After Compline,
    Zaxo

Re: More efficient way to lookup with 2 AoA's.
by BrowserUk (Patriarch) on Jul 27, 2004 at 21:09 UTC

    Like everyone says--whenever you need to do a lookup in Perl: Think hashes,

    #! perl -slw use strict; use Data::Dumper; my @gene_score = ( [ "gene_name_0", "score_0" ], [ "gene_name_1", "score_1" ], # ... [ "gene_name_400", "score_400" ] ); my @gene_start_stop_chr = ( [ "gene_name_0", "start_0", "stop_0", "chr_0" ], [ "gene_name_1", "start_1", "stop_1", "chr_1" ], # ... [ "gene_name_400", "start_400", "stop_400", "chr_400" ], [ "gene_name_30000", "start_30000", "stop_30000", "chr_30000 +" ] ); ## Build a hash from the lookup array my %gene_start_stop_chr = map{ $_->[ 0 ] => [ @{ $_ }[ 1 .. 3 ] ] } @gene_start_stop_chr; ## Use it to map the inputs to results my @results = map{ [ $_->[ 0 ], $_->[ 1 ], @{ $gene_start_stop_chr{ $_->[ 0 ] } } ] } @gene_score; print Dumper \@results; __END__ P:\test>377857 $VAR1 = [ [ 'gene_name_0', 'score_0', 'start_0', 'stop_0', 'chr_0' ], [ 'gene_name_1', 'score_1', 'start_1', 'stop_1', 'chr_1' ], [ 'gene_name_400', 'score_400', 'start_400', 'stop_400', 'chr_400' ] ];

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: More efficient way to lookup with 2 AoA's.
by bgreenlee (Friar) on Jul 27, 2004 at 21:12 UTC

    I wrote something up, but Zaxo (and now rir) beat me to the punch, so instead here's some code to convert your arrays into a single hash:

    my %gene = (); foreach (@gene_score) { $gene{$_->[0]}->{score} = $_->[1]; } foreach (@gssc) { $gene{$_->[0]}->{start} = $_->[1]; $gene{$_->[0]}->{stop} = $_->[2]; $gene{$_->[0]}->{chr} = $_->[3]; }

    Now %gene looks like:

    %gene = ( gene_name_0 => { score => 'score_0', start => 'start_0', stop => 'stop_0', chr => 'chr_0' }, gene_name_1 => { score => 'score_1', ... );

    Brad

Re: More efficient way to lookup with 2 AoA's.
by rir (Vicar) on Jul 27, 2004 at 21:05 UTC
    Use a hash for your smaller array. Something like:
    $, = " "; # just playing with the =>'s my @gn_score = ( [ name_0 => score_0 => ], [ name_1 => score_1 => ], [ name_2 => score_2 => ], ); my @gn_start_stop_chr = ( [ name_0 => b_0 => e_0 => ], [ name_1 => b_1 => e_1 => ], [ name_2 => b_2 => e_2 => ], [ name_0 => b_30 => e_3 => ], [ name_2 => b_42=> e_4 => ], [ name_1 => b_51=> e_5 => ], ); my %score; $score{$_->[0]} = $_->[1] for (@gn_score ); for ( @gn_start_stop_chr) { my ( $name => $begin => $end => ) = @$_; die unless exists $score{$name}; print $name, # or stash your data somewhere $score{$name}, $begin, $end, $/; };
Re: More efficient way to lookup with 2 AoA's.
by CountZero (Bishop) on Jul 27, 2004 at 21:08 UTC
    Dump both AoA's into a database (each in its own table) and do a SELECT on both tables joined by the keyfield of gene_name. Somehow you will have to persist the AoA's or are you going to input them each time by hand again (or perhaps read them from a flat file)? What better way then than to put them in database from the start?

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law