in reply to Matching hashes

The hash approach will work, but it will be memory bound if the files grow enormous. Anyway, if I were to take the hash approach, here's one way I might do it:

use strict; use warnings; my %indices; open my $primary, '<', 'filename.txt' or die $! while ( my $line = <$primary> ) { my $key = ( split /,/, $line )[0]; # The following line is wrong. # $indices{$line} = 0; # Here's the correct line... $indices{$key} = 0; } close $primary; open my $secondary, '<', 'filename2.txt' or die $!; while ( my $line = <$secondary> ) { my $key = ( split /,/, $line )[0]; if( exists $indices{$key} ) { $indices{$key}++; } } close $secondary; foreach( keys %indices ) { if( $indices{$_} > 0 ) { print "$_ from the first file was found ", $indices{$_}, " times in the second file.\n"; } }

That's one way to do it. If your files are going to grow big enough for memory to become a concern you would need an approach that doesn't attempt to hold the whole index in memory at once. A lightweight database like SQLite could be helpful in that regard.


Dave

Replies are listed 'Best First'.
Re^2: Matching hashes
by ada (Novice) on Dec 05, 2007 at 19:51 UTC
    Ok I will try this post haste and see if I get any closer to my goal! Ada x
Re^2: Matching hashes
by ada (Novice) on Dec 05, 2007 at 20:06 UTC
    Hmm there is nothing being returned when I execute the code inclusive of the correction. Do you think there is another method of doing this, or perhaps maybe it is my data and ur code is perfect ;) Anyway thanks Ada x
Re^2: Matching hashes
by ada (Novice) on Dec 05, 2007 at 18:12 UTC
    Oh great! Thanks for the code u just laid down, I will go through this and see if it works, it look quire promising. The files are in plain text file with about 5000 entries in each, so not too sure how memory sapping this would be. But because I am only looking for the first elements in the column i.e 1021 from :(102l,0,GLU,,11,S,PSIBLAST,206l) do I need a regex to distinguish it e.g /^\d+\w{3}/ ? Ada

      5000 hash entries of a four digit key won't be a problem for any computer built in this century.

      You don't need a regex. Sure, you can use one, but my split accomplishes the same thing more clearly, in my opinion. I just split on ',' and then select the first element of the split. That isolates your index number.

      My solution ought to work as long as your data set matches your description, and as long as I didn't inadvertently impose some typos into my code. ;)


      Dave

      A reply falls below the community's threshold of quality. You may see it by logging in.