Matching hashes

ada has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Matching hashes by davido (Cardinal) on Dec 05, 2007 at 17:54 UTC
The hash approach will work, but it will be memory bound if the files grow enormous. Anyway, if I were to take the hash approach, here's one way I might do it: use strict; use warnings; my %indices; open my $primary, '<', 'filename.txt' or die $! while ( my $line = <$primary> ) { my $key = ( split /,/, $line )[0]; # The following line is wrong. # $indices{$line} = 0; # Here's the correct line... $indices{$key} = 0; } close $primary; open my $secondary, '<', 'filename2.txt' or die $!; while ( my $line = <$secondary> ) { my $key = ( split /,/, $line )[0]; if( exists $indices{$key} ) { $indices{$key}++; } } close $secondary; foreach( keys %indices ) { if( $indices{$_} > 0 ) { print "$_ from the first file was found ", $indices{$_}, " times in the second file.\n"; } } [download] That's one way to do it. If your files are going to grow big enough for memory to become a concern you would need an approach that doesn't attempt to hold the whole index in memory at once. A lightweight database like SQLite could be helpful in that regard. Dave	[reply] [d/l]
Re^2: Matching hashes by ada (Novice) on Dec 05, 2007 at 19:51 UTC
Ok I will try this post haste and see if I get any closer to my goal! Ada x	[reply]
Re^2: Matching hashes by ada (Novice) on Dec 05, 2007 at 20:06 UTC
Hmm there is nothing being returned when I execute the code inclusive of the correction. Do you think there is another method of doing this, or perhaps maybe it is my data and ur code is perfect ;) Anyway thanks Ada x	[reply]
Re^2: Matching hashes by ada (Novice) on Dec 05, 2007 at 18:12 UTC
Oh great! Thanks for the code u just laid down, I will go through this and see if it works, it look quire promising. The files are in plain text file with about 5000 entries in each, so not too sure how memory sapping this would be. But because I am only looking for the first elements in the column i.e 1021 from :(102l,0,GLU,,11,S,PSIBLAST,206l) do I need a regex to distinguish it e.g /^\d+\w{3}/ ? Ada	[reply]
Re^3: Matching hashes by davido (Cardinal) on Dec 05, 2007 at 18:42 UTC
5000 hash entries of a four digit key won't be a problem for any computer built in this century. You don't need a regex. Sure, you can use one, but my split accomplishes the same thing more clearly, in my opinion. I just split on ',' and then select the first element of the split. That isolates your index number. My solution ought to work as long as your data set matches your description, and as long as I didn't inadvertently impose some typos into my code. ;) Dave	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Matching hashes by moritz (Cardinal) on Dec 05, 2007 at 17:51 UTC
When you say "match the keys", do you mean that you look for occurrences of identical keys? Or are you looking for substrings? If it is the former, you should read one line at a time, split it, and store it in a hash. (No need to the store the whole file in memory). And the you read the second file line by line, consulting your hash for each of them. An entirely different solution is to view them as database tables, and use DBI together with DBD::CSV and do a join on the tables.	[reply]
Re^2: Matching hashes by ada (Novice) on Dec 05, 2007 at 17:58 UTC
My word that was quick! Thanks.. Yes I am looking for occurrences of identical keys.. so the first method u impose would be to store everything in an array first then split it according to the values I am after in a regex then storing into a hash? The second method looks daunting I am rather a novice! Ada	[reply]