ada has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody! Relatively fresh to Perl hence need support. I am parsing in two data files which have a list like this: 102l,0,GLU,,11,S,PSIBLAST,206l of about 5000 entries, I am interested in seeing how many values of the first column, e.g 1021, match with values of the other file. I guess I would need to parse everything into a hash and match the keys with the keys from another hash but inexperience is having trouble extracting the information. Do I store everything into an array first and then split and use a regex for a match? Thanks in advance. Ada.

Replies are listed 'Best First'.
Re: Matching hashes
by davido (Cardinal) on Dec 05, 2007 at 17:54 UTC

    The hash approach will work, but it will be memory bound if the files grow enormous. Anyway, if I were to take the hash approach, here's one way I might do it:

    use strict; use warnings; my %indices; open my $primary, '<', 'filename.txt' or die $! while ( my $line = <$primary> ) { my $key = ( split /,/, $line )[0]; # The following line is wrong. # $indices{$line} = 0; # Here's the correct line... $indices{$key} = 0; } close $primary; open my $secondary, '<', 'filename2.txt' or die $!; while ( my $line = <$secondary> ) { my $key = ( split /,/, $line )[0]; if( exists $indices{$key} ) { $indices{$key}++; } } close $secondary; foreach( keys %indices ) { if( $indices{$_} > 0 ) { print "$_ from the first file was found ", $indices{$_}, " times in the second file.\n"; } }

    That's one way to do it. If your files are going to grow big enough for memory to become a concern you would need an approach that doesn't attempt to hold the whole index in memory at once. A lightweight database like SQLite could be helpful in that regard.


    Dave

      Ok I will try this post haste and see if I get any closer to my goal! Ada x
      Hmm there is nothing being returned when I execute the code inclusive of the correction. Do you think there is another method of doing this, or perhaps maybe it is my data and ur code is perfect ;) Anyway thanks Ada x
      Oh great! Thanks for the code u just laid down, I will go through this and see if it works, it look quire promising. The files are in plain text file with about 5000 entries in each, so not too sure how memory sapping this would be. But because I am only looking for the first elements in the column i.e 1021 from :(102l,0,GLU,,11,S,PSIBLAST,206l) do I need a regex to distinguish it e.g /^\d+\w{3}/ ? Ada

        5000 hash entries of a four digit key won't be a problem for any computer built in this century.

        You don't need a regex. Sure, you can use one, but my split accomplishes the same thing more clearly, in my opinion. I just split on ',' and then select the first element of the split. That isolates your index number.

        My solution ought to work as long as your data set matches your description, and as long as I didn't inadvertently impose some typos into my code. ;)


        Dave

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Matching hashes
by moritz (Cardinal) on Dec 05, 2007 at 17:51 UTC
    When you say "match the keys", do you mean that you look for occurrences of identical keys? Or are you looking for substrings?

    If it is the former, you should read one line at a time, split it, and store it in a hash. (No need to the store the whole file in memory).

    And the you read the second file line by line, consulting your hash for each of them.

    An entirely different solution is to view them as database tables, and use DBI together with DBD::CSV and do a join on the tables.

      My word that was quick! Thanks.. Yes I am looking for occurrences of identical keys.. so the first method u impose would be to store everything in an array first then split it according to the values I am after in a regex then storing into a hash? The second method looks daunting I am rather a novice! Ada