snape has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I am having a file which as follows:

2_SecVal 4 1 1 0 0 0 1 0 0 1 0 1_SecVal 4 1 0 1 0 1 0 0 0 1 0 2_FirVal 4 1 0 1 0 1 1 0 0 0 0 3_FirVal 4 0 0 1 0 1 1 0 0 1 0 3_SecVal 4 0 0 1 0 1 1 0 0 1 0 2_FirVal 5 1 0 1 0 0 0 1 1 0 1 4_SecVal 5 1 0 1 0 1 1 0 0 1 0 5_SecVal 5 1 0 1 0 1 1 0 0 1 0 7_FirVal 5 1 0 1 0 1 1 0 0 1 0 4_FirVal 5 1 0 1 0 1 1 0 0 1 0 5_FirVal 5 1 1 0 0 0 0 1 1 0 1 6_FirVal 5 1 0 1 0 1 1 0 0 1 0 6_SecVal 5 1 0 1 0 0 0 1 1 0 1 7_SecVal 5 1 1 0 0 0 0 1 1 0 1

I am reading the file into hash of scalar references with first column as first key and second column as the second key. I am trying to match the values which are common in the entire hash and I am counting it to find the number of common value in the hash. I am using "undef" to remove the pattern which is matching (while iterating -- which I think is not good way of programing, so any other tips for the same) and therefore would like to know the total number of them. I am little skeptical and therefore would like to know whether it is working or not. My code is as follows:

while(<$IN>){ chomp; my ($id, $sum, $haploStr) = split ('\t',$_); push @{$data{$id}{$sum}}, \$haploStr; } my @keys = keys %data; my $j = 1; for my $i ( 0 .. $#keys ) { my $key1 = $keys[$i]; my $count = 1; for my $sum1 ( %{ $data{$key1} } ) { for my $str1 (@{$data{$key1}{$sum1}}){ for my $key2 ( @keys[ $i + 1 .. $#keys ] ) { for my $sum2 ( %{ $data{$key2} } ) { for my $str2(@{$data{$key2}{$sum2}}){ if ($$str1 eq $$str2){ #print "MATCH: $$str1 \t $$str2\n "; $count++; undef @{$data{$key2}{$sum2}}; } } } } } } print "H$j\tMatching Pattern as $key1\t $count\n"; $j++; }

Also, it is taking more time than expected therefore would like to known your comments. Thanks.

Update: I have the change from array to scalar references. I am sorry for that

Replies are listed 'Best First'.
Re: Extract the common values in hash of array with double keys
by Cristoforo (Curate) on Feb 04, 2010 at 21:14 UTC
    This way avoids all the loops you have in your solution.
    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %data; while(<DATA>){ chomp; my ($id, $sum, $haploStr) = split '\t'; $data{ $haploStr }{ count }++; push @{ $data{$haploStr}{id} }, $id; } for my $haploStr ( sort by_count keys %data) { print "$haploStr\tCount: $data{$haploStr}{count}\n"; print "\t$_\n" for @{ $data{$haploStr}{id} }; } sub by_count { $data{$b}{count} <=> $data{$a}{count} } __DATA__ 2_SecVal 4 1 1 0 0 0 1 0 0 1 0 1_SecVal 4 1 0 1 0 1 0 0 0 1 0 2_FirVal 4 1 0 1 0 1 1 0 0 0 0 3_FirVal 4 0 0 1 0 1 1 0 0 1 0 3_SecVal 4 0 0 1 0 1 1 0 0 1 0 2_FirVal 5 1 0 1 0 0 0 1 1 0 1 4_SecVal 5 1 0 1 0 1 1 0 0 1 0 5_SecVal 5 1 0 1 0 1 1 0 0 1 0 7_FirVal 5 1 0 1 0 1 1 0 0 1 0 4_FirVal 5 1 0 1 0 1 1 0 0 1 0 5_FirVal 5 1 1 0 0 0 0 1 1 0 1 6_FirVal 5 1 0 1 0 1 1 0 0 1 0 6_SecVal 5 1 0 1 0 0 0 1 1 0 1 7_SecVal 5 1 1 0 0 0 0 1 1 0 1
    Output
    1 0 1 0 1 1 0 0 1 0 Count: 5 4_SecVal 5_SecVal 7_FirVal 4_FirVal 6_FirVal 1 0 1 0 0 0 1 1 0 1 Count: 2 2_FirVal 6_SecVal 1 1 0 0 0 0 1 1 0 1 Count: 2 5_FirVal 7_SecVal 0 0 1 0 1 1 0 0 1 0 Count: 2 3_FirVal 3_SecVal 1 0 1 0 1 0 0 0 1 0 Count: 1 1_SecVal 1 0 1 0 1 1 0 0 0 0 Count: 1 2_FirVal 1 1 0 0 0 1 0 0 1 0 Count: 1 2_SecVal
      Thanks a lot for helping me out.
Re: Extract the common values in hash of array with double keys
by toolic (Bishop) on Feb 04, 2010 at 19:38 UTC
    would like to know whether it is working or not.
    Only you know if your code is working the way you want it to work. I don't follow your description of your requirements. That being said, the following looks suspicious:
    my ($id, $sum, $haploStr) = split ('\t',$_); push @{$data{$id}{$sum}}, \$haploStr;
    Your data structure is a hash of hashes of ... scalar references. Did you really want a hash of hashes of arrays?
    my ($id, $sum, @haploStr) = split ('\t',$_); push @{$data{$id}{$sum}}, \@haploStr;
    Use Data::Dumper after your while loop to get a better idea of what I'm taking about:
    print Dumper(\%data);
    I am using "undef" to remove the pattern which is matching (while iterating -- which I think is not good way of programing,
    I agree that using undef seems unconventional here. Perhaps you want to delete?

      delete won't work as it will give me a run time error that the free space is being encountered after deleting the element during iteration

        You have to change the line to:
        delete $data{$key2}{$sum2};

        UPDATE: Oops, I think I misunderstood your problem. You might include a check for existence before you access an array value in the loop

        if (exists $data{$key1}{$sum1}) { ... # and later if (exists $data{$key2}{$sum2}) { ... # or next if (not exists $data{$key2}{$sum2});

        You might even avoid one of the exists-lines if you don't delete the second item (i.e. $data{$key2}{$sum2}) but the original item (i.e. $data{$key1}{$sum1}) and exit the inner loops immediately. If a third identical item is there it will be detected when the second item later becomes an original item. Since you never delete an item that $key1 and $sum1 might encounter later on, you don't need to test $data{$key1}{$sum1} for existence.

        Whether your code is correct I can't easily judge (without investing a lot more time). Maybe you should extract your algorithm to a subroutine and the comparision of the two arrays to another subroutine, that would make your program much more readable and testable. Then write some test cases and compare the expected result with what you really get