Using perl hash for comparing columns of 2 files

vikas.bansal has asked for the wisdom of the Perl Monks concerning the following question:

I have tried a lot but did not achieve the results. I have 2 big files (tab delimited). first file ->

Col1           Col2    Col3 Col4     Col5        Col6       Col7    Col8
101_#2          1       H    F0       263        278        2       1.5
102_#1          1       6    F1       766        781        1       1.0
103_#1          2       15   V1       526        581        1       0.0
103_#1          2       9    V2       124        134        1       1.3
104_#1          1       12   V3       137        172        1       1.0
105_#1          1       17   F2       766        771        1       1.0

second file ->

Col1    Col2    Col3             Col4
97486	H	262               279
67486	9	118	          119
87486	9	183	          185
248233	9	124	          134

If col3 value/character (of file1) and col2 value/character (of file 2) are same and then compare col5 and col6 of file 1(like a range value) with col3 and col4 of file2, if range of file 1 is present in file 2 then return that row (from file1) and also add the extra column1 from file2 in output. Expected output ->

Col1      Col2    Col3 Col4     Col5        Col6       Col7    Col8   Col9
101_#2        1       H    F0       263        278        2       1.5       97486
103_#1        2       9    V2       124        134        1       1.3       248233

So far I have tried something with hashes->

   @ARGV or die "No input file specified";
    open my $first, '<',$ARGV[0] or die "Unable to open input file: $!
+";
    open my $second,'<', $ARGV[1] or die "Unable to open input file: $
+!";
    print scalar (<$first>);
    while(<$second>){
    chomp;
    @line=split /\s+/;
    $hash{$line[2]}=$line[3];
    }
    while (<$first>) {
        @cols = split /\s+/;
        $p1 = $cols[4];
        $p2 = $cols[5];
    foreach $key (sort keys %hash){
    
    if ($p1>= "$key"){
    if ($p2<=$hash{$key})
    {
    print join("\t",@cols),"\n";
    }
    }
    else{next;}
    }
    }
[download]

But there is no comparison of col3 value/character (of file1) and col2 value/character (of file 2)in above code. But this is also taking lot of time and memory.Can anybody suggest how I can make it fast using hashes or hashes of hashes.Thanks a lot.

Hello everyone,

Thanks a lot for your help. I figured out an efficient way for my own question.

   @ARGV or die "No input file specified";
    open $first, '<',$ARGV[0] or die "Unable to open input file: $!";
    open $second,'<', $ARGV[1] or die "Unable to open input file: $!";
    print scalar (<$first>);
    
    
    while(<$second>){
    chomp;
    @line=split /\s+/;
    
        $hash{$line[1]}{$line[2]}{$line[3]}= $line[0];
        }
    while (<$first>) {
        
        @cols = split /\s+/;
     
    foreach  $key1 (sort keys %hash) {
       foreach $key2 (sort keys %{$hash{$key1}}) {
            foreach  $key3 (sort keys %{$hash{$key1}{$key2}}) {
            if (($cols[2] eq $key1) && ($cols[4]>=$key2) && ($cols[5]<
+=$key3)){
                print join("\t",@cols),"\t",$hash{$key1}{$key2}{$key3}
+,"\n";
            
            }    
            last;
        }
        
    }
    }
    }
[download]

Comment on Using perl hash for comparing columns of 2 files Select or Download Code

Replies are listed 'Best First'.
Re: Using perl hash for comparing columns of 2 files by jethro (Monsignor) on Jan 31, 2012 at 11:39 UTC
Your script has a few problems: You don't store Col2 anywhere in your hash. So you obviously can't compare it with Col3 of the other file. Also you use a hash but then the only thing you do with it is looping through its values!? A hash is for looking up values instantly. If all you do is looping, then an array will be better suited So (for the hash of file1) you need a hash with Col2 as key and the two comparision values (Col5, Col6) as values. You could use a Hash of Arrays for this: `push @{$hash{$line1}}, $line[2],$line[3];` [download] Now, instead of the foreach loop you just need to check if there is the right value in the hash: `if (exists $hash{$cols[2]}) { if ($p1>= $hash{$cols[2]}[0] and $o2<= $hash{$cols[2]}[1]) { ...` [download]	[reply] [d/l] [select]
Re: Using perl hash for comparing columns of 2 files by choroba (Cardinal) on Jan 31, 2012 at 11:36 UTC
Crossposted at StackOverflow. It is polite to inform about crossposting so people do not invest their time into problems already solved at the other site.	[reply]
Re: Using perl hash for comparing columns of 2 files by locked_user sundialsvc4 (Abbot) on Jan 31, 2012 at 14:14 UTC
An almost-generic approach to any problem of this sort is, first, to use a package like Text::CSV (or do a search at http://search.cpan.org for the (currently) 463 packages that match the search-term “CSV”), then write a solution that uses lists and/or hashes. Or, consider a purely non-Perl solution, such as what is offered by the (public domain!) SQLite database (http://www.sqlite.org), which is a server-less, “all in one file” SQL engine that is certainly in your cell phone and maybe also in your microwave oven. Now, all your data is in “one file” that you can query, and you did not have to “write a program” either to get the CSV-formatted data into it, nor to get your answers out. And, when (your manager \| your customer \| the sales department) says to you, “this is wonderful, but now can you tell me <this?>, umm, right now?” ... you can probably just about do it on the spot. And that, to me, says, “Win!”
Re: Using perl hash for comparing columns of 2 files by Marshall (Canon) on Jan 31, 2012 at 21:01 UTC
Making a file2 into a HashofArray keyed on this main link between the two files (H,9) narrows down the number of rows to search for a match. Your HoHoH is unnecessary complexity and there is no need for sorting. #!/usr/bin/perl -w use strict; use Data::Dumper; use autodie; my $file2text = <<END; 97486 H 262 279 67486 9 118 119 87486 9 183 185 248233 9 124 134 END open my $fh2, '<', \$file2text; my %file2; #this is a HashofArray (HoA) while (<$fh2>) { my @file2_cols = split; push @{$file2{$file2_cols[1]}}, [@file2_cols]; } #print Dumper \%file2; while (<DATA>) { chomp; my ($link, $low_limit, $high_limit) = (split)[2,4,5]; next unless exists $file2{$link}; foreach my $row_ref (@{$file2{$link}}) { if ( ($row_ref->[2] <= $low_limit) and ($row_ref->[3] >= $high_limit) ) { print "$_ $row_ref->[0]\n"; } } } =prints 101_#2 1 H F0 263 278 2 1.5 97486 103_#1 2 9 V2 124 134 1 1.3 248233 =cut __DATA__ 101_#2 1 H F0 263 278 2 1.5 102_#1 1 6 F1 766 781 1 1.0 103_#1 2 15 V1 526 581 1 0.0 103_#1 2 9 V2 124 134 1 1.3 104_#1 1 12 V3 137 172 1 1.0 105_#1 1 17 F2 766 771 1 1.0 [download]	[reply] [d/l]