how to speed up comparison between two files

greeknlatin has asked for the wisdom of the Perl Monks concerning the following question:

I want to compare two files in such a way that

The code should print all the matches for individual factor from each cluster from file2, by comparing it with file1

e.g. ABC is one cluster and A,B and C are individual factors

file1

A seq1 20
B seq2 25
B seq2 80
B seq1 40
C seq1 25
D seq2 30
E seq2 45
[download]

file2

A B C
B D E
[download]

Output

A Seq1 20 B seq1 40 C seq1 25
B seq2 25 D seq2 30 E seq2 45
B seq2 80 D seq2 30 E seq2 45
[download]

so far I have tried the following code. But, it is taking so much time as my input files are huge

#file opening
open(AB,"try_fimo.txt")||die("cannot open");
open(BC,"try_fimo2.txt")||die("cannot open");

#storing file in an array
@data=<AB>;
chomp(@data);
@data2=<BC>;
chomp(@data2);

#reading file line by line
foreach $line(@data)
 {
 
  foreach $line2(@data2)
    {
    
     if($line2=~/(.*?)\s+(.*?)\s+(.*)/)
      {
     
        $t1=$1; #eg. in first row from file2 i.e.ABC, it will first ta
+ke A followed by B & C 
        $t2=$2;
        $t3=$3;
      }
     if($line=~/(.*?)\s+(.*?)\s+(.*)/)
      {
      
         if($1 eq $t1)
         
           {
           #storing each column in seperate array based on match
            push(@tf1,$1);
             push(@seq1,$2);
              push(@dis1,$3);
            # print $1,"\t",$2,"\t",$3,"\t";
             
            }
        
        
                 if($1 eq $t2)
             
              {
                      push(@tf2,$1);
                        push(@seq2,$2);
                         push(@dis2,$3);
                 
                   }
                   
                   
                            if($1 eq $t3)
                         
                    {
                                   push(@tf3,$1);
                                  push(@seq3,$2);
                                 push(@dis3,$3);       
                             
                           }
                   
             }      

                   
      }
      
      
      
  }

#comparison using loops

      for($i=0;$i<@tf1;$i++)
        {
        
          for($j=0;$j<@tf2;$j++)
             {
             
               for($k=0;$k<@tf3;$k++)
                 {
                 
                   if(($seq1[$i] eq $seq2[$j]) && ($seq1[$i] eq $seq3[
+$k]))
                      {
                      
                        if(($tf1[$i] ne $tf2[$j]) && ($tf1[$i] ne $tf3
+[$k]))
                          {
                            print $tf1[$i],"\t",$seq1[$i],"\t",$dis1[$
+i],"\t",$tf2[$j],"\t",$seq2[$j],"\t",$dis2[$j],"\t",$tf3[$k],"\t",$se
+q3[$k],"\t",$dis3[$k],"\n";
                          
                          }
                            
                         
                        }
                        
                   }
                   
                   
              }
              
              
          
         }
[download]

Can anyone please suggest a faster solution?

Thanks

Comment on how to speed up comparison between two files Select or Download Code

Replies are listed 'Best First'.
Re: how to speed up comparison between two files by BrowserUk (Patriarch) on Dec 10, 2014 at 08:02 UTC
Are there always 3 factors per line of file 2? Or a larger number? Or a variable number? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: how to speed up comparison between two files by greeknlatin (Initiate) on Dec 10, 2014 at 08:11 UTC
Yes, there always 3 factors per line of file 2	[reply]
Re^3: how to speed up comparison between two files (Updated:At least 2 orders of maginitude faster) by BrowserUk (Patriarch) on Dec 10, 2014 at 08:28 UTC
Update: I generated a file1 of 1000 lines (factors(A..Z)Seq(0..9)Pos(0..1000). And a file2 of 1000 triples of A..Z. Processing with your code ran for over 1.5 hours before I abandoned it; with my code it took less than a second. HTH. Try this. On cursory inspection (and using only the tiny amount of data provided), I think it should be quite a bit quicker than your brute force iteration/comparison method: #! perl -slw use strict; use Inline::Files; my( %file1, %facBySeq, %seqByFac ); while( <FILE_1> ) { my( $fac, $seq, $pos ) = split ' '; push @{ $file1{ $fac }{ $seq } }, $pos; $facBySeq{ $seq }{ $fac } = 1; $seqByFac{ $fac }{ $seq } = 1; } while( <FILE_2> ) { my @facs = split ' '; for my $seq ( keys %{ $seqByFac{ $facs[ 0 ] } } ) { if( exists $facBySeq{ $seq }{ $facs[ 1 ] } and exists $facBySeq{ $seq }{ $facs[ 2 ] } ) { for my $pos1 ( @{ $file1{ $facs[ 0 ] }{ $seq } } ) { for my $pos2 ( @{ $file1{ $facs[ 1 ] }{ $seq } } ) { for my $pos3 ( @{ $file1{ $facs[ 2 ] }{ $seq } } ) + { print join ' ', $facs[0], $seq, $pos1, $facs[ +1 ], $seq, $pos2, $facs[ 2 ], $seq, $pos3; } } } } } } __FILE_1__ A seq1 20 B seq2 25 B seq2 80 B seq1 40 C seq1 25 D seq2 30 E seq2 45 __FILE_2__ A B C B D E [download] Outputs: `[ 8:24:45.48] C:\test>1109868.pl A seq1 20 B seq1 40 C seq1 25 B seq2 25 D seq2 30 E seq2 45 B seq2 80 D seq2 30 E seq2 45` [download] Basically, it just constructs a couple of ancilliary indexes using hashes to avoid much of the iteration. (Note:The Inline::Files just allowed me to put all the test data and code in a single file.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: how to speed up comparison between two files (Updated:At least 2 orders of maginitude faster) by Anonymous Monk on Dec 11, 2014 at 04:41 UTC
Re: how to speed up comparison between two files by Anonymous Monk on Dec 10, 2014 at 07:53 UTC
Can anyone please suggest a faster solution? how fast is that original solution?	[reply]
Re: how to speed up comparison between two files by Anonymous Monk on Dec 10, 2014 at 08:03 UTC
This looks exactly like something that databases are supposed to do...	[reply]
Re^2: how to speed up comparison between two files by Anonymous Monk on Dec 10, 2014 at 13:06 UTC
So I was thinking how I would do that and came up with this (MySQL): drop table if exists clusters; create table clusters ( factor char(1) not null, sequence varchar(10) not null, num integer unsigned not null, index (sequence) ); insert into clusters values ('C', 'seq1', 25), ('A', 'seq1', 20), ('E', 'seq2', 45), ('D', 'seq2', 30), ('B', 'seq2', 25), ('B', 'seq1', 40), ('B', 'seq2', 80) ; select sequence, group_concat(factor order by factor) as factors, group_concat(num order by factor) as nums from clusters group by sequence having factors like '%A%B%C%' or factors like '%B%D%E%' [download] Output: `+----------+---------+-------------+ \| sequence \| factors \| nums \| +----------+---------+-------------+ \| seq1 \| A,B,C \| 20,40,25 \| \| seq2 \| B,B,D,E \| 80,25,30,45 \| +----------+---------+-------------+` [download] I guess there is a better way but I prefer not to mess with SQL too much. As a bonus, some grouping and printing routines. Features flexible search pattern (not limited to 3) and other nice things. use strict; use warnings; use Carp; use List::Util 'max'; my @factors = ('B,B,D,D,E,E' x 10) x 10_000; my @numbers = ('80,25,30,1000,45,0.5' x 10) x 10_000; my $search = 'BDE'; for my $i ( 0 .. $#factors ) { print_groups( group( $factors[$i], $numbers[$i], $search ) ); } sub group { my ( $factors, $numbers, $search ) = @_; my @factors = split ',', $factors; my @numbers = split ',', $numbers; croak "Fatal error!" unless @factors == @numbers; my @groups; for my $s (split '', $search) { my @temp = ($s); for my $i ( 0 .. $#factors ) { push @temp, $numbers[$i] if $factors[$i] eq $s; } push @groups, \@temp; } return \@groups; } sub print_groups { my $groups = shift; my $current = shift \|\| 0; # or defined-or my $prev = shift; if ( not defined $groups->[$current] ) { print $prev , "\n"; return; } my ( $factor, @numbers ) = @{ $groups->[$current] }; my $max = max map length, @numbers; for my $number (@numbers) { print_groups( $groups, $current + 1, ( $prev ? "$prev " : '' ) . ( sprintf '%s = %-*s', $factor, $max, $number ) ); } } [download] Output: `B = 80 D = 30 E = 45 B = 80 D = 30 E = 0.5 B = 80 D = 1000 E = 45 B = 80 D = 1000 E = 0.5 B = 25 D = 30 E = 45 B = 25 D = 30 E = 0.5 B = 25 D = 1000 E = 45 B = 25 D = 1000 E = 0.5 ... (24_200_000 rows)` [download] That takes about a minute on my pretty underpowered laptop, and I would expect the database to be reasonably fast too.	[reply] [d/l] [select]
Re^2: how to speed up comparison between two files by Laurent_R (Canon) on Dec 10, 2014 at 23:15 UTC
Or maybe exactly what databases are often doing rather poorly... Consider that a hash is often about 3 orders of magnitude faster than a DB.	[reply]
Re: how to speed up comparison between two files by wee (Scribe) on Dec 10, 2014 at 22:34 UTC
I started to come up with a solution much like BrowserUK's, though not quite as elegant, and I decided to move on since his works well enough. In order to see what was going on, I wanted to run your code. In doing so I fixed all the scoping errors (the 'strict' pragma should have been made the default behavior in like 1996), corrected the whitespace issues, removed unnecessary comments, and closed your filehandles after they were done being used. In case anyone else wants to copy/paste a working version of the original code (which uses Inline::Files), here it is: use warnings; use strict; use Inline::Files; #open(AB, "try_fimo.txt") \|\| die("cannot open"); #my @data = <AB>; #close(AB); #chomp(@data); # #open(BC, "try_fimo2.txt") \|\| die("cannot open"); #my @data2 = <BC>; #close(BC); #chomp(@data2); my @data = <AB>; my @data2 = <BC>; my ($t1, $t2, $t3); my (@tf1, @seq1, @dis1); my (@tf2, @seq2, @dis2); my (@tf3, @seq3, @dis3); foreach my $line (@data) { foreach my $line2 (@data2) { if ($line2 =~ /(.?)\s+(.?)\s+(.)/) { $t1 = $1; # eg. in first row from file2 i.e. ABC, it will first +take A followed by B & C $t2 = $2; $t3 = $3; } if ($line =~ /(.?)\s+(.?)\s+(.)/) { if ($1 eq $t1) { push(@tf1, $1); push(@seq1, $2); push(@dis1, $3); # print $1,"\t",$2,"\t",$3,"\t"; } elsif ($1 eq $t2) { push(@tf2, $1); push(@seq2, $2); push(@dis2, $3); } elsif ($1 eq $t3) { push(@tf3, $1); push(@seq3, $2); push(@dis3, $3); } } } } for (my $i = 0; $i < @tf1; $i++) { for (my $j = 0; $j < @tf2; $j++) { for (my $k = 0; $k < @tf3; $k++) { if (($seq1[$i] eq $seq2[$j]) && ($seq1[$i] eq $seq3[$k])) { if (($tf1[$i] ne $tf2[$j]) && ($tf1[$i] ne $tf3[$k])) { print $tf1[$i], "\t", $seq1[$i], "\t", $dis1[$i], "\t", $tf2[$j], "\t", $seq2[$j], "\t", $dis2[$j], "\t", $tf3[$k], "\t", $seq3[$k], "\t", $dis3[$k], "\n"; } } } } } __AB__ A seq1 20 B seq2 25 B seq2 80 B seq1 40 C seq1 25 D seq2 30 E seq2 45 __BC__ A B C [download]	[reply] [d/l]