greeknlatin has asked for the wisdom of the Perl Monks concerning the following question:

I want to compare two files in such a way that

The code should print all the matches for individual factor from each cluster from file2, by comparing it with file1

e.g. ABC is one cluster and A,B and C are individual factors

file1

A seq1 20 B seq2 25 B seq2 80 B seq1 40 C seq1 25 D seq2 30 E seq2 45

file2

A B C B D E

Output

A Seq1 20 B seq1 40 C seq1 25 B seq2 25 D seq2 30 E seq2 45 B seq2 80 D seq2 30 E seq2 45

so far I have tried the following code. But, it is taking so much time as my input files are huge

#file opening open(AB,"try_fimo.txt")||die("cannot open"); open(BC,"try_fimo2.txt")||die("cannot open"); #storing file in an array @data=<AB>; chomp(@data); @data2=<BC>; chomp(@data2); #reading file line by line foreach $line(@data) { foreach $line2(@data2) { if($line2=~/(.*?)\s+(.*?)\s+(.*)/) { $t1=$1; #eg. in first row from file2 i.e.ABC, it will first ta +ke A followed by B & C $t2=$2; $t3=$3; } if($line=~/(.*?)\s+(.*?)\s+(.*)/) { if($1 eq $t1) { #storing each column in seperate array based on match push(@tf1,$1); push(@seq1,$2); push(@dis1,$3); # print $1,"\t",$2,"\t",$3,"\t"; } if($1 eq $t2) { push(@tf2,$1); push(@seq2,$2); push(@dis2,$3); } if($1 eq $t3) { push(@tf3,$1); push(@seq3,$2); push(@dis3,$3); } } } } #comparison using loops for($i=0;$i<@tf1;$i++) { for($j=0;$j<@tf2;$j++) { for($k=0;$k<@tf3;$k++) { if(($seq1[$i] eq $seq2[$j]) && ($seq1[$i] eq $seq3[ +$k])) { if(($tf1[$i] ne $tf2[$j]) && ($tf1[$i] ne $tf3 +[$k])) { print $tf1[$i],"\t",$seq1[$i],"\t",$dis1[$ +i],"\t",$tf2[$j],"\t",$seq2[$j],"\t",$dis2[$j],"\t",$tf3[$k],"\t",$se +q3[$k],"\t",$dis3[$k],"\n"; } } } } }

Can anyone please suggest a faster solution?

Thanks

Replies are listed 'Best First'.
Re: how to speed up comparison between two files
by BrowserUk (Patriarch) on Dec 10, 2014 at 08:02 UTC

    Are there always 3 factors per line of file 2? Or a larger number? Or a variable number?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Yes, there always 3 factors per line of file 2

        Update: I generated a file1 of 1000 lines (factors(A..Z)Seq(0..9)Pos(0..1000). And a file2 of 1000 triples of A..Z. Processing with your code ran for over 1.5 hours before I abandoned it; with my code it took less than a second. HTH.

        Try this. On cursory inspection (and using only the tiny amount of data provided), I think it should be quite a bit quicker than your brute force iteration/comparison method:

        #! perl -slw use strict; use Inline::Files; my( %file1, %facBySeq, %seqByFac ); while( <FILE_1> ) { my( $fac, $seq, $pos ) = split ' '; push @{ $file1{ $fac }{ $seq } }, $pos; $facBySeq{ $seq }{ $fac } = 1; $seqByFac{ $fac }{ $seq } = 1; } while( <FILE_2> ) { my @facs = split ' '; for my $seq ( keys %{ $seqByFac{ $facs[ 0 ] } } ) { if( exists $facBySeq{ $seq }{ $facs[ 1 ] } and exists $facBySeq{ $seq }{ $facs[ 2 ] } ) { for my $pos1 ( @{ $file1{ $facs[ 0 ] }{ $seq } } ) { for my $pos2 ( @{ $file1{ $facs[ 1 ] }{ $seq } } ) { for my $pos3 ( @{ $file1{ $facs[ 2 ] }{ $seq } } ) + { print join ' ', $facs[0], $seq, $pos1, $facs[ +1 ], $seq, $pos2, $facs[ 2 ], $seq, $pos3; } } } } } } __FILE_1__ A seq1 20 B seq2 25 B seq2 80 B seq1 40 C seq1 25 D seq2 30 E seq2 45 __FILE_2__ A B C B D E

        Outputs:

        [ 8:24:45.48] C:\test>1109868.pl A seq1 20 B seq1 40 C seq1 25 B seq2 25 D seq2 30 E seq2 45 B seq2 80 D seq2 30 E seq2 45

        Basically, it just constructs a couple of ancilliary indexes using hashes to avoid much of the iteration. (Note:The Inline::Files just allowed me to put all the test data and code in a single file.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: how to speed up comparison between two files
by Anonymous Monk on Dec 10, 2014 at 07:53 UTC

    Can anyone please suggest a faster solution?

    how fast is that original solution?

Re: how to speed up comparison between two files
by Anonymous Monk on Dec 10, 2014 at 08:03 UTC
    This looks exactly like something that databases are supposed to do...
      So I was thinking how I would do that and came up with this (MySQL):
      drop table if exists clusters; create table clusters ( factor char(1) not null, sequence varchar(10) not null, num integer unsigned not null, index (sequence) ); insert into clusters values ('C', 'seq1', 25), ('A', 'seq1', 20), ('E', 'seq2', 45), ('D', 'seq2', 30), ('B', 'seq2', 25), ('B', 'seq1', 40), ('B', 'seq2', 80) ; select sequence, group_concat(factor order by factor) as factors, group_concat(num order by factor) as nums from clusters group by sequence having factors like '%A%B%C%' or factors like '%B%D%E%'
      Output:
      +----------+---------+-------------+ | sequence | factors | nums | +----------+---------+-------------+ | seq1 | A,B,C | 20,40,25 | | seq2 | B,B,D,E | 80,25,30,45 | +----------+---------+-------------+
      I guess there is a better way but I prefer not to mess with SQL too much.

      As a bonus, some grouping and printing routines. Features flexible search pattern (not limited to 3) and other nice things.

      use strict; use warnings; use Carp; use List::Util 'max'; my @factors = ('B,B,D,D,E,E' x 10) x 10_000; my @numbers = ('80,25,30,1000,45,0.5' x 10) x 10_000; my $search = 'BDE'; for my $i ( 0 .. $#factors ) { print_groups( group( $factors[$i], $numbers[$i], $search ) ); } sub group { my ( $factors, $numbers, $search ) = @_; my @factors = split ',', $factors; my @numbers = split ',', $numbers; croak "Fatal error!" unless @factors == @numbers; my @groups; for my $s (split '', $search) { my @temp = ($s); for my $i ( 0 .. $#factors ) { push @temp, $numbers[$i] if $factors[$i] eq $s; } push @groups, \@temp; } return \@groups; } sub print_groups { my $groups = shift; my $current = shift || 0; # or defined-or my $prev = shift; if ( not defined $groups->[$current] ) { print $prev , "\n"; return; } my ( $factor, @numbers ) = @{ $groups->[$current] }; my $max = max map length, @numbers; for my $number (@numbers) { print_groups( $groups, $current + 1, ( $prev ? "$prev " : '' ) . ( sprintf '%s = %-*s', $factor, $max, $number ) ); } }
      Output:
      B = 80 D = 30 E = 45 B = 80 D = 30 E = 0.5 B = 80 D = 1000 E = 45 B = 80 D = 1000 E = 0.5 B = 25 D = 30 E = 45 B = 25 D = 30 E = 0.5 B = 25 D = 1000 E = 45 B = 25 D = 1000 E = 0.5 ... (24_200_000 rows)
      That takes about a minute on my pretty underpowered laptop, and I would expect the database to be reasonably fast too.
      Or maybe exactly what databases are often doing rather poorly... Consider that a hash is often about 3 orders of magnitude faster than a DB.
Re: how to speed up comparison between two files
by wee (Scribe) on Dec 10, 2014 at 22:34 UTC

    I started to come up with a solution much like BrowserUK's, though not quite as elegant, and I decided to move on since his works well enough.

    In order to see what was going on, I wanted to run your code. In doing so I fixed all the scoping errors (the 'strict' pragma should have been made the default behavior in like 1996), corrected the whitespace issues, removed unnecessary comments, and closed your filehandles after they were done being used. In case anyone else wants to copy/paste a working version of the original code (which uses Inline::Files), here it is:

    use warnings; use strict; use Inline::Files; #open(AB, "try_fimo.txt") || die("cannot open"); #my @data = <AB>; #close(AB); #chomp(@data); # #open(BC, "try_fimo2.txt") || die("cannot open"); #my @data2 = <BC>; #close(BC); #chomp(@data2); my @data = <AB>; my @data2 = <BC>; my ($t1, $t2, $t3); my (@tf1, @seq1, @dis1); my (@tf2, @seq2, @dis2); my (@tf3, @seq3, @dis3); foreach my $line (@data) { foreach my $line2 (@data2) { if ($line2 =~ /(.*?)\s+(.*?)\s+(.*)/) { $t1 = $1; # eg. in first row from file2 i.e. ABC, it will first +take A followed by B & C $t2 = $2; $t3 = $3; } if ($line =~ /(.*?)\s+(.*?)\s+(.*)/) { if ($1 eq $t1) { push(@tf1, $1); push(@seq1, $2); push(@dis1, $3); # print $1,"\t",$2,"\t",$3,"\t"; } elsif ($1 eq $t2) { push(@tf2, $1); push(@seq2, $2); push(@dis2, $3); } elsif ($1 eq $t3) { push(@tf3, $1); push(@seq3, $2); push(@dis3, $3); } } } } for (my $i = 0; $i < @tf1; $i++) { for (my $j = 0; $j < @tf2; $j++) { for (my $k = 0; $k < @tf3; $k++) { if (($seq1[$i] eq $seq2[$j]) && ($seq1[$i] eq $seq3[$k])) { if (($tf1[$i] ne $tf2[$j]) && ($tf1[$i] ne $tf3[$k])) { print $tf1[$i], "\t", $seq1[$i], "\t", $dis1[$i], "\t", $tf2[$j], "\t", $seq2[$j], "\t", $dis2[$j], "\t", $tf3[$k], "\t", $seq3[$k], "\t", $dis3[$k], "\n"; } } } } } __AB__ A seq1 20 B seq2 25 B seq2 80 B seq1 40 C seq1 25 D seq2 30 E seq2 45 __BC__ A B C