in reply to how to speed up comparison between two files

Are there always 3 factors per line of file 2? Or a larger number? Or a variable number?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re: how to speed up comparison between two files

Replies are listed 'Best First'.
Re^2: how to speed up comparison between two files
by greeknlatin (Initiate) on Dec 10, 2014 at 08:11 UTC

    Yes, there always 3 factors per line of file 2

      Update: I generated a file1 of 1000 lines (factors(A..Z)Seq(0..9)Pos(0..1000). And a file2 of 1000 triples of A..Z. Processing with your code ran for over 1.5 hours before I abandoned it; with my code it took less than a second. HTH.

      Try this. On cursory inspection (and using only the tiny amount of data provided), I think it should be quite a bit quicker than your brute force iteration/comparison method:

      #! perl -slw use strict; use Inline::Files; my( %file1, %facBySeq, %seqByFac ); while( <FILE_1> ) { my( $fac, $seq, $pos ) = split ' '; push @{ $file1{ $fac }{ $seq } }, $pos; $facBySeq{ $seq }{ $fac } = 1; $seqByFac{ $fac }{ $seq } = 1; } while( <FILE_2> ) { my @facs = split ' '; for my $seq ( keys %{ $seqByFac{ $facs[ 0 ] } } ) { if( exists $facBySeq{ $seq }{ $facs[ 1 ] } and exists $facBySeq{ $seq }{ $facs[ 2 ] } ) { for my $pos1 ( @{ $file1{ $facs[ 0 ] }{ $seq } } ) { for my $pos2 ( @{ $file1{ $facs[ 1 ] }{ $seq } } ) { for my $pos3 ( @{ $file1{ $facs[ 2 ] }{ $seq } } ) + { print join ' ', $facs[0], $seq, $pos1, $facs[ +1 ], $seq, $pos2, $facs[ 2 ], $seq, $pos3; } } } } } } __FILE_1__ A seq1 20 B seq2 25 B seq2 80 B seq1 40 C seq1 25 D seq2 30 E seq2 45 __FILE_2__ A B C B D E

      Outputs:

      [ 8:24:45.48] C:\test>1109868.pl A seq1 20 B seq1 40 C seq1 25 B seq2 25 D seq2 30 E seq2 45 B seq2 80 D seq2 30 E seq2 45

      Basically, it just constructs a couple of ancilliary indexes using hashes to avoid much of the iteration. (Note:The Inline::Files just allowed me to put all the test data and code in a single file.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Thank you very much for your reply (code). It is really taking very less time to complete.