how to speed up pattern match between two files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Input file1:

col1    col2    col3    col4
ZGLP1   ICAM4   13.27   0.2425
ICAM4   ZGLP1   13.27   0.2425
RRP1B   CDH24   20.8    1
ZGLP1   OOEP    18.79   0.3060
ZGLP1   RRP1B   39.62   0.2972
ZGLP1   CDH24   51.21   0.2560
BBCDI   DND1    19.44   0.2833
BBCDI   SOHLH2  36.61   0.2909
DND1    SOHLH2  18      0.8
[download]

Input file2:

chr8     18640000   18960000    ZGLP1   RRP1B   CDH24  #gene number he
+re is not fixed can be #4 #5 or more
chr8     19000000   19080000    BBCDI   DND1    SOHLH2 #gene number he
+re is not fixed can be #4 #5 or more
[download]

I have written a code which compares col1 and col2 of file1 with each line of file2 such that, if any of the pair falls anywhere in a line of file2 then programme should print "chromosome pos1 pos2 and the matching content of the file1 with values

output file:

chr8     18640000   18960000    ZGLP1   RRP1B 39.62 0.2972
chr8     18640000   18960000    ZGLP1 CDH24 51.21   0.2560
chr8     18640000   18960000    RRP1B CDH24 20.8    1
chr8     19000000   19080000    BBCDI   DND1 19.44  0.2833
chr8     19000000   19080000    BBCDI SOHLH2 36.61  0.2909
chr8     19000000   19080000    DND1 SOHLH2 18 0.8
[download]

so far I have tried this but it is taking so much time as my input files are huge (2gb).

my perl code

open( AB, "file1" ) || die("cannot open");
open( BC, "file2" ) || die("cannot open");
open( OUT, ">output.txt" );

@file = <AB>;

chomp(@file);
@data = <BC>;

chomp(@data);

foreach $fl (@file) {
    if ( $fl =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)/ ) {
        $one = $1;
        $two = $2;
        $thr = $3;
        $for = $4;
    }

    foreach $line (@data) {
        if ( $line =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)+/ ) {
            $chr  = $1;
            $pos1 = $2;
            $pos2 = $3;
        }

        if ( $line =~ /$one/ ) {
            if ( $line =~ /$two/ ) {
                print OUT $chr, "\t", $pos1, "\t", $pos2, "\t", $fl, "
+\n";
            }
        }
    }
}
[download]

Comment on how to speed up pattern match between two files Select or Download Code

Replies are listed 'Best First'.
Re: how to speed up pattern match between two files by RichardK (Parson) on Sep 16, 2014 at 17:19 UTC
Well you're doing a lot of work, searching though a large array for each entry in another large array. but there are a few things you can try. You don't need to extract the positions unless it's a line your interested in, so swap the order of those tests :- `foreach $line (@data) { if ( $line =~ /$one/ && $line =~ /$two/ ) { if ( $line =~ /(.?)\s+(.?)\s+(.?)\s+(.)+/ ) { $chr = $1; $pos1 = $2; $pos2 = $3; } ... } }` [download] Try using index instead of a regex to match $one & $two, it's usually quicker. `if (index($line,$one) != -1 && index($line,$two) != -1) { ... }` [download] Profile your code to find out where the time is going, and measure the difference each change makes.	[reply] [d/l] [select]
Re^2: how to speed up pattern match between two files by gnujsa (Acolyte) on Sep 17, 2014 at 07:38 UTC
In a quick test, index doesn't seem to be quicker than simple regex. example: $ perl -MTime::HiRes -MBenchmark=timethese -le'open F,"</usr/share/dic +t/words";@words=<F>;chomp for @words; $re= q(^bonjour); timethese(-1, + { index => sub { for (@words) { return 1 if index($_,"bonjour") != - +1 } }, re => sub { for (@words) { return 1 if /\bbonjour\b/ } }, q(re +^) => sub { for (@words) { return 1 if /^bonjour/ } }, re_comp => sub + { $re= qr/^bonjour/o; for (@words) { return 1 if /$re/o } }, grep => + sub { return 1 if grep /bonjour/, @words } } )' Benchmark: running grep, index, re, re^, re_comp for at least 1 CPU seconds ... grep: 2 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU) @ 32 +.41/s (n=35) index: 1 wallclock secs ( 1.12 usr + 0.02 sys = 1.14 CPU) @ 26 +7.54/s (n=305) re: 1 wallclock secs ( 1.10 usr + 0.02 sys = 1.12 CPU) @ 27 +2.32/s (n=305) re^: 1 wallclock secs ( 1.14 usr + 0.00 sys = 1.14 CPU) @ 32 +7.19/s (n=373) re_comp: 1 wallclock secs ( 1.04 usr + 0.00 sys = 1.04 CPU) @ 26 +8.27/s (n=279) [download] In this test anchoring the regex with '^' boost a little (+20%), and index or compiling the regex doesn't help. As you said, profiling the code can help here.	[reply] [d/l]
Re^3: how to speed up pattern match between two files by RichardK (Parson) on Sep 17, 2014 at 09:15 UTC
The line lengths in the words list are short so any differences in performance will be lost in the system noise. So your test can't tell us anything useful and isn't a great match for the OPs problem.	[reply]
Re^4: how to speed up pattern match between two files by gnujsa (Acolyte) on Sep 17, 2014 at 15:58 UTC
Re^2: how to speed up pattern match between two files by Anonymous Monk on Sep 16, 2014 at 19:25 UTC
Thank you for your reply. Ill do that. Unlike the previous reply its really informative and helpful	[reply]
Re^3: how to speed up pattern match between two files by ww (Archbishop) on Sep 17, 2014 at 22:46 UTC
PM is about helping you learn; not about spoon-feeding you, nor about do-it-for-you. So read carefully even replies that (in your limited knowledge, maybe) seem not "really informative and helpful." In this case, the reply in question seems to me (duh!) both informative and helpful (unless the level of your knowledge is such that you need not have asked the original question). So even if you choose not to profile your code to find out where it's taking a lot of time, you would be well-served to consider the comment about the loops (nested loops are INEFFICIENT in your problem case) and to follow up on the observation about your regexen. If you take just a little time to study `perldoc perlretut` and or comparable documents, you'll learn how to write one that's not reliant on the death_star ... eg, `.`, even when restricted by the minimal modifier... e.g., `?`. `++$anecdote ne $data`*	[reply] [d/l] [select]
Re: how to speed up pattern match between two files by ww (Archbishop) on Sep 16, 2014 at 15:45 UTC
Super Search with "profile" as your search term If following up on point_1 doesn't clarify your problem, you may wish to quantify "so much time" for us. I suspect -- from a cursory reading of your code, that the expectations that led you to complain about "so much time" may have been excessively optomistic: looping thru two 2GB+ files simultaneously is unlikely to be fast with a read per line per file to execute and a regex that relies so heavily on wildcards. `++$anecdote ne $data`	[reply] [d/l]
Re: how to speed up pattern match between two files by Lennotoecom (Pilgrim) on Sep 16, 2014 at 17:36 UTC
upload your entire 2GB file into local db mysql and do select * from table where col1 = 'a' and col2 = 'b';	[reply]
Re: how to speed up pattern match between two files (hash) by tye (Sage) on Sep 17, 2014 at 23:33 UTC
I'd check if you are exhausting your available RAM I'd use split I'd build a hash keyed by sorted pairs and skip regex/index entirely If not enough RAM, I'd process the first file $N lines at-a-time - tye	[reply]