Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Input file1: col1 col2 col3 col4 ZGLP1 ICAM4 13.27 0.2425 ICAM4 ZGLP1 13.27 0.2425 RRP1B CDH24 20.8 1 ZGLP1 OOEP 18.79 0.3060 ZGLP1 RRP1B 39.62 0.2972 ZGLP1 CDH24 51.21 0.2560 BBCDI DND1 19.44 0.2833 BBCDI SOHLH2 36.61 0.2909 DND1 SOHLH2 18 0.8
Input file2: chr8 18640000 18960000 ZGLP1 RRP1B CDH24 #gene number he +re is not fixed can be #4 #5 or more chr8 19000000 19080000 BBCDI DND1 SOHLH2 #gene number he +re is not fixed can be #4 #5 or more

I have written a code which compares col1 and col2 of file1 with each line of file2 such that, if any of the pair falls anywhere in a line of file2 then programme should print "chromosome pos1 pos2 and the matching content of the file1 with values

output file: chr8 18640000 18960000 ZGLP1 RRP1B 39.62 0.2972 chr8 18640000 18960000 ZGLP1 CDH24 51.21 0.2560 chr8 18640000 18960000 RRP1B CDH24 20.8 1 chr8 19000000 19080000 BBCDI DND1 19.44 0.2833 chr8 19000000 19080000 BBCDI SOHLH2 36.61 0.2909 chr8 19000000 19080000 DND1 SOHLH2 18 0.8

so far I have tried this but it is taking so much time as my input files are huge (2gb).

my perl code open( AB, "file1" ) || die("cannot open"); open( BC, "file2" ) || die("cannot open"); open( OUT, ">output.txt" ); @file = <AB>; chomp(@file); @data = <BC>; chomp(@data); foreach $fl (@file) { if ( $fl =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)/ ) { $one = $1; $two = $2; $thr = $3; $for = $4; } foreach $line (@data) { if ( $line =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)+/ ) { $chr = $1; $pos1 = $2; $pos2 = $3; } if ( $line =~ /$one/ ) { if ( $line =~ /$two/ ) { print OUT $chr, "\t", $pos1, "\t", $pos2, "\t", $fl, " +\n"; } } } }

Replies are listed 'Best First'.
Re: how to speed up pattern match between two files
by RichardK (Parson) on Sep 16, 2014 at 17:19 UTC

    Well you're doing a lot of work, searching though a large array for each entry in another large array.

    but there are a few things you can try.

    You don't need to extract the positions unless it's a line your interested in, so swap the order of those tests :-

    foreach $line (@data) { if ( $line =~ /$one/ && $line =~ /$two/ ) { if ( $line =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)+/ ) { $chr = $1; $pos1 = $2; $pos2 = $3; } ... } }

    Try using index instead of a regex to match $one & $two, it's usually quicker.

    if (index($line,$one) != -1 && index($line,$two) != -1) { ... }

    Profile your code to find out where the time is going, and measure the difference each change makes.

      In a quick test, index doesn't seem to be quicker than simple regex. example:

      $ perl -MTime::HiRes -MBenchmark=timethese -le'open F,"</usr/share/dic +t/words";@words=<F>;chomp for @words; $re= q(^bonjour); timethese(-1, + { index => sub { for (@words) { return 1 if index($_,"bonjour") != - +1 } }, re => sub { for (@words) { return 1 if /\bbonjour\b/ } }, q(re +^) => sub { for (@words) { return 1 if /^bonjour/ } }, re_comp => sub + { $re= qr/^bonjour/o; for (@words) { return 1 if /$re/o } }, grep => + sub { return 1 if grep /bonjour/, @words } } )' Benchmark: running grep, index, re, re^, re_comp for at least 1 CPU seconds ... grep: 2 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU) @ 32 +.41/s (n=35) index: 1 wallclock secs ( 1.12 usr + 0.02 sys = 1.14 CPU) @ 26 +7.54/s (n=305) re: 1 wallclock secs ( 1.10 usr + 0.02 sys = 1.12 CPU) @ 27 +2.32/s (n=305) re^: 1 wallclock secs ( 1.14 usr + 0.00 sys = 1.14 CPU) @ 32 +7.19/s (n=373) re_comp: 1 wallclock secs ( 1.04 usr + 0.00 sys = 1.04 CPU) @ 26 +8.27/s (n=279)
      In this test anchoring the regex with '^' boost a little (+20%), and index or compiling the regex doesn't help. As you said, profiling the code can help here.

        The line lengths in the words list are short so any differences in performance will be lost in the system noise. So your test can't tell us anything useful and isn't a great match for the OPs problem.

      Thank you for your reply. Ill do that. Unlike the previous reply its really informative and helpful

        PM is about helping you learn; not about spoon-feeding you, nor about do-it-for-you. So read carefully even replies that (in your limited knowledge, maybe) seem not "really informative and helpful."

        In this case, the reply in question seems to me (duh!) both informative and helpful (unless the level of your knowledge is such that you need not have asked the original question). So even if you choose not to profile your code to find out where it's taking a lot of time, you would be well-served to consider the comment about the loops (nested loops are INEFFICIENT in your problem case) and to follow up on the observation about your regexen. If you take just a little time to study perldoc perlretut and or comparable documents, you'll learn how to write one that's not reliant on the death_star ... eg,  .*, even when restricted by the minimal modifier... e.g.,  ?.


        ++$anecdote ne $data


Re: how to speed up pattern match between two files
by ww (Archbishop) on Sep 16, 2014 at 15:45 UTC
    1. Super Search with "profile" as your search term
    2. If following up on point_1 doesn't clarify your problem, you may wish to quantify "so much time" for us. I suspect -- from a cursory reading of your code, that the expectations that led you to complain about "so much time" may have been excessively optomistic: looping thru two 2GB+ files simultaneously is unlikely to be fast with a read per line per file to execute and a regex that relies so heavily on wildcards.

    ++$anecdote ne $data


Re: how to speed up pattern match between two files
by Lennotoecom (Pilgrim) on Sep 16, 2014 at 17:36 UTC
    upload your entire 2GB file into local db mysql
    and do select * from table where col1 = 'a' and col2 = 'b';
Re: how to speed up pattern match between two files (hash)
by tye (Sage) on Sep 17, 2014 at 23:33 UTC

    1. I'd check if you are exhausting your available RAM
    2. I'd use split
    3. I'd build a hash keyed by sorted pairs and skip regex/index entirely
    4. If not enough RAM, I'd process the first file $N lines at-a-time

    - tye