in reply to how to speed up pattern match between two files

Well you're doing a lot of work, searching though a large array for each entry in another large array.

but there are a few things you can try.

You don't need to extract the positions unless it's a line your interested in, so swap the order of those tests :-

foreach $line (@data) { if ( $line =~ /$one/ && $line =~ /$two/ ) { if ( $line =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)+/ ) { $chr = $1; $pos1 = $2; $pos2 = $3; } ... } }

Try using index instead of a regex to match $one & $two, it's usually quicker.

if (index($line,$one) != -1 && index($line,$two) != -1) { ... }

Profile your code to find out where the time is going, and measure the difference each change makes.

Replies are listed 'Best First'.
Re^2: how to speed up pattern match between two files
by gnujsa (Acolyte) on Sep 17, 2014 at 07:38 UTC

    In a quick test, index doesn't seem to be quicker than simple regex. example:

    $ perl -MTime::HiRes -MBenchmark=timethese -le'open F,"</usr/share/dic +t/words";@words=<F>;chomp for @words; $re= q(^bonjour); timethese(-1, + { index => sub { for (@words) { return 1 if index($_,"bonjour") != - +1 } }, re => sub { for (@words) { return 1 if /\bbonjour\b/ } }, q(re +^) => sub { for (@words) { return 1 if /^bonjour/ } }, re_comp => sub + { $re= qr/^bonjour/o; for (@words) { return 1 if /$re/o } }, grep => + sub { return 1 if grep /bonjour/, @words } } )' Benchmark: running grep, index, re, re^, re_comp for at least 1 CPU seconds ... grep: 2 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU) @ 32 +.41/s (n=35) index: 1 wallclock secs ( 1.12 usr + 0.02 sys = 1.14 CPU) @ 26 +7.54/s (n=305) re: 1 wallclock secs ( 1.10 usr + 0.02 sys = 1.12 CPU) @ 27 +2.32/s (n=305) re^: 1 wallclock secs ( 1.14 usr + 0.00 sys = 1.14 CPU) @ 32 +7.19/s (n=373) re_comp: 1 wallclock secs ( 1.04 usr + 0.00 sys = 1.04 CPU) @ 26 +8.27/s (n=279)
    In this test anchoring the regex with '^' boost a little (+20%), and index or compiling the regex doesn't help. As you said, profiling the code can help here.

      The line lengths in the words list are short so any differences in performance will be lost in the system noise. So your test can't tell us anything useful and isn't a great match for the OPs problem.

        I've added 3 columns of 3 dictionaries. I've put some random chars at the end to get lines length the same as his files. And, again, short regex (he uses short regex in this part of the code) was as fast as index (and even slightly faster). I don't know why, but the best he can do, it's to try with his own data and his own perl (I've used perl 5.20.0)

Re^2: how to speed up pattern match between two files
by Anonymous Monk on Sep 16, 2014 at 19:25 UTC
    Thank you for your reply. Ill do that. Unlike the previous reply its really informative and helpful

      PM is about helping you learn; not about spoon-feeding you, nor about do-it-for-you. So read carefully even replies that (in your limited knowledge, maybe) seem not "really informative and helpful."

      In this case, the reply in question seems to me (duh!) both informative and helpful (unless the level of your knowledge is such that you need not have asked the original question). So even if you choose not to profile your code to find out where it's taking a lot of time, you would be well-served to consider the comment about the loops (nested loops are INEFFICIENT in your problem case) and to follow up on the observation about your regexen. If you take just a little time to study perldoc perlretut and or comparable documents, you'll learn how to write one that's not reliant on the death_star ... eg,  .*, even when restricted by the minimal modifier... e.g.,  ?.


      ++$anecdote ne $data