in reply to Question about speeding a regexp count

Seems a regex is not the methode of choice here...

Provided my code has no bugs, see for yourself:

#!/usr/bin/perl use strict; use warnings; use Benchmark; use Data::Dumper; my $DNAlength= 600000; my $sequencelength= 3; my $testfilename= 'DNA.tst'; my $bufsize= 1024; # create testdata open(my $testfile, ">$testfilename") or die "$testfilename: $!"; my $buffer = ''; for (my $i=$DNAlength; $i; --$i) { $buffer.= qw(A C G T)[rand 4]; if (length($buffer) > $bufsize) { print $testfile $buffer; $buffer = ''; } } print $testfile $buffer; close $testfile; a1: { # Start collecting data my %DNA_sequence_count; my $a1_0=new Benchmark; open($testfile, $testfilename) or die "$testfilename: $!"; my $kept=0; while (1) { my $read_bytes = read($testfile, $buffer, $bufsize-$kept); last unless $read_bytes; my $end_buffer = $read_bytes-$sequencelength; for (my $i=0; $i <= $end_buffer; ++$i) { ++$DNA_sequence_count{substr($buffer, $i, $sequencelength) +}; } $kept= $sequencelength-1; $buffer= substr($buffer, $end_buffer+1, $kept); } close $testfile; my $a1_n=new Benchmark; print "the file code took:",timestr(timediff($a1_n, $a1_0)),"\n"; } a2: { # Start collecting data my %DNA_sequence_count; my $a2_0=new Benchmark; open($testfile, $testfilename) or die "$testfilename: $!"; my $allbuffered= <$testfile>; close $testfile; my $end_buffer= length($allbuffered)-$sequencelength+1; for (my $i=0; $i <= $end_buffer; ++$i) { ++$DNA_sequence_count{substr($allbuffered, $i, $sequencele +ngth)}; } my $a2_n=new Benchmark; print "the scalar code took:",timestr(timediff($a2_n, $a2_0)),"\n" +; } a3: { # Start collecting data my %DNA_sequence_count; my $a3_0=new Benchmark; open($testfile, $testfilename) or die "$testfilename: $!"; my $allbuffered= <$testfile>; close $testfile; ++$DNA_sequence_count{$1.$2} while $allbuffered=~ /(.)(?=(..))/g; my $a3_n=new Benchmark; print "the regex code took:",timestr(timediff($a3_n, $a3_0)),"\n"; }
Output:
the file code took: 2 wallclock secs ( 0.96 usr + 0.04 sys = 1.00 CPU)
the scalar code took: 3 wallclock secs ( 0.91 usr + 0.03 sys = 0.94 CPU)
the regex code took: 6 wallclock secs ( 2.54 usr + 0.06 sys = 2.60 CPU)

Update: ikegami told me, my code is incomplete because: "...since it only lists groups of $sequencelength, not of $sequencelength or less."

that's true, but did it appear to you - and correct me if i'm wrong, that you can get the lower counts without counting each of the sequences!?
Simply loop once for each smaller length over all results and just count the last few sequences at the end which you didn't hit.

my (@keys)= keys %DNA_sequence_count; foreach my $key (@keys) { for ($i=$sequencelength-1; $i; --$i) { $DNA_sequence_count{substr($key, 0, $i}+= $DNA_sequence_count{$key +}; } } # correction again thanks to ikegami! for ($i=$sequencelength-1; $i; --$i) { # second correction thanks to sauoq for ($j=$i; $j<$sequencelength; ++$j) { ++$DNA_sequence_count{substr($allbuffered, -$j, $i)}; } }
Small example:
CATCAT gives CAT 2 ATC 1 TCA 1 Applying that algorithm you get CA 2 (directly from CAT) AT 2 (from ATC + AT at the end) TC 1 (directly from TCA) C 2 A 2 (from AT at the end) T 2 (from TCA + last T)

Update: sauoq noticed there must be a bug in that correction algorithm and in fact, he's right. fixed now


s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re^2: Question about speeding a regexp count
by sauoq (Abbot) on Oct 13, 2005 at 23:32 UTC
    Provided my code has no bugs, see for yourself:

    I've noticed persistent differences in the results of your code and mine. (Well, the versions of your code as included in ikegami's benchmarks.) They take the form of 2 to 3 subsequences that differ by a count of 1 or 2. I suspect that this is an easy bug to fix... but since it's your code, I figure you can debug and fix it. ;-)

    That all said, I'd like to point out that 1) including the file reads in your benchmark obscures the issue. As long as the machine has plenty of memory (and maybe yours doesn't) you are contaminating your results with two different methods of reading the data. And 2) you could modify my algorithm and most of the others' to work with chunks as well if RAM really was an issue.

    If you work your bugs out, I am interested in your method of iterating over the shorter strings for the smaller length matches though. I haven't looked at it closely yet, but I'd like to see how that scales.

    -sauoq
    "My two cents aren't worth a dime.";
    

      Well spotted, sauoq! There are indeed bugs:

      1. There was a bug in the implementation of skeeve2
      2. There was a bug in my correction algorithm
      Fixing them does not significantly increase calculation time. for those who are interested, here the code to fix ikegami's benchmark:

      I'll fix my algorithm outlined in my other post.

      Regarding your other issues:

      1. including the file reads in your benchmark obscures the issue. As long as the machine has plenty of memory (and maybe yours doesn't) you are contaminating your results with two different methods of reading the data.

        I don't agree. It was intended that way.

        Reading chunks of data is important if you are low on memory. And I'm 100% sure the OP has to read his data from a file. So reading data is something that is essential for the algorithm.

      2. you could modify my algorithm and most of the others' to work with chunks as well if RAM really was an issue.

        Here I do agree. Of course can it be done. But you didn't ;-)

        I challenge you to do so and then let's compare times.

        But OTOH: My algorithm taking low memory into account is the same as a2, the one ikegami implemented as "skeeve2". So the only difference should be the time needed for reading of data. And if we all do so, and if we all read the same chunk-size, there shouldn't be a difference in the ranking.

        Of course: All of you could rewrite and use my correction algorithm. This would make some of your algorithms significantly faster, I think.

      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        Reading chunks of data is important if you are low on memory. And I'm 100% sure the OP has to read his data from a file. So reading data is something that is essential for the algorithm.

        If you are so low on memory that you can't read 6 megs of data into RAM, you might want to invest a couple dollars in an upgrade. ;-)

        And, though you are right that he probably has to read the data in from the file, your benchmark assumes that everyone else would pick a poor way of reading that data while you pick a relatively good way. For instance, instead of the lame my $allbuffered= <$testfile>; you chose for us, I would likely choose to use sysread with 4k chunks which, on my machine, would beat read with 1k chunks with a performance gain of close to %100 on average.

        You can't make assumptions like that about somebody else's code and then claim yours is better. All you show is that your assumptions are poor.

        Here I do agree. Of course can it be done. But you didn't ;-)

        I remain unconvinced that it is even necessary. As I already pointed out, the size of the data is only 6 megs. This isn't the 80's and that's nothing. Even if he had 60 megs of data, reading it all in might likely be a fine strategy. Now, if he had 600 megs, I'd look at reading chunks... at least on my desktop.

        If you like though, do a comparison after swapping out the buffered IO you assumed for us with sysread like I suggested. Leave yours as is and report the difference.

        Of course: All of you could rewrite and use my correction algorithm. This would make some of your algorithms significantly faster, I think.

        I'm not so sure of that. I don't see anything intrinsic about it which would suggest an increase. It is an interesting twist though. I'd rather see corrected benchmarks first (i.e. equivalent reads or no reads at all) before I started making that comparison.

        -sauoq
        "My two cents aren't worth a dime.";