Re^3: Question about speeding a regexp count

Well spotted, sauoq! There are indeed bugs:

There was a bug in the implementation of skeeve2
There was a bug in my correction algorithm

Fixing them does not significantly increase calculation time. for those who are interested, here the code to fix ikegami's benchmark:

sub skeeve2 {
   my %count;

   for my $i (0..length($seq)-GROUP_LENGTH) {
      $count{substr($seq, $i, GROUP_LENGTH)}++;
   }

   my @keys = keys %count;
   foreach my $key (@keys) {
      for my $i (1..GROUP_LENGTH-1) {
         $count{substr($key, 0, $i)} += $count{$key};
      }
   }
   for my $i (1..GROUP_LENGTH-1) {
      for my $j ($i..GROUP_LENGTH-1) {
          ++$count{substr($seq, -$j, $i)};
      }
   }
   1;
}


sub skeeve3 {
   my %count;

   $count{"$1$2"}++ while $seq =~ /(.)(?=(..))/g;

   my @keys = keys %count;
   foreach my $key (@keys) {
      for my $i (1..GROUP_LENGTH-1) {
         $count{substr($key, 0, $i)} += $count{$key};
      }
   }
   for my $i (1..GROUP_LENGTH-1) {
      for my $j ($i..GROUP_LENGTH-1) {
          ++$count{substr($seq, -$j, $i)};
      }
   }

   1;
}


sub skeeve3_i {
   my %count;

   $count{$1}++ while $seq =~ /(?=(.{3}))/g;

   my @keys = keys %count;
   foreach my $key (@keys) {
      for my $i (1..GROUP_LENGTH-1) {
         $count{substr($key, 0, $i)} += $count{$key};
      }
   }
   for my $i (1..GROUP_LENGTH-1) {
      for my $j ($i..GROUP_LENGTH-1) {
          ++$count{substr($seq, -$j, $i)};
      }
   }

   1;
}
[download]

I'll fix my algorithm outlined in my other post.

Regarding your other issues:

including the file reads in your benchmark obscures the issue. As long as the machine has plenty of memory (and maybe yours doesn't) you are contaminating your results with two different methods of reading the data.

I don't agree. It was intended that way.

Reading chunks of data is important if you are low on memory. And I'm 100% sure the OP has to read his data from a file. So reading data is something that is essential for the algorithm.
you could modify my algorithm and most of the others' to work with chunks as well if RAM really was an issue.

Here I do agree. Of course can it be done. But you didn't ;-)

I challenge you to do so and then let's compare times.

But OTOH: My algorithm taking low memory into account is the same as a2, the one ikegami implemented as "skeeve2". So the only difference should be the time needed for reading of data. And if we all do so, and if we all read the same chunk-size, there shouldn't be a difference in the ranking.

Of course: All of you could rewrite and use my correction algorithm. This would make some of your algorithms significantly faster, I think.

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Comment on Re^3: Question about speeding a regexp count Select or Download Code

Replies are listed 'Best First'.
Re^4: Question about speeding a regexp count by sauoq (Abbot) on Oct 14, 2005 at 16:45 UTC
Reading chunks of data is important if you are low on memory. And I'm 100% sure the OP has to read his data from a file. So reading data is something that is essential for the algorithm. If you are so low on memory that you can't read 6 megs of data into RAM, you might want to invest a couple dollars in an upgrade. ;-) And, though you are right that he probably has to read the data in from the file, your benchmark assumes that everyone else would pick a poor way of reading that data while you pick a relatively good way. For instance, instead of the lame `my $allbuffered= <$testfile>;` you chose for us, I would likely choose to use `sysread` with 4k chunks which, on my machine, would beat `read` with 1k chunks with a performance gain of close to %100 on average. You can't make assumptions like that about somebody else's code and then claim yours is better. All you show is that your assumptions are poor. Here I do agree. Of course can* it be done. But you didn't ;-)* I remain unconvinced that it is even necessary. As I already pointed out, the size of the data is only 6 megs. This isn't the 80's and that's nothing. Even if he had 60 megs of data, reading it all in might likely be a fine strategy. Now, if he had 600 megs, I'd look at reading chunks... at least on my desktop. If you like though, do a comparison after swapping out the buffered IO you assumed for us with `sysread` like I suggested. Leave yours as is and report the difference. Of course: All of you could rewrite and use my correction algorithm. This would make some of your algorithms significantly faster, I think. I'm not so sure of that. I don't see anything intrinsic about it which would suggest an increase. It is an interesting twist though. I'd rather see corrected benchmarks first (i.e. equivalent reads or no reads at all) before I started making that comparison. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^4: Question about speeding a regexp count
by sauoq (Abbot) on Oct 14, 2005 at 16:45 UTC

Reading chunks of data is important if you are low on memory. And I'm 100% sure the OP has to read his data from a file. So reading data is something that is essential for the algorithm.

If you are so low on memory that you can't read 6 megs of data into RAM, you might want to invest a couple dollars in an upgrade. ;-)

And, though you are right that he probably has to read the data in from the file, your benchmark assumes that everyone else would pick a poor way of reading that data while you pick a relatively good way. For instance, instead of the lame my $allbuffered= <$testfile>; you chose for us, I would likely choose to use sysread with 4k chunks which, on my machine, would beat read with 1k chunks with a performance gain of close to %100 on average.

You can't make assumptions like that about somebody else's code and then claim yours is better. All you show is that your assumptions are poor.

Here I do agree. Of course can it be done. But you didn't ;-)

I remain unconvinced that it is even necessary. As I already pointed out, the size of the data is only 6 megs. This isn't the 80's and that's nothing. Even if he had 60 megs of data, reading it all in might likely be a fine strategy. Now, if he had 600 megs, I'd look at reading chunks... at least on my desktop.

If you like though, do a comparison after swapping out the buffered IO you assumed for us with sysread like I suggested. Leave yours as is and report the difference.

Of course: All of you could rewrite and use my correction algorithm. This would make some of your algorithms significantly faster, I think.

I'm not so sure of that. I don't see anything intrinsic about it which would suggest an increase. It is an interesting twist though. I'd rather see corrected benchmarks first (i.e. equivalent reads or no reads at all) before I started making that comparison.

-sauoq
"My two cents aren't worth a dime.";

[reply]
[d/l]
[select]