in reply to Re: Question about speeding a regexp count
in thread Question about speeding a regexp count

Provided my code has no bugs, see for yourself:

I've noticed persistent differences in the results of your code and mine. (Well, the versions of your code as included in ikegami's benchmarks.) They take the form of 2 to 3 subsequences that differ by a count of 1 or 2. I suspect that this is an easy bug to fix... but since it's your code, I figure you can debug and fix it. ;-)

That all said, I'd like to point out that 1) including the file reads in your benchmark obscures the issue. As long as the machine has plenty of memory (and maybe yours doesn't) you are contaminating your results with two different methods of reading the data. And 2) you could modify my algorithm and most of the others' to work with chunks as well if RAM really was an issue.

If you work your bugs out, I am interested in your method of iterating over the shorter strings for the smaller length matches though. I haven't looked at it closely yet, but I'd like to see how that scales.

-sauoq
"My two cents aren't worth a dime.";
  • Comment on Re^2: Question about speeding a regexp count

Replies are listed 'Best First'.
Re^3: Question about speeding a regexp count
by Skeeve (Parson) on Oct 14, 2005 at 10:51 UTC

    Well spotted, sauoq! There are indeed bugs:

    1. There was a bug in the implementation of skeeve2
    2. There was a bug in my correction algorithm
    Fixing them does not significantly increase calculation time. for those who are interested, here the code to fix ikegami's benchmark:

    I'll fix my algorithm outlined in my other post.

    Regarding your other issues:

    1. including the file reads in your benchmark obscures the issue. As long as the machine has plenty of memory (and maybe yours doesn't) you are contaminating your results with two different methods of reading the data.

      I don't agree. It was intended that way.

      Reading chunks of data is important if you are low on memory. And I'm 100% sure the OP has to read his data from a file. So reading data is something that is essential for the algorithm.

    2. you could modify my algorithm and most of the others' to work with chunks as well if RAM really was an issue.

      Here I do agree. Of course can it be done. But you didn't ;-)

      I challenge you to do so and then let's compare times.

      But OTOH: My algorithm taking low memory into account is the same as a2, the one ikegami implemented as "skeeve2". So the only difference should be the time needed for reading of data. And if we all do so, and if we all read the same chunk-size, there shouldn't be a difference in the ranking.

      Of course: All of you could rewrite and use my correction algorithm. This would make some of your algorithms significantly faster, I think.

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Reading chunks of data is important if you are low on memory. And I'm 100% sure the OP has to read his data from a file. So reading data is something that is essential for the algorithm.

      If you are so low on memory that you can't read 6 megs of data into RAM, you might want to invest a couple dollars in an upgrade. ;-)

      And, though you are right that he probably has to read the data in from the file, your benchmark assumes that everyone else would pick a poor way of reading that data while you pick a relatively good way. For instance, instead of the lame my $allbuffered= <$testfile>; you chose for us, I would likely choose to use sysread with 4k chunks which, on my machine, would beat read with 1k chunks with a performance gain of close to %100 on average.

      You can't make assumptions like that about somebody else's code and then claim yours is better. All you show is that your assumptions are poor.

      Here I do agree. Of course can it be done. But you didn't ;-)

      I remain unconvinced that it is even necessary. As I already pointed out, the size of the data is only 6 megs. This isn't the 80's and that's nothing. Even if he had 60 megs of data, reading it all in might likely be a fine strategy. Now, if he had 600 megs, I'd look at reading chunks... at least on my desktop.

      If you like though, do a comparison after swapping out the buffered IO you assumed for us with sysread like I suggested. Leave yours as is and report the difference.

      Of course: All of you could rewrite and use my correction algorithm. This would make some of your algorithms significantly faster, I think.

      I'm not so sure of that. I don't see anything intrinsic about it which would suggest an increase. It is an interesting twist though. I'd rather see corrected benchmarks first (i.e. equivalent reads or no reads at all) before I started making that comparison.

      -sauoq
      "My two cents aren't worth a dime.";