in reply to The indisputable speed of tr///

Just one more question,

In my experience, I/O bottlenecks or pathological algorithms are usually the culprits when apps become too slow.

That said, tr/// is certainly the right way to do it. Teardown time alone on that hash is probably longer than tr/// takes - not to mention all the memory management and copying involved in the split.

The operation is only done once, though, and 300 times nothing can still be pretty small.

After Compline,
Zaxo

Replies are listed 'Best First'.
Re^2: The indisputable speed of tr///
by radiantmatrix (Parson) on Jun 26, 2006 at 20:46 UTC

    Did it help? Quite a bit, actually. This was one component in an otherwise quite decent system, written by a coder who was essentially a temp. The file-IO stuff was handled as efficiently as reasonable in Perl and with appropriate error-checking. It's entirely a File-IO type of system.

    The operation is only done once, though, and 300 times nothing can still be pretty small.

    See, that's just it. We're talking about large chunks of text, handled a line at a time for memory's sake -- so this operation is repeated tens of thousands of times per run, and many thousands of runs per day (many of which are in parallel -- this doesn't slam the processor by any stretch, since we wait mostly on I/O).

    Now, the actual performance difference in the app at large was no where near the 30_000% in the demonstration but it was significant, and certainly worth doing. This change alone made roughly a 350% difference, and that was after improving the I/O routine to read and process up to 500 lines of text at a time, instead of one.

    <radiant.matrix>
    A collection of thoughts and links from the minds of geeks
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet

      I wonder ... if all you were doing was some transliteration why would you bother with the lines at all? Read and process say 64 KB at once. As I said, not sure this is doable in your case.

      Second ... why would you spend time "figuring out reasonable character classes? If you have the mapping hash, just build the tr/// out of it. I just tried to benchmark the difference between tr/a-z/A-Z/ and tr/abcdef...xyz/ABCDEF...ZYX/ and the later actually came out a tiny little bit faster. So you could either have the tr/// generated from the hash and copy the result into the script or use string eval to define a subroutine to do the transliteration.  my $sub = eval 'sub { my $x = shift(); $x=~ tr/' . join('', keys %h) . '/' . join('', values %h) . '/; return $x}'; or something. If the loop is simple enough you could eval the whole loop so that you do not waste time by the subroutine call for each line. Or am I missing something?

        I wonder ... if all you were doing was some transliteration why would you bother with the lines at all?

        Good point. I now have the code altered in test to use sysread to read in 128KB at a time, and there is definitely a boost there. Probably roll that into prod with the next release. Thanks!

        why would you spend time "figuring out reasonable character classes? If you have the mapping hash, just build the tr/// out of it.

        I started there, actually, with code that looks something like this:

        my %rot = ( 'A' => 'N','B' => 'O','C' => 'P','D' => 'Q','E' => 'R','F' => 'S', 'G' => 'T','H' => 'U','I' => 'V','J' => 'W','K' => 'X','L' => 'Y', 'M' => 'Z','N' => 'A','O' => 'B','P' => 'C','Q' => 'D','R' => 'E', 'S' => 'F','T' => 'G','U' => 'H','V' => 'I','W' => 'J','X' => 'K', 'Y' => 'L','Z' => 'M','a' => 'n','b' => 'o','c' => 'p','d' => 'q', 'e' => 'r','f' => 's','g' => 't','h' => 'u','i' => 'v','j' => 'w', 'k' => 'x','l' => 'y','m' => 'z','n' => 'a','o' => 'b','p' => 'c', 'q' => 'd','r' => 'e','s' => 'f','t' => 'g','u' => 'h','v' => 'i', 'w' => 'j','x' => 'k','y' => 'l','z' => 'm', ); my ($tr_a, $tr_b); for (keys %rot) { $tr_a.=$_; $tr_b.=$rot{$_} } print "tr/$tr_a/$tr_b/;\n";

        However, I wasn't kidding when I said that ROT-13 is a simplified case of the actual problem! I spent some time finding the char classes, not for runtime efficiency, but for readability/maintainability. There's a reasonable chance of the rules changing, and a tr/// with 300-or-so-char clauses is mighty unreadable.

        The string eval is an interesting solution, and one I will have to play around with -- unfortunately, it won't be allowed in prod without an exception filing (string evals are against our coding standards), so there will have to be significant benefit to get it approved. Still, something I hadn't considered, so I thank you for that...

        <radiant.matrix>
        A collection of thoughts and links from the minds of geeks
        The Code that can be seen is not the true Code
        I haven't found a problem yet that can't be solved by a well-placed trebuchet