BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I was running a piece of code that ran far slower than I expected and thrashed my cpu in the process. It turned out that I was accessing substrings of a substring via a substr reference, as in the 3rd loop below. On my system, under 5.8.6 and 5.10.0 this proves to be 2000 times slower than doing the same thing directly on the string or via a normal reference to that same string:

#! perl -slw use strict; use Time::HiRes qw[ time ]; my $string = 'X' x 1e6; my $ref = \$string; my $subRef = \substr $string, 0; my $start; my $x; $start = time; $x = substr $string, $_*50, 50 for 0 .. 2e4; print time() - $start; $start = time; $x = substr $$ref, $_*50, 50 for 0 .. 2e4; print time() - $start; $start = time; $x = substr $$subRef, $_*50, 50 for 0 .. 2e4; print time() - $start; __END__ c:\test>junk1 0.0107600688934326 0.0100328922271729 22.2601997852325 c:\test>\Perl510\bin\perl5.10.0.exe junk1.pl 0.00995421409606934 0.0109601020812988 21.9631989002228
  1. Is this a platform thing?
  2. Is there a work around other than copying the substr ref to a normal string before looping over its substrings?
  3. Is there an explanation of why it is soo slow?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re: Access via substr refs 2000 times slower
by ikegami (Patriarch) on Dec 28, 2008 at 11:11 UTC

    You're triggering substr's lvalue return, which involves magic.

    use Devel::Peek; { my $subRef = \substr $string, 0; Dump($$subRef); } { my $subRef = \(''.substr $string, 0); Dump($$subRef); }
    SV = PVLV(0x1834fdc) at 0x1831820 REFCNT = 2 FLAGS = (PADMY,GMG,SMG,pPOK) IV = 0 NV = 0 PV = 0x18208ec ""\0 CUR = 0 LEN = 4 MAGIC = 0x182443c MG_VIRTUAL = &PL_vtbl_substr MG_TYPE = PERL_MAGIC_substr(x) TYPE = x TARGOFF = 0 TARGLEN = 0 TARG = 0x236dc8 SV = PV(0x238e44) at 0x236dc8 REFCNT = 2 FLAGS = (POK,pPOK) PV = 0x182eca4 ""\0 CUR = 0 LEN = 4 SV = PV(0x238e80) at 0x1831808 REFCNT = 2 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x183646c ""\0 CUR = 0 LEN = 4

    By changing to

    my $subRef = \(''.substr $string, 0);

    I get

    0.0120670795440674 0.0105710029602051 0.010854959487915

    Update: scalar also works, and doesn't have the (albeit minute) overhead of calling concat.

    my $subRef = \scalar substr $string, 0;

      Trouble is, either of those cause copying of the referenced substring. Effectively just giving you a reference to an anonymous scalar that is a copy of the substring. You might just as well do:

      my $substr = substr $bigstring, $start, $length; func( \$substr );

      The purpose of taking a substr ref was to avoid copying large chunks of large string, and allow the large string to be modified in place via that reference.

      That said. It seems that taking an (lvalue) substr ref also also triggers copying these days. Albeit with attached magic that means that changes made to the copy also get applied to the original substring. Which is a bit cockeyed.

      It never used to, but obviously has for some time--at least since 5.8.6. I'm surprised that I've never noticed it before now. It kind of devalues the purpose of taking a reference to a substring. Methinks whomever made the change did not really get Lvalue refs.

      I feel the need to write some XS to (again), give me the ability to to pass a reference to a substring around with causing that substring to be copied.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        It never used to, but obviously has for some time--at least since 5.8.6

        At least since 5.6.0.

        From what I can tell, 'get' magic works by storing the value from the magic handler into the SV, allowing the code that follows to ignore magic. (mg_get: Do magic after a value is retrieved from the SV.) That would explain the copy.

        Similarly, 'set' magic works by passing the value in the SV to the magic handler, allowing the code that preceeds to ignore magic. (mg_set: Do magic after a value is assigned to the SV.) That would also create a copy.

        If my understanding is correct, that means the problem isn't related to lvalue substr but with magic in general.

        I wonder if this change is tied to the work that was done allowing copy on write strings. Modifying a large string in place via a reference doesn't look like it would play well with the idea of having multiple variables using copy-on-write so they can share one actual copy of a large string.
Re: Access via substr refs 2000 times slower
by talexb (Chancellor) on Dec 29, 2008 at 02:41 UTC

    Wow. It's this kind of discussion that keeps running Perl across a chef's steel, sharpening and sharpening that edge. Frankly, this level of analysis is more involved than I care to worry about, about, but I'm thrilled that there are eager souls such as yourselves who continue to dig for further improvements.

    Thank you.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      I didn't set out to find this. I just happened to pass an substr ref into an existing subroutine that takes a scalar ref argument and uses substr to iterate over the data pointed at by that reference in chunks.

      It was only when that subroutine that would normally take a couple of seconds to run, was still running 40 minutes later that I went looking for the reason. The effect was so dramatic that I thought it worth looking at.

      The subroutine in question is one that takes large genome sequences and prints them in FASTA format (wrapped every 50 chars). The input scalars can be 100s of megabytes in length, hence the reason for passing a reference rather than a copy. Breaking such large scalars into a list using split or unpack would dramatically increase memory usage, hence the reason for iterating over the scalar using substr.

      The subroutine has worked well for several years, but on this occasion I was just generating (simulating) a very large FASTA file for testing some ideas. So, instead of generating a new large scalar of random ACGTs for each record, I generated a single random sequence of the largest size possible, and was then selecting a substring of that for output as each record. It was when passing those substring refs to the FASTA output routine that I encountered the slowness.

      As I was trying to generate a 3.5GB FASTA test file, something that under normal circumstances might take a couple of hours, with this combination of factors and the 2000x slowdown, it would have taken almost 6 months!


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.