hrr has asked for the wisdom of the Perl Monks concerning the following question:

I would like to process a lengthy (binary) string of length N as a stream, chopping off the front part as the parsing goes along.
my $n = unpack("v", $str); $str = substr($str, 1);
Now I have noted that my application considerably slows down on longer strings, and it seems that is due to the following effect, indicating that this approach scales as N^2 (instead of the naively expected N).
$ time perl -e '$a = "a" x 20_000; $a = substr($a, 1) while length($a) +;' user 0m0.350s $ time perl -e '$a = "a" x 40_000; $a = substr($a, 1) while length($a) +;' user 0m0.791s $ time perl -e '$a = "a" x 80_000; $a = substr($a, 1) while length($a) +;' user 0m2.683s
A similar issue was discussed in Access via substr refs 2000 times slower: I think the reason for this slowdown is that when constructing support for the lvalue property of substr, a copy is made. Is there a good workaround to this scaling issue? Instead of passing an ever-shrinking stream around different functions, I could just pass the whole string and an offset,
my $n = unpack("v", substr($str, $off, 2)); $off += 2;
but this seems rather clumsy to me... Are there more elegant ways of doing this?

Update: Thanks for all the help and for the insightful explanations! As suggested, the 4-argument version of substr (setting REPLACEMENT to "") instead of the (2|3)-argument substr solves this in a very neat manner,
my $n = unpack("v", substr($str, $off, 2, ""));

Replies are listed 'Best First'.
Re: Using substr to split a string scales as N^2 -- alternatives?
by ysth (Canon) on Jan 05, 2009 at 00:22 UTC
    Use 4 arg substr:
    $ time perl -e '$a = "a" x 50_000; substr($a, 0, 1, "") while length($ +a);' real 0m0.016s user 0m0.012s sys 0m0.008s ysth@cat:~$ time perl -e '$a = "a" x 100_000; substr($a, 0, 1, "") whi +le length($a);' real 0m0.029s user 0m0.024s sys 0m0.004s ysth@cat:~$ time perl -e '$a = "a" x 200_000; substr($a, 0, 1, "") whi +le length($a);' real 0m0.041s user 0m0.036s sys 0m0.004s
    Or even s/.//s; Either will trigger an optimization that just adjusts the effective start of the string buffer without moving the data at all (when removing characters from the start of the string).
Re: Using substr to split a string scales as N^2 -- alternatives?
by ikegami (Patriarch) on Jan 05, 2009 at 00:25 UTC

    I think the reason for this slowdown is that when constructing support for the lvalue property of substr,

    No, it's only done when necessary, and it's not here.

    >perl -MO=Concise -e"\substr($_,1)" 2>&1 | find "substr" 5 <@> substr[t4] sKRM/2 ->6 ^ | "M"odifiable => magic created. >perl -MO=Concise -e"$a = substr($_,1)" 2>&1 | find "substr" 5 <@> substr[t3] sK/2 ->6 ^ | Not modifiable => no magic created.

    The time is spent creating new strings and assigning them.

    Pass 1: Creates a 79,999 char string & copies it into $a's buffer Pass 2: Creates a 79,998 char string & copies it into $a's buffer Pass 3: Creates a 79,997 char string & copies it into $a's buffer ... Pass 79,999: Creates a 1 char string & copies it into $a's buffer Pass 80,000: Creates a 0 char string & copies it into $a's buffer

    It's copying (((N-1)+1)(N-1)/2)*2 = (N^2 - N) = 6,399,920,000 bytes!

    However, it's possible to do what you want without copying a singly byte! Through a feature called "OOK", Perl can efficiently remove the leading chars of strings when it does so in-place.

    $a =~ s/.//s; # Remove leading char with no copying. substr($a, 0, 1, ''); # Remove leading char with no copying.

    The latter even returns the character removed! (Well, so does the former if you add a capture.)

    Benchmarks:

    -------------------- 20,000 Rate substr3 s/// substr4 substr3 1.61/s -- -93% -96% s/// 23.1/s 1336% -- -44% substr4 41.2/s 2460% 78% -- -------------------- 40,000 Rate substr3 s/// substr4 substr3 0.884/s -- -96% -98% s/// 23.1/s 2509% -- -45% substr4 41.6/s 4606% 80% -- -------------------- 80,000 Rate substr3 s/// substr4 substr3 0.445/s -- -98% -99% (warning: unreliable count) s/// 23.4/s 5162% -- -42% substr4 40.1/s 8897% 71% --

    The rate is the number of times 80,000 chars can be processed per second. Notice how the 4-arg substr and s/// aren't affected by the length of the input?

    Benchmark code

      I know this is supposed to be a Perl focussed forum, so other languages are generally off topic, and this is going off on a pretty tenuous tangent, but orang-utan coders would prefer to implement OOK as follows:
      Ook? Ook. Ook. Ook! Ook! Ook!
      The consequential benchmark results are left as an exercise to the (hairy orange) reader.

      --
      use JAPH;
      print JAPH::asString();

Re: Using substr to split a string scales as N^2 -- alternatives?
by BrowserUk (Patriarch) on Jan 05, 2009 at 00:34 UTC

    The best workaround I've found is to modify the string in-place by using the 4-arg substr. This effectively just adjusts a pointer and a couple of offsets, and so runs 1000x faster:

    c:\test>TimeIt \perl\bin\perl.exe -e"$a = 'a' x 800_000; substr($a, 0, 1, '') while length($a);" Kernel: 0.00000 User: 0.26563 Total: 0.26563

    Unfortunately, it isn't a good substitute for the lvalue ref problem.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.