knexus has asked for the wisdom of the Perl Monks concerning the following question:

My Goal: Use a sub to manipulate multiple large blocks of text without making copies of the original text.

What I think I know:

1.Stick with references as Scalars are passed by reference and can be accessed via $_[0], $_[1], etc.
2.Creating a scalar variable within the sub makes a copy of the scalar.

My problem: (Besides being new to perl)

When multiple scalars are passed, the code gets harder to read and understand.
So, I'd like to use named variables.

A solution Use more references in the sub.
sub processText($$) { my $subParm0ref = \$_[0]; my $subParm1ref = \$_[1]; # Do something .... print $$subParm0ref; print $$subParm1ref; }
My Questions:

Is this a good way to handle the problem?

Is there a better way or am I missing something here?

Replies are listed 'Best First'.
Re: Memory Use and/or efficiency passing scalars to subs
by Limbic~Region (Chancellor) on Aug 30, 2003 at 15:12 UTC
    knexus,
    It would be a good way to handle the problem, but I don't think you are doing it quite right using conventional syntax.
    my $large_scalar = 'a huge string like 64K or something'; modify_largescalar(\$large_scalar); sub modify_largescalar { my $scalar_ref = shift; # Code to modify $$scalar_ref; }
    Your use of prototypes isn't helping (but maybe you need that for some other reason). The thing is you are not creating refs until you are inside the sub - which means you are already copying the large scalar. You are using the fact that the elements of @_ are aliases to the variables and avoiding copying by assigning new variable names to references to the aliases. I do not know if this will avoid copying, but it certainly isn't conventional syntax. I also wouldn't worry about optimization of these large scalars unless they are truly large and/or you are calling these subs many times in a tight loop.

    Hope this helps - L~R

    Updated after CombatSquirrel pointed something out I took another look at perldoc perlsub and conferred with belg4mit in the CB.

      Limbic~Region, are you sure of that? I thought this at first too and wrote a piece of code to demonstrate it:
      sub direct(\$\$) { print $_[0] . " " . $_[1] . "\n"; } sub indirect($$) { my ($one, $two) = \(@_); print $one . " " . $two . "\n"; } my ($first, $second) = qw(Hi there); print \$first . " " . \$second . "\n"; direct $first, $second; indirect $first, $second; __END__ SCALAR(0x1823c78) SCALAR(0x224f88) SCALAR(0x1823c78) SCALAR(0x224f88) SCALAR(0x1823c78) SCALAR(0x224f88)
      But the scalar refs point to the same memory address, which is natural for the main program and the direct sub, but completely unexpected (at least for me) for the indirect sub.
      Maybe there is a different explanation for this behaviour, though.
      Cheers,
      CombatSquirrel.
      Entropy is the tendency of everything going to hell.
        CombatSquirrel,
        My understanding of aliasing was wrong, so my comments that have been striked are probably incorrect as you pointed out - though my syntax is more conventional. Read on if you are interested or take a look at perldoc perlsub.

        The @_ array in a sub is like what happens in a foreach loop where modifying $_ during an iteration is modifying the actual element itself. This is accomplished without copying the array (in current versions of Perl), but with some internal stuff that creates aliases.

        Something similar happens in a sub. Each element in the @_ array is an alias to the actual variable - so you can modify it and change the variable it aliases. According to belg4mit in the CB, copying doesn't actually happen until you make an assignment such as my $variable = $_[0];. The thing is knexus didn't do that - they made an assignment to a reference to the alias. I do not know if that makes a copy or not (Benchmark would be one way to find out for sure).

        In any case I stand by my original post for a more conventional way of doing it even if my rationale was flawed.

        Cheers - L~R

Re: Memory Use and/or efficiency passing scalars to subs
by perrin (Chancellor) on Aug 30, 2003 at 15:16 UTC
    I'm not sure I understand the problem, but why are you using prototypes and taking a reference to the parameters? I think the following is more obvious:

    my $big_text = get_big_text(); process_text(\$big_text, \$other_big_text); sub process_text { my ($big_text_ref, $other_big_text_ref) = @_; # do something print ${$big_text_ref}, ${$other_big_text_ref}; }
Re: Memory Use and/or efficiency passing scalars to subs
by BrowserUk (Patriarch) on Aug 31, 2003 at 05:24 UTC

    The optimum way of passing arguments to subs, especially if you are going to modify the arguments, is always to use the by-reference aliases that perl provides by default, rather than copying the arguments to named locals or by passing references to the arguments.

    The significance of the difference however, depends very much on what you are doing within the sub.

    If the sub is doing anything substantial, then using named references, whether these are done explicitly by the caller, implicitly by perl (through prototypes) or explicitly within the called sub all run a very close second to using the aliases perl gives you, with the minor differences between the methods of generating the references being completely insignificant and within the bounds of 'experimental error', sometimes switching places between runs of the same benchmark.

    Obviously, copying the arguments, once on the way in and once on the way out, it always the least optimal. (Is that pessimal?).

    Large scalars

    Rate copied proto caller direct called NAMED copied 105/s -- -52% -62% -62% -62% -62% proto 219/s 109% -- -20% -21% -21% -21% caller 274/s 162% 25% -- -1% -1% -2% direct 277/s 164% 26% 1% -- -0% -1% called 277/s 164% 26% 1% 0% -- -1% NAMED 279/s 166% 27% 2% 1% 1% --

    In the above results on large scalars, you can see that copying sub x{ my( $a1, $a2 ) = @_; ... } is much slower as you would expect.

    Using references or aliases is all much of a muchness, except when they are generated using a prototype which for some reason is significantly slower. The only explanation I can think of for this is that the interpreter has to look up the prototype and that becomes significant--by I'll admit that it doesn't make much sense and is a guess anyway.

    The My conclusion

    If your scalars are large, and by implication (though it wouldn't always be true) the amount of processing within the sub is substantial, then using references or aliases makes little differences.

    Small scalars

    However, if the sub is a convenience sub, used to clarify the calling code by naming a fairly simply operation that is performed in many places, or a method used to maintain OO-integrity of abstraction by indirecting access to the underlying data structures (think getters and setters), then the overhead of taking and naming references can become significant. In this case, using the aliases perl provides rather than taking and naming your own references is more optimal to a point that it can become worthwhile.

    Rate copied called caller proto NAMED direct copied 15741/s -- -54% -57% -66% -74% -74% called 34560/s 120% -- -5% -25% -43% -44% caller 36557/s 132% 6% -- -21% -39% -41% proto 46058/s 193% 33% 26% -- -24% -25% NAMED 60268/s 283% 74% 65% 31% -- -2% direct 61563/s 291% 78% 68% 34% 2% --

    In these results, even though the scalars are small (40/10 chars, 1st/2nd arg), avoiding the copying is still beneficial if you are calling the sub many times. However, avoiding taking references by using the aliases that perl provides, can substantially increase that benefit, as can be seen by the last two results versus the second and third.

    Quite why using a prototype on small scalars would come out to be so beneficial relative to using a prototype on large scalars above, I am at a loss to explain. This probably indicates a flaw in the benchmark, but I've spent an inordinate amount of time trying to track it down and can't. So, I've washed my face in preparation for the egg I'm going to be wearing:)

    The My conclusion

    Using the aliases is worthwhile for low-impact /high-use subs and methods.

    Arguments against

    As far as I am aware, the only arguments against using the aliases are:

    • Aesthetics.
    • Readability.
    • Maintainability.

    IMO, these are effectively the same argument.

    One of the my criteria for judging source code to be 'aesthetic', is the ease of reading it. By this, I mean slightly more than just perceiving the symbols, it's more to do with being able to quickly grasp the intent of the code easily. If this is true, then the code is readable and maintainable, and therefore aesthetically pleasing.

    To this end, I've included NAMED_args() in the benchmark. It probably should have been called NAMED_direct() or NAMED_alias() but that threw the presentation of the benchmark out.

    Basically, this is using the aliases in $_[0], $_[1] etc., but using constants to give them meaningful names.

    use constant { STRING=>0, NUMS=>1, }; sub foo{ $_[STRING] =~ tr[...][...] if $_[NUMS] == '123'; }

    Having your cake and eating it

    I contended a while ago (though few agreed with me:), that this is useful to get readability and maintainability whilst retaining the performance of using the aliases. In effect, this is akin to and achieves some of what Perl 6 achieves with the binding operator (:=). Ie. The naming of the aliases. Giving the benefit of working with named entities rather than numerical referenced, anonymous one, but retaining the (performance) benefits of aliasing.

    The aliasing happens anyway, all this does is give you a way of making best use of it without descending into the nightmare of unmaintainable code.

    Of course, it doesn't address the issue of positionality, but Perl 6 is coming and we'll have to wait for the icing:).

    Full benchmark

    Notes

    • tr// and rot13.

      I used the rot13 thing because its reversability allowed me to re-use the same test arguments for all cases, and because it's runtime cost is almost entirely proportional to the length of the arguments.

      This allows the non-overhead costs of each sub to be identical and almost completely linear with the size of the arguments.

    • Prototypes.

      My comments regarding the apparent variation on the cost of using a prototype to dereference the arguments are probably worthless. I do not have an explaination for the apparent non-linearity of this. The comments are left on to see what (if any) alternative explanantions they might prompt.

    • Strings and numbers (as a string).

      I appreciate that using and differenciating between the two parameters on the basis that one consists of alpha characters and one numeric is completely spurious. They serve only to provide the benchmark with multiple arguments as in the original question and nothing more.

    • Titles of testcases.

      Anyone noting that the titles of the benchmark results show are different to those in the benchmark itself should know that the change took place in my editor as I prepared this post, because the PM's code wrap 'feature' munged the results to the point where they were unreadable. The numbers are as they were generated on my pc.

      Of all the (mostly minor) irritations of using PM, the overzealous and arbitrary wrapping of code blocks is the one I most love to hate. Maybe that would make a good subject for a poll!


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      Wow, thanks for the detailed response. Being new to perl it will take me a while to fully digest it.

      However, if I understand it correctly using $_[0] certainly can't hurt. I think I like the approach of using a constant to help with naming.

      I suppose I will eventually have some style in perl. Thanks

Re: Memory Use and/or efficiency passing scalars to subs
by liz (Monsignor) on Aug 30, 2003 at 19:57 UTC
    Not related to performance, but rather at ease of use during programming. When I have a subroutine that does something to the values it is passed, I have it run different code paths depending on how the subroutine is called. For example:
    sub foo { if (defined wantarray) { # we're in scalar or list context my @param = @_; # make a copy of the parameters foo( @param ); # recursive call for ease of maintenance return @param; # return the result } # we only get here if in void context # do what you need to do on @_ directly, only thing to maintain } #foo
    This allows you to use the subroutine in two ways. One directly modifying the parameters passed:
    foo( @data );
    and one returning the changed parameters, keeping the old:
    @newdata = foo( @olddata );
    This gives me a lot of flexibility: on the one hand the direct changing which is CPU and memory efficient (no need to copy anything) and the copying way, which also comes in handy if you're passing constant (non left-value) parameters, like so:
    @data = foo( qw(foo bar baz) );
    If you're really concerned about performance, you could remove the recursive call, but that would leave you with two identical code paths to maintain, which is always a bad thing.

    Liz

      Thanks for the tips, I am sure I can make use of them. I need to learn more about "context" in perl. Time to do some reading I suppose.

      Although I am new to Perl, I have coded in ASM, C/C++ for too many years when combined. So, I sometimes get hung up on things when working in a new language trying to relate things to previous experiences, which can be a good or bad thing.

      I appreciate the "ease of use" info becuase I am moving from writing fairly simple scripts to more involved ones and I want them to be easy to maintain and understand.

      Thanks

Re: Memory Use and/or efficiency passing scalars to subs
by TomDLux (Vicar) on Aug 30, 2003 at 16:29 UTC
    • How large are the large blocks of text?
    • How manyb is multiple?
    • How does that compare to the memory quota for a single proceess?

    If you're talking about half a dozen or a dozen 64K chunks, that counts as small stuff, nowadays. A friend got a new laptop with the memory upgrade, so he has 1GB of RAM... ten years ago, that was the hard drive for the four of us multi-tasking on a SparcStation.

    My first question when reading your message is to wonder why you have such large scalars?

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      I am processing html and xml retrieved via LWP::Simple Get() which returns the link contents to me in a scalar. I then process the $html/$xml, checking for changes and/or making changes as required.

      So, the scalars are not HUGE by todays standard. They are in the 40-64K size on average with some upto several hundred Kb. The problem is more about the fact that I am working through 100s to 1000s.

      Anyhow, one process I ran took over 13 hours to complete, which is hard to live with. So, I am looking to speed things up.

      One of the things I went looking for (among others) was to see if I was making unnecessary copies of data. Being new to perl I was not sure how arguments were passed to subroutines i.e. by value or by reference (aka ptrs).

      I found a statement in a book on Perl that says "When you pass scalars to subroutines they are passed by reference,... which acts like the address of the scalar.". The books also says that arrays etc. are copied into @_.

      Hummm, I thought, I need to look into what's going on here. Which is in part what prompted my questions. Thanks in advance for any insights.

        So, I am looking to speed things up.

        You are almost certainly barking at the wrong tree.

        You seem to be assuming that passing/copying large scalars makes much difference to runtime. Memory use sure. Runtime - not really, only if you get into swap.

        I will almost guarantee you that 99% of your runtime is spent in LWP - getting (waiting) the data.

        I would suggest Benchmarking before you try to optimise an area that probably has nothing at all to do with your speed issue.

        Assuming that I am right the easiest practical solution is to split your code into GET and a MUNGE units - this also makes the Benchmark a breeze. Anyway you will typically want to run 10-100 parallel LWP agents to pull down data as fast as your bandwidth/the target servers will deliver it. LWP::Parallel::UserAgent will probably be a lot more useful than LWP::Simple. Note don't accidentally implement a DOS attack on your target servers. First it is not nice. Second, some firewall implementations will lock your Agent/IP out if you hit it too hard.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Memory Use and/or efficiency passing scalars to subs
by Elian (Parson) on Sep 01, 2003 at 16:56 UTC
    If you really, really, really want named aliases without making copies or playing games with modules, then do so. Make your sub look like:
    sub foo { foreach my $arg1 ($_[0]) { foreach my $arg2 ($_[1]) { foreach my $arg3 ($_[2]) { # insert your code here which uses arg1, arg2, and arg3 } } } }
    Not, mind, that'd I'd recommend, nor even suggest that, but...
Re: Memory Use and/or efficiency passing scalars to subs
by dragonchild (Archbishop) on Sep 02, 2003 at 15:55 UTC
    Another option could be:
    # The following sub uses the following aliases: # I: [0] - $self # I: [1] - value # C: [2] - some temp space I use to calculate sub { }

    When did comments become passee? Self-commmenting code is nice, but if you're doing black magic for performance reasons, you will need to comment outside the code.

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.