references best practice

bende has asked for the wisdom of the Perl Monks concerning the following question:

Good Day Everyone
To be honest, im very new to perl (very new to computer science at uni as well!) but i have been using perl and absolutly love it! so please forgive the potential nievity of this question, we've all had to learn at some point.

im currently working my head around references, which i have the basic concept of, but i wanted to ask for your oppinion just to make sure im on the right track. I have shown 3 pieces of code, it is very simple; i have a subroutine to generate data (stored in a library but thats beyond the scope of my question) and then i call this subroutine when i want to output that data somewhere. (CGI, text file etc). The subroutine populates an array of hashes, which is looped through to output. as you can see I am handling the passing of this array in 3 different ways. In 1 I just create the data array in the subroutine and return it. In 2 I create the array first then pass its reference to the subroutine that populates it. In 3 I create the array in the subroutine and return a reference to it. It is my guess that 2 is the most efficient as it does not require copying the array as in 1, and will guarantee that the subroutine is (and any variables used) removed from memory as soon as the sub has finished executing.

my example code below

1: Simplest form, returns the @data;

my @data = prepData();

for my $item (@data) {
    #format and output
}

sub prepData {
    my @data;
    #create array of hashes
    return @data;
}
[download]

2: pass the sub the reference to the array

my @data;
prepData(\@data);
for my $item (@data) {
    #format and output
}
sub prepData {
   my $refData = shift;
   #do data generation;
}
[download]

3: return the reference to the array

my $refData = prepData();
for my $item (@$refData) {
   #format and output
}
sub prepData {
    my @data;
    #create array of hashes
    return \@data;
}
[download]

Many thanks for your help
Ben

Comment on references best practice Select or Download Code

Replies are listed 'Best First'.
Re: references best practice by kyle (Abbot) on Apr 25, 2008 at 12:23 UTC
I'd write it the first (simple) way or the third way, not the second. First way (`my @stuff = stuff_generator()`): This is the way I'd write it if I'm sure that I'll never need to pass out any "out of band" data. If it's always going to be just the array, guaranteed, this is straight forwardly easy to understand. Second way (`stuff_inserter(\@stuff)`): I generally want a sub not to modify the things I pass into it, so I'll avoid that kind of solution if there's another one handy. It does sometimes make sense to do this, but I don't think this situation does. Third way (`my $stuff_ref = stuff_generator()`): I might do this if the resulting array is going to be passed around later as a reference anyway. Rather than write `\@stuff` some places and `$stuff_ref` other places, I can just keep it as a reference all the time. I've heard some folks say that they like to use references all the time because the sigils are confusing, or it offers some kind of consistency. I'm not in that camp, but it's something else to think about. Bear in mind that if your array is huge, there could be a performance consideration to copying it or just passing a reference around. I suggest you write it the way that's clearest until a profiler tells you that it's a problem.	[reply] [d/l] [select]
Re^2: references best practice by amarquis (Curate) on Apr 25, 2008 at 12:44 UTC
I agree. The second option I think is also poorly suited to new programmers, as it can cause hard to track down action-at-a-distance problems. I remember many a trip to the good ol' debugger when I was learning to program to find out why my variables were changing state and where.	[reply]
Re: references best practice by hipowls (Curate) on Apr 25, 2008 at 11:33 UTC
The first form returns a list, each item of which is copied. The other two forms copy a single scalar, a reference to an array. I would expect that either of these two to have similar performance. Suspicion, however is no substitute for benchmarking so I ran the following script. use warnings; use strict; use Benchmark qw(cmpthese); sub prepData1 { my @data = ( map { log $_ } 1 .. 100 ); return @data; } sub runPrep1 { my @result = prepData1(); } sub prepData2 { my $refData = shift; $refData = [ map { log $_ } 1 .. 100 ]; } sub runPrep2 { my @result; prepData2( \@result ); } sub prepData3 { my @data = ( map { log $_ } 1 .. 100 ); return \@data; } sub runPrep3 { my $result = prepData3(); } sub prepData4 { my $data = [ map { log $_ } 1 .. 100 ]; return $data; } sub runPrep4 { my $result = prepData4(); } cmpthese( -5, # Run each function for at least 5 seconds { array_out => \&runPrep1, array_ref_in => \&runPrep2, array_ref_out1 => \&runPrep3, array_ref_out2 => \&runPrep4, } ); __END__ Rate array_out array_ref_out1 array_ref_in arra +y_ref_out2 array_out 11270/s -- -30% -43% + -44% array_ref_out1 16137/s 43% -- -19% + -20% array_ref_in 19873/s 76% 23% -- + -1% array_ref_out2 20101/s 78% 25% 1% + -- [download] I added a fourth style since that is the form I prefer where the data is directly put in an array reference. Once again benchmarking proves me wrong;-) That's why I always do it rather than guess. Update 1: I prefer my idiom because I create a single container of related values but I can't pass that back since the function returns a list of values. To retain the relationship between them I use an array reference so that the caller of the function gets back a group of related values. If, for some reason, I need to extend the function to return a second group of data I can pass back two references thereby maintaining the logical grouping of data. Update 2: As Haarg kindly pointed out I got the second case wrong, I was creating a new anonymous array reference and assigning it to the array ref that was passed int. Updating the code & benchmarks I get `sub prepData2 { my $refData = shift; @$refData = ( map { log $_ } 1 .. 100 ); } __END__ Rate array_out array_ref_out1 array_ref_in arra +y_ref_out2 array_out 11801/s -- -28% -28% + -39% array_ref_out1 16406/s 39% -- -1% + -15% array_ref_in 16496/s 40% 1% -- + -14% array_ref_out2 19238/s 63% 17% 17% + --` [download] Looks my original guess was correct which just shows if benchmarking gives an unexpected result you should check both your assumptions and your benchmarking.	[reply] [d/l] [select]
Re^2: references best practice by amarquis (Curate) on Apr 25, 2008 at 18:47 UTC
Benchmarking doesn't prove you wrong, it proves that something might be slower in a given case. If I had to choose between a marginal speed increase an an idiom I am comfortable with, I'd say the latter is correct. (Until I have proof that speed is a problem and profiling tells me this is the bottleneck).	[reply]
Re^2: references best practice by Haarg (Priest) on Apr 25, 2008 at 22:51 UTC
Your second example isn't doing what you intend. The array outside of the sub isn't modified. The first line of prepData2 sets $refData to the incoming reference. The second replaces that reference with a different one, leaving the contents of the first reference unchanged. You'd want something more like: `sub prepData2 { my $refData = shift; @$refData = map { log $_ } 1 .. 100; }` [download] With that change, the benchmark changes somewhat, making it almost identical to array_ref_out1 in my tests.	[reply] [d/l]
Re: references best practice by jethro (Monsignor) on Apr 25, 2008 at 15:04 UTC
Just as a reminder: One of the biggest sins of programming is optimizing for speed too early. One should first optimize for clarity. If and only if the performance is then lacking should one look at hot spots and other ways to speed up the program. Getting the data from the database is probably at least a hundred times slower (totally unsubstantiated guess by me) than the returning of the array. If you double the speed of the sub return it means only a speedup of 0.5% of the subs runtime. If that sub takes up 20% of the total runtime of the program, we are already in tenth of a percent territory. Famous quote: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." C.A.R. Hoare	[reply]
Re^2: references best practice by hsmyers (Canon) on Apr 25, 2008 at 15:50 UTC
Likewise remember that 97% of coders use this as an excuse to write sloppy code 50% of the time. --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply]
Re^3: references best practice by roboticus (Chancellor) on Apr 25, 2008 at 18:13 UTC
...and don't forget that 72% of all statistics are just random numbers some wiseacre pulled out of a hat. ...roboticus	[reply]
Re: references best practice (wantarray) by lodin (Hermit) on Apr 25, 2008 at 14:35 UTC
Regarding the first and last alternatives (returning a plain list or an array reference): there's a middle road here that you sometimes encounter. The routine can return a list in list context and an array reference in scalar context. That is, it usually looks like this: `sub foobar { ... return wantarray ? @foobar : \@foobar; }` [download] Let's say `foo` always returns a plain array, `bar` always returns the array reference, and `foobar` is the middle road. Then you'll have `\| foo \| bar \| foobar \| "best" ------+-----------+----------+-------------+------------- List \| foo() \| @{bar()} \| foobar() \| foo / foobar Ref \| [ foo() ] \| bar() \| foobar() \| bar / foobar Count \| foo() \| @{bar()} \| @{foobar()} \| foo` [download] and in terms of memory efficiency this translates to `\| foo \| bar \| foobar ------+-----------+----------+-------- List \| copy \| copy \| copy Ref \| copy \| no-copy \| no-copy Count \| no-copy \| no-copy \| no-copy` [download] This sums up to `foobar` having a more convenient syntax than `bar` when memory isn't an issue, yet has the same memory efficiency. However, when dealing with large lists and you only want the count/length, you can't in `foobar` use `wantarray` to make the sometimes very useful optimization (both speed- and memory-wise) of not fully processing all elements but just return the count. (For instance, your list may contain objects that then needn't be created.) My conclusion is, as ever so often, that it depends, and it's almost always just a convenience decision. In the worst case you just create `foo_ref` or `bar_count`. lodin	[reply] [d/l] [select]
Re: references best practice by sundialsvc4 (Abbot) on Apr 28, 2008 at 21:27 UTC
I'll second the motion that you should always optimize for clarity. Having said that, I think that the best way to handle a problem like this is through references. Create a nice structure (i.e. a hash) that contains the information you want. Then, store and pass-around references to it. Each object or value (of any kind) in Perl has a built-in “reference count” which is used by the built-in “garbage collector.” So you can have references to references to references and, as long as you do not create circular references, memory allocation and deallocation will always be reliably and correctly managed. This is also favorable to the interests of the operating system's virtual-memory manager, because you won't be unnecessarily copying things around... an especially important consideration when they are large “things.”	[reply]


more useful options
	PerlMonks