Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Venerable monks

i was hoping you might be able to explain the following to me - taken from a website - it shows 2 ways of doing the same thing and one way is supposed to be be 'better' than the other and i don't understand the difference.

# get_all_Genes returns returns a list reference #each item in the list is a gene reference # Iterate through all of the genes on a clone foreach my $gene ( @{ $first_clone->get_all_Genes() } ) { print $gene->stable_id(), "\n"; } # More memory efficient way of doing the same thing my $genes = $first_clone->get_all_Genes(); while ( my $gene = shift @{$genes} ) { print $gene->stable_id(), "\n"; }
i don't understand why the while/shift is more efficient on memory. Here is the website's explanation
Some of the data that makes up the objects returned from the Ensembl A +PI is lazy loaded. By using lazy loading, we are able to minimize the + number of database queries and only "fill in" the data in the object + that the program actually asked for. This makes the code faster and +its memory footprint smaller, but it also means that the more data th +at the program requests from an object the larger it becomes. The con +sequence of this is that looping over a large number of these objects + in some cases might grow the memory footprint of the program conside +rably. By using a while-shift loop rather than a foreach loop, the growth of +the memory footprint due to lazy loading of data is more likely to st +ay small. This is why the comment on the last loop above says that it + is a "more memory efficient way", and this is also why we use this c +onvention for most similar loop constructs in the remainder of this A +PI tutorial. NB: This strategy obviously won't work if the contents of the list bei +ng iterated over is needed at some later point after the end of the l +oop.
I thought that both the foreach loop and the while/shift loop will return a gene reference from the gene list and the data in the gene will be populated as needed in the get method for the gene object. So how is foreach better in memory usage than the while/shift as they are both just getting gene references off the list and not getting any data from the gene object

thanks a lot

Replies are listed 'Best First'.
Re: lazy loading and memory usage
by BrowserUk (Patriarch) on Dec 19, 2010 at 00:29 UTC

    That suggests that when get_all_Genes() is called, the values in the returned array reference are handles to as-yet-unpopulated objects. That is, the anonymous array returned is filled with handles to objects that are, at the point of return, empty. They do not get populated until you call the first method call upon them.

    Therefore, if you iterate that array in a for loop, each one gets populated when you call its stable_id() method. So by the end of the for loop, all the genes will have been populated, and as their object handles are still held in the array, all the memory required by all of them will still be in-use.

    Conversely, in the while loop, each gene is again populated by the call to the stable_id() method, but because the object handle was shifted off the array, when the loop iterates, that object handle will go out of scope, thereby allowing it and all the memory required to holds its contents to be released.

    With the while-shift method, only one gene from the array is ever populated at any given time, so the total memory usage is reduced.

    You could achieve the same thing--arguably more clearly--but using undef in conjunction with the for loop:

    # Iterate through all of the genes on a clone foreach my $gene ( @{ $first_clone->get_all_Genes() } ) { print $gene->stable_id(), "\n"; undef $gene; ## Free the gene object and the memory it uses. }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Nice Post. Would the following achieve the same thing without needing "undef"?
      # Iterate through all of the genes on a clone foreach my $gene ( @{ $first_clone->get_all_Genes() } ) { my $temp_obj_handle = $gene; print $temp_obj_handle->stable_id(), "\n"; }
      Update: Again kudos to BrowserUk.

      Although $temp_obj_handle gets re-cycled and it does "go out of scope", it points to the same place as $gene within the loop and $gene does not go "out of scope". Therefore, the $gene reference count is not, "at the end of the day" decremented. End result: The above code does not save memory within the foreach() loop, although the code from BrowserUk does.

      At the end of day, does this memory optimization within a for() loop matter at all? I think that it usually does not.

      Perl is excellent about re-cycling memory that it has used before. A typical Perl program reaches a max memory usage and then just stays there (provided of course that you don't have memory leaks :-)). There are not any "garbage collection" calls like in Java or C#. In short, this is fine:

      foreach my $gene ( @{ $first_clone->get_all_Genes() } ) { print $gene->stable_id(), "\n"; }
      I would not worry until there are thousands of new objects being created. A few hundred? => no.

      Update again:

      Well as with many things in programming, judgment is required. 5 objects might very well consume 1,483 MB of memory. I have no idea of how much memory that a particular $gene->stable_id() will consume... it might be a lot. On other hand, it might not "be much". This is very application specific. I think this thread has pointed out how the memory allocation works and the OP can decide what to do in a particular situation. I personally would use the most simple loop unless there is a reason not to do so. In other words, make things more complicated when it is necessary to do so. "necessary" is application specific.

        No. Because $temp_obj_handle and $gene both point to the same thing. Once you've populated the object one points at ...

        And as reference one remains in the anonymous array, once the object is populated, it remains populated until

        1. all references to it go out of scope.

          Which is never, as the reference to the anon. array is at the package level.

        2. Or, it is explicitly freed. Eg. undef'd.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        At the end of day, does this memory optimization within a for() loop matter at all? I think that it usually does not.

        ...

        I would not worry until there are thousands of new objects being created. A few hundred? => no.

        Discussion of the number of genes, and whether hundreds or thousands constitutes a number worth concerning about is premature and assumptive until you know how big each gene is!

        Given that:

        1. individual genes can be millions of characters in length--and that's when stored in raw string form without any structuring or associated meta data.
        2. And that the warning the OP is asking about comes from the authors of the module in question who presumably know far more about its internals than we are party to.

        In general, worrying about such an optimisation might be unnecessary, but this warning is not a general warning, but rather a very specific warning about a very particular library, from the people that wrote that library and who therefore are best placed to know.

        I think it would be best to heed such warnings.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: lazy loading and memory usage
by Marshall (Canon) on Dec 19, 2010 at 00:21 UTC
    This claim doesn't sound right to me.

    foreach my $gene ( @{ $first_clone->get_all_Genes() } ) { print $gene->stable_id(), "\n"; }
    $first_clone->get_all_Genes() returns a reference to an array. This is de-referenced to make a list of all elements (this takes some memory), then the foreach iterates over that list of elements.

    This is the same as:

    my $gene_ref= $first_clone->get_all_Genes(); foreach my $gene (@$gene_ref) # or (@{$gene_ref}), the same. # extra {} only needed # when subscript is used { print $gene->stable_id(), "\n"; }
    The second code:
    my $genes = $first_clone->get_all_Genes(); while ( my $gene = shift @{$genes} ) { print $gene->stable_id(), "\n"; }
    This code is going to create the list of @$genes just like the first code, but uses a different formulation of the iterator than "foreach". I would expect that this formulation uses the same amount of memory as the first code (i.e. that a complete de-referenced list of @$genes is built as a preliminary step). The claim appears to be that somehow this does not create a list of @$genes. I would expect that the memory usage of the second version is the same and probably runs a bit slower than the foreach() iterator would. It is certainly more obtuse from a coding style.

    If this is not the case, then I would also like to hear about it. But on the surface, this claim appears to be incorrect.

    Update:I didn't consider that some memory consuming operation would happen with: $gene->stable_id(). When I saw the print, I just was thinking that this printed some existing value. Kudos to BrowerUk.