Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello great monks!

I have a design problem, which ought to be simple but is confounding me. I have a package containing a variety of functions to help me deal with my data, which is in large arrays. I am having troubles abstracting a change to my code. Currently I have:

sub processA { # common stuff # special stuff } sub processB { loop (1..1000) { # common stuff # special stuff } }

I want to make this into something more modularized for a variety of reasons, so I was considering something like this:

sub processA { common_stuff($); # special stuff } sub processB { loop (1..1000) { common_stuff($); # special stuff } } sub common_stuff { # common stuff }

Pretty straight-forward, I guess. The problem is that as it stands now, the repeated code section (#common stuff) constructs a large array, then manipulates it. With my proposed modification, I would be creating this array for every iteration of the loop in processB. That isn't acceptable from a performance stand-point. So, I want something that may not be possible (or that may be, and I just don't know the term for it!):

Basically I want some way to create a large array in a sub-routine, keep it around as long as I need, and then to get rid of it. And preferrably I'd still want to keep this all in a single package.

Am I dreaming, or misguided, or naive, or lost? :)

Replies are listed 'Best First'.
Re: Design Question
by sgifford (Prior) on Aug 06, 2003 at 21:38 UTC
    Basically you want to create a temporary data structure and cache it. You should be able to do that by just creating a package global variable containing the large array. Something like:
    use vars qw($big_array); sub get_big_array { if (!$big_array) { $big_array = [ ... construct big array... ]; } return $big_array; } sub clear_big_array { undef $big_array; } sub processA { common_stuff($_); # special stuff } sub processB { loop (1..1000) { common_stuff($_); # special stuff } } sub common_stuff { my $ba = get_big_array(); ... }
    Knowing when to call clear_big_array is harder.

      Hi, thanks for your help!

      Knowing when to call clear_big_array is harder.


      So if I take this approach, I will need to have a "convention" that all sub-routines in this package call clear_big_array when they are done with this array, or before they start using it. I can see that being a bit harder to maintain -- are there any other disadvantages to that?

        I can't think of any other disadvantages. Even if you never call clear_big_array, it might not matter, the memory will just stay allocated. Generally the program won't return memory to the OS until it exits, and if the OS needs the memory for something else it will just page it out.

        If you give more details about exactly what you're doing, you might be able to get more specific advice.

Re: Design Question
by TomDLux (Vicar) on Aug 06, 2003 at 23:57 UTC

    You could create the array in processA() / processB() and pass a reference to it to commonStuff() as a parameter:

    processAorB ....{ my @commonArray; commonStuff( \@commonArray, \%other, \$variables ); ... } sub commonStuff { my ( $commonArrayRef, $hash, $var ) = @_; if ( $commonArrayRef->[0] > 37 ) { ... } }
    As you can see, accessing the $commonArrayRef contents is just like handling the array itself, except for sticking an arrow in the middle. There's only one array, it's only a reference which is passed around, rather than copying the array, so function calls are faster, and any changes to the array within commonStuff() is visible where it is called.

    If you don't need the array outside commonStuff(), you can use a closure. The one trick would be detecting when you go from processA() to processB() ... I'm assuming the array should be re-initialized at that point. Hopefully, there is some way of detecting within commonStuff() when the time has come to re-initialize. Alternately, you could provide a second closure method for re-initializing the closure, and invoke that at appropriate times from processA() / processB().

    sub processA { ... commonInit() commonStuff( $various, $vars ); } sub processB { ... commonInit() commonStuff( $various, $other, $vars ); } { my @commonArray; sub commonInit { @commonArray = (); } sub commonStuff { # do stuff to @commonArray; } }

    This way, you can preserve your data without polluting the entire environment. You might also split the common stuff into several routines.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      This was actually pretty quick to implement, so I've tested it out and it works great. I do need the array outside of common_stuff so I didn't use a closure. This approach seems to get exactly what I want without the necessity to clean-up a global. Are there any negatives with doing things this way that I should be aware of? The main problem I saw is that it required a change to the argument-lists for some functions.

        Beginner/Intermediate programmers are sometimes intimidated by an array reference or hash reference. The way I see it, they think ( or someone tells them ) that referenceces are complicated, so they get frightened and do things wrong.

        Since you know that references are actually quite easy, you should be fine. It takes a slight bit of effort to keep track of when a variable is actually a reference to a hash, so you can use it properly, especially when deealign with something several layers dow a data structure.

        But testing as you write code (Test:;Simple and test::More), and use of the Perl debugger and Data::Dumper when something doesn't seem right, will get you lots of working code in no time.

        --
        TTTATCGGTCGTTATATAGATGTTTGCA

Re: Design Question
by cfreak (Chaplain) on Aug 06, 2003 at 21:39 UTC

    Well you could create the array first as a global variable (to the package) then just manipulate it. Something like this:

    my @array = (); sub processA { push(@array,$some_data); common_stuff(); # clear the array? } sub processB { for(1 .. 1000) { push(@array,$some_data); } common_stuff(); # clear the array? } sub common_stuff { foreach(@array) { #do stuff ... } }

    With that you should always have just one array that gets no bigger than the largest amount of stuff you put in it. Of course it also depends on what you're doing with the data and weither or not you clear your array.

    Lobster Aliens Are attacking the world!

      I'm partially through chapter 2 of PP, so I thought that I'd add something to this:

      Since your common array is going to have 1000 values in it, you may want to define you 1000th value before you start adding stuff to the array, in order to speed up execution time.

      my @arr(); $arr[999] = ''; # or perhaps: $#arr = 999; #... insert code suggested by other perlmonks here

      I'm not sure if 1000 entries on an array is enough to get a speed boost from predefining it's length, but it likely would...

      Also, consider compiling to bytecode or doing a memory dump and saving it as an executable file (if it doesn't have to travel across operating systems).

      Take care,
      Dave.

      "For fate which has ordained that there shall be no friendship among the evil has also ordained that there shall ever be friendship among the good." - Plato / Socrates

Re: Design Question
by demerphq (Chancellor) on Aug 06, 2003 at 21:59 UTC

    Basically I want some way to create a large array in a sub-routine, keep it around as long as I need, and then to get rid of it. And preferrably I'd still want to keep this all in a single package.

    Sound to me like you would benefit from OO. Depending on the details of what you are doing I would probably be thinking of either converting the whole lot into a single class, or perhaps into two classes, one for the "common_stuff" and one for the processStuff (which would contain a common_stuff object). Have a read of perltoot and perlboot and maybe perltootc too.


    ---
    demerphq

    <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

      I don't feel at all comfortable with Perl OO, to be honest: it feels... weird to me for some reason. But the more "rational" reason I've avoided it here is that my objects really have no attached methods and very little need for privacy. It's just passing (large, large) quantities of data around between subroutines in an orderly fashion. I suppose I can't avoid Perl OO forever, though.

        I don't feel at all comfortable with Perl OO, to be honest: it feels... weird to me for some reason.

        I think a lot of people give "OO" a lot more fear and awe than it deserves. I think partly this is because of all the multisyllable terms that the jargon freaks just love to insist is the cool parts of OO. (Polymorphism, Inheritance, Method Dispatch, Overloading, etc). Perl's OO is really close to the raw idea of OO: it tightly associates data with the methods/subs that operate on that data. (The fact that Perl also does pretty much the full repretoire of OO, with the exception of Data Hiding, but thats not an essential part of OO in my mind.)

        So lets consider this subroutine structure...

        sub make_complex_array { my @args=@_; # ... return $complex_array_ref } sub do_funky_stuff_with_complex_array { my $array=shift; # .... return $value; }

        To convert this to a class/object all we have to do is add one line and alter two! (Well as a sop to convention and aesthetics we'll rename the subs)

        package Complex::Array; sub new { my ($class,@args)=@_; # ... return bless $complex_array_ref,$class; } sub do_funky_stuff { my $array=shift; # .... return $value; }

        And whats so hard about this? Now the $array knows that it isnt just an array with indexes, it knows its a complex data strcture with a set of extended behaviour. If we need to cache things about the array we have a nice convient handle that we do so with.

        The point of this all is that you want to maintain state through a set of package level variables. This isnt a good idea a general rule (singletons aside). If you need to maintain state across multiple subroutine invocations you should use OO. Especially as in Perl its so damn easy. :-)

        my objects really have no attached methods and very little need for privacy.

        There is no privacy in Perl so thats no excuse. As for the methods from what youve described all of your subroutines are method candidates to me. If you like post a more beefy code example and ill refactor it into OO for you.

        HTH


        ---
        demerphq

        <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...