dneedles has asked for the wisdom of the Perl Monks concerning the following question:

This is a follow on from a previous post Out of memory., which resulted in not finding a way to handle a problem within memory. The program aborts with either a core dump or an "Out of Memory Error!"

The PERL was ran on Linux (Redhad), Solaris, Windows XP, with 4-16 gigs of memory using various builds of PERL 5.8 and 5.10 with the same result.

Strategically, if a problem is too big to model in RAM is using DBI:DBD the fastest and simplest way to store data outside of RAM? I am assuming from a high level there are only a few methods are available. Also, is there in Windows a way to detect how much heap a perl program has? (I don't think this is a global value.)

The goal is to convert "if-then-else" statements in a 4th GL scripting language into equations. For example "if(a) {}elsif (b) {} else {}" would becomes (a+b+c).

Once accomplished all paths through the 4th GL script can be found by "multiplying" the equations through. For example, two simple if/else statements could be represented as (a+b)(c+d.) The independent paths throught the code is: ac+ad+bc+bd.

The code provided here does this equation transformation.

The difficulty is that the equation is 64K in characters and will result in a couple million independent elements as a result. I'd like to be able to run the program on windows as well as UNIX, but need a way in windows to detect how much heap is available for the program. (I think a general how much memory is on the system will not work since the "Out of Memory!" problem will still occur. I could be wrong though.) My thought is that a database is the best way to externalize and keep the speed up.

The data is available here: Out of memory.

From the last post (the latest in-memory attempt was:
$|=1; sub multiply { my( $ref1, $ref2 ) = @_; my @temp; for my $r1 ( @{ $ref1 } ) { for my $r2 ( @{ $ref2 } ) { push @temp, $r2 . $r1; } } return \@temp; } $/ = ','; my @stack; open POSTFIX, '<', 'postfix.txt' or die $!; while( my $op = <POSTFIX> ) { chomp $op; chop $op if $op =~ tr[\n][\n]; ## remove any trailing newline print "MULTIPLY\n"; if( $op eq 'x' ) { push @stack, multiply( pop( @stack ), pop( @stack ) ); } print "ADD\n"; elsif( $op eq '+' ) { push @stack, [ @{ pop @stack }, @{ pop @stack } ]; } elsif( $op =~ m[^\d+$] ) { push @stack, [ pack 'v', $op ]; } else { die "Bad '$op' at position;" . tell( POSTFIX ); } } print "stacksize: " . @stack; for my $item ( @stack ) { printf "%s", join ',', unpack 'v*', @{ $item }[ 0 ]; for my $group ( @{ $item }[ 1 .. $#$item ] ) { printf "+%s", join ',', unpack 'v*', $group; } print "\n"; }
However, it ran out of memory as well. Any insights in regards to approach is appreciated.

Replies are listed 'Best First'.
Re: Out of Memory 2.
by graff (Chancellor) on Oct 25, 2008 at 18:43 UTC
    As Illuminatus points out, there are better ways to post your question (or perhaps, better questions you should be asking).

    "Out of memory" failures should be pretty rare, in general, and only show up when there's some bug in the code, and/or some misunderstanding about the nature and quantity of your input data, so a few words to describe the data and what the code is supposed to do with it would be relevant.

    A common cause of "Out of memory" failures is the use of a variable as an array index, when the value of the variable happens to be something unexpectedly (unintentionally) large -- here's the quickest/easiest way to get "Out of memory":

    $ perl -le '$x=2**30; print $x; $a[$x]=0' 1073741824 Out of memory during array extend at -e line 1.
    (Though conceivably some perl installs on some machines might succeed with that particular example.)

    Another common cause is making multiple (unnecessary) copies of large files in memory -- e.g. doing stuff like:

    @lines = <FILE>; # copy 1 $all = join("",@lines); # copy 2 for (@lines) { ($key,$val) = split " ", $_, 2; $hash{$key} = $val; # copy 3 }
    If you know in advance that your script needs to access to more data than can fit in memory at a given time, then SQL access to a database is certainly a good way to work around that. DBM files are another option, if the data are structured as "key:value" tuples.

    Whatever you are trying to accomplish, I would not recommend trying to game it so that your process tries to consume as much memory as possible without taking "too much" -- that approach only worked back in the old days of single-job systems (the original MS-DOS, DEC's old RT-11 are the only ones I recall), and nobody uses those anymore.

    Assuming you know enough about your data and your task, you should be looking for a solution that runs within reasonable memory limits (or you should be putting more RAM on your machine until it's not a problem anymore).

      Thanks for the info. I updated the initial post to provide a bit more tactical info and to consolodate the answers to the many questions.
        Could you expand a little on what you're trying to do? That is to say, Matrix manipulation, Fourier Transforms, 3-D animation, etc.?

        There may already be a CPAN module optimized to the problem you're trying to solve.

Re: Out of Memory 2.
by Illuminatus (Curate) on Oct 25, 2008 at 17:19 UTC
    Please, please read How do I post a question effectively?
    1. define 'loaded'.
    2. what OS?
    3. This is perl we are talking about, right?
    4. What have you tried that has not worked?
    5. If you are using Linux, you can use the 'free' command to roughly figure out how much free memory remains
      Good questions. I updated the initial post with the following information:

      1. I have a large equation that I need to manipulate. In a previous post folks tried using a variety of hashes, arrays, strings, etc to handle the manipulation. All attempts ended with Out of Memory. Hence I need to externalize parts of the equation into files or some other non RAM memory store.

      2. Tried on Linux (RedHat), Solaris, Windows XP on 5 machines ranging from 4 core to 8 core with 4-16 gigs of memory.

      3. Yes 5.8 and 5.10

      4. I've tried a variety of PERL based constructs to store and manipulate the symbolic equation - hash, arrays, strings. I've tried a couple different versions of PERL (due to different versions being on different systems)

      5. Thanks! Is there an equivallent for windows?

      Overall I am looking for an approach which I thought would be independent of these tactical factors. That is, if you have a program that runs out of RAM and given PERL is relatively OS independent, is there a common preferred way to manage this situation. I was thinking databases were the way to go but I am not sure.
      I updated the initial post to address these concerns. My thought was that an approach to running out of RAM would be common across all OSs, etc. But perhaps that is not the case?
Re: Out of Memory 2.
by ig (Vicar) on Oct 25, 2008 at 22:40 UTC

    Two of the print statements in your sample code appeared to be out of place, with one causing a compile failure. I modified it to the following:

    Then I put the sample expression you provided "(a+b)(c+d.)" in a file named postfix.txt as follows:

    (a+b)(c+d.)
    and ran the script.

    It produced the following output:

    Bad '(a+b)(c+d.)' at position;12 at ./test.pl line 38, <POSTFIX> chunk + 1.

    This isn't surprising as there is nothing in your code to handle the parentheses. Nor is there anything to handle the alphabetic characters, if it got past the initial parenthesis.

    It also appears that your sample expression is infix but your code is written to work with postfix expressions.

    As GrandFather said, it would be very helpful if you provided a small working sample of input to your program.

    It appears that you are trying to produce a large number of combinations. With input of approximately 64K corresponding to nothing other than a series of simple if/else blocks, with something on the order of 4 characters per block, then you would have approximately 16K blocks. If I understand what you are doing correctly, you would then produce all combinations of either branch of each if/else statement. This corresponds to approximately 2**16384, a very large number indeed. You may find that producing and storing all possible combinations exceeds the capacity of your database.

    In summary, you need to explore and understand the magnitude of the data set your are trying to produce before deciding how to store it.

    Depending on what you want to do with the data once it is produced, you may find it is better to use and discard it as you produce it, rather than producing and storing it in one pass then reading and using it in another.

    Update: I missed the link to Out of memory. and the sample data there. Now I see that the example expression in your post is not an example of intended input to your program but how about posting a small but complete sample input?

      Then I put the sample expression you provided "(a+b)(c+d.)" in a file named postfix.txt as follows: ... This isn't surprising as there is nothing in your code to handle the parentheses.

      It is a postfix expression evaluator. That example would have to be entered as a,b,+,c,d,+,x


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      It appear for what ever reason, I am having a horrible time at communicating and causing folks feathers to ruffle. Further, the question I asked about approach really isn't being addressed. At this point, I'll bow out and go the database route. Thanks for everyone's help but it seems there is more harm that good being accomplished.
Re: Out of Memory 2.
by BrowserUk (Patriarch) on Oct 25, 2008 at 23:40 UTC

    You are running out of memory because your program requires more virtual memory to complete, than your computer is capable of supplying!

    At byte offset 113864 of your equation (the original, fully expanded one with numeric symbols):

    ...,15618,+,15619,+,15620,15621,15622,+,x,+,x,+,x,+,x,15623,15624,1562 +5,... 113864..............................................^

    Where (using my best yet compression technique described below), your equation calls for the multiplication of two additive (+) terms containing 22665 & 186 subterms respectively resulting in a subterm containing 3.8 million terms.

    This subterm is next used at offset 113952:

    ...,x,15630,x,15631,x,15632,x,15633,x,+,x,+,x 113952..................................^

    where it is multiplied by another with 6 terms to produce a resultant containing 22,846,320 (+)terms. But, at the point where this caclulation starts, even with compression and token cache substitutions, the process is already consuming 1.7GB!

    I estimate that it would require at least 6GB to complete that expansion. And there are two more to go! Obviously, with 32-bit machines being limited to (at most) 3GB of virtual memory per process, this is not going to work.

    My DB/SQL skills are more dormant than real at this stage, but I do not see any easy (or hard) way of using a DB for this. Maybe if you re-described the operation of your postfix expression evaluator, someone with greater DB foo than me could see a way to approach this. I do not expect it would be very fast though.

    Other solutions

    One possibility is to move your application to a well specified 64-bit machine with say 8GB of ram.

    Another it to start caching the intermediate terms to disk. Besides being very slow, the more I think about this, the harder it gets. Essentially you would have to load the smaller of the next two terms from a disk based stack(??How to implement??), into memory, break it to its subterms, and then read the larger in, subterm by subterm, and output the result back to disk on the fly. Distinctly non-trivial.

    The final possibility that I see, and have been pursuing as time permitted, is to find a more aggressive compression strategy than I have been using till now.

    Compresion strategy: token caching

    Any subterm of of a subequation like 1,2,3,4,5,6+1,2,3,4,5,6,7,8,9+7,8,9 can be reduced (in size) by caching the common subsequences of multiplicative terms and subtituting a symbol. So the above caches 1,2,3,4,5,6 and assigns symbol A; caches 7,8,9 as B; and the result becomes A+A,B+B. And the multiplicative terms A,B can in turn be cached and C substituted.

    Once the calculation is complete, these substitutions can be (recursively) expanded on-the-fly as the result is written to disk.

    The problem is, even with this symbol caching, the equation as the point outlined above consisting of purely additive terms. Eg. PQRS+HIJK+ABCD...(22,million) and it is still breaking the memory bank. And you cannot do symbol substitution on sequences of additive terms because multiplication needs access to the individual elements.

    However, the largest cache token I've seen in use is only four characters (eg.'PQRS'), which means there must (at the point I've managed to get to so far), be less than 456,976 (26^4) unique terms in the cache. And as the largest subequation (so far) contains 3.8 million terms, there must be many duplicates. Eg.

    ABCD+IJKL+ABCD+PQRS+ABCD+IJKL...

    Which could be reduced to:

    (ABCD*3)+(IJKL*2)+(PQRS*1)...

    and those subterms can safely be used in subsequent calculations for a further considerable reduction in memory usage. They can then be expanded on output as above.

    I was still trying to a) convince myself this was correct; b) work out how to code it; when I saw your latest post.

    I also wonder quite what you are going to do with the results of this? I don't see that the human brain is going to derive much information from a several gigabytes of symbolic mush.

    I think I recall you mentioning Excel? I seriously doubt Excel's capability to even load this much data on a 32-bit machine, never mind do anything useful with it?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks for the info but I have to ask, is there a particular reason this forum has become some incredibly hostile and combative since I last visited it a couple years back?

      Normally I wouldn't mind but my question about approach hasn't been answered due to up front assumptions, backed by hostility which has dragged the discussion from a strategic realm into a tactical one which from the previous post was already shown a dead path.

      In regards to the final retorical questions - the resulting equations is used by a C program to determine errors in a 4th GL language. This program works on smaller examples and the results are compressed and placed in Excel.

      Anyway, I'm going the database route. I've had enough negativity to last the rest of the year.

        I think the main problem was that you posted 64kb of non-wrapped, unitelligable data, without code tags so PM couldn't wrap it for you. And then compounded that, by doing exactly the same thing a second time.

        Taking a refresher course in the PM approved markup and applying it might help. (And basically don't post huge chunks of raw data. Email is far better for that.)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Out of Memory 2.
by GrandFather (Saint) on Oct 25, 2008 at 21:27 UTC

    It may help if the sample code you provide was real code and you provided just enough sample input data with it that we can run the code.

    As it stands there seem to be two misplaced print statements and at least one missing }.

    It would definitely help if you were to break your solid text block into a number of paragraphs, especially where you are answering a list of questions or suggestions.

    In some of your replies you speak in general terms of '"multiplying" the equations through' and go on to give simple examples, but not examples that would be parsed by your code.

    Many of your replies are simply a regurgitation of material you have already given. You would save yourself and us a lot of time by simply referring to the previously posted material by linking to it with the PerlMonks [id://nodenumber] link mark up. For example your first Out of Memory post can be linked as [id://718318]: Out of memory..


    Perl reduces RSI - it saves typing
      Great ideas.

      I updated the initial post with the correct code.

      I updated the initial post to point to the data.

      There is also cleaned data in Re^6: Out of memory. and continued into Re^7: Out of memory. but it is the same data and also will require minor code changes.

      Part of the reason for the repeat is the question I originally asked isn't answered, but it might be my assumption that there is a single method for "paging" memory (i.e. database, file, etc) that is more common in PERL might be inaccurate.

        Are you being deliberately thick? If you don't read what we write and can't be bothered to answer the questions or respond to the suggestions we propose, how do you expect to get any answer that is of benefit to you?

        Let me reiterate:

        1. The code you posted can not run - it is syntactically incorrect.
        2. We don't want 64K of data. We want about 20 characters of data to exercise the code.
        3. The input data you describe is not the input data that the code you supplied parses (even if it could run).
        4. Learn to use paragraph tags to break your text up into chunks that can be parsed as a unit and that express a single idea or point of reply.

        Update it is conventional here to note substantive updates to your nodes with an update section (like this one). For those who came in late, the node to which this is a reply has been updated and the original node in this thread has been updated.

        Due to the excessive data you have provided in Out of memory. FireFox does not render it in a usable fashion for me. If there is any sample data of sensible size provided in that node I can not see it.


        Perl reduces RSI - it saves typing
Re: Out of Memory 2.
by JavaFan (Canon) on Oct 25, 2008 at 23:37 UTC
    Also, is there in Windows a way to detect how much heap a perl program has? (I don't think this is a global value.)
    Perl will use as much memory as it needs, up to the point the OS is no longer willing to give Perl anymore memory. The usualy cause of the OS no longer willing to give Perl memory is that all other memory has been claimed by other processes. Which is an ever fluctuating value. If a short-lived process uses all memory of the computer, even for a brief moment, and Perl asks for a small slice of memory just at the same time, Perl will die with an out of memory error. Because it couldn't get the memory from the OS at the moment it wanted.

    So, there's no way to find out how much memory Perl can use. Because it's a value that can change on every clock cycle.

Re: Out of Memory 2.
by graff (Chancellor) on Oct 27, 2008 at 01:12 UTC
    I'm sorry the responses here have come across as hostile, but to be honest, your presentation of the problem has been, um, problematic. (I'm truly grateful that someone -- you? -- fixed your original Out of memory. post; that entire thread is much easier to read now.)

    I made some slight changes your program (as currently posted in the present thread), and it ran on a slightly modified version of the data provided in the earlier thread, with no "out of memory" problem.

    I don't think my code changes had anything to do with memory management -- I just fixed the syntax errors, and changed to using STDERR for the status messages. In order to get it to work on the previously posted "postfix.txt" file (about 64KB), I had to delete the "." characters at the end of the data, in order to get past the "Bad $op ..." die condition.

    With my version, you can redirect STDOUT to a file, and just see the status messages on your terminal. With a bash-like shell, you can also redirect STDERR to a separate file, which is what I did.

    When the run finished successfully, the STDERR log file had 5170 lines containing "MULTIPLY" and 3966 lines containing "ADD", which happens to match the number of "x" and "+" characters in the data file, respectively. There were 30 elements in the "@stack" array, and 30 lines of data in the STDOUT file, the first two lines being:

    1 2,5,6+2,4,6+2,3,6
    The third line was over 8 KB long, the fourth line was:
    40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55
    The fifth line was about 150 KB long, the 7th was 658 chars, the 28th was 228 chars, and all the rest had a single 4-digit number. The number on the 30th line was "9166", which is happens to be the number of digit strings in the postfix.txt input file.

    I have no idea what this sort of output is supposed to mean. Perhaps I've missed something, but I think your statements about the intent of your program, and the nature of your approach, consist so far of the following:

    I wrote a program to sybolically process multiplications on an equation in postfix notation. (taken from the first post)

    The goal is to convert "if-then-else" statements in a 4th GL scripting language into equations. For example "if(a) {}elsif (b) {} else {}" would becomes (a+b+c).

    Once accomplished all paths through the 4th GL script can be found by "multiplying" the equations through. For example, two simple if/else statements could be represented as (a+b)(c+d.) The independent paths throught the code is: ac+ad+bc+bd.

    The code provided here does this equation transformation. The difficulty is that the equation is 64K in characters and will result in a couple million independent elements as a result.

    Sorry -- not only do I fail to get a clear idea of the goal, but I really can't see any relation between that description and the output that I got from your script. For that matter, I'm clueless about what the output data provides, let alone what the input represents. Still, I hope the information I provided was helpful in some way.

      Thanks for the response. It is hard to take "Are you being deliberately thick?" as not hostile.

      The problem is large so I provided only a subset of what I am trying to solve, which appears confusing to folks. What I posted is an attempt to translate an algebraic solution into an expanded solution.

      When that failed in the previous post, for this post I was attempting to see what folks did in the way of "paging" when programs got too big for memory. It appears this question is to abstract for this forum.

      I've completed the first step which was to manually create array of arrays so I can page part of the data out to a database. With this I am up to 16 million array elements - only 8 million more to go. ;~)
      $|=1; $/ = ','; my @letters; my @words; $words[0]=0; my $idxop; open POSTFIX, '<', 'postfix.txt' or die $!; while( my $op = <POSTFIX> ) { $idxop++; chomp $op; chop $op if $op =~ tr[\n][\n]; ## remove any trailing newline print "$idxop ($op) WORDS: $#words; LETTERS: $#letters\n"; if( $op eq 'x' ) { print "MULTIPY\n"; ## GRAB LAST TWO ELEMENTS FROM WORDS BY RANGES x1->x2-1;x2->x3 +-1 my $x3=pop(@words); my $x2=pop(@words); my $x1=pop(@words); my $elements=$x3-$x1; ## APPEND NEW VALUES ON TOP OF STACK $idx=$x3; for (my $i=$x2; $i<$x3; $i++) { for (my $j=$x1; $j<$x2; $j++) { $letters[$idx++]= $letters[$j] . $letters[$i]; } print "AT: $i toward $x3 with $#letters\n"; } ## DELETE OLD ELEMENTS splice @letters,$x1,$x3-$x1; ## PUSH POINTERS TO ENDS OF NEW ALGOMATED ELEMENT push(@words,$x1); $elements=$idx-$elements; push(@words,$elements); } elsif( $op eq '+' ) { print " ADD\n"; ## GRAB LAST TWO ELEMENTS FROM WORDS BY RANGES x1->x2-1;x2->x3 +-1 my $x3=pop(@words); my $x2=pop(@words); my $x1=pop(@words); ## PUSH POINTERS TO ENDS OF NEW ALGOMATED ELEMENT push(@words,$x1); push(@words,$x3); } elsif( $op =~ m/^\d+$/ ) { print " STORE NUMBER $op\n"; push @letters, pack 'v', $op ; #push @letters, $op ; $words[$#words+1]=$#letters+1; } else { die "Bad '$op' at position;" . tell( POSTFIX ); } }