perlbeginner10 has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys,
I have a program that reads input from file, creates two separate hashes for each word in the input file, and tries to do some complex operation whose complexity is O(n^n), and print the output in a text file. The program works very fine with small input. But my input file is about 48MB !! So when I run the program, there is no output, or the output is blank!!!
Can anyone suggest me anything? The input file can't be made smaller.
Thanks.

Replies are listed 'Best First'.
Re: out of memory problem
by GrandFather (Saint) on Mar 15, 2006 at 20:51 UTC

    If it is really O(n^n) and n is anything greater than about 10, then you are stuffed.

    Describe what you are trying to achieve and show us the guts of the code with a very small sample set. We may be able to advise on a better technique than you are currently using. Consider this thread as an example of how a different approach can help.


    DWIM is Perl's answer to Gödel
Re: out of memory problem
by Tanktalus (Canon) on Mar 15, 2006 at 20:38 UTC

    In order of easiest/cheapest (on developer costs) to most difficult ...

    • Check your ulimit. If you have a limit on memory from that, then you'll get an out of memory error when you reach it. I had this problem - 64MB wasn't enough, so I just told the sysadmins that I needed unlimited memory, and that problem went away.
    • Buy more RAM/increase swap space. If you still don't have enough memory after the above, then maybe it's because there's no more memory to have. My current home computer has 4GB of RAM and 8GB of swap specifically because of this. Of course, RAM is way faster than swap.
    • Try upgrading to 64-bit. That means a 64-bit processor as well as 64-bit OS and 64-bit perl. With sufficient memory. If approximately 3.5GB isn't enough memory to access, then 64-bit will allow you to access anything you can throw at it.
    • Try using a file-system-backed tied data structure. Since I don't use this, nor do I know what data structures you're using, I can't really tell you which one. But the basic idea is to throw all your intermediate results to disk, and read them back in as you need them. A tied structure, such as DBM or something, can radically simplify this. This allows you to use your hard disk as if it were RAM, without actually hitting the limits of your ulimit or CPU or physical RAM/swap.
    Hope that helps.

Re: out of memory problem
by izut (Chaplain) on Mar 15, 2006 at 20:31 UTC

    Could you post the code you wrote?

    Igor 'izut' Sutton
    your code, your rules.

Re: out of memory problem
by ikegami (Patriarch) on Mar 15, 2006 at 20:32 UTC
    We don't know anything about your algorithm or your input. How can we help?
      Sorry about that. Here is my code
      my %fnameof; my %valueof; my @relation; my @second; my $mainfile; my $subfiles; { open ($testdataset, "datasetnew.txt") or die "Cannot open file"; @testdataset = <$testdataset>; close ($testdataset); open (STDOUT, ">>result.txt"); $fcount = 1; $secondcount = 0; @testdataset = grep { $_ ne '' } @testdataset; @testdataset = grep /\S/, @testdataset; foreach $dataline (@testdataset) { ($mainfile, $subfiles) = GetFileName($dataline); for ($mainfile) { $mainfile =~ s/^\s+//; $mainfile =~ s/\s+$//; } addtoHash($mainfile); @subfiles = @keywords = split(/;/, $subfiles); @subfiles = grep { $_ ne '' } @subfiles; @subfiles = grep /\S/, @subfiles; foreach $subfile (@subfiles) { $subfile =~ s/^\s+//; $subfile =~ s/\s+$//; addtoHash($subfile) unless ($_ ne ''); } #defining the relation of mainfile with subfiles. Each mainfil +e has relation weight = 1 with subfile. foreach $subfile (@subfiles) { $relation[$valueof{$mainfile}][$valueof{$subfile}] = 1 +; $second[$secondcount] = "$valueof{$mainfile};$valueof{ +$subfile}"; $secondcount++; } } #creating transitive relationship. ie: if A->B and B->C, then A->C foreach $seconditem (@second) { @test = split(/;/, $seconditem); $b = $test[0]; $c = $test[1]; for ($k = 1; $k<=$secondcount; $k++) { if ($relation[$c][$k] gt 0) { $relation[$b][$k] = $relation[$b][$k]+1; } } } PrintArray(); } #get mainfile and subfiles sub GetFileName{ my $item = $_[0]; @datasplit = split(/\t/, $item); $mainfile = @datasplit[0]; $subfiles = @datasplit[1]; return ($mainfile, $subfiles); } sub addtoHash{ my $file = $_[0]; $exist = 0; for ($i = 0; $i < $fcount; $i++) { if ($fnameof{$i} eq $file) { $exist = $i; } } if ($exist == 0) { $fnameof{$fcount}= $file; $valueof{$file} = $fcount; $fcount++; } } sub PrintArray(){ for($i=1;$i<$fcount; $i++) { for($j=1;$j<$fcount;$j++){ if (defined ($relation[$i][$j])) { print $fnameof{$i}."-".$relation[$i][$j]."->".$fnameof +{$j}."\n"; } } } print "\n"; }
      And Here is sample dataset:
      cancer breast cancer; lung cancer; heart cancer; stomach cancer; breast cancer foot cancer; foot cancer some cancer; lung cancer blood cancer; foot cancer; heart cancer foot cancer; stomach cancer foot cancer; blood cancer some cancer;
      But this dataset is actually huge. It's about 48MB. I have 1GB memory in my comp. I ran this program on Windows and Fedora core, but the resut is the same: blank --(with the 48MB dataset). PS: If there are any other points that can improve my code, please let me know.

        First glance -

        • add use strict; use warnings to your code then clean up the errors and warnings.
        • don't use $a or $b as variable names - they are reserved for use by sort
        • use the three parameter open
        • where does $fcount in addtoHash get a value? Make it explicit by passing the value into the sub rather than relying on a global.
        • Don't prototype PrintArray - especially after it's first use!
        • you probably want chomp @testdataset; before @testdataset = grep { $_ ne '' } @testdataset;
        • @testdataset = grep { $_ ne '' } @testdataset; is redundant when followed by @testdataset = grep /\S/, @testdataset;
        • what does for ($mainfile) { achieve?
        • You test if ($exist == 0), but $i can == 0 and therefore $exist can == 0 (in addtoHash)

        You could describe the output you expect. Sometimes knowing what is expected of a piece of code helps understand it - sometimes it helps misunderstand it :)

        Update: more items added


        DWIM is Perl's answer to Gödel

        I don't have time to look at it personally, at least not now, but the following will help you greatly. Change

        open ($testdataset, "datasetnew.txt") or die "Cannot open file"; @testdataset = <$testdataset>; close ($testdataset); @testdataset = grep { $_ ne '' } @testdataset; @testdataset = grep /\S/, @testdataset; foreach $dataline (@testdataset) {

        to

        open (my $testdataset, '<', "datasetnew.txt") or die "Cannot open input file: $!\n"; while (my $dataline = <$testdataset>) { next if $dataline =~ /^\s*$/;

        You'll have (2 or 3) fewer copies of your file in memory.

Re: out of memory problem
by GrandFather (Saint) on Mar 16, 2006 at 01:52 UTC

    It's not entierly clear what you are tryng to achieve. But on the guess that it is something to do with finding transitive relationships in the data, the following may be of use:

    use strict; use warnings; my %mappings; while (<DATA>) { chomp; next if ! /\S/; s/^\s+//; s/\s+$//; my ($mainfile, $subfiles) = split /\s*,\s*/; my @subfiles = split /\s*;\s*/, $subfiles; $mappings{$mainfile} = [grep /\S/, @subfiles]; } # Print transitive relationships. ie: if A->B and B->C, then A->C for my $A (sort keys %mappings) { for my $B (@{$mappings{$A}}) { print " $A - $B -> @{$mappings{$B}}\n" if exists $mappings{$B} +; } } __DATA__ cancer,breast cancer; lung cancer; heart cancer; stomach cancer; breast cancer,foot cancer; foot cancer,some cancer; lung cancer,blood cancer; foot cancer; heart cancer,foot cancer; stomach cancer,foot cancer; blood cancer,some cancer;

    Prints:

    breast cancer - foot cancer -> some cancer cancer - breast cancer -> foot cancer cancer - lung cancer -> blood cancer foot cancer cancer - heart cancer -> foot cancer cancer - stomach cancer -> foot cancer heart cancer - foot cancer -> some cancer lung cancer - blood cancer -> some cancer lung cancer - foot cancer -> some cancer stomach cancer - foot cancer -> some cancer

    DWIM is Perl's answer to Gödel