in reply to Re: Memory utilization and hashes
in thread Memory utilization and hashes

Sorry for the typos in the code; fixing them.

My actual data consists of data from several hundred MB to several hundred GB so that sample data set is just a sample of the sort of thing I am processing.

The two queries and two answers in a row is what my real world data contains, specifically there can be anywhere from 1 to n answers for each query and the queries and answers occur in any order and the only guarantee is that the answer will follow (sometime later) the query it goes with.

Max rows in files to process = 31291204, average lines in files 8707186.

Replies are listed 'Best First'.
Re^3: Memory utilization and hashes
by Laurent_R (Canon) on Jan 17, 2018 at 21:49 UTC
    Just to keep track:
    my $l; # all these three variables should probably better decla +red within the my @vals; # while loop. Only %pairs probably need to be declared b +efore the while my $json; while (<>) { $l = $_; chomp $l; @vals = split /;/, $l; if ($vals[0] =~ /Query/) { $pairs{$vals[1]}{$vals[2]} = $vals[3]; # %pairs isn't decla +red anywhere } elsif {$vals[0] =~ /Answer/) { # syntax error: elsi +f { should be elsif ( $pairs{$vals[1}{$vals[2]} = $vals[3]; $json = encode_json $pairs{$vals[1]}; # what do you think +is the content of $pairs{$vals[1]}? Probably not what you want to enc +ode. print $json."\n"; delete $pairs{$vals[1]}; } }
    This will still not compile.

    Do yourself a favor. Use the following pragmas:

    use strict; use warnings;
Re^3: Memory utilization and hashes
by Laurent_R (Canon) on Jan 17, 2018 at 22:11 UTC
    specifically there can be anywhere from 1 to n answers for each query
    Then you can't delete your hash entries as you go, because when a second answer comes of a given query, you no longer have the information from the query available.

      See my last post at the end. I check for a repeat and when I detect it, I delete it and start over.

      Update:

      Hmmm, thinking about what you said and what I just said ...

      Maybe deleting the entry and starting over (though needed to get rid of the last answers) does not really solve my memory issue at all. You might actually be onto something there but I do not know how to fix it, given that it is the case ...

        In fact, all you need to store in your hash is the queries, there is no reason to store any answer in the hash, since you can process it immediately and print it. This may help, but I'm not sure, though, that this is sufficient to reduce your memory footprint to an acceptable level (it depends on your data profile).
Re^3: Memory utilization and hashes
by Laurent_R (Canon) on Jan 17, 2018 at 21:39 UTC
    Even with the fixes that you did in the original post, you still have several syntax errors.