comment on

Definitely go with a DBM approach as described above, to move the hash structure to disk. Apart from that, I'm wondering why you use two different hashes with identical keys (%total and %connects), and why you test a condition that would obviously never be false (if $total{foobar}{from} exists, there's no point testing whether $total{foobar}{to} doesn't exist, since "from" and "to" are both assigned at the same time).

I think the following would be equivalent to the OP code in terms of what it does, but might take less memory and might run a bit faster:

while (<LOG>) {
    my ($source, $sport, $to, $dport, $proto, $packs, $bytes) = split;
    my $key = "$source$dport";
    if ( exists( $total{$key}{from} )) {
        $total{$key}{connects}++;
        $total{$key}{bytes} += $bytes;
    }
    else {
        $total{$key} = { from => $source,
                         to => "$to:$dport",
                         bytes => $bytes,
                       };  # maybe should set 'connects => 1' as well?
    }
    $total += $bytes;
}
[download]

Here are a few (potentially meaningless) benchmarks about the trade-off between more top-level (simple, flat) hashes vs. a single top-level hash with more sub-hash keys (I put a "sleep" in there so I could study the memory/time consumption once the hashes were filled):

perl -e '$k="aaaaa";
 for $i (1..1_000_000)
 { $h1{$k}={foo=>"bar",bar=>"foo",iter=>$i}; $h1{$k}{total}++; $k++}
 sleep 20'
## consumes 344 MB in ~14.4 sec

perl -e '$k="aaaaa";
 for $i (1..1_000_000)
 {$h1{$k}={foo=>"bar",bar=>"foo",iter=>$i}; $h2{$k}++; $k++}
 sleep 20'
## consumes 352 MB in ~15.0 sec

perl -e '$k="aaaaa";
 for $i (1..1_000_000)
 {$h1{$k}={foo=>"bar",bar=>"foo"}; $h2{$k}++; $h3{$k}=$i; $k++}
 sleep 20'
## consumes 360 MB in ~16.5 sec
[download]

So given that you are using one HoH already, there's a slight advantage in not creating a second (or third) hash with the same set of primary keys -- better to add another key to the sub-hash instead.

In reply to Re: How to save memory, parsing a big file. by graff
in thread How to save memory, parsing a big file. by idle

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.