a_salway has asked for the wisdom of the Perl Monks concerning the following question:

I'm working with a hash containing approx 16 million key-value pairs; keys are short strings, values are integers.

When the hash is being generated I see memory usage rise to about 3.1GB. I have tried several ways to (sort) and print the hash but all of them seem to increase memory usage significantly. If I could avoid this increase then obviously I could work with a much bigger hash. Thanks in advance for any suggestions, or explanations.

Here's what I've tried...

print %hash;

This added 1.6GB to memory usage and took about 6 minutes to execute, but gives unhelpful output format.

my @keys = sort { $hash{$b} <=> $hash{$a} } keys %hash; foreach my $key (@keys) {print "$key\t$hash{$key}\n";}

This added 2GB to memory usage and took about 7 minutes, giving perfect sorted output for me.

print map { "$_ $summedNgrams{$_}\n" } keys %summedNgrams;

This added 3GB to memory usage and took about 8 minutes to give readable but unsorted output.

Running Perl on Cygwin64, Windows7. Timings approximate. Memory usage observed in Windows Task Manager.

Replies are listed 'Best First'.
Re: What is the most memory efficient way to (sort) and print a hash?
by choroba (Cardinal) on May 29, 2014 at 16:59 UTC
    Use each to print the hash:
    while (my ($k, $v) = each %hash) { print "$k\t$v\n"; }

    Then pipe to shell's sort -k2n.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Many thanks choroba. That prints the hash with no further memory usage which helps a lot. If you, or others, can suggest a better way to get sorted output then that would be a bonus.

        Define 'better' in relation to the suggested answer. There are trade-offs whether you use this solution or another, e.g., using a database as suggested elsewhere in this thread.
        That prints the hash ... a ... way to get sorted output ... would be a bonus.

        So, what happens if you then actually (system) sort the output per choroba's original suggestion?

Re: What is the most memory efficient way to (sort) and print a hash?
by kennethk (Abbot) on May 29, 2014 at 17:55 UTC

    Another possibility beyond choroba's suggestion would be using a database. For large data structures, a database will often win out. Assuming you are committed to doing it all in memory, you could use DBD::SQLite in an in-memory context to avoid disk access issues.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: What is the most memory efficient way to (sort) and print a hash?
by marioroy (Prior) on Jul 29, 2024 at 04:41 UTC

    Greetings, a_salway! This is 10 years late! Recently, chuma requested similarly the Long List is Long challenge. For testing, I modified eyepopslikeamosquito's file generator to emit 16 million unique key-value pairs, in parallel. The key-names length are 8 chars minimum ~ 19 chars maximum, sufficient for uniqueness.

    use v5.30; use MCE; { # Generate a random word, inspired by eyepopslikeamosquito's # gen-llil.pl generator https://perlmonks.org/?node_id=11148681 my $ordmin = ord('A'); my $ordmax = ord('z') + 1; sub gen_random_word { my $nchar = shift; # the number of random chars to append my $word = ""; for (1 .. $nchar) { my $ord = $ordmin + int( rand($ordmax - $ordmin) ); # try again if 91 [, 92 \, 93 ], 94 ^, 95 _, or 96 ` while ($ord > 90 && $ord < 97) { $ord = $ordmin + int( rand($ordmax - $ordmin) ); } $word .= chr($ord); } return $word; } } srand(42); MCE->new( max_workers => MCE::Util::get_ncpu(), sequence => [ 1, 16e6 ], bounds_only => 1, chunk_size => 8e3, init_relay => 1, posix_exit => 1, user_func => sub { my ($mce, $chunk_ref, $chunk_id) = @_; my $seq_beg = $chunk_ref->[0]; my $seq_end = $chunk_ref->[1]; my $out = ""; # Worker seeds generator using internal seed and wid value. # Compute similarly using the chunk_id value. my $seed = abs(MCE->seed - ($chunk_id * 1e5)) % 1073741780; srand($seed); for ($seq_beg .. $seq_end) { $out .= gen_random_word(int(rand(12) + 8)); $out .= "\t" . int(rand(1e4) + 1) . "\n"; } # Write output directly and orderly via relay. MCE::relay { print $out }; } )->run;

    I tried three programs for sorting the output similarly to the Long List is Long challenge. That is count descending and key names ascending. Parsort is a program included with GNU parallel. Another variant is mcesort using mini-MCE integrated into the code.

    Testing was captured on an AMD Ryzen Threadripper 3970X machine, 3600 MHz DDR4. The output file md5sum is c6da8fd098664155c190782f1dc0ca7f.

    $ rm -f out; time perl demo.pl | LC_ALL=C sort -k2rn > out real 0m34.779s user 1m39.925s sys 0m00.555s $ rm -f out; time perl demo.pl | LC_ALL=C parsort -k2rn > out real 0m04.484s user 1m17.538s sys 0m00.575s $ rm -f out; time perl demo.pl | LC_ALL=C mcesort -k2rn > out real 0m04.287s user 1m54.859s sys 0m01.748s

    Next, a C++ demonstration running llil4map_buf2.cc. Set SSO_LIMIT to 20 for best performance, ~ line 131. Key names plus null character greater than SSO_LIMIT are dynamically allocated using blocks of memory.

    $ time perl demo.pl > out.in real 0m01.396s user 1m16.659s sys 0m00.287s $ rm -f out; ./llil4map2 out.in > out llil4map start use OpenMP SSO_LIMIT 12 SSO_LIMIT 20 use boost sort get properties 0.166 secs 0.159 secs map to vector 0.055 secs 0.077 secs vector stable sort 0.201 secs 0.113 secs write stdout 0.097 secs 0.068 secs total time 0.521 secs 0.419 secs count lines 16000000 count unique 16000000
Re: What is the most memory efficient way to (sort) and print a hash?
by zentara (Cardinal) on May 29, 2014 at 21:09 UTC
    Maybe a merge-sort?
    #!/usr/bin/perl -w #by david #with 20million rows, you probably don't want to store #everything in memory and then sort them. what you have to do is sort +the #data file segment by segment and then merge them back. merging is the + real #tricky business. the following script(which i did for someone a while + ago) #will do that for you. what it does is break the file into multiple ch +unks #of 100000 lines, sort the chunks in a disk tmp file and then merge al +l the #chunks back together. when i sort the file, i keep the smallest bound +ary #ofeach chunk and use this number to sort the file so you don't have t +o #compare all the tmp files. #there is also a merge sort in the PPT Perl Power Tools on cpan use strict; my @buffer = (); my @tmps = (); my %bounds = (); my $counter = 0; open( FILE, "file.txt" ) || die $!; while (<FILE>) { push ( @buffer, $_ ); if ( @buffer > 100000 ) { my $tmp = "tmp" . $counter++ . ".txt"; push ( @tmps, $tmp ); sort_it( \@buffer, $tmp ); @buffer = (); } } close(FILE); merge_it( \%bounds ); unlink(@tmps); #-- DONE --# sub sort_it { my $ref = shift; my $tmp = shift; my $first = 1; open( TMP, ">$tmp" ) || die $!; for ( sort { my @fields1 = split ( /\s/, $a ); my @fields2 = split ( /\s/, $b ); $fields1[2] <=> $fields2[2] } @{$ref} ) { if ($first) { $bounds{$tmp} = ( split (/\s/) )[2]; $first = 0; } print TMP $_; } close(TMP); } sub merge_it { my $ref = shift; my @files = sort { $ref->{$a} <=> $ref->{$b} } keys %{$ref}; my $merged_to = $files[0]; for ( my $i = 1 ; $i < @files ; $i++ ) { open( FIRST, $merged_to ) || dir $!; open( SECOND, $files[$i] ) || dir $!; my $merged_tmp = "merged_tmp$i.txt"; open( MERGED, ">$merged_tmp" ) || die $!; my $line1 = <FIRST>; my $line2 = <SECOND>; while (1) { if ( !defined($line1) && defined($line2) ) { print MERGED $line2; print MERGED while (<SECOND>); last; } if ( !defined($line2) && defined($line1) ) { print MERGED $line1; print MERGED while (<FIRST>); last; } last if ( !defined($line1) && !defined($line2) ); my $value1 = ( split ( /\s/, $line1 ) )[2]; my $value2 = ( split ( /\s/, $line2 ) )[2]; if ( $value1 == $value2 ) { print MERGED $line1; print MERGED $line2; $line1 = <FIRST>; $line2 = <SECOND>; } elsif ( $value1 > $value2 ) { while ( $value1 > $value2 ) { print MERGED $line2; $line2 = <SECOND>; last unless ( defined $line2 ); $value2 = ( split ( /\s/, $line2 ) )[2]; } } else { while ( $value1 < $value2 ) { print MERGED $line1; $line1 = <FIRST>; last unless ( defined $line1 ); $value1 = ( split ( /\s/, $line1 ) )[2]; } } } close(FIRST); close(SECOND); close(MERGED); $merged_to = $merged_tmp; } }

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: What is the most memory efficient way to (sort) and print a hash? (Judy::SL)
by tye (Sage) on May 29, 2014 at 18:16 UTC
Re: What is the most memory efficient way to (sort) and print a hash?
by a_salway (Initiate) on May 30, 2014 at 10:35 UTC

    Thanks to all, I've learnt a lot.

    In my rush to try the code suggested by choroba, I failed to notice the second step, i.e. system sort, duh!.

    I've tried it now and it seems good for my purposes - I liked that it seemed to be able to use all cores (showed 99% CPU usage, versus 12% for perl.exe which I assume is just one of my 8 cores). I'm off now to learn more about the sort command - I need it to recognise tab separation.

      For sort, columns are whitespace separated. If you want only tabs, use something like
      sort -t $'\t'
      (the bash way).
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ