What is the most memory efficient way to (sort) and print a hash?

a_salway has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: What is the most memory efficient way to (sort) and print a hash? by choroba (Cardinal) on May 29, 2014 at 16:59 UTC
Use each to print the hash: `while (my ($k, $v) = each %hash) { print "$k\t$v\n"; }` [download] Then pipe to shell's `sort -k2n`. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: What is the most memory efficient way to (sort) and print a hash? by a_salway (Initiate) on May 29, 2014 at 17:33 UTC
Many thanks choroba. That prints the hash with no further memory usage which helps a lot. If you, or others, can suggest a better way to get sorted output then that would be a bonus.	[reply]
Re^3: What is the most memory efficient way to (sort) and print a hash? by runrig (Abbot) on May 29, 2014 at 18:25 UTC
Define 'better' in relation to the suggested answer. There are trade-offs whether you use this solution or another, e.g., using a database as suggested elsewhere in this thread.	[reply]
Re^3: What is the most memory efficient way to (sort) and print a hash? by AnomalousMonk (Archbishop) on May 29, 2014 at 17:51 UTC
That prints the hash ... a ... way to get sorted output ... would be a bonus. So, what happens if you then actually (system) sort the output per choroba's original suggestion?	[reply]
Re^3: What is the most memory efficient way to (sort) and print a hash? by no_slogan (Deacon) on May 29, 2014 at 18:14 UTC
File::Sort rather than `system "sort"`?	[reply]
Re: What is the most memory efficient way to (sort) and print a hash? by kennethk (Abbot) on May 29, 2014 at 17:55 UTC
Another possibility beyond choroba's suggestion would be using a database. For large data structures, a database will often win out. Assuming you are committed to doing it all in memory, you could use DBD::SQLite in an in-memory context to avoid disk access issues. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply]
Re: What is the most memory efficient way to (sort) and print a hash? by marioroy (Prior) on Jul 29, 2024 at 04:41 UTC
Greetings, a_salway! This is 10 years late! Recently, chuma requested similarly the Long List is Long challenge. For testing, I modified eyepopslikeamosquito's file generator to emit 16 million unique key-value pairs, in parallel. The key-names length are 8 chars minimum ~ 19 chars maximum, sufficient for uniqueness. use v5.30; use MCE; { # Generate a random word, inspired by eyepopslikeamosquito's # gen-llil.pl generator https://perlmonks.org/?node_id=11148681 my $ordmin = ord('A'); my $ordmax = ord('z') + 1; sub gen_random_word { my $nchar = shift; # the number of random chars to append my $word = ""; for (1 .. $nchar) { my $ord = $ordmin + int( rand($ordmax - $ordmin) ); # try again if 91 [, 92 \, 93 ], 94 ^, 95 _, or 96 ` while ($ord > 90 && $ord < 97) { $ord = $ordmin + int( rand($ordmax - $ordmin) ); } $word .= chr($ord); } return $word; } } srand(42); MCE->new( max_workers => MCE::Util::get_ncpu(), sequence => [ 1, 16e6 ], bounds_only => 1, chunk_size => 8e3, init_relay => 1, posix_exit => 1, user_func => sub { my ($mce, $chunk_ref, $chunk_id) = @_; my $seq_beg = $chunk_ref->[0]; my $seq_end = $chunk_ref->[1]; my $out = ""; # Worker seeds generator using internal seed and wid value. # Compute similarly using the chunk_id value. my $seed = abs(MCE->seed - ($chunk_id * 1e5)) % 1073741780; srand($seed); for ($seq_beg .. $seq_end) { $out .= gen_random_word(int(rand(12) + 8)); $out .= "\t" . int(rand(1e4) + 1) . "\n"; } # Write output directly and orderly via relay. MCE::relay { print $out }; } )->run; [download] I tried three programs for sorting the output similarly to the Long List is Long challenge. That is count descending and key names ascending. `Parsort` is a program included with GNU parallel. Another variant is mcesort using mini-MCE integrated into the code. Testing was captured on an AMD Ryzen Threadripper 3970X machine, 3600 MHz DDR4. The output file `md5sum` is `c6da8fd098664155c190782f1dc0ca7f`. `$ rm -f out; time perl demo.pl \| LC_ALL=C sort -k2rn > out real 0m34.779s user 1m39.925s sys 0m00.555s $ rm -f out; time perl demo.pl \| LC_ALL=C parsort -k2rn > out real 0m04.484s user 1m17.538s sys 0m00.575s $ rm -f out; time perl demo.pl \| LC_ALL=C mcesort -k2rn > out real 0m04.287s user 1m54.859s sys 0m01.748s` [download] Next, a C++ demonstration running `llil4map_buf2.cc`. Set `SSO_LIMIT` to `20` for best performance, ~ line 131. Key names plus null character greater than `SSO_LIMIT` are dynamically allocated using blocks of memory. `$ time perl demo.pl > out.in real 0m01.396s user 1m16.659s sys 0m00.287s $ rm -f out; ./llil4map2 out.in > out llil4map start use OpenMP SSO_LIMIT 12 SSO_LIMIT 20 use boost sort get properties 0.166 secs 0.159 secs map to vector 0.055 secs 0.077 secs vector stable sort 0.201 secs 0.113 secs write stdout 0.097 secs 0.068 secs total time 0.521 secs 0.419 secs count lines 16000000 count unique 16000000` [download]	[reply] [d/l] [select]
Re: What is the most memory efficient way to (sort) and print a hash? by zentara (Cardinal) on May 29, 2014 at 21:09 UTC
Maybe a merge-sort? #!/usr/bin/perl -w #by david #with 20million rows, you probably don't want to store #everything in memory and then sort them. what you have to do is sort +the #data file segment by segment and then merge them back. merging is the + real #tricky business. the following script(which i did for someone a while + ago) #will do that for you. what it does is break the file into multiple ch +unks #of 100000 lines, sort the chunks in a disk tmp file and then merge al +l the #chunks back together. when i sort the file, i keep the smallest bound +ary #ofeach chunk and use this number to sort the file so you don't have t +o #compare all the tmp files. #there is also a merge sort in the PPT Perl Power Tools on cpan use strict; my @buffer = (); my @tmps = (); my %bounds = (); my $counter = 0; open( FILE, "file.txt" ) \|\| die $!; while (<FILE>) { push ( @buffer, $_ ); if ( @buffer > 100000 ) { my $tmp = "tmp" . $counter++ . ".txt"; push ( @tmps, $tmp ); sort_it( \@buffer, $tmp ); @buffer = (); } } close(FILE); merge_it( \%bounds ); unlink(@tmps); #-- DONE --# sub sort_it { my $ref = shift; my $tmp = shift; my $first = 1; open( TMP, ">$tmp" ) \|\| die $!; for ( sort { my @fields1 = split ( /\s/, $a ); my @fields2 = split ( /\s/, $b ); $fields1[2] <=> $fields2[2] } @{$ref} ) { if ($first) { $bounds{$tmp} = ( split (/\s/) )[2]; $first = 0; } print TMP $_; } close(TMP); } sub merge_it { my $ref = shift; my @files = sort { $ref->{$a} <=> $ref->{$b} } keys %{$ref}; my $merged_to = $files[0]; for ( my $i = 1 ; $i < @files ; $i++ ) { open( FIRST, $merged_to ) \|\| dir $!; open( SECOND, $files[$i] ) \|\| dir $!; my $merged_tmp = "merged_tmp$i.txt"; open( MERGED, ">$merged_tmp" ) \|\| die $!; my $line1 = <FIRST>; my $line2 = <SECOND>; while (1) { if ( !defined($line1) && defined($line2) ) { print MERGED $line2; print MERGED while (<SECOND>); last; } if ( !defined($line2) && defined($line1) ) { print MERGED $line1; print MERGED while (<FIRST>); last; } last if ( !defined($line1) && !defined($line2) ); my $value1 = ( split ( /\s/, $line1 ) )[2]; my $value2 = ( split ( /\s/, $line2 ) )[2]; if ( $value1 == $value2 ) { print MERGED $line1; print MERGED $line2; $line1 = <FIRST>; $line2 = <SECOND>; } elsif ( $value1 > $value2 ) { while ( $value1 > $value2 ) { print MERGED $line2; $line2 = <SECOND>; last unless ( defined $line2 ); $value2 = ( split ( /\s/, $line2 ) )[2]; } } else { while ( $value1 < $value2 ) { print MERGED $line1; $line1 = <FIRST>; last unless ( defined $line1 ); $value1 = ( split ( /\s/, $line1 ) )[2]; } } } close(FIRST); close(SECOND); close(MERGED); $merged_to = $merged_tmp; } } [download] I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply] [d/l]
Re: What is the most memory efficient way to (sort) and print a hash? (Judy::SL) by tye (Sage) on May 29, 2014 at 18:16 UTC
Judy::SL. - tye	[reply]
Re: What is the most memory efficient way to (sort) and print a hash? by a_salway (Initiate) on May 30, 2014 at 10:35 UTC
Thanks to all, I've learnt a lot. In my rush to try the code suggested by choroba, I failed to notice the second step, i.e. system sort, duh!. I've tried it now and it seems good for my purposes - I liked that it seemed to be able to use all cores (showed 99% CPU usage, versus 12% for perl.exe which I assume is just one of my 8 cores). I'm off now to learn more about the sort command - I need it to recognise tab separation.	[reply]
Re^2: What is the most memory efficient way to (sort) and print a hash? by choroba (Cardinal) on May 30, 2014 at 10:37 UTC
For sort, columns are whitespace separated. If you want only tabs, use something like `sort -t $'\t'` [download] (the bash way). لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]