comment on

Maybe it is not worth anything, but just exploiting a loophole of ridiculous 6 bytes a..z fixed length keys, and count fitting single byte. $r is just to quickly convert decimal to/from base-27, nothing more. Sure it all could be optimized. Excluding 0 from base-27 representation (i.e. using 27 instead of 26) is to avoid padding with leading zeroes. It follows that $buf_in length is wasteful 387,420,489 instead of 308,915,776, but, what the heck, I can be royally wasteful with this implementation. Decrementing count down from 255 instead of just incrementing from 0 is further silly optimization which I see this late time of day, but sure oversee better improvements. + Of course $MAX is 27**6 (or would be 26**6).

Can't confirm what eyepopslikeamosquito says about memory, with original llil.pl I see Working Set goes to approx. 2.9 GB. Mine doesn't exceed ~530 Mb.

llil start
get_properties : 15 secs
sort + output  : 103 secs
total          : 118 secs

my_test start
get_properties : 31 secs
sort + output: 10 secs
total: 41 secs
[download]

...

use strict;
use warnings;
use feature 'say';

use Math::GMPz ':mpz';
use Sort::Packed 'sort_packed';

@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;

warn "my_test start\n";
my $tstart1 = time;

my $r = Rmpz_init;

Rmpz_set_str( $r, 'zzzzzz' =~ tr/a-z1-9/1-9a-z/r, 27 );
my $MAX = Rmpz_get_ui( $r );

my ( $buf_in, $buf_out ) = ( "\xFF" x $MAX, '' );

for my $fname ( @llil_files ) {
    open( my $fh, '<', $fname ) or die "error: open '$fname': $!";
    while ( <$fh> ) {
        chomp;
        my ( $word, $count ) = split /\t/;
        
        $word =~ tr/a-z1-9/1-9a-z/;
        Rmpz_set_str( $r, $word, 27 );
        vec( $buf_in, Rmpz_get_ui( $r ), 8 ) -= $count;
    }
    close( $fh ) or die "error: close '$fname': $!";
}

while ( $buf_in =~ /[^\xFF]/g ) {
    Rmpz_set_ui( $r, @- );
    $buf_out .= pack 'aa6', $&, Rmpz_get_str( $r, 27 ) =~ tr/1-9a-z/a-
+z/r
}

my $tend1 = time;
warn "get_properties : ", $tend1 - $tstart1, " secs\n";

my $tstart2 = time;
sort_packed C7 => $buf_out;
while ( $buf_out ) {
    my ( $count, $word ) = unpack 'Ca6', substr $buf_out, 0, 7, '';
    printf "%s\t%d\n", $word, 255 - $count
}
    
my $tend2 = time;
warn "sort + output: ", $tend2 - $tstart2, " secs\n";
warn "total: ", $tend2 - $tstart1, " secs\n";
[download]

What follows is fiction, not implemented in code, can be ignored. I said 'ridiculous' above, but in fact I do remember original LLIL thread, not sure now but then I thought keys were expected significantly longer than qw/foo bar aaaaaa/, etc. (genetic sequences?). So then this would be multi-GB total of input files which are mostly keys, just keeping them keys in RAM is out of the question. Not to mention building and keeping hashes and working with them.

I thought about HQ hashing (xxHash?) of keys and sparsely storing (Judy?) values indexed by produced integer, where values are e.g. 64-bit-packed integer comprised of

file id
offset of start of line containing unique key first seen (tell)
count (updated as files are read in)

After all files are consumed, value positions within array (i.e. indexes) are no longer important. IF densely-packed array data (i.e. discard zero values) fits RAM, then problem is solved. Sort packed data, and produce output, which, yes, would require randomly reading lines (i.e. real keys) from input files AGAIN based on stored file id and line position.

In reply to Re: Rosetta Code: Long List is Long by Anonymous Monk
in thread Rosetta Code: Long List is Long by eyepopslikeamosquito

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.