For reference, results of llil2d.pl (11148585) for my hardware are:
llil2d start
get_properties : 10 secs
sort + output : 21 secs
total : 31 secs
2317524 Kbytes of RAM were used
(a report about resident RAM size was produced with same few lines of code as in script below)
In fact, "words" are so short (and few) that I realized (later, after I wrote 11148660 with results there) that I don't need any indirection and can simply use Judy::HS: "words" are kept in RAM all the time, both in my in-memory flat "DB" and, opaquely, somewhere in Judy code.
And then there's solution in Perl, which is both faster and requires much less memory, and generates identical output:
my_test start
get_properties : 13 secs
sort + output : 7 secs
total : 20 secs
349124 Kbytes of RAM were used
Short integer to keep count is not required with this example, it only demonstrates count can be any length and not limited to a byte as in my previous "solution" in this thread. Same about 10 bytes to pad "words" to: it can be anything to fit longest word; words can be different in length.
use strict;
use warnings;
use Judy::HS qw/ Set Get Free /;
use Sort::Packed 'sort_packed';
my $DATA_TEMPLATE = 'nZ10';
my $DATA_SIZE = 12;
my $COUNT_SIZE_BYTES = 2;
my $COUNT_SIZE_BITS = 16;
my $COUNT_MAX = int( 2 ** $COUNT_SIZE_BITS - 1 );
@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;
warn "my_test start\n";
my $tstart1 = time;
my ( $data, $current ) = ( '', 0 );
my $judy;
for my $fname ( @llil_files ) {
open( my $fh, '<', $fname ) or die $!;
while ( <$fh> ) {
chomp;
my ( $word, $count ) = split /\t/;
( undef, my $val ) = Get( $judy, $word );
if ( defined $val ) {
vec( $data, $val * $DATA_SIZE / $COUNT_SIZE_BYTES,
$COUNT_SIZE_BITS ) -= $count
}
else {
$data .= pack $DATA_TEMPLATE, $COUNT_MAX - $count, $word;
Set( $judy, $word, $current );
$current ++
}
}
}
Free( $judy );
my $tend1 = time;
warn "get_properties : ", $tend1 - $tstart1, " secs\n";
my $tstart2 = time;
sort_packed "C$DATA_SIZE", $data;
while ( $data ) {
my ( $count, $word ) = unpack $DATA_TEMPLATE, substr $data, 0, $DA
+TA_SIZE, '';
printf "%s\t%d\n", $word, $COUNT_MAX - $count
}
my $tend2 = time;
warn "sort + output : ", $tend2 - $tstart2, " secs\n";
warn "total : ", $tend2 - $tstart1, " secs\n";
use Memory::Usage;
my $m = Memory::Usage-> new;
$m-> record;
warn $m-> state-> [0][3], " Kbytes of RAM were used\n";
What if "words" are significantly longer? With approx. 10e6 unique words in this test, if they were each hundreds of bytes, then several GB of RAM would be used just to keep them. Perhaps impractical.
So I'm returning to my idea of keeping only hashes and offsets within files. Results I posted in 11148660 are of course valid, I was using 64-bit hashes as Judy::L integer keys.
In fact 32-bit hashes generated with xxHash have a few collisions for 10e6 6-character words. 64-bit hashes have no collisions. I'm not qualified to predict how safe it is to expect no collisions for set of which size. Maybe 128-bit hashes, below, are overkill.
Just for entertainment I decided to write "indirect" solution with 128-bit hashes. Therefore I'm not using Judy::L, but same Judy::HS with keys being 32-char hex-encoded hashes. Otherwise, almost same plan and code. Data template layout chosen arbitrarily and can be adjusted.
Of course, words are not alphabetically sorted in output, but OTOH it was not original LLIL requirement. Results:
my_test start
get_properties : 21 secs
sort + output : 23 secs
total : 44 secs
841880 Kbytes of RAM were used
I think same amount of RAM would be used if words were not 6 but 600 or 6000 bytes. Relatively fast 2nd phase (considering huge amount of random reads) is due to NMVe storage here.
(Off topic: I managed to install Crypt::xxHash under Windows/Strawberry, but Judy was too much challenge for me. I wrote solution very close to code below, using not Judy, but very thin Inline::C wrapper for AVL library (google for jsw_avltree). It uses approx. same RAM and ~2x time for 1st phase. But also ~2.5x time for 2nd phase, which is exactly same code with same hardware (dual boot). Don't know if Windows I/O is just so much slower)
use strict;
use warnings;
use Judy::HS qw/ Set Get Free /;
use Crypt::xxHash 'xxhash3_128bits_hex';
use Sort::Packed 'sort_packed';
my $DATA_TEMPLATE = 'nnNn'; # word count
# file index
# word position
# word length
my $DATA_SIZE = 10;
my $COUNT_SIZE_BYTES = 2;
my $COUNT_SIZE_BITS = 16;
my $COUNT_MAX = int( 2 ** $COUNT_SIZE_BITS - 1 );
@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;
warn "my_test start\n";
my $tstart1 = time;
my ( $data, $current ) = ( '', 0 );
my $judy;
for my $idx ( 0 .. $#llil_files ) {
open( my $fh, '<', $llil_files[ $idx ]) or die $!;
until ( eof $fh ) {
my $pos = tell $fh;
$_ = <$fh>;
chomp;
my ( $word, $count ) = split /\t/;
my $xx = xxhash3_128bits_hex( $word, 0 );
( undef, my $val ) = Get( $judy, $xx );
if ( defined $val ) {
vec( $data, $val * $DATA_SIZE / $COUNT_SIZE_BYTES,
$COUNT_SIZE_BITS ) -= $count
}
else {
$data .= pack $DATA_TEMPLATE,
$COUNT_MAX - $count,
$idx,
$pos,
length $word;
Set( $judy, $xx, $current );
$current ++
}
}
}
Free( $judy );
my $tend1 = time;
warn "get_properties : ", $tend1 - $tstart1, " secs\n";
my $tstart2 = time;
sort_packed "C$DATA_SIZE", $data;
my @fh;
open $fh[ $_ ], '<', $llil_files[ $_ ] for 0 .. $#llil_files;
while ( $data ) {
my ( $count, $idx, $pos, $len )
= unpack $DATA_TEMPLATE, substr $data, 0, $DATA_SIZE, '';
sysseek $fh[ $idx ], $pos, 0;
sysread $fh[ $idx ], my( $word ), $len;
printf "%s\t%d\n", $word, $COUNT_MAX - $count
}
my $tend2 = time;
warn "sort + output : ", $tend2 - $tstart2, " secs\n";
warn "total : ", $tend2 - $tstart1, " secs\n";
use Memory::Usage;
my $m = Memory::Usage-> new;
$m-> record;
warn $m-> state-> [0][3], " Kbytes of RAM were used\n";