Re: Solving the Long List is Long challenge

There is one missing in the mix :)

Kyoto Cabinet is the successor to Tokyo Cabinet. This new variant also creates 32 hash databases (by default) into a temp directory. Crypt::xxHash is used to determine which database to insert/update. Sorting is handled by Sort::Packed. Similar to the DB_File variant, this is best run on a Unix OS.

Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl llilkch.pl file...
       perl llilkch.pl --keysize=N --threads=N --maps=N file...
       perl llilkch.pl --keysize=N --threads=max --maps=max file...

Running:

$ perl llilkch.pl big{1,2,3}.txt | cksum
Kyoto Cabinet hash database - start
fixed string length=12, threads=8, maps=32
get properties   :     3.348 secs
pack properties  :     1.187 secs
sort packed data :     0.960 secs
write stdout     :     0.775 secs
total time       :     6.273 secs
    count lines  : 10545600
    count unique : 10367603
2956888413 93308427

$ perl llilkch.pl --threads=16 --maps=64 big{1,2,3}.txt | cksum
Kyoto Cabinet hash database - start
fixed string length=12, threads=16, maps=64
get properties   :     1.925 secs
pack properties  :     0.723 secs
sort packed data :     0.965 secs
write stdout     :     0.387 secs
total time       :     4.005 secs
    count lines  : 10545600
    count unique : 10367603
2956888413 93308427

$ perl llilkch.pl --threads=24 --maps=64 big{1,2,3}.txt | cksum
Kyoto Cabinet hash database - start
fixed string length=12, threads=24, maps=64
get properties   :     1.420 secs
pack properties  :     0.538 secs
sort packed data :     0.975 secs
write stdout     :     0.286 secs
total time       :     3.225 secs
    count lines  : 10545600
    count unique : 10367603
2956888413 93308427

$ perl llilkch.pl --threads=48 --maps=max big{1,2,3}.txt | cksum
Kyoto Cabinet hash database - start
fixed string length=12, threads=48, maps=128
get properties   :     0.908 secs
pack properties  :     0.372 secs
sort packed data :     0.969 secs
write stdout     :     0.205 secs
total time       :     2.462 secs
    count lines  : 10545600
    count unique : 10367603
2956888413 93308427
[download]

llilkch.pl

##
# This demonstration requires Kyoto Cabinet.
#   homepage      http://fallabs.com/kyotocabinet/
#   documentation http://fallabs.com/kyotocabinet/perldoc/
#
# Installation:
#   wget http://fallabs.com/kyotocabinet/pkg/kyotocabinet-1.2.80.tar.g
+z
#   wget http://fallabs.com/kyotocabinet/perlpkg/kyotocabinet-perl-1.2
+0.tar.gz
#
#   macos: please refer to https://perlmonks.org/?node_id=1198574 for 
+tips
#
#   tar xzf kyotocabinet-1.2.80.tar.gz && cd kyotocabinet-1.2.80
#   ./configure --disable-lzo --disable-lzma  # enabling requires lzo/
+lzma dev pkgs
#   make
#   make install  # Note: you may need to use "sudo"
#   cd ..
#
#   tar xzf kyotocabinet-perl-1.20.tar.gz && cd kyotocabinet-perl-1.20
#   perl Makefile.PL
#   make
#   make install  # Note: you may need to use "sudo"
#   cd ..
##

use strict;
use warnings;
no warnings 'uninitialized';

use KyotoCabinet;
use Crypt::xxHash qw(xxhash64);
use Sort::Packed qw(sort_packed);
use Time::HiRes qw(time);
use MCE::Signal qw($tmp_dir -use_dev_shm);
use MCE;

sub usage {
  die "Usage: [KEYSIZE=N] [NUM_THREADS=N] [NUM_MAPS=N] perl $0 file...
+\n".
      "       perl $0 [--keysize=N] [--threads=N] [--maps=N] file...\n
+".
      "       perl $0 [--keysize=N] [--threads=max] [--maps=max] file.
+..\n";
}

@ARGV or usage();

my $NUM_CPUS = MCE::Util->get_ncpu();
my $KEY_SIZE = $ENV{KEYSIZE}     || 12;
my $NUM_THDS = $ENV{NUM_THREADS} || 8;
my $NUM_MAPS = $ENV{NUM_MAPS}    || 32;

while ($ARGV[0] =~ /^--?/) {
    my $arg = shift;
    $KEY_SIZE = $1,        next if $arg =~ /-keysize=(\d+)$/;
    $NUM_THDS = $1,        next if $arg =~ /-threads=(\d+)$/;
    $NUM_THDS = $NUM_CPUS, next if $arg =~ /-threads=max$/;
    $NUM_MAPS = $1,        next if $arg =~ /-maps=(\d+)$/;
    $NUM_MAPS = 128,       next if $arg =~ /-maps=max$/;
    usage();
}   

$NUM_THDS = $NUM_CPUS if $NUM_THDS > $NUM_CPUS;
$NUM_MAPS = 128 if ($NUM_MAPS > 128);

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~
# Setup.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~

print {*STDERR} "Kyoto Cabinet hash database - start\n";
print {*STDERR} "fixed string length=${KEY_SIZE}, threads=${NUM_THDS},
+ maps=${NUM_MAPS}\n";

our @MM;
# Let's have Kyoto Cabinet handle locking because we're not doing FETC
+H/STORE.
# Instead, we're calling "increment" to increment the value (single ca
+ll).

# if ($^O =~ /cygwin|MSWin32/) {
#     # On Cygwin, use Channel instead for better performance.
#     $MM[$_] = MCE::Mutex->new(impl => "Channel")
#         for (0 .. $NUM_MAPS - 1);
# } else {
#     $MM[$_] = MCE::Mutex->new(impl => "Flock", path => "$tmp_dir/$_.
+sem")
#         for (0 .. $NUM_MAPS - 1);
# }

# Open DB function.
# Each child must open the DB file separately.

sub open_db {
    my ($idx, $omode) = @_;
    # hash (*.kch db) is faster than tree (*.kct db) for this demonstr
+ation

    my $db   = KyotoCabinet::DB->new();
    my $path = "$tmp_dir/$idx.kch#bnum=500000";  # Hash database
   #my $path = "$tmp_dir/$idx.kct#pccap=128m";   # B+ Tree database

    $db->open($path, $omode) or die "Open error '$path': $!";

    return $db;
}

# Create the databases.
{
    open_db($_, KyotoCabinet::DB::OWRITER | KyotoCabinet::DB::OCREATE)
        for (0 .. $NUM_MAPS - 1);
}

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~
# Get properties.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~

my ($start1, $process_count, $num_lines) = (time, 0, 0);
our ($NUM_LINES, %TEMP) = (0); # child vars

my $mce = MCE->new(
    max_workers => $NUM_THDS,
    chunk_size  => 65536,
    gather      => sub { $num_lines += $_[0] },
    posix_exit  => 1,
    use_slurpio => 1,

    user_func   => sub {
        my ($mce, $slurp_ref, $chunk_id) = @_;
        open my $input_fh, '<', $slurp_ref;
        while (<$input_fh>) {
            my ($key, $count) = split /\t/;
            my $idx = xxhash64($key, 0) % $NUM_MAPS;
            $TEMP{$idx}{$key} += $count;
            $NUM_LINES++;
        }
        close $input_fh;
    },

    user_end    => sub {
        my $omode = KyotoCabinet::DB::OREADER | KyotoCabinet::DB::OWRI
+TER;
        for my $idx (keys %TEMP) {

            # Acquire the lock before opening the DB file. Must also c
+lose.
            # $MM[$idx]->lock_exclusive;
            # my $db = open_db($idx, $omode);
            # while (my ($key, $count) = each %{ $TEMP{$idx} }) {
            #     my $val = $db->get($key);
            #     $db->set($key, $val + $count);
            # }
            # $db->close;
            # $MM[$idx]->unlock;

            my $db = open_db($idx, $omode);
            while (my ($key, $count) = each %{ $TEMP{$idx} }) {
                $db->increment($key, $count);
            }
            $db->close;
        }
        MCE->gather($NUM_LINES);
        $NUM_LINES = 0, %TEMP = ();
    },

);

for my $fname (@ARGV) {
    warn("'$fname': Is a directory, skipping\n"), next if (-d $fname);
    warn("'$fname': No such file, skipping\n"), next unless (-f $fname
+);
    warn("'$fname': Permission denied, skipping\n"), next unless (-r $
+fname);

    ++$process_count, $mce->process($fname) if (-s $fname);
}

$mce->shutdown; # reap workers

printf {*STDERR} "get properties   : %9.3f secs\n", time - $start1;
exit unless $process_count;

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~
# Pack data for sorting.
# Each worker handles a unique DB.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~

my $VAL_SIZE   = length pack('l', 0);
my $STR_SIZE   = $KEY_SIZE + 1; # null-terminated
my $PACK_SIZE  = $STR_SIZE + $VAL_SIZE;
my $FETCH_SIZE = $PACK_SIZE * 12000;

sub pack_task {
    my ($mce, $seq_id, $chunk_id) = @_;
    my $db = open_db($seq_id, KyotoCabinet::DB::OREADER);
    my ($num_rows, $kv_pairs) = (0, '');

    # Calling increment above? Because Kyoto serializes the value as a
+n
    # 8-byte integer in big-endian order, they should be processed wit
+h
    # the 'unpack' function with the 'q>' operator after retrieval.

    my $cur = $db->cursor;  $cur->jump;
    my ($key, $val);
    while (($key, $val) = $cur->get(1)) {
        $num_rows += 1;
       #$kv_pairs .= pack("lZ${STR_SIZE}", -($val), $key);
        $kv_pairs .= pack("lZ${STR_SIZE}", -(unpack 'q>', $val), $key)
+;
    }
    $cur->disable;

    $mce->gather($num_rows, $kv_pairs);
}

my ($start2, $unique, $data) = (time, 0, '');

# Spin up MCE workers to handle packing and output.
$mce = MCE->new(
    max_workers => $NUM_THDS,
    chunk_size  => 1,
    init_relay  => 1,
    posix_exit  => 1,
    user_func   => sub {
        my $task = MCE->user_args->[0];
        no strict 'refs';
        $task->(@_);
    },
);

# Pack data for sorting.
$mce->process({
    user_args  => [ 'pack_task' ],
    sequence   => [ 0, $NUM_MAPS - 1 ],
    chunk_size => 1,
    gather     => sub {
        $unique += $_[0];
        $data   .= $_[1];
    },
});

printf {*STDERR} "pack properties  : %9.3f secs\n", time - $start2;

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~
# Output data by value descending, word ascending.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~

# Get the next down value for integer division.
sub divide_down {
    my ($dividend, $divisor) = @_;
    return int($dividend / $divisor) if $dividend % $divisor;
    return int($dividend / $divisor) - 1;
}

# Return a chunk of $data: manager responding to worker request.
sub fetch_chunk {
    my ($seq_id) = @_;
    return substr($data, $seq_id * $FETCH_SIZE, $FETCH_SIZE);
}

# Worker task: unpack chunk and write directly to standard output.
sub disp_task {
    my ($mce, $seq_id, $chunk_id) = @_;
    my ($output, $chunk) = ('', $mce->do('fetch_chunk', $seq_id));
    while (length $chunk) {
        my ($val, $key) = unpack(
            "lZ$STR_SIZE",
            substr($chunk, 0, $PACK_SIZE, '')
        );
        $output .= $key. "\t". -($val). "\n";
    }
    MCE::relay { print $output; };
}

if (length $data) {
    my $start3 = time;
    sort_packed "C$PACK_SIZE", $data;
    printf {*STDERR} "sort packed data : %9.3f secs\n", time - $start3
+;

    my $start4 = time;
    $mce->process({
        user_args  => [ 'disp_task' ],
        sequence   => [ 0, divide_down(length($data), $FETCH_SIZE) ],
        chunk_size => 1,
    });
    printf {*STDERR} "write stdout     : %9.3f secs\n", time - $start4
+;
}

$mce->shutdown; # reap workers
@MM = ();

printf {*STDERR} "total time       : %9.3f secs\n", time - $start1;
printf {*STDERR} "    count lines  : %lu\n", $num_lines;
printf {*STDERR} "    count unique : %lu\n", $unique;
[download]

Comment on Re: Solving the Long List is Long challenge - Kyoto Cabinet Select or Download Code

Replies are listed 'Best First'.
Re^2: Solving the Long List is Long challenge - Kyoto Cabinet by hippo (Archbishop) on Jul 14, 2023 at 08:52 UTC
Kyoto Cabinet is the successor to Tokyo Cabinet. Thanks for reminding me of the existence of Kyoto Cabinet. I looked at it for a particular project some years back and was impressed with the speed and ease of use. Unfortunately the project was canned before it could be used in production. However, I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. It does require C++17 but might be worth a look. Unfortunately there do not seem to be any modules on CPAN using it yet, AFAICS. 🦛	[reply]
Re^3: Solving the Long List is Long challenge - Learning Tkrzw by marioroy (Prior) on Jul 15, 2023 at 07:21 UTC
> I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. I found time to try the Tkrzw C++ library. Tkrzw provides sharding capabilities. Spoiler alert... It's awesome :) C++ bits: `#include <tkrzw_dbm_hash.h> #include <tkrzw_dbm_shard.h> ... // tkrzw::HashDBM dbm; // dbm.Open("/dev/shm/casket.tkh", true).OrDie(); tkrzw::ShardDBM dbm; const std::map<std::string, std::string> params = { {"num_shards", "8"}, {"dbm", "HashDBM"} }; dbm.OpenAdvanced("/dev/shm/casket.tkh", true, tkrzw::File::OPEN_DEF +AULT, params); for (int i = 0; i < nfiles; ++i) get_properties(fname[i], nthds, dbm); dbm.Close();` [download] ls -1 /dev/shm `casket.tkh-00000-of-00008 casket.tkh-00001-of-00008 casket.tkh-00002-of-00008 casket.tkh-00003-of-00008 casket.tkh-00004-of-00008 casket.tkh-00005-of-00008 casket.tkh-00006-of-00008 casket.tkh-00007-of-00008` [download] I will come back after completing a new llil4shard C++ variant. The Tkrzw library is amazing. In the meantime, the left column is the number of shards processing 26 big files (48 CPU threads). Total: 91,395,200 lines, 79,120,065 unique keys. For reference, Perl lliltch.pl "get properties" takes 9.533 seconds; 128 maps (or shards). `8 shards : 11.958 secs 7.643 mil QPS 16 shards : 6.680 secs 13.682 mil QPS 32 shards : 4.424 secs 20.659 mil QPS 64 shards : 3.419 secs 26.732 mil QPS 96 shards : 3.052 secs 29.946 mil QPS 128 shards : 2.903 secs 31.483 mil QPS` [download] Yay :) Tkrzw provides the increment method. Like Kyoto Cabinet, the value is stored as an 8-byte big-endian integer. #include <byteswap.h> ... // Process max Nthreads chunks concurrently. while (first < last) { char* beg_ptr{first}; first = find_char(first, last, '\n'); char* end_ptr{first}; ++first; if ((found = find_char(beg_ptr, end_ptr, '\t')) == end_ptr) continue; count = fast_atoll64(found + 1); klen = std::min(MAX_STR_LEN_L, (size_t)(found - beg_ptr)); std::basic_string_view<char> key{ reinterpret_cast<const char>(beg_ptr), klen }; dbm_ret.IncrementSimple(key, count); // std::string value = dbm_ret.GetSimple(key); // int64_t bigendian_num = reinterpret_cast<int64_t>(value.data() +); // std::cout << key << ": " << bswap_64(bigendian_num) << "\n"; } [download] So much learning from the Long List is Long series :) Notice above, no locking among threads for incrementing the count. No local hash either. The "IncrementSimple" method is a single operation. I tested retrieval and conversion, which will be done later in the code. Update: Iteration is slow `// Store the properties into a vector vec_str_int_type propvec; propvec.reserve(num_keys); std::string key, value; int64_t bigendian_num = reinterpret_cast<int64_t>(value.data()); std::unique_ptr<tkrzw::DBM::Iterator> iter = dbm.MakeIterator(); iter->First(); while (iter->Get(&key, &value) == tkrzw::Status::SUCCESS) { propvec.emplace_back(key, bswap_64(bigendian_num)); iter->Next(); } dbm.Close();` [download] Notice "tkrzw to vector". I will try again later and iterate the individual maps (or shards) in parallel. `$ NUM_THREADS=8 NUM_MAPS=4 ./llil4tkh big{1,2,3}.txt \| cksum llil4tkh (fixed string length=12) start use OpenMP use boost sort get properties 2.978 secs shardDBM to vector 5.848 secs vector stable sort 0.157 secs write stdout 0.213 secs total time 9.197 secs 2956888413 93308427` [download] Compared to Perl using Tokyo Cabinet :) `$ perl llilthc.pl --threads=8 --maps=4 big{1,2,3}.txt \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=8, maps=4 get properties : 5.487 secs pack properties : 3.545 secs sort packed data : 0.969 secs write stdout : 0.764 secs total time : 10.769 secs count lines : 10545600 count unique : 10367603 2956888413 93308427` [download] Update: Parallel iteration* This works :), to make iteration faster. Iterate all the maps (or shards) in parallel. Append to the property vector, serially. // Store the properties into a vector vec_str_int_type propvec; propvec.reserve(num_keys); #pragma omp parallel for schedule(static, 1) for (int i = 0; i < nmaps; ++i) { // casket.tkh-00000-of-00004 // casket.tkh-00001-of-00004 // casket.tkh-00002-of-00004 // casket.tkh-00003-of-00004 char path[255]; std::sprintf(path, "/dev/shm/casket.tkh-%05d-of-%05d", i, nmaps) +; tkrzw::HashDBM dbm; dbm.Open(path, false).OrDie(); int64_t num_keys = dbm.CountSimple(); if (num_keys > 0) { vec_str_int_type locvec; locvec.reserve(num_keys); std::string key, value; int64_t bigendian_num = reinterpret_cast<int64_t>(value.dat +a()); std::unique_ptr<tkrzw::DBM::Iterator> iter = dbm.MakeIterator +(); iter->First(); while (iter->Get(&key, &value) == tkrzw::Status::SUCCESS) { locvec.emplace_back(key, bswap_64(bigendian_num)); iter->Next(); } #pragma omp critical propvec.insert( // Append local vector to propvec propvec.end(), std::make_move_iterator(locvec.begin()), std::make_move_iterator(locvec.end()) ); } dbm.Close(); } [download] Results:* $ NUM_THREADS=8 NUM_MAPS=4 ./llil4tkh big{1,2,3}.txt \| cksum llil4tkh (fixed string length=12) start use OpenMP use boost sort get properties 2.985 secs shardDBM to vector 1.381 secs vector stable sort 0.157 secs write stdout 0.214 secs total time 4.739 secs 2956888413 93308427 $ NUM_THREADS=8 NUM_MAPS=8 ./llil4tkh big{1,2,3}.txt \| cksum llil4tkh (fixed string length=12) start use OpenMP use boost sort get properties 2.106 secs shardDBM to vector 0.683 secs vector stable sort 0.159 secs write stdout 0.208 secs total time 3.157 secs 2956888413 93308427 $ NUM_THREADS=8 NUM_MAPS=32 ./llil4tkh big{1,2,3}.txt \| cksum llil4tkh (fixed string length=12) start use OpenMP use boost sort get properties 1.364 secs shardDBM to vector 0.639 secs vector stable sort 0.159 secs write stdout 0.207 secs total time 2.372 secs 2956888413 93308427 [download] Let's try processing 26 big files :) Get properties is 3 times faster than Perl. The QPS is measured by count_lines and count_unique, respectively, divided by time: in millions. $ perl lliltch.pl --threads=48 --maps=max in/biga* \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=48, maps=128 get properties : 9.533 secs 9.587 mil QPS pack properties : 3.276 secs 24.151 mil QPS sort packed data : 6.826 secs write stdout : 1.631 secs total time : 21.284 secs count lines : 91395200 count unique : 79120065 2005669956 712080585 $ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh in/biga* \| cksum llil4tkh (fixed string length=12) start sharding managed by the tkrzw::ShardDBM library use OpenMP use boost sort get properties 2.872 secs 31.823 mil QPS shardDBM to vector 1.546 secs 51.177 mil QPS vector stable sort 1.399 secs write stdout 1.561 secs total time 7.380 secs 2005669956 712080585 [download] Thank you, hippo for mentioning the Tkrzw C++ library. I'm one step away before posting the new llil variant. Currently, the db path is hard-coded to "/dev/shm/casket.tkh". Update: app-level sharding For better performance, I tried constructing an array of "tkrzw::HashDBM" objects versus a single "tkrzw::ShardDBM" object. This requires the application to compute the hash value, which is not a problem. Below, see timings for app-level sharding. `$ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh2 in/biga* \| cksum llil4tkh2 (fixed string length=12) start sharding managed by the application use OpenMP use boost sort get properties 2.337 secs 39.108 mil QPS hashDBMs to vector 1.607 secs 49.235 mil QPS vector stable sort 1.379 secs write stdout 1.576 secs total time 6.900 secs 2005669956 712080585` [download] Update: MAX_STR_LEN_L optimization Notice "vector stable sort" completing in half the time. The code is final and will post two Tkrzw variants this evening {one sharding by the C++ library, another application-level sharding}. `$ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh2 in/biga* \| cksum llil4tkh2 (fixed string length=12) start sharding managed by the application use OpenMP use boost sort get properties 2.331 secs 39.209 mil QPS hashDBMs to vector 1.420 secs 55.718 mil QPS vector stable sort 0.663 secs write stdout 1.541 secs total time 5.957 secs 2005669956 712080585` [download]	[reply] [d/l] [select]
Re^4: Solving the Long List is Long challenge - Tkrzw llil4tkh2 by marioroy (Prior) on Jul 17, 2023 at 05:33 UTC
I created another Tkrzw demonstration. This one constructs many HashDBMs. Basically, sharding is managed by the application. Update 1: The HashDBMs are now interchangeable/compatible with ShardDBMs, since using the same hash function. Update 2: Changed bswap_64, now using the library tkrzw::StrToIntBigEndian function. `#include <tkrzw_dbm_common_impl.h> idx = tkrzw::SecondaryHash(key, nmaps); dbm_ret[idx].IncrementSimple(key, count);` [download] $ NUM_THREADS=24 NUM_MAPS=96 ./llil4tkh2 big{1,2,3}.txt \| cksum llil4tkh2 (fixed string length=12) start sharding managed by the application use OpenMP use boost sort get properties 0.446 secs 23.645 mil QPS hashDBMs to vector 0.354 secs vector stable sort 0.081 secs write stdout 0.210 secs total time 1.092 secs count lines 10545600 count unique 10367603 2956888413 93308427 # Results for 26 big files: $ NUM_THREADS=24 NUM_MAPS=96 ./llil4tkh2 in/biga* \| cksum llil4tkh2 (fixed string length=12) start sharding managed by the application use OpenMP use boost sort get properties 3.507 secs 26.051 mil QPS hashDBMs to vector 1.777 secs vector stable sort 0.665 secs write stdout 1.532 secs total time 7.483 secs count lines 91395200 count unique 79120065 2005669956 712080585 $ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh2 in/biga* \| cksum llil4tkh2 (fixed string length=12) start sharding managed by the application use OpenMP use boost sort get properties 2.335 secs 39.141 mil QPS hashDBMs to vector 1.410 secs vector stable sort 0.677 secs write stdout 1.555 secs total time 5.979 secs count lines 91395200 count unique 79120065 2005669956 712080585 # One billion+ lines (312 big files) $ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh2 \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ \| cksum llil4tkh2 (fixed string length=12) start sharding managed by the application use OpenMP use boost sort get properties 24.295 secs 45.143 mil QPS hashDBMs to vector 1.410 secs vector stable sort 0.644 secs write stdout 1.439 secs total time 27.790 secs count lines 1096742400 count unique 79120065 3625599930 791200650 [download] llil4tkh2.cc Read more... (20 kB)	[reply] [d/l] [select]
Re^4: Solving the Long List is Long challenge - Tkrzw llil4tkh by marioroy (Prior) on Jul 17, 2023 at 05:27 UTC
I finished the tkrzw::ShardDBM demonstration. Sharding is managed by the C++ library. Update: Changed bswap_64, now using the library tkrzw::StrToIntBigEndian function. $ NUM_THREADS=24 NUM_MAPS=96 ./llil4tkh big{1,2,3}.txt \| cksum llil4tkh (fixed string length=12) start sharding managed by the tkrzw::ShardDBM library use OpenMP use boost sort get properties 0.564 secs 18.698 mil QPS shardDBM to vector 0.352 secs vector stable sort 0.078 secs write stdout 0.206 secs total time 1.202 secs count lines 10545600 count unique 10367603 2956888413 93308427 # Results for 26 big files: $ NUM_THREADS=24 NUM_MAPS=96 ./llil4tkh in/biga* \| cksum llil4tkh (fixed string length=12) start sharding managed by the tkrzw::ShardDBM library use OpenMP use boost sort get properties 4.355 secs 20.986 mil QPS shardDBM to vector 1.789 secs vector stable sort 0.667 secs write stdout 1.577 secs total time 8.389 secs count lines 91395200 count unique 79120065 2005669956 712080585 $ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh in/biga* \| cksum llil4tkh (fixed string length=12) start sharding managed by the tkrzw::ShardDBM library use OpenMP use boost sort get properties 2.858 secs 31.979 mil QPS shardDBM to vector 1.412 secs vector stable sort 0.663 secs write stdout 1.553 secs total time 6.488 secs count lines 91395200 count unique 79120065 2005669956 712080585 # One billion+ lines (312 big files) $ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ \| cksum llil4tkh (fixed string length=12) start sharding managed by the tkrzw::ShardDBM library use OpenMP use boost sort get properties 28.506 secs 38.474 mil QPS shardDBM to vector 1.456 secs vector stable sort 0.645 secs write stdout 1.453 secs total time 32.062 secs count lines 1096742400 count unique 79120065 3625599930 791200650 [download] llil4tkh.cc Read more... (20 kB)	[reply] [d/l] [select]
Re^3: Solving the Long List is Long challenge - Kyoto Cabinet by marioroy (Prior) on Jul 14, 2023 at 14:47 UTC
> However, I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. It does require C++17 but might be worth a look. Unfortunately there do not seem to be any modules on CPAN using it yet, AFAICS. That looks interesting. Then, maybe a Python or C++ demonstration :) I added to my TODO list.	[reply]