Solving the Long List is Long challenge, finally?

Replies are listed 'Best First'.
Re: Solving the Long List is Long challenge - SQLite DB by marioroy (Prior) on Jul 13, 2023 at 09:53 UTC
Aloha, I circled back to Perl and tried something using Tie::Hash::DBD. The demonstration creates 32 SQLite databases (by default) into a temp directory. Crypt::xxHash is used to determine which database to insert/update. Sorting is handled by Sort::Packed. Similar to the DB_File variant, this is best run on a Unix OS. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl llilsql.pl file... perl llilsql.pl --keysize=N --threads=N --maps=N file... perl llilsql.pl --keysize=N --threads=max --maps=max file... Running: $ perl llilsql.pl big{1,2,3}.txt \| cksum Tie::Hash::DBD SQLite database - start fixed string length=12, threads=8, maps=32 get properties : 5.123 secs pack properties : 0.893 secs sort packed data : 0.949 secs write stdout : 0.745 secs total time : 7.713 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llilsql.pl --threads=16 --maps=64 big{1,2,3}.txt \| cksum Tie::Hash::DBD SQLite database - start fixed string length=12, threads=16, maps=64 get properties : 2.927 secs pack properties : 0.518 secs sort packed data : 0.944 secs write stdout : 0.396 secs total time : 4.792 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llilsql.pl --threads=24 --maps=64 big{1,2,3}.txt \| cksum Tie::Hash::DBD SQLite database - start fixed string length=12, threads=24, maps=64 get properties : 2.074 secs pack properties : 0.450 secs sort packed data : 0.940 secs write stdout : 0.275 secs total time : 3.747 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llilsql.pl --threads=48 --maps=max big{1,2,3}.txt \| cksum Tie::Hash::DBD SQLite database - start fixed string length=12, threads=48, maps=128 get properties : 1.432 secs pack properties : 0.297 secs sort packed data : 0.939 secs write stdout : 0.255 secs total time : 2.934 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 [download] llilsql.pl Read more... (11 kB)	[reply] [d/l] [select]
Re: Solving the Long List is Long challenge - IPC::MMA by marioroy (Prior) on Jul 13, 2023 at 10:52 UTC
Hi, again... :) Something else I tried is IPC::MMA, to complement the other two demonstrations here (using Tie::Hash::DBD) and here (using DB_File). This variant creates 256 hashes (by default) into shared memory. Crypt::xxHash is used to determine which hash to insert/update. Sorting is handled by Sort::Packed. The Author states, "IPC::MMA 'hashes' do not hash keys. They maintain their entries in sorted order on their keys, but they are not BTrees." So, to keep up with SQLite and DB_File, I configured 8x the number of maps. Be sure ulimit -n is plentiful. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl llilmma.pl file... perl llilmma.pl --keysize=N --threads=N --maps=N file... perl llilmma.pl --keysize=N --threads=max --maps=max file... Running: $ perl llilmma.pl big{1,2,3}.txt \| cksum IPC::MMA shared memory hash - start fixed string length=12, threads=8, maps=256 get properties : 6.677 secs pack properties : 0.781 secs sort packed data : 0.935 secs write stdout : 0.734 secs total time : 9.131 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llilmma.pl --threads=16 --maps=64 big{1,2,3}.txt \| cksum IPC::MMA shared memory hash - start fixed string length=12, threads=16, maps=512 get properties : 3.189 secs pack properties : 0.462 secs sort packed data : 0.941 secs write stdout : 0.383 secs total time : 4.983 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llilmma.pl --threads=24 --maps=64 big{1,2,3}.txt \| cksum IPC::MMA shared memory hash - start fixed string length=12, threads=24, maps=512 get properties : 2.388 secs pack properties : 0.332 secs sort packed data : 0.942 secs write stdout : 0.294 secs total time : 3.962 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llilmma.pl --threads=48 --maps=max big{1,2,3}.txt \| cksum IPC::MMA shared memory hash - start fixed string length=12, threads=48, maps=1024 get properties : 1.441 secs pack properties : 0.409 secs sort packed data : 0.935 secs write stdout : 0.223 secs total time : 3.024 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 [download] llilmma.pl Read more... (8 kB)	[reply] [d/l] [select]
Re: Solving the Long List is Long challenge - C++ vs Perl by marioroy (Prior) on Jul 13, 2023 at 12:16 UTC
How does Perl stack up to C++? The std::unordered_map container ships with C++ which is what I tried. Also, llil4emh and llil4hmap for comparison. The C++ demonstrations involve parallel sort. Running C++ (input: 26 big files) The right-column are results using April 2024 release. $ NUM_THREADS=48 ./llil4emh in/biga* \| cksum llil4emh (fixed string length=12) start use OpenMP get properties 1.170 secs 0.924 secs map to vector 0.184 secs 0.408 secs vector stable sort 0.477 secs 0.422 secs write stdout 1.526 secs 0.325 secs total time 3.438 secs 2.080 secs count lines 91395200 count unique 79120065 2005669956 712080585 $ NUM_THREADS=48 ./llil4hmap in/biga* \| cksum llil4hmap (fixed string length=12) start use OpenMP get properties 1.210 secs 0.866 secs map to vector 0.224 secs 0.377 secs vector stable sort 0.476 secs 0.416 secs write stdout 1.460 secs 0.249 secs total time 3.372 secs 1.910 secs count lines 91395200 count unique 79120065 2005669956 712080585 $ NUM_THREADS=48 ./llil4umap in/biga* \| cksum llil4umap (fixed string length=12) start use OpenMP get properties 3.928 secs 2.047 secs map to vector 4.345 secs 1.413 secs vector stable sort 0.473 secs 0.455 secs write stdout 1.468 secs 0.264 secs total time 17.163 secs 4.180 secs count lines 91395200 count unique 79120065 2005669956 712080585 [download] Running Perl (input: 26 big files) Notice how Perl's Sort::Packed module runs faster, sorting-wise on one CPU core. That is interesting. llilsql.pl - Tie::Hash::DBD SQLite database llildbf.pl - DB_File B-tree database llilmma.pl - IPC::MMA shared memory hash lliltch.pl - Tokyo Cabinet hash database llilkch.pl - Kyoto Cabinet hash database $ perl llilsql.pl --threads=48 --maps=max in/biga* \| cksum Tie::Hash::DBD SQLite database - start fixed string length=12, threads=48, maps=128 get properties : 29.645 secs pack properties : 1.883 secs sort packed data : 6.691 secs write stdout : 1.744 secs total time : 39.982 secs count lines : 91395200 count unique : 79120065 2005669956 712080585 $ perl llildbf.pl --threads=48 --maps=max in/biga* \| cksum DB_File B-tree database - start fixed string length=12, threads=48, maps=128 get properties : 24.957 secs pack properties : 1.969 secs sort packed data : 6.717 secs write stdout : 1.783 secs total time : 35.443 secs count lines : 91395200 count unique : 79120065 2005669956 712080585 $ perl llilmma.pl --threads=48 --maps=max in/biga* \| cksum IPC::MMA shared memory hash - start fixed string length=12, threads=48, maps=1024 get properties : 28.063 secs pack properties : 1.575 secs sort packed data : 6.662 secs write stdout : 1.778 secs total time : 38.101 secs count lines : 91395200 count unique : 79120065 2005669956 712080585 $ perl lliltch.pl --threads=48 --maps=max in/biga* \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=48, maps=128 get properties : 9.533 secs pack properties : 3.276 secs sort packed data : 6.826 secs write stdout : 1.631 secs total time : 21.284 secs count lines : 91395200 count unique : 79120065 2005669956 712080585 $ perl llilkch.pl --threads=48 --maps=max in/biga* \| cksum Kyoto Cabinet hash database - start fixed string length=12, threads=48, maps=128 get properties : 10.114 secs pack properties : 2.488 secs sort packed data : 6.863 secs write stdout : 1.540 secs total time : 21.024 secs count lines : 91395200 count unique : 79120065 2005669956 712080585 [download] llil4umap.cc Read more... (20 kB)	[reply] [d/l] [select]
Re^2: Solving the Long List is Long challenge - C++ vs Perl by marioroy (Prior) on Jul 14, 2023 at 14:26 UTC
I ran a test run for 1 billion+ lines (312 big files). This was done to compare the various demonstrations "mainly" processing duplicate keys. I find it impressive seeing Tie::Hash::DBD (SQLite) keep up with DB_File. Even more impressive are Tokyo and Kyoto Cabinet. They provide the "addinc" and "increment" API, respectively. Perl solutions: Note: All tasks run parallel except for "sort packed data". $ perl llilsql.pl --threads=48 --maps=max \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ \| cksum Tie::Hash::DBD SQLite database - start fixed string length=12, threads=48, maps=128 get properties : 321.814 secs 3.408 mil QPS pack properties : 1.621 secs sort packed data : 6.873 secs write stdout : 1.651 secs total time : 331.978 secs count lines : 1096742400 count unique : 79120065 3625599930 791200650 $ perl llildbf.pl --threads=48 --maps=max \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ \| cksum DB_File B-tree database - start fixed string length=12, threads=48, maps=128 get properties : 314.637 secs 3.486 mil QPS pack properties : 2.074 secs sort packed data : 6.690 secs write stdout : 1.677 secs total time : 325.097 secs count lines : 1096742400 count unique : 79120065 3625599930 791200650 $ perl lliltch.pl --threads=48 --maps=max \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=48, maps=128 get properties : 120.170 secs 9.127 mil QPS pack properties : 3.312 secs sort packed data : 6.872 secs write stdout : 1.712 secs total time : 132.086 secs count lines : 1096742400 count unique : 79120065 3625599930 791200650 $ perl llilkch.pl --threads=48 --maps=max \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ \| cksum Kyoto Cabinet hash database - start fixed string length=12, threads=48, maps=128 get properties : 139.692 secs 7.851 mil QPS pack properties : 2.476 secs sort packed data : 6.974 secs write stdout : 1.635 secs total time : 150.795 secs count lines : 1096742400 count unique : 79120065 3625599930 791200650 [download] C++ for comparison: QPS is queries per second for get properties. The right-column are results using April 2024 release. $ NUM_THREADS=48 ./llil4umap \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ \| cksum llil4umap (fixed string length=12) start use OpenMP 73.939 mil QPS 96.757 mil QPS get properties 14.833 secs 11.335 secs map to vector 4.343 secs 1.410 secs vector stable sort 0.471 secs 0.436 secs write stdout 1.413 secs 0.297 secs total time 28.107 secs 13.480 secs count lines 1096742400 count unique 79120065 3625599930 791200650 $ NUM_THREADS=48 ./llil4hmap \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ \| cksum llil4hmap (fixed string length=12) start use OpenMP 123.660 mil QPS 142.527 mil QPS get properties 8.869 secs 7.695 secs map to vector 0.235 secs 0.373 secs vector stable sort 0.477 secs 0.413 secs write stdout 1.445 secs 0.315 secs total time 11.028 secs 8.798 secs count lines 1096742400 count unique 79120065 3625599930 791200650 $ NUM_THREADS=48 ./llil4emh \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ in/biga* in/biga* in/biga* in/biga* in/biga* in/biga* \ \| cksum llil4emh (fixed string length=12) start use OpenMP 150.239 mil QPS 176.609 mil QPS get properties 7.300 secs 6.219 secs map to vector 0.186 secs 0.378 secs vector stable sort 0.478 secs 0.439 secs write stdout 1.410 secs 0.305 secs total time 9.458 secs 7.334 secs count lines 1096742400 count unique 79120065 3625599930 791200650 [download]	[reply] [d/l] [select]
Re: Solving the Long List is Long challenge - DB_File by marioroy (Prior) on Jul 13, 2023 at 10:07 UTC
Greetings, After trying C++, I thought to give Perl another try using DB_File. The demonstration creates 32 databases (by default) into a temp directory. Crypt::xxHash is used to determine which database to insert/update. Sorting is handled by Sort::Packed. Like with the SQLite variant, this is best run on a Unix OS. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl llildbf.pl file... perl llildbf.pl --keysize=N --threads=N --maps=N file... perl llildbf.pl --keysize=N --threads=max --maps=max file... Running: $ perl llildbf.pl big{1,2,3}.txt \| cksum DB_File B-tree database - start fixed string length=12, threads=8, maps=32 get properties : 6.918 secs pack properties : 0.915 secs sort packed data : 0.948 secs write stdout : 0.763 secs total time : 9.548 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llildbf.pl --threads=16 --maps=64 big{1,2,3}.txt \| cksum DB_File B-tree database - start fixed string length=12, threads=16, maps=64 get properties : 3.418 secs pack properties : 0.530 secs sort packed data : 0.933 secs write stdout : 0.390 secs total time : 5.274 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llildbf.pl --threads=24 --maps=64 big{1,2,3}.txt \| cksum DB_File B-tree database - start fixed string length=12, threads=24, maps=64 get properties : 2.400 secs pack properties : 0.438 secs sort packed data : 0.935 secs write stdout : 0.297 secs total time : 4.076 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llildbf.pl --threads=48 --maps=max big{1,2,3}.txt \| cksum DB_File B-tree database - start fixed string length=12, threads=48, maps=128 get properties : 1.387 secs pack properties : 0.318 secs sort packed data : 0.934 secs write stdout : 0.242 secs total time : 2.891 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 [download] llildbf.pl Read more... (9 kB)	[reply] [d/l] [select]
Re: Solving the Long List is Long challenge - Kyoto Cabinet by marioroy (Prior) on Jul 14, 2023 at 07:53 UTC
There is one missing in the mix :) Kyoto Cabinet is the successor to Tokyo Cabinet. This new variant also creates 32 hash databases (by default) into a temp directory. Crypt::xxHash is used to determine which database to insert/update. Sorting is handled by Sort::Packed. Similar to the DB_File variant, this is best run on a Unix OS. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl llilkch.pl file... perl llilkch.pl --keysize=N --threads=N --maps=N file... perl llilkch.pl --keysize=N --threads=max --maps=max file... Running: $ perl llilkch.pl big{1,2,3}.txt \| cksum Kyoto Cabinet hash database - start fixed string length=12, threads=8, maps=32 get properties : 3.348 secs pack properties : 1.187 secs sort packed data : 0.960 secs write stdout : 0.775 secs total time : 6.273 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llilkch.pl --threads=16 --maps=64 big{1,2,3}.txt \| cksum Kyoto Cabinet hash database - start fixed string length=12, threads=16, maps=64 get properties : 1.925 secs pack properties : 0.723 secs sort packed data : 0.965 secs write stdout : 0.387 secs total time : 4.005 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llilkch.pl --threads=24 --maps=64 big{1,2,3}.txt \| cksum Kyoto Cabinet hash database - start fixed string length=12, threads=24, maps=64 get properties : 1.420 secs pack properties : 0.538 secs sort packed data : 0.975 secs write stdout : 0.286 secs total time : 3.225 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl llilkch.pl --threads=48 --maps=max big{1,2,3}.txt \| cksum Kyoto Cabinet hash database - start fixed string length=12, threads=48, maps=128 get properties : 0.908 secs pack properties : 0.372 secs sort packed data : 0.969 secs write stdout : 0.205 secs total time : 2.462 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 [download] llilkch.pl Read more... (11 kB)	[reply] [d/l] [select]
Re^2: Solving the Long List is Long challenge - Kyoto Cabinet by hippo (Archbishop) on Jul 14, 2023 at 08:52 UTC
Kyoto Cabinet is the successor to Tokyo Cabinet. Thanks for reminding me of the existence of Kyoto Cabinet. I looked at it for a particular project some years back and was impressed with the speed and ease of use. Unfortunately the project was canned before it could be used in production. However, I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. It does require C++17 but might be worth a look. Unfortunately there do not seem to be any modules on CPAN using it yet, AFAICS. 🦛	[reply]
Re^3: Solving the Long List is Long challenge - Learning Tkrzw by marioroy (Prior) on Jul 15, 2023 at 07:21 UTC
> I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. I found time to try the Tkrzw C++ library. Tkrzw provides sharding capabilities. Spoiler alert... It's awesome :) C++ bits: `#include <tkrzw_dbm_hash.h> #include <tkrzw_dbm_shard.h> ... // tkrzw::HashDBM dbm; // dbm.Open("/dev/shm/casket.tkh", true).OrDie(); tkrzw::ShardDBM dbm; const std::map<std::string, std::string> params = { {"num_shards", "8"}, {"dbm", "HashDBM"} }; dbm.OpenAdvanced("/dev/shm/casket.tkh", true, tkrzw::File::OPEN_DEF +AULT, params); for (int i = 0; i < nfiles; ++i) get_properties(fname[i], nthds, dbm); dbm.Close();` [download] ls -1 /dev/shm `casket.tkh-00000-of-00008 casket.tkh-00001-of-00008 casket.tkh-00002-of-00008 casket.tkh-00003-of-00008 casket.tkh-00004-of-00008 casket.tkh-00005-of-00008 casket.tkh-00006-of-00008 casket.tkh-00007-of-00008` [download] I will come back after completing a new llil4shard C++ variant. The Tkrzw library is amazing. In the meantime, the left column is the number of shards processing 26 big files (48 CPU threads). Total: 91,395,200 lines, 79,120,065 unique keys. For reference, Perl lliltch.pl "get properties" takes 9.533 seconds; 128 maps (or shards). `8 shards : 11.958 secs 7.643 mil QPS 16 shards : 6.680 secs 13.682 mil QPS 32 shards : 4.424 secs 20.659 mil QPS 64 shards : 3.419 secs 26.732 mil QPS 96 shards : 3.052 secs 29.946 mil QPS 128 shards : 2.903 secs 31.483 mil QPS` [download] Yay :) Tkrzw provides the increment method. Like Kyoto Cabinet, the value is stored as an 8-byte big-endian integer. #include <byteswap.h> ... // Process max Nthreads chunks concurrently. while (first < last) { char* beg_ptr{first}; first = find_char(first, last, '\n'); char* end_ptr{first}; ++first; if ((found = find_char(beg_ptr, end_ptr, '\t')) == end_ptr) continue; count = fast_atoll64(found + 1); klen = std::min(MAX_STR_LEN_L, (size_t)(found - beg_ptr)); std::basic_string_view<char> key{ reinterpret_cast<const char>(beg_ptr), klen }; dbm_ret.IncrementSimple(key, count); // std::string value = dbm_ret.GetSimple(key); // int64_t bigendian_num = reinterpret_cast<int64_t>(value.data() +); // std::cout << key << ": " << bswap_64(bigendian_num) << "\n"; } [download] So much learning from the Long List is Long series :) Notice above, no locking among threads for incrementing the count. No local hash either. The "IncrementSimple" method is a single operation. I tested retrieval and conversion, which will be done later in the code. Update: Iteration is slow `// Store the properties into a vector vec_str_int_type propvec; propvec.reserve(num_keys); std::string key, value; int64_t bigendian_num = reinterpret_cast<int64_t>(value.data()); std::unique_ptr<tkrzw::DBM::Iterator> iter = dbm.MakeIterator(); iter->First(); while (iter->Get(&key, &value) == tkrzw::Status::SUCCESS) { propvec.emplace_back(key, bswap_64(bigendian_num)); iter->Next(); } dbm.Close();` [download] Notice "tkrzw to vector". I will try again later and iterate the individual maps (or shards) in parallel. `$ NUM_THREADS=8 NUM_MAPS=4 ./llil4tkh big{1,2,3}.txt \| cksum llil4tkh (fixed string length=12) start use OpenMP use boost sort get properties 2.978 secs shardDBM to vector 5.848 secs vector stable sort 0.157 secs write stdout 0.213 secs total time 9.197 secs 2956888413 93308427` [download] Compared to Perl using Tokyo Cabinet :) `$ perl llilthc.pl --threads=8 --maps=4 big{1,2,3}.txt \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=8, maps=4 get properties : 5.487 secs pack properties : 3.545 secs sort packed data : 0.969 secs write stdout : 0.764 secs total time : 10.769 secs count lines : 10545600 count unique : 10367603 2956888413 93308427` [download] Update: Parallel iteration* This works :), to make iteration faster. Iterate all the maps (or shards) in parallel. Append to the property vector, serially. // Store the properties into a vector vec_str_int_type propvec; propvec.reserve(num_keys); #pragma omp parallel for schedule(static, 1) for (int i = 0; i < nmaps; ++i) { // casket.tkh-00000-of-00004 // casket.tkh-00001-of-00004 // casket.tkh-00002-of-00004 // casket.tkh-00003-of-00004 char path[255]; std::sprintf(path, "/dev/shm/casket.tkh-%05d-of-%05d", i, nmaps) +; tkrzw::HashDBM dbm; dbm.Open(path, false).OrDie(); int64_t num_keys = dbm.CountSimple(); if (num_keys > 0) { vec_str_int_type locvec; locvec.reserve(num_keys); std::string key, value; int64_t bigendian_num = reinterpret_cast<int64_t>(value.dat +a()); std::unique_ptr<tkrzw::DBM::Iterator> iter = dbm.MakeIterator +(); iter->First(); while (iter->Get(&key, &value) == tkrzw::Status::SUCCESS) { locvec.emplace_back(key, bswap_64(bigendian_num)); iter->Next(); } #pragma omp critical propvec.insert( // Append local vector to propvec propvec.end(), std::make_move_iterator(locvec.begin()), std::make_move_iterator(locvec.end()) ); } dbm.Close(); } [download] Results:* $ NUM_THREADS=8 NUM_MAPS=4 ./llil4tkh big{1,2,3}.txt \| cksum llil4tkh (fixed string length=12) start use OpenMP use boost sort get properties 2.985 secs shardDBM to vector 1.381 secs vector stable sort 0.157 secs write stdout 0.214 secs total time 4.739 secs 2956888413 93308427 $ NUM_THREADS=8 NUM_MAPS=8 ./llil4tkh big{1,2,3}.txt \| cksum llil4tkh (fixed string length=12) start use OpenMP use boost sort get properties 2.106 secs shardDBM to vector 0.683 secs vector stable sort 0.159 secs write stdout 0.208 secs total time 3.157 secs 2956888413 93308427 $ NUM_THREADS=8 NUM_MAPS=32 ./llil4tkh big{1,2,3}.txt \| cksum llil4tkh (fixed string length=12) start use OpenMP use boost sort get properties 1.364 secs shardDBM to vector 0.639 secs vector stable sort 0.159 secs write stdout 0.207 secs total time 2.372 secs 2956888413 93308427 [download] Let's try processing 26 big files :) Get properties is 3 times faster than Perl. The QPS is measured by count_lines and count_unique, respectively, divided by time: in millions. $ perl lliltch.pl --threads=48 --maps=max in/biga* \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=48, maps=128 get properties : 9.533 secs 9.587 mil QPS pack properties : 3.276 secs 24.151 mil QPS sort packed data : 6.826 secs write stdout : 1.631 secs total time : 21.284 secs count lines : 91395200 count unique : 79120065 2005669956 712080585 $ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh in/biga* \| cksum llil4tkh (fixed string length=12) start sharding managed by the tkrzw::ShardDBM library use OpenMP use boost sort get properties 2.872 secs 31.823 mil QPS shardDBM to vector 1.546 secs 51.177 mil QPS vector stable sort 1.399 secs write stdout 1.561 secs total time 7.380 secs 2005669956 712080585 [download] Thank you, hippo for mentioning the Tkrzw C++ library. I'm one step away before posting the new llil variant. Currently, the db path is hard-coded to "/dev/shm/casket.tkh". Update: app-level sharding For better performance, I tried constructing an array of "tkrzw::HashDBM" objects versus a single "tkrzw::ShardDBM" object. This requires the application to compute the hash value, which is not a problem. Below, see timings for app-level sharding. `$ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh2 in/biga* \| cksum llil4tkh2 (fixed string length=12) start sharding managed by the application use OpenMP use boost sort get properties 2.337 secs 39.108 mil QPS hashDBMs to vector 1.607 secs 49.235 mil QPS vector stable sort 1.379 secs write stdout 1.576 secs total time 6.900 secs 2005669956 712080585` [download] Update: MAX_STR_LEN_L optimization Notice "vector stable sort" completing in half the time. The code is final and will post two Tkrzw variants this evening {one sharding by the C++ library, another application-level sharding}. `$ NUM_THREADS=48 NUM_MAPS=128 ./llil4tkh2 in/biga* \| cksum llil4tkh2 (fixed string length=12) start sharding managed by the application use OpenMP use boost sort get properties 2.331 secs 39.209 mil QPS hashDBMs to vector 1.420 secs 55.718 mil QPS vector stable sort 0.663 secs write stdout 1.541 secs total time 5.957 secs 2005669956 712080585` [download]	[reply] [d/l] [select]
Re^4: Solving the Long List is Long challenge - Tkrzw llil4tkh by marioroy (Prior) on Jul 17, 2023 at 05:27 UTC
Re^4: Solving the Long List is Long challenge - Tkrzw llil4tkh2 by marioroy (Prior) on Jul 17, 2023 at 05:33 UTC
Re^3: Solving the Long List is Long challenge - Kyoto Cabinet by marioroy (Prior) on Jul 14, 2023 at 14:47 UTC
> However, I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. It does require C++17 but might be worth a look. Unfortunately there do not seem to be any modules on CPAN using it yet, AFAICS. That looks interesting. Then, maybe a Python or C++ demonstration :) I added to my TODO list.	[reply]
Re: Solving the Long List is Long challenge - Tokyo Cabinet by marioroy (Prior) on Jul 14, 2023 at 00:17 UTC
Hi, I'm adding another to the mix. I tried also, Tokyo Cabinet with Perl. The demonstration creates 32 hash databases (by default) into a temp directory. Crypt::xxHash is used to determine which database to insert/update. Sorting is handled by Sort::Packed. Similar to the DB_File variant, this is best run on a Unix OS. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl lliltch.pl file... perl lliltch.pl --keysize=N --threads=N --maps=N file... perl lliltch.pl --keysize=N --threads=max --maps=max file... Running: $ perl lliltch.pl big{1,2,3}.txt \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=8, maps=32 get properties : 3.238 secs pack properties : 1.624 secs sort packed data : 0.963 secs write stdout : 0.725 secs total time : 6.554 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl lliltch.pl --threads=16 --maps=64 big{1,2,3}.txt \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=16, maps=64 get properties : 1.814 secs pack properties : 0.891 secs sort packed data : 0.971 secs write stdout : 0.384 secs total time : 4.064 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl lliltch.pl --threads=24 --maps=64 big{1,2,3}.txt \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=24, maps=64 get properties : 1.353 secs pack properties : 0.695 secs sort packed data : 0.963 secs write stdout : 0.276 secs total time : 3.296 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 $ perl lliltch.pl --threads=48 --maps=max big{1,2,3}.txt \| cksum Tokyo Cabinet hash database - start fixed string length=12, threads=48, maps=128 get properties : 0.817 secs pack properties : 0.450 secs sort packed data : 0.961 secs write stdout : 0.210 secs total time : 2.446 secs count lines : 10545600 count unique : 10367603 2956888413 93308427 [download] lliltch.pl Read more... (11 kB)	[reply] [d/l] [select]