Chuma posted an interesting Long list is long challenge, last year October 2022. eyepopslikeamosquito created a separate thread one month later. Many solutions were provided by several folks.
My goal was keeping memory consumption low no matter if running a single thread or 20+ threads. Ideally, running more threads should run faster. It turns out that this is possible. Ditto, zero merge overhead as the keys are unique. Just move the elements from all the sub-maps over to a vector for sorting and output.
In a nut-shell, the following is the strategy used for the hash-map solutions in latest June 2023 refresh.
1. create many sub-maps and mutexes 2. parallel single-file via chunking versus parallel list of files 3. create a hash-value for the key and store the value with the key 4. determine the sub-map to use by hash-value MOD number-of-maps 5. there are total 963 sub-maps to minimize locking contention 6. randomness kicks in, allowing many threads to run
Thank you, Gregory Popovitch. He identified the last one-off error in my C++ chunking logic plus shared a couple suggestions. See issue 198. Thank you, eyepopslikeamosquito for introducing me to C++. Thank you, anonymous monk. There, our anon-friend mentioned the word parallel. So, we tried running parallel in C++. Eventually, chunking too. :)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Solving the Long List is Long challenge - SQLite DB
by marioroy (Prior) on Jul 13, 2023 at 09:53 UTC | |
Aloha, I circled back to Perl and tried something using Tie::Hash::DBD. The demonstration creates 32 SQLite databases (by default) into a temp directory. Crypt::xxHash is used to determine which database to insert/update. Sorting is handled by Sort::Packed. Similar to the DB_File variant, this is best run on a Unix OS. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl llilsql.pl file... perl llilsql.pl --keysize=N --threads=N --maps=N file... perl llilsql.pl --keysize=N --threads=max --maps=max file... Running:
llilsql.pl
Read more... (11 kB)
| [reply] [d/l] [select] |
Re: Solving the Long List is Long challenge - IPC::MMA
by marioroy (Prior) on Jul 13, 2023 at 10:52 UTC | |
Hi, again... :) Something else I tried is IPC::MMA, to complement the other two demonstrations here (using Tie::Hash::DBD) and here (using DB_File). This variant creates 256 hashes (by default) into shared memory. Crypt::xxHash is used to determine which hash to insert/update. Sorting is handled by Sort::Packed. The Author states, "IPC::MMA 'hashes' do not hash keys. They maintain their entries in sorted order on their keys, but they are not BTrees." So, to keep up with SQLite and DB_File, I configured 8x the number of maps. Be sure ulimit -n is plentiful. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl llilmma.pl file... perl llilmma.pl --keysize=N --threads=N --maps=N file... perl llilmma.pl --keysize=N --threads=max --maps=max file... Running:
llilmma.pl
Read more... (8 kB)
| [reply] [d/l] [select] |
Re: Solving the Long List is Long challenge - C++ vs Perl
by marioroy (Prior) on Jul 13, 2023 at 12:16 UTC | |
How does Perl stack up to C++? The std::unordered_map container ships with C++ which is what I tried. Also, llil4emh and llil4hmap for comparison. The C++ demonstrations involve parallel sort. Running C++ (input: 26 big files) The right-column are results using April 2024 release.
Running Perl (input: 26 big files) Notice how Perl's Sort::Packed module runs faster, sorting-wise on one CPU core. That is interesting. llilsql.pl - Tie::Hash::DBD SQLite database llildbf.pl - DB_File B-tree database llilmma.pl - IPC::MMA shared memory hash lliltch.pl - Tokyo Cabinet hash database llilkch.pl - Kyoto Cabinet hash database
llil4umap.cc
Read more... (20 kB)
| [reply] [d/l] [select] |
by marioroy (Prior) on Jul 14, 2023 at 14:26 UTC | |
I ran a test run for 1 billion+ lines (312 big files). This was done to compare the various demonstrations "mainly" processing duplicate keys. I find it impressive seeing Tie::Hash::DBD (SQLite) keep up with DB_File. Even more impressive are Tokyo and Kyoto Cabinet. They provide the "addinc" and "increment" API, respectively. Perl solutions: Note: All tasks run parallel except for "sort packed data".
C++ for comparison: QPS is queries per second for get properties. The right-column are results using April 2024 release.
| [reply] [d/l] [select] |
Re: Solving the Long List is Long challenge - DB_File
by marioroy (Prior) on Jul 13, 2023 at 10:07 UTC | |
Greetings, After trying C++, I thought to give Perl another try using DB_File. The demonstration creates 32 databases (by default) into a temp directory. Crypt::xxHash is used to determine which database to insert/update. Sorting is handled by Sort::Packed. Like with the SQLite variant, this is best run on a Unix OS. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl llildbf.pl file... perl llildbf.pl --keysize=N --threads=N --maps=N file... perl llildbf.pl --keysize=N --threads=max --maps=max file... Running:
llildbf.pl
Read more... (9 kB)
| [reply] [d/l] [select] |
Re: Solving the Long List is Long challenge - Kyoto Cabinet
by marioroy (Prior) on Jul 14, 2023 at 07:53 UTC | |
There is one missing in the mix :) Kyoto Cabinet is the successor to Tokyo Cabinet. This new variant also creates 32 hash databases (by default) into a temp directory. Crypt::xxHash is used to determine which database to insert/update. Sorting is handled by Sort::Packed. Similar to the DB_File variant, this is best run on a Unix OS. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl llilkch.pl file... perl llilkch.pl --keysize=N --threads=N --maps=N file... perl llilkch.pl --keysize=N --threads=max --maps=max file... Running:
llilkch.pl
Read more... (11 kB)
| [reply] [d/l] [select] |
by hippo (Archbishop) on Jul 14, 2023 at 08:52 UTC | |
Kyoto Cabinet is the successor to Tokyo Cabinet. Thanks for reminding me of the existence of Kyoto Cabinet. I looked at it for a particular project some years back and was impressed with the speed and ease of use. Unfortunately the project was canned before it could be used in production. However, I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. It does require C++17 but might be worth a look. Unfortunately there do not seem to be any modules on CPAN using it yet, AFAICS. 🦛 | [reply] |
by marioroy (Prior) on Jul 15, 2023 at 07:21 UTC | |
> I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. I found time to try the Tkrzw C++ library. Tkrzw provides sharding capabilities. Spoiler alert... It's awesome :) C++ bits:
ls -1 /dev/shm
I will come back after completing a new llil4shard C++ variant. The Tkrzw library is amazing. In the meantime, the left column is the number of shards processing 26 big files (48 CPU threads). Total: 91,395,200 lines, 79,120,065 unique keys. For reference, Perl lliltch.pl "get properties" takes 9.533 seconds; 128 maps (or shards).
Yay :) Tkrzw provides the increment method. Like Kyoto Cabinet, the value is stored as an 8-byte big-endian integer.
So much learning from the Long List is Long series :) Notice above, no locking among threads for incrementing the count. No local hash either. The "IncrementSimple" method is a single operation. I tested retrieval and conversion, which will be done later in the code. Update: Iteration is slow
Notice "tkrzw to vector". I will try again later and iterate the individual maps (or shards) in parallel.
Compared to Perl using Tokyo Cabinet :)
Update: Parallel iteration This works :), to make iteration faster. Iterate all the maps (or shards) in parallel. Append to the property vector, serially.
Results:
Let's try processing 26 big files :) Get properties is 3 times faster than Perl. The QPS is measured by count_lines and count_unique, respectively, divided by time: in millions.
Thank you, hippo for mentioning the Tkrzw C++ library. I'm one step away before posting the new llil variant. Currently, the db path is hard-coded to "/dev/shm/casket.tkh". Update: app-level sharding For better performance, I tried constructing an array of "tkrzw::HashDBM" objects versus a single "tkrzw::ShardDBM" object. This requires the application to compute the hash value, which is not a problem. Below, see timings for app-level sharding.
Update: MAX_STR_LEN_L optimization Notice "vector stable sort" completing in half the time. The code is final and will post two Tkrzw variants this evening {one sharding by the C++ library, another application-level sharding}.
| [reply] [d/l] [select] |
by marioroy (Prior) on Jul 17, 2023 at 05:27 UTC | |
by marioroy (Prior) on Jul 17, 2023 at 05:33 UTC | |
by marioroy (Prior) on Jul 14, 2023 at 14:47 UTC | |
> However, I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. It does require C++17 but might be worth a look. Unfortunately there do not seem to be any modules on CPAN using it yet, AFAICS. That looks interesting. Then, maybe a Python or C++ demonstration :) I added to my TODO list. | [reply] |
Re: Solving the Long List is Long challenge - Tokyo Cabinet
by marioroy (Prior) on Jul 14, 2023 at 00:17 UTC | |
Hi, I'm adding another to the mix. I tried also, Tokyo Cabinet with Perl. The demonstration creates 32 hash databases (by default) into a temp directory. Crypt::xxHash is used to determine which database to insert/update. Sorting is handled by Sort::Packed. Similar to the DB_File variant, this is best run on a Unix OS. Usage: KEYSIZE=N NUM_THREADS=N NUM_MAPS=N perl lliltch.pl file... perl lliltch.pl --keysize=N --threads=N --maps=N file... perl lliltch.pl --keysize=N --threads=max --maps=max file... Running:
lliltch.pl
Read more... (11 kB)
| [reply] [d/l] [select] |