in reply to Re: Solving the Long List is Long challenge - Kyoto Cabinet
in thread Solving the Long List is Long challenge, finally?
Kyoto Cabinet is the successor to Tokyo Cabinet.
Thanks for reminding me of the existence of Kyoto Cabinet. I looked at it for a particular project some years back and was impressed with the speed and ease of use. Unfortunately the project was canned before it could be used in production.
However, I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. It does require C++17 but might be worth a look. Unfortunately there do not seem to be any modules on CPAN using it yet, AFAICS.
🦛
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: Solving the Long List is Long challenge - Learning Tkrzw
by marioroy (Prior) on Jul 15, 2023 at 07:21 UTC | |
> I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. I found time to try the Tkrzw C++ library. Tkrzw provides sharding capabilities. Spoiler alert... It's awesome :) C++ bits:
ls -1 /dev/shm
I will come back after completing a new llil4shard C++ variant. The Tkrzw library is amazing. In the meantime, the left column is the number of shards processing 26 big files (48 CPU threads). Total: 91,395,200 lines, 79,120,065 unique keys. For reference, Perl lliltch.pl "get properties" takes 9.533 seconds; 128 maps (or shards).
Yay :) Tkrzw provides the increment method. Like Kyoto Cabinet, the value is stored as an 8-byte big-endian integer.
So much learning from the Long List is Long series :) Notice above, no locking among threads for incrementing the count. No local hash either. The "IncrementSimple" method is a single operation. I tested retrieval and conversion, which will be done later in the code. Update: Iteration is slow
Notice "tkrzw to vector". I will try again later and iterate the individual maps (or shards) in parallel.
Compared to Perl using Tokyo Cabinet :)
Update: Parallel iteration This works :), to make iteration faster. Iterate all the maps (or shards) in parallel. Append to the property vector, serially.
Results:
Let's try processing 26 big files :) Get properties is 3 times faster than Perl. The QPS is measured by count_lines and count_unique, respectively, divided by time: in millions.
Thank you, hippo for mentioning the Tkrzw C++ library. I'm one step away before posting the new llil variant. Currently, the db path is hard-coded to "/dev/shm/casket.tkh". Update: app-level sharding For better performance, I tried constructing an array of "tkrzw::HashDBM" objects versus a single "tkrzw::ShardDBM" object. This requires the application to compute the hash value, which is not a problem. Below, see timings for app-level sharding.
Update: MAX_STR_LEN_L optimization Notice "vector stable sort" completing in half the time. The code is final and will post two Tkrzw variants this evening {one sharding by the C++ library, another application-level sharding}.
| [reply] [d/l] [select] |
by marioroy (Prior) on Jul 17, 2023 at 05:27 UTC | |
I finished the tkrzw::ShardDBM demonstration. Sharding is managed by the C++ library. Update: Changed bswap_64, now using the library tkrzw::StrToIntBigEndian function.
llil4tkh.cc
Read more... (20 kB)
| [reply] [d/l] [select] |
by marioroy (Prior) on Jul 17, 2023 at 05:33 UTC | |
I created another Tkrzw demonstration. This one constructs many HashDBMs. Basically, sharding is managed by the application. Update 1: The HashDBMs are now interchangeable/compatible with ShardDBMs, since using the same hash function.
llil4tkh2.cc
Read more... (20 kB)
| [reply] [d/l] [select] |
Re^3: Solving the Long List is Long challenge - Kyoto Cabinet
by marioroy (Prior) on Jul 14, 2023 at 14:47 UTC | |
> However, I see from the linked page that Kyoto Cabinet itself now has a successor which is Tkrzw. It does require C++17 but might be worth a look. Unfortunately there do not seem to be any modules on CPAN using it yet, AFAICS. That looks interesting. Then, maybe a Python or C++ demonstration :) I added to my TODO list. | [reply] |