Re^3: Size of Judy::HS array: where is MemUsed()?

Replies are listed 'Best First'.
Re^4: Size of Judy::HS array: where is MemUsed()? by kcott (Archbishop) on Apr 10, 2023 at 23:42 UTC
G'day Mario, Thanks for the feedback. As mentioned earlier, this was put on hold for a family Easter event; I expect to be working on it again this week. My main concern with `MemUsed()` was the bug(s) reported by hv: if I were to present `Judy::HS` to $work, as a buggy module, which needed patching, and appeared to be abandonware, it probably wouldn't be received too well. Using Memory::Usage instead of `MemUsed()` would circumvent this problem; other parts of `Judy::HS` seem solid (from what I've read). Early results do show that `Judy::HS` used a lot less memory than `%hash`. I initially used `/usr/share/dict/australian-english` to populate the hash keys. I chose this because it was the largest of several files I have in `/usr/share/dict/` (the fact that I'm an Aussie was only a secondary consideration); however, I found that this file has entries with characters outside the 7-bit ASCII range (e.g. `Ångström`). This required some encoding manipulation for `Judy::HS`; creating this data structure was slower than for a `%hash`. `/usr/share/dict/linux.words` is the smallest in that directory and, as far as I can tell, only uses 7-bit ASCII. I'll be giving that a try to see how `Judy::HS` fares against `%hash` when there's no encoding consideration. There's other areas I intend to address, which will likely include: reading the data structures with and without encoding; non-integer values; and, complex structures (e.g. HoH). All very interesting; there should be a Meditation somewhere down the track with results of this investigation. — Ken	[reply] [d/l] [select]
Re^5: Size of Judy::HS array: where is MemUsed()? - perldelta, Perl Releases and Building Perl by eyepopslikeamosquito (Archbishop) on Apr 11, 2023 at 11:46 UTC
if I were to present `Judy::HS` to $work, as a buggy module, which needed patching, and appeared to be abandonware, it probably wouldn't be received too well Agreed. I find the topic of third party dependencies a difficult and fascinating one -- see Re: Criteria for when to use a cpan module (Buy vs Build). Early results do show that Judy::HS used a lot less memory than %hash That agrees with marioroy's experiences and tye's experiences and BrowserUk's experiences: A supersearch for my name and judy arrays will turn up a compact, single file version of the Judy array code that compiled clean and worked very well for that application. Still slower than hashes, but far more compact. Just hope you don't find any bugs, because the Judy code is horribly complex. -- BrowserUk in Re: Fastest way to lookup a point in a set Assuming you see big savings in memory (but not speed), would you recommend Judy at $work? That would seem to depend on your company's business model. If your software must run on thousands of (many different) customer machines, then buying more memory is not an option. OTOH, if you're running high performance code for clients on machines that you own and control, it may be better to stick with Perl hashes and just buy more memory. After all, a DDR4 DIMM can hold up to 64 GB while DDR5 octuples that to 512 GB, so I expect the cost of buying more memory for multiple in-house machines would be dwarfed by your monthly salary bill. It's also a low risk solution because no code changes are required and you don't need to worry about Judy being abandoned or hitting a nasty bug in the notoriously complex Judy C code that you need to urgently fix. There's also the opportunity cost of your people working on Judy code instead of other projects. All very interesting; there should be a Meditation somewhere down the track with results of this investigation. Looking forward to it. :) Update: In Re^3: 32bit/64bit hash function: Use perls internal hash function? (Apr 2022) hv reports that a proposed Nicholas Clark change to improve Perl hash performance in perl 5.38 reduced memory usage by 20% and gave substantial speed improvements in one of his test cases (perl 5.38 release date prediction market). See also: dev.perl.org - shows perl releases and current stable version perldelta - what is new for perl v5.36.1 perldelta - for development perl v5.37.11 perldelta - for perl v5.38.0-RC2 Perl Releases (dev.perl.org) Official Perl Download page - www.perl.org perl v5.38 (Fedora) - might be released on June 10th 2023 (update: now July 2nd 2023) perl v5.38.0 is now available - perl 5 porters email announcement perl-5.38.0 changes Update: Building Perl from Source References Re^7: Meaning of XS object version (Package Manager Security References - example building Perl securely from source) - example (secure) build of perl v5.38.0 from source on my Ubuntu Linux VM Re^7: Rosetta Code: Long List is Long (Updated Solutions - short Perl GRT and for_list) - older version building v5.36.0 from source	[reply] [d/l]
Re^6: Size of Judy::HS array: where is MemUsed()? by kcott (Archbishop) on Apr 11, 2023 at 19:34 UTC
This raises many good questions; however, at this stage, answering most would require crystal ball gazing. In terms of memory vs. speed; the latter is, by far, the more important. Choosing the largest file (`/usr/share/dict/australian-english`) for testing, then finding the encoding requirements (details earlier) was probably fortuitous in that it alerted me to this issue. However, strings containing sequences of A, C, G & T have only 7-bit ASCII characters and would not require encoding. Testing with `/usr/share/dict/linux.words` may have interesting results. Although I did see potential $work applications, this really just started out of interest and was an academic exercise. I'll probably still continue investigating the aspects mentioned earlier, even if unsuitable for $work. — Ken	[reply] [d/l] [select]