Unicode::UCD=charprop and the speed of various keys

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

The core module Unicode::UCD has a function called "charprop" that looks up the values of various properties of unicode characters. (These properties can also be viewed as a hash by calling charprop_all.) Looking up the "Age" of a code point is extremely fast but "Name" is very slow. Does anyone know what the problem is and how to speed up the "Name" lookup?

The first example looks up "Age" for each character in the "Alchemical Symbols" block, and the second example looks up the "Name" of the symbols:


time perl -MUnicode::UCD=charprop -le 'print "$_ ".charprop($_,$ARGV[0
+]) for "128768".."128895"' Age
[download]

real	0m0.920s
user	0m0.888s
sys	0m0.017s


time perl -MUnicode::UCD=charprop -le 'print "$_ ".charprop($_,$ARGV[0
+]) for "128768".."128895"' Name
[download]

real	0m38.150s
user	0m37.429s
sys	0m0.454s

(Grabbing the hash from charprop_all is similarly slow...)

Comment on Unicode::UCD=charprop and the speed of various keys Select or Download Code

Replies are listed 'Best First'.
Re: Unicode::UCD=charprop and the speed of various keys by davido (Cardinal) on Aug 26, 2018 at 17:50 UTC
Devel::NYTProf can be used to profile your example code without any real modification other than to add `-d:NYTProf` to the command line. Most of the time is spent in a subroutine called `prop_invmap`. These are the most expensive lines in that subroutine: 3970944 2.22s my ($hex_code_point, $name) = split "\t", $line; 3354 3355 # Weeds out all comments, blank lines, and named sequences 3356 3970944 5.69s 3970944 828ms next if $hex_code_point =~ /[^:xdigit:]/a; # spent 828ms making 3970944 calls to Unicode::UCD::CORE:match, avg 208ns/call 3357 3358 3914368 648ms my $code_point = hex $hex_code_point; 3359 3360 # The name of all controls is the default: the empty string. 3361 # The set of controls is immutable 3362 3914368 5.18s 3914368 475ms next if chr($code_point) =~ /[:cntrl:]/u; # spent 475ms making 3914368 calls to Unicode::UCD::CORE:match, avg 121ns/call 3363 3364 # If this is a name_alias, it isn't a name 3365 3894016 1.85s next if grep { $_ eq $name } @{$aliases{$code_point}}; 3366 3367 # If we are beyond where one of the special lines needs to 3368 # be inserted ... 3369 3854464 1.10s while ($i < @$algorithm_names 3370 && $code_point > $algorithm_names->$i->{'low'}) 3371 { It might be worthwhile looking at mitigation options. If you are willing to throw memory at the problem, subclass or monkeypatch Unicode::UCD, and in your subclass use Memoize to memoize `prop_invmap`. The results are astounding: `real 0m0.275s user 0m0.263s sys 0m0.012s` [download] Here's an inelegant example of monkeypatching `prop_invmap` in a module that otherwise simply exposes Unicode::UCD: package MyUnicodeUCD; use strict; use warnings; use constant EXPORT_OK => [ qw( charinfo charblock charscript charblocks charscripts charinrange charprop charprops_all general_categories bidi_types compexcl casefold all_casefolds casespec namedseq num prop_aliases prop_value_aliases prop_values prop_invlist prop_invmap search_invlist MAX_CP ), ]; use Unicode::UCD @{EXPORT_OK()}; use Exporter; our @ISA=qw(Exporter); our @EXPORT_OK = @{EXPORT_OK()}; use Memoize; memoize 'prop_invmap'; *Unicode::UCD::prop_invmap = \&prop_invmap; 1; [download] Dave	[reply] [d/l] [select]
Re^2: Unicode::UCD=charprop and the speed of various keys by Anonymous Monk on Aug 27, 2018 at 04:07 UTC
Thank you for your time Dave. I didn't know you could Memoize a sub from another module! I'm sure that trick will come in handy. I went back to the docs and realized there is another function in Unicode::UCD called charinfo() that returns point names very rapidly: `time perl -MUnicode::UCD=charinfo -le 'for ("128768".."128895"){ $c=ch +arinfo($_); print "$_ ".$c->{name} } real. 0m0.230s user. 0m0.207s sys. 0m0.017s` [download] Thanks again for the valuable lesson!	[reply] [d/l]
Re^3: Unicode::UCD=charprop and the speed of various keys by davido (Cardinal) on Aug 27, 2018 at 19:16 UTC
I encourage you, if you are going to go with memoization, to create a more robust wrapper for Unicode::UCD instead of using the global monkeypatching technique. This technique is fragile because it modifies a subroutine's behavior that might be shared / used by other consumers of Unicode::UCD. For example, you might be writing module Foo, which uses MyUnicodeUCD, which monkeypatches Unicode::UCD. But you might also be using module Bar from CPAN (names are made up to protect the innocent). Maybe module Bar also uses Unicode::UCD. Your monkeypatching would propagate back to alter Unicode::UCD for all callers, including Bar which isn't expecting modified behavior. Memoization is probably pretty innocuous in this case -- you're unlikely to fill all of available memory by memoizing those calls even if there is some other consumer of the function elsewhere in your code base. But it's not generally a great practice to do that. Creating a package that exposes functions that are thin wrappers around Unicode::UCD could be a better solution, as you could make any sub call that invokes the expensive subroutine handle the assignment to typeglob in local terms. You could do something like this, for example: `BEGIN { Unicode::UCD->import('prop_invmap'); memoize 'prop_invmap'; } sub charinfo { my ($self, $arg) = @_; local *Unicode::UCD::prop_invmap = \&prop_invmap; return Unicode::UCD::charinfo($arg); }` [download] With this strategy the memoized sub is injected into Unicode::UCD only for the duration of the call to `charinfo`, and then Unicode::UCD reverts to original behavior when `charinfo`'s scope ends. You would possibly want to do this for each sub from Unicode::UCD that uses prop_invmap. There are several. For the rest of your MyUnicodeUCD you would just import the original subroutine into MyUnicodeUCD's namespace where it should be able to work without writing a wrapper. All of this is still a little fragile, as it depends on nothing really changing in the interface for Unicode::UCD, nor in the implementation of functions that call prop_invmap. But for a specific use case, it could be just fine. Dave	[reply] [d/l] [select]