Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

The core module Unicode::UCD has a function called "charprop" that looks up the values of various properties of unicode characters. (These properties can also be viewed as a hash by calling charprop_all.) Looking up the "Age" of a code point is extremely fast but "Name" is very slow. Does anyone know what the problem is and how to speed up the "Name" lookup?

The first example looks up "Age" for each character in the "Alchemical Symbols" block, and the second example looks up the "Name" of the symbols:

time perl -MUnicode::UCD=charprop -le 'print "$_ ".charprop($_,$ARGV[0 +]) for "128768".."128895"' Age
real	0m0.920s
user	0m0.888s
sys	0m0.017s
time perl -MUnicode::UCD=charprop -le 'print "$_ ".charprop($_,$ARGV[0 +]) for "128768".."128895"' Name
real	0m38.150s
user	0m37.429s
sys	0m0.454s
(Grabbing the hash from charprop_all is similarly slow...)

Replies are listed 'Best First'.
Re: Unicode::UCD=charprop and the speed of various keys
by davido (Cardinal) on Aug 26, 2018 at 17:50 UTC

    Devel::NYTProf can be used to profile your example code without any real modification other than to add -d:NYTProf to the command line.

    Most of the time is spent in a subroutine called prop_invmap. These are the most expensive lines in that subroutine:

    3970944	2.22s			                my ($hex_code_point, $name) = split "\t", $line;
    3354					
    3355					                # Weeds out all comments, blank lines, and named sequences
    3356	3970944	5.69s	3970944	828ms	                next if $hex_code_point =~ /[^:xdigit:]/a;
                    # spent   828ms making 3970944 calls to Unicode::UCD::CORE:match, avg 208ns/call
    3357					
    3358	3914368	648ms			                my $code_point = hex $hex_code_point;
    3359					
    3360					                # The name of all controls is the default: the empty string.
    3361					                # The set of controls is immutable
    3362	3914368	5.18s	3914368	475ms	                next if chr($code_point) =~ /[:cntrl:]/u;
                    # spent   475ms making 3914368 calls to Unicode::UCD::CORE:match, avg 121ns/call
    3363					
    3364					                # If this is a name_alias, it isn't a name
    3365	3894016	1.85s			                next if grep { $_ eq $name } @{$aliases{$code_point}};
    3366					
    3367					                # If we are beyond where one of the special lines needs to
    3368					                # be inserted ...
    3369	3854464	1.10s			                while ($i < @$algorithm_names
    3370					                    && $code_point > $algorithm_names->$i->{'low'})
    3371					                {
    

    It might be worthwhile looking at mitigation options. If you are willing to throw memory at the problem, subclass or monkeypatch Unicode::UCD, and in your subclass use Memoize to memoize prop_invmap. The results are astounding:

    real 0m0.275s user 0m0.263s sys 0m0.012s

    Here's an inelegant example of monkeypatching prop_invmap in a module that otherwise simply exposes Unicode::UCD:

    package MyUnicodeUCD; use strict; use warnings; use constant EXPORT_OK => [ qw( charinfo charblock charscript charblocks charscripts charinrange charprop charprops_all general_categories bidi_types compexcl casefold all_casefolds casespec namedseq num prop_aliases prop_value_aliases prop_values prop_invlist prop_invmap search_invlist MAX_CP ), ]; use Unicode::UCD @{EXPORT_OK()}; use Exporter; our @ISA=qw(Exporter); our @EXPORT_OK = @{EXPORT_OK()}; use Memoize; memoize 'prop_invmap'; *Unicode::UCD::prop_invmap = \&prop_invmap; 1;

    Dave

      Thank you for your time Dave. I didn't know you could Memoize a sub from another module! I'm sure that trick will come in handy. I went back to the docs and realized there is another function in Unicode::UCD called charinfo() that returns point names very rapidly:
      time perl -MUnicode::UCD=charinfo -le 'for ("128768".."128895"){ $c=ch +arinfo($_); print "$_ ".$c->{name} } real. 0m0.230s user. 0m0.207s sys. 0m0.017s
      Thanks again for the valuable lesson!

        I encourage you, if you are going to go with memoization, to create a more robust wrapper for Unicode::UCD instead of using the global monkeypatching technique. This technique is fragile because it modifies a subroutine's behavior that might be shared / used by other consumers of Unicode::UCD. For example, you might be writing module Foo, which uses MyUnicodeUCD, which monkeypatches Unicode::UCD. But you might also be using module Bar from CPAN (names are made up to protect the innocent). Maybe module Bar also uses Unicode::UCD. Your monkeypatching would propagate back to alter Unicode::UCD for all callers, including Bar which isn't expecting modified behavior.

        Memoization is probably pretty innocuous in this case -- you're unlikely to fill all of available memory by memoizing those calls even if there is some other consumer of the function elsewhere in your code base. But it's not generally a great practice to do that. Creating a package that exposes functions that are thin wrappers around Unicode::UCD could be a better solution, as you could make any sub call that invokes the expensive subroutine handle the assignment to typeglob in local terms. You could do something like this, for example:

        BEGIN { Unicode::UCD->import('prop_invmap'); memoize 'prop_invmap'; } sub charinfo { my ($self, $arg) = @_; local *Unicode::UCD::prop_invmap = \&prop_invmap; return Unicode::UCD::charinfo($arg); }

        With this strategy the memoized sub is injected into Unicode::UCD only for the duration of the call to charinfo, and then Unicode::UCD reverts to original behavior when charinfo's scope ends. You would possibly want to do this for each sub from Unicode::UCD that uses prop_invmap. There are several. For the rest of your MyUnicodeUCD you would just import the original subroutine into MyUnicodeUCD's namespace where it should be able to work without writing a wrapper.

        All of this is still a little fragile, as it depends on nothing really changing in the interface for Unicode::UCD, nor in the implementation of functions that call prop_invmap. But for a specific use case, it could be just fine.


        Dave