NERDVANA has asked for the wisdom of the Perl Monks concerning the following question:
Perl knows about a lot of posix character classes because you can match against them with the regex library. But what if you want to reverse that? I would like to iterate the characters that belong to a named character class, and take advantage of Perl's knowledge rather than building massive lists of my own. I would also like to avoid brute-force solutions like iterating every character and testing each for membership in the set.
Does anyone know if there is a good way to do this? Ideally quick enough to look up random members of the set in log(n) time or better.
Example of a Not-Good way to do this:
my @alpha= grep /[[:alpha:]]/, map chr, 0..0xEFFFF;
return $alpha[rand scalar @alpha];
Re: Inspect the members of a Posix Character Class
by haukex (Archbishop) on Jul 29, 2021 at 13:35 UTC
|
"POSIX Character Classes" in perlrecharclass gives the names of Unicode properties that are equivalent to the POSIX character classes, e.g. [[:ascii:]] is \p{ASCII} and so on, and then you can use Unicode::UCD to inspect those properties. So if I understand your question correctly:
use warnings;
use strict;
use Unicode::UCD 'prop_invlist';
my @invlist = prop_invlist("PosixAlpha");
for (my $i = 0; $i < @invlist; $i += 2) {
my $upper = ($i + 1) < @invlist ? $invlist[$i+1] - 1
: $Unicode::UCD::MAX_CP;
my @chars = map { chr $_ } $invlist[$i] .. $upper;
print join(", ", @chars), "\n";
}
__END__
A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X
+, Y, Z
a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x
+, y, z
| [reply] [d/l] [select] |
|
Perfect! Yes this is exactly the sort of thing I was looking for.
| [reply] |
|
| [reply] |
Re: Inspect the members of a Posix Character Class
by choroba (Cardinal) on Jul 29, 2021 at 13:11 UTC
|
The POSIX classes must be used inside a character class:
grep /[[:alpha:]]/, map chr, 0 .. 0xEFFFF;
If running this is too slow, save the result (in a variable, or in a file for persistence).
Update: s/char/chr/, thanks pryrt.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
(You caught me inbetween edits. code is corrected.)
The execution time is undesirable, but also it is an array of 118 thousand elements, which perl does not store efficiently. Even if I store them as one string, then Perl can't index into it efficiently.
I could cache it as a series of codepoint spans, like [ [ 0x65 => 0x90 ], ... ] and build these ahead of time, but I feel like there should be some way to query this right out of perl's internal data structures. And someone must have wanted to do that before? and probably wrote a module for it?
| [reply] [d/l] |
|
Generating them isn't a huge cost, and it's just a one-time cost. And if you use a string rather than an array to store it, it's much more efficient, both in time and size.
benchmark for [[:alpha:]] on Windows:
__END__
C:\Users\peter.jones\Downloads\TempData\perl>posix_bench.pl
mem before generate arrays: 11,584 K
mem after generate arrays: 25,688 K
mem after delete arrays: 17,600 K
mem before generate strings: 17,600 K
mem after generate strings: 18,264 K
mem after delete strings: 18,264 K
COMPARE GENERATING:
Rate genArr genStr
genArr 1.27/s -- -92%
genStr 16.5/s 1198% --
COMPARE ACCESSING:
Rate getStr10k getArr10k
getStr10k 7.40/s -- -98%
getArr10k 417/s 5539% --
And practical code for using multiple posix character sets, and generating random strings from those sets:
#!perl
use 5.012; # strict, //
use warnings;
use open ':std', ':encoding(UTF-8)';
my (%posix, %lengths);
for my $class (qw/alpha digit punct/)
{
$posix{$class} .= $_ for grep { /[[:${class}:]]/ } map {chr} 0 ..
+0xEFFFF;
$lengths{$class} = length($posix{$class});
}
sub get_random_char_from_posix
{
my $class = shift;
die "no such class" unless exists $posix{$class} and exists $lengt
+hs{$class};
substr $posix{$class}, rand($lengths{$class}), 1;
}
use Data::Dump;
my $alpha_str; $alpha_str .= get_random_char_from_posix('alpha') for 1
+ .. 10;
my $digit_str; $digit_str .= get_random_char_from_posix('digit') for 1
+ .. 10;
my $punct_str; $punct_str .= get_random_char_from_posix('punct') for 1
+ .. 10;
dd $_ for $alpha_str, $digit_str, $punct_str;
__END__
C:\Users\peter.jones\Downloads\TempData\perl>posix_use.pl
"\x{288EC}\x{1309F}\x{88F8}\x{29B5D}\x{85A3}\x{209E2}\x{9EE1}\x{23015}
+\x{168AB}\x{2A691}"
"\x{B6D}\x{17E5}\x{1D7F6}\x{A627}\x{F21}\x{6F1}\x{1043}\x{A8D7}\x{118E
+4}\x{116C2}"
"\x{A673}\x{FE16}\x{110BC}\x{2CFF}\x{1BFE}\x{1804}\x{11642}\x{1AA0}\x{
+2051}\x{1183B}"
| [reply] [d/l] [select] |
|
|
|
| [reply] [d/l] |
|
|
|
| [reply] |
|
Re: Inspect the members of a Posix Character Class
by davido (Cardinal) on Jul 29, 2021 at 14:26 UTC
|
perl -MUnicode::UCD=prop_invlist -MString::Range::Expand=expand_expr -
+MList::Util=pairmap -E 'say join(",", sort map {chr} expand_expr(join
+(",", pairmap{"[$a-".($b-1)."]"} prop_invlist("PosixAlpha"))))'
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,
+j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
This isn't better than other solutions here, nor well golfed. It's just for fun.
I did want to show what code led to this one liner, though, as it shows some of the tricks we can derive from CPAN:
use strict;
use warnings;
use Unicode::UCD qw(prop_invlist);
use String::Range::Expand qw(expand_expr);
use List::Util qw(pairmap);
my @invlist = prop_invlist('PosixAlpha');
my $to_expand = join(',', pairmap {"[$a-" . ($b-1) . ']'} @invlist);
my @chars = sort map {chr $_} expand_expr($to_expand);
print join(',', @chars), "\n";
This produces the same output as above.
The only tricky part is that prop_invlist returns a list in the notation represented by [$a..$z), or in other words the start to one past the end. Consequently we need to pass $b-1 to expand_expr.
| [reply] [d/l] [select] |
Re: Inspect the members of a Posix Character Class
by LanX (Saint) on Jul 29, 2021 at 12:52 UTC
|
> But what if you want to reverse that?
What does that mean? Could you give an example?
At best an Short, Self-Contained, Correct Example ...
edit
did you check perlrecharclass and do you expect also to get a list of characters from the shown intersection operations?
| [reply] |
|
Added an example for you. (which actually runs a lot faster than I expected, but is still very inefficient) And I don't really need intersections. perlrecharclass doesn't seem to mention any API for querying perls internal tables.
| [reply] |
|
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
|
|