Re: Inspect the members of a Posix Character Class

Replies are listed 'Best First'.
Re^2: Inspect the members of a Posix Character Class by NERDVANA (Deacon) on Jul 29, 2021 at 13:23 UTC
(You caught me inbetween edits. code is corrected.) The execution time is undesirable, but also it is an array of 118 thousand elements, which perl does not store efficiently. Even if I store them as one string, then Perl can't index into it efficiently. I could cache it as a series of codepoint spans, like `[ [ 0x65 => 0x90 ], ... ]` and build these ahead of time, but I feel like there should be some way to query this right out of perl's internal data structures. And someone must have wanted to do that before? and probably wrote a module for it?	[reply] [d/l]
Re^3: Inspect the members of a Posix Character Class by pryrt (Abbot) on Jul 29, 2021 at 15:00 UTC
Generating them isn't a huge cost, and it's just a one-time cost. And if you use a string rather than an array to store it, it's much more efficient, both in time and size. benchmark for `[[:alpha:]]` on Windows: <Reveal this spoiler or all in this thread> `__END__ C:\Users\peter.jones\Downloads\TempData\perl>posix_bench.pl mem before generate arrays: 11,584 K mem after generate arrays: 25,688 K mem after delete arrays: 17,600 K mem before generate strings: 17,600 K mem after generate strings: 18,264 K mem after delete strings: 18,264 K COMPARE GENERATING: Rate genArr genStr genArr 1.27/s -- -92% genStr 16.5/s 1198% -- COMPARE ACCESSING: Rate getStr10k getArr10k getStr10k 7.40/s -- -98% getArr10k 417/s 5539% --` [download] And practical code for using multiple posix character sets, and generating random strings from those sets: #!perl use 5.012; # strict, // use warnings; use open ':std', ':encoding(UTF-8)'; my (%posix, %lengths); for my $class (qw/alpha digit punct/) { $posix{$class} .= $_ for grep { /[[:${class}:]]/ } map {chr} 0 .. +0xEFFFF; $lengths{$class} = length($posix{$class}); } sub get_random_char_from_posix { my $class = shift; die "no such class" unless exists $posix{$class} and exists $lengt +hs{$class}; substr $posix{$class}, rand($lengths{$class}), 1; } use Data::Dump; my $alpha_str; $alpha_str .= get_random_char_from_posix('alpha') for 1 + .. 10; my $digit_str; $digit_str .= get_random_char_from_posix('digit') for 1 + .. 10; my $punct_str; $punct_str .= get_random_char_from_posix('punct') for 1 + .. 10; dd $_ for $alpha_str, $digit_str, $punct_str; __END__ C:\Users\peter.jones\Downloads\TempData\perl>posix_use.pl "\x{288EC}\x{1309F}\x{88F8}\x{29B5D}\x{85A3}\x{209E2}\x{9EE1}\x{23015} +\x{168AB}\x{2A691}" "\x{B6D}\x{17E5}\x{1D7F6}\x{A627}\x{F21}\x{6F1}\x{1043}\x{A8D7}\x{118E +4}\x{116C2}" "\x{A673}\x{FE16}\x{110BC}\x{2CFF}\x{1BFE}\x{1804}\x{11642}\x{1AA0}\x{ +2051}\x{1183B}" [download]	[reply] [d/l] [select]
Re^4: Inspect the members of a Posix Character Class by NERDVANA (Deacon) on Jul 29, 2021 at 17:40 UTC
Your benchmark seems to show that accessing an array if 55x faster than accessing a position in the string. This is what I would expect because unicode strings (utf8) don't have fixed-width characters and perl has to go searching for character boundaries. For comparison purposes, could you add construction of an inversion list to your benchmark? `sub build_inversion_list_and_index { my @invlist; my $match; for (0..$max_codepoint) { next unless $match xor (chr($_) =~ /[[:$class:]]/); push @invlist, $_; $match= !$match; } my @index= ( 0 ); for (my $i= 0; $i < @invlist; $i+= 2) { push @index, $index[-1] + $invlist[$i+1] - $invlist[$i]; } shift @index; return \@invlist, \@index; }` [download] and random selection from the inversion list: `sub get_nth_char($i, $invlist, $index) { return undef if $i >= $index[-1]; my ($min, $max, $mid)= (0, $#$index); while (1) { $mid= ($min+$max) >> 1; if ($i > $index[$mid]) { $min= $mid+1 } elsif ($mid > 0 && $i < $index[$mid-1]) { $max= $mid-1 } else { return $invlist[$mid*2] + ($i - ($mid > 0? $index[$mid-1] : + 0)) } } }` [download]	[reply] [d/l] [select]
Re^5: Inspect the members of a Posix Character Class by pryrt (Abbot) on Jul 29, 2021 at 17:49 UTC
Re^3: Inspect the members of a Posix Character Class by choroba (Cardinal) on Jul 29, 2021 at 13:27 UTC
> someone must have wanted to do that before? What's your use case? Why do you need it? `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^4: Inspect the members of a Posix Character Class by NERDVANA (Deacon) on Jul 29, 2021 at 13:59 UTC
I have a module that is basically an improved version of Data::Faker and I have a few character classes that were useful to me. Before I publish this on CPAN, I wanted to make the module a bit more generic to that other people are more likely to find it useful. To that end, I want people to be able to say "give me characters from this charset X". I am looking for the most efficient way to do that, so that I'm not tossing out some big bloated slow memory hog of a module.	[reply]
Re^4: Inspect the members of a Posix Character Class by NERDVANA (Deacon) on Aug 27, 2021 at 10:53 UTC
And now it's done: RFC: Mock::Data::Regex	[reply]
Re^3: Inspect the members of a Posix Character Class by LanX (Saint) on Jul 29, 2021 at 13:29 UTC
Again why? Could you give an example for the need to have a condensed list of those 118000 elements? FWIW 0xEFFFF is only 983039 so looping over all and filtering 11% is reasonably fast. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^4: Inspect the members of a Posix Character Class by NERDVANA (Deacon) on Aug 27, 2021 at 10:52 UTC
Ta-Da! RFC: Mock::Data::Regex	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks