how can I get a list of all unicode characters that have a certain attribute?

danmcb has asked for the wisdom of the Perl Monks concerning the following question:

Say I want to print out the Cyrillic alphabet ...

That is : all characters that m/\p{Cyrillic}/

How can I get all of these in an array, in an efficient way?

(update:) thanks rhesa and cbu. I had considered that a looping solution would be possible but wanted to avoid that. The UCD based one looks very clean, I'll test it when I get onto my dev machine.

Comment on how can I get a list of all unicode characters that have a certain attribute? Download Code

Replies are listed 'Best First'.
Re: how can I get a list of all unicode characters that have a certain attribute? by rhesa (Vicar) on Mar 05, 2007 at 14:53 UTC
Have a look at Unicode::UCD. You can get the info from the `charblock` and `charscript` functions. Update: Added example script. `use Unicode::UCD qw/charblock/; binmode STDOUT, ":utf8"; my $range = charblock(q{Cyrillic}); for ( $range->[0][0] .. $range->[0][1] ) { print "$_: ", chr($_), "\n"; }` [download]	[reply] [d/l] [select]
Re: how can I get a list of all unicode characters that have a certain attribute? by Tux (Canon) on Mar 05, 2007 at 14:55 UTC
If you want to do it for all codepoints supported/known by the current perl, this is quite efficient: #!/usr/bin/perl use strict; use warnings; binmode STDOUT, ":utf8"; sub Names () { do "unicore/Name.pl"; } # Names my (%name, %cp, $n); for (split m/\n/ => Names ()) { s/\s+$//; my ($cp, $cp2, $name) = split m/\t/, $_, 3; $name =~ m/a-z/ and next; # Non-character ($cp, $cp2) = map { hex "0$_" } ($cp, $cp2); $name{$cp} = $name; $cp{$name} //= $cp; } my @codepoints = sort { $a <=> $b } keys %name; my $ncp = @codepoints; print "Testing $ncp codepoints\n"; foreach my $cp (@codepoints) { my $chr = chr $cp; # int-2-char conversion $chr =~ m/\p{Cyrillic}/ or next; printf "U+%04X %s\t%s\n", $cp, $chr, $name{$cp}; } Enjoy, Have FUN! H.Merijn	[reply]
Re:how can I get a list of all unicode characters... by graff (Chancellor) on Mar 06, 2007 at 01:44 UTC
Along the same lines suggested by cbu, I found a very nice little snippet posted to the perl-unicode mail list, and included it in this reply to an older SoPW node. I think you can see a nice way to adapt that little command-line tool into a module that uses perl's `grep` function in a suitable way to give you the sort of flexibility you want. In effect, a one-line idiom like this: `@names = grep /cyrillic/i, split /^/, do 'unicore/Name.pl';` [download] is sort of what you're after, I think. Just put a variable into the regex for the grep, and you can fetch subsets based on all sorts of patterns in unicode character names. (But if you're going to do this sort of grep repeatedly in one script, you probably want to run unicore/Name.pl just once and keep all it's output in an array.)	[reply] [d/l] [select]