comment on

Larry Wall recently posted this nifty little script on the perl-unicode mail list -- here it is, pretty much verbatim (I added the "S" on the shebang line, to make STDIN/STDOUT/STDERR be utf8):

#!/usr/bin/perl -CS

$pat = shift;
if (ord $pat > 256) {
    $pat = sprintf("%04x", ord $pat);
}
elsif (ord $pat > 128) {        # arg in sneaky UTF-8
    $pat = sprintf("%04x", unpack("U0U",$pat));
}
 
@names = split /^/, do 'unicore/Name.pl';
for (@names) {
    if (/$pat/io) {
        $hex = hex($_);
        print chr($hex),"\t",$_;
    }
}
[download]

The idea is to output a list of unicode code points (if any) that match any given expression you put into $ARGV[0] -- here's a relevant command-line usage example (Larry had this script in a file named "uni"):


uni "latin (?:small|capital) letter A with"
[download]

(update: if you try this, you'll want to be running in a terminal window that handles utf8 characters!)

So, all you need for what you want is the part that assigns the output of "unicode/Name.pl" to an array -- this gives you the unicode character database -- and grep through the array to get the set of vowels you want. Then, put the first token (first character in each array element is the utf8 character itself) into a character-class expression. Something like:

my @names = split/^/, do 'unicore/Name.pl';

#...

my @vowelsets;
for my $v ( qw/A E I O U/ ) {
    push( @vowelsets, 
          join( '', map { chr hex( substr $_, 0, 4 ) }
                grep /LATIN (?:SMALL|CAPITAL) LETTER $v/, @names ));
}

# now you can use each element of @vowelsets as a character class
# (similiarly for consonants...)
[download]

(updated this snippet: changed the map block from a regex to substr; updated a second time to use "chr hex()" in the map block -- each element of @names begins with a four-digit hex code-point value, which needs to be converted to a character.)

Still a bit cumbersome, I suppose, but quite manageable and not that bulky.

In reply to Re: Regular expressions and accents by graff
in thread Regular expressions and accents by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Just another Perl shrine
	PerlMonks