comment on

I am having an issue using Regular Expressions on Unicode. The Perl program I am using is v5.8.8. There are four uncommon characters that I need to function for use in Regular Expressions, and only one of them works. Here are various ways I've tried to get the RegEx on Unicode to work, and have failed. The three characters I need to work but that still don't are: COMBINING ACUTE ACCENT (0301), the MODIFIER LETTER GLOTTAL STOP (02C0), and LATIN SMALL LETTER TURNED V (028C). Any suggestions?:

Code that works, but not with my special characters:

Trial 1 RegEx using full character name:

A. This does not work:

#!/usr/bin/perl
use utf8;
use charnames ':full';

while ($line=<>){
    @array = split(/ /, $line);
    foreach $x (@array){
        if ($x=~ /\N{MODIFIER LETTER GLOTTAL STOP}/){ #full character 
+name for a glottal stop
            print "$x\n";
        }
    }
}
[download]

B. This works:

#!/usr/bin/perl

use utf8;
use charnames ':full';

while ($line=<>){
        @array = split(/ /, $line);
        foreach $x (@array){
                if ($x=~ /\N{LATIN SMALL LETTER K}/){ #full character 
+name for a 'k'
                        print "$x\n";
                }
        }
}
[download]

Trial 2 Regex Using Character codes:

A. This does not work:

#!/usr/bin/perl

use utf8;
use charnames ':full';

while ($line=<>){
    @array = split(/ /, $line);
    foreach $x (@array){
        if ($x=~ /\x{02c0}/){ #code for glottal stop 
            print "$x\n";
        }
    }
}
[download]

B: Trial that actually works ('k' instead of glottal stop):

#!/usr/bin/perl
use utf8;
use charnames ':full';

while ($line=<>){
        @array = split(/ /, $line);
        foreach $x (@array){
                if ($x=~ /\x{006b}/){ #code for lower case 'k'
                        print "$x\n";
                }
        }
}
[download]

Interesting: For whatever reason, the MIDDLE DOT works. It works any way I want to input it (directly, by code(00B7), or by full character name)

#!/usr/bin/perl

use utf8;
use charnames ':full';

while ($line=<>){
        @array = split(/ /, $line);
        foreach $x (@array){
                if ($x=~ /\N{MIDDLE DOT}/){
                        print "$x\n";
                }
        }
}
[download]

Correct Output:

perl oneidaregex.pl oneidafull.txt
Ne&#769;·
niwakkalo&#769;·t&#652;.
niyaw&#652;&#769;·u.
Ne&#769;·
kwi&#769;·
[download]

Thanks for the help!

Update: tried it on v5.10.0 and it still doesn't work.

In reply to Regular Expressions on Unicode by larimar123

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.