larimar123 has asked for the wisdom of the Perl Monks concerning the following question:

I am having an issue using Regular Expressions on Unicode. The Perl program I am using is v5.8.8. There are four uncommon characters that I need to function for use in Regular Expressions, and only one of them works. Here are various ways I've tried to get the RegEx on Unicode to work, and have failed. The three characters I need to work but that still don't are: COMBINING ACUTE ACCENT (0301), the MODIFIER LETTER GLOTTAL STOP (02C0), and LATIN SMALL LETTER TURNED V (028C). Any suggestions?:


Code that works, but not with my special characters:

Trial 1 RegEx using full character name:

A. This does not work:
#!/usr/bin/perl use utf8; use charnames ':full'; while ($line=<>){ @array = split(/ /, $line); foreach $x (@array){ if ($x=~ /\N{MODIFIER LETTER GLOTTAL STOP}/){ #full character +name for a glottal stop print "$x\n"; } } }
B. This works:
#!/usr/bin/perl use utf8; use charnames ':full'; while ($line=<>){ @array = split(/ /, $line); foreach $x (@array){ if ($x=~ /\N{LATIN SMALL LETTER K}/){ #full character +name for a 'k' print "$x\n"; } } }
Trial 2 Regex Using Character codes:
A. This does not work:
#!/usr/bin/perl use utf8; use charnames ':full'; while ($line=<>){ @array = split(/ /, $line); foreach $x (@array){ if ($x=~ /\x{02c0}/){ #code for glottal stop print "$x\n"; } } }
B: Trial that actually works ('k' instead of glottal stop):
#!/usr/bin/perl use utf8; use charnames ':full'; while ($line=<>){ @array = split(/ /, $line); foreach $x (@array){ if ($x=~ /\x{006b}/){ #code for lower case 'k' print "$x\n"; } } }
Interesting: For whatever reason, the MIDDLE DOT works. It works any way I want to input it (directly, by code(00B7), or by full character name)
#!/usr/bin/perl use utf8; use charnames ':full'; while ($line=<>){ @array = split(/ /, $line); foreach $x (@array){ if ($x=~ /\N{MIDDLE DOT}/){ print "$x\n"; } } }
Correct Output:
perl oneidaregex.pl oneidafull.txt Ne&#769;· niwakkalo&#769;·t&#652;. niyaw&#652;&#769;·u. Ne&#769;· kwi&#769;·
Thanks for the help!


Update: tried it on v5.10.0 and it still doesn't work.

Replies are listed 'Best First'.
Re: Regular Expressions on Unicode
by moritz (Cardinal) on Dec 14, 2009 at 09:48 UTC
    You never decode the incoming data, so Perl defaults to Latin-1. Since MODIFIER LETTER GLOTTAL STOP is not a character from the Latin-1 range, it will never match. Try adding these two lines at the top of your script:
    binmode STDIN, ":encoding(UTF-8)"; binmode STDOUT, ":encoding(UTF-8)";
    Perl 6 - links to (nearly) everything that is Perl 6.
      Took your advice and put those two lines at the top of the code I provided, but frustratingly it didn't work. I just began learning Perl about a month ago, so I apologize for sounding dense. To clarify, does it look like I did as you suggested? I am also looking through the website you linked, hopefully something more will turn up there. My code now looks like:
      #!/usr/bin/perl #I want to take a file of text as input, split it into an array of wor +ds #then search through the array for a word that matches the regular #expression, printing all matches. binmode STDIN, ":decoding(UTF-8)"; binmode STDOUT, ":decoding (UTF-8)"; use utf8; use charnames ':full'; while ($line=<>){ @array = split(/ /, $line); foreach $x (@array){ if ($x=~ /\x{02c0}/){#glottal stop print "$x\n"; } } }
      Thanks so much for your help!
        That's roughly how I would have done it, except that :decoding(UTF-8) is wrong, it's still :encoding(UTF-8).

        Here is a working example how to search for that character:

        use strict; use warnings; use charnames qw(:full); binmode STDOUT, ':encoding(UTF-8)'; my $filename = 'test.txt'; if (@ARGV) { open my $handle, '>:encoding(UTF-8)', $filename or die "Can't write to file '$filename': $!"; print $handle <<"OUT"; The next line contains a\N{MODIFIER LETTER GLOTTAL STOP} Really! OUT close $handle or warn $!; } else { open my $handle, '<:encoding(UTF-8)', $filename or die "Can't open file '$filename' for reading: $!"; for (<$handle>) { print if /\N{MODIFIER LETTER GLOTTAL STOP}/; } close $handle; }

        When you call it with command line arguments it writes a test file, when called without any that test file is read again:

        $ perl sample.pl gen
        $ perl sample.pl 
        contains aˀ
        

        I hope this help, you can gradually morph it into the program you want, when you change something and it breaks you know what's wrong.

        Why did you change it to decoding?
        use open IO => ":encoding(UTF-8)";
      He's not reading from STDIN, he's reading from ARGV, which might read from STDIN. That will only work in some circumstances, if at all.
Re: Regular Expressions on Unicode
by ikegami (Patriarch) on Dec 14, 2009 at 18:42 UTC

    A. This does not work:

    $line doesn't contain MODIFIER LETTER GLOTTAL STOP, it contains some encoding of it. You need to decode the input or tell Perl to do it for you.

    You can tell Perl to handle the decoding using the :encoding PerlIO layer. It can be added to handles using binmode or use open.

    There's a catch.

    <> is short for <ARGV>. ARGV is special handle, and unfortunately, adding PerlIO layers to it doesn't work well.

    It might be simplest to handle the decoding yourself. Say the input is encoded using UTF-8, all you need is

    while (my $line = <>) { utf8::decode( $line ); ... }

    For other encodings, use Encode's decode.

Re: Regular Expressions on Unicode
by graff (Chancellor) on Dec 14, 2009 at 22:08 UTC
    When I write while (<>){...} with the intention of using the script on utf8 data that comes from either named files in @ARGV or redirected/piped input via STDIN, I normally include these lines near the top:
    use open IN => ':utf8'; binmode STDIN, ':utf8';
    The first line takes care of making sure that all files in @ARGV get opened with the intended encoding layer, and the second line covers STDIN. (I also typically include , OUT => ':utf8' on the first line, and add a third line for STDOUT.)

    The difference between ":encoding(utf8)" and just plain ":utf8" is, I think, simply a matter of how much you want to trust your input. If there are encoding errors (sequences of non-ASCII bytes that do not form valid utf8 characters), the simpler form will just cause the program to die with an error message, whereas ":encoding(utf8)" will give a detailed warning message, supply a replacement string that makes the problem easy to spot, and keep running.

    (updated code snippet to normalize quotes)