in reply to Regular Expressions on Unicode

You never decode the incoming data, so Perl defaults to Latin-1. Since MODIFIER LETTER GLOTTAL STOP is not a character from the Latin-1 range, it will never match. Try adding these two lines at the top of your script:
binmode STDIN, ":encoding(UTF-8)"; binmode STDOUT, ":encoding(UTF-8)";
Perl 6 - links to (nearly) everything that is Perl 6.

Replies are listed 'Best First'.
Re^2: Regular Expressions on Unicode
by larimar123 (Initiate) on Dec 14, 2009 at 10:47 UTC
    Took your advice and put those two lines at the top of the code I provided, but frustratingly it didn't work. I just began learning Perl about a month ago, so I apologize for sounding dense. To clarify, does it look like I did as you suggested? I am also looking through the website you linked, hopefully something more will turn up there. My code now looks like:
    #!/usr/bin/perl #I want to take a file of text as input, split it into an array of wor +ds #then search through the array for a word that matches the regular #expression, printing all matches. binmode STDIN, ":decoding(UTF-8)"; binmode STDOUT, ":decoding (UTF-8)"; use utf8; use charnames ':full'; while ($line=<>){ @array = split(/ /, $line); foreach $x (@array){ if ($x=~ /\x{02c0}/){#glottal stop print "$x\n"; } } }
    Thanks so much for your help!
      That's roughly how I would have done it, except that :decoding(UTF-8) is wrong, it's still :encoding(UTF-8).

      Here is a working example how to search for that character:

      use strict; use warnings; use charnames qw(:full); binmode STDOUT, ':encoding(UTF-8)'; my $filename = 'test.txt'; if (@ARGV) { open my $handle, '>:encoding(UTF-8)', $filename or die "Can't write to file '$filename': $!"; print $handle <<"OUT"; The next line contains a\N{MODIFIER LETTER GLOTTAL STOP} Really! OUT close $handle or warn $!; } else { open my $handle, '<:encoding(UTF-8)', $filename or die "Can't open file '$filename' for reading: $!"; for (<$handle>) { print if /\N{MODIFIER LETTER GLOTTAL STOP}/; } close $handle; }

      When you call it with command line arguments it writes a test file, when called without any that test file is read again:

      $ perl sample.pl gen
      $ perl sample.pl 
      contains aˀ
      

      I hope this help, you can gradually morph it into the program you want, when you change something and it breaks you know what's wrong.

      Why did you change it to decoding?
      use open IO => ":encoding(UTF-8)";
        IO doesn't include STDOUT/STDIN
        perl -Mopen=IO,:encoding(UTF-8) -le"print join q! !, $$_[0], PerlIO::g +et_layers($$_[0], output => $$_[1]) for [*STDOUT,1], [*STDIN,0], [*ST +DERR,1] " *main::STDOUT unix crlf *main::STDIN unix crlf *main::STDERR unix crlf
        You want use open qw! :std :encoding(UTF-8) !; ex:
        perl -Mopen=:std,:encoding(UTF-8) -le"print join q! !, $$_[0], PerlIO: +:get_layers($$_[0], output => $$_[1]) for [*STDOUT,1], [*STDIN,0], [* +STDERR,1] " *main::STDOUT unix crlf encoding(utf-8-strict) utf8 *main::STDIN unix crlf encoding(utf-8-strict) utf8 *main::STDERR unix crlf encoding(utf-8-strict) utf8
Re^2: Regular Expressions on Unicode
by ikegami (Patriarch) on Dec 14, 2009 at 18:25 UTC
    He's not reading from STDIN, he's reading from ARGV, which might read from STDIN. That will only work in some circumstances, if at all.