Re: Regular Expressions on Unicode

Replies are listed 'Best First'.
Re^2: Regular Expressions on Unicode by larimar123 (Initiate) on Dec 14, 2009 at 10:47 UTC
Took your advice and put those two lines at the top of the code I provided, but frustratingly it didn't work. I just began learning Perl about a month ago, so I apologize for sounding dense. To clarify, does it look like I did as you suggested? I am also looking through the website you linked, hopefully something more will turn up there. My code now looks like: `#!/usr/bin/perl #I want to take a file of text as input, split it into an array of wor +ds #then search through the array for a word that matches the regular #expression, printing all matches. binmode STDIN, ":decoding(UTF-8)"; binmode STDOUT, ":decoding (UTF-8)"; use utf8; use charnames ':full'; while ($line=<>){ @array = split(/ /, $line); foreach $x (@array){ if ($x=~ /\x{02c0}/){#glottal stop print "$x\n"; } } }` [download] Thanks so much for your help!	[reply] [d/l]
Re^3: Regular Expressions on Unicode by moritz (Cardinal) on Dec 14, 2009 at 19:19 UTC
That's roughly how I would have done it, except that `:decoding(UTF-8)` is wrong, it's still `:encoding(UTF-8)`. Here is a working example how to search for that character: use strict; use warnings; use charnames qw(:full); binmode STDOUT, ':encoding(UTF-8)'; my $filename = 'test.txt'; if (@ARGV) { open my $handle, '>:encoding(UTF-8)', $filename or die "Can't write to file '$filename': $!"; print $handle <<"OUT"; The next line contains a\N{MODIFIER LETTER GLOTTAL STOP} Really! OUT close $handle or warn $!; } else { open my $handle, '<:encoding(UTF-8)', $filename or die "Can't open file '$filename' for reading: $!"; for (<$handle>) { print if /\N{MODIFIER LETTER GLOTTAL STOP}/; } close $handle; } [download] When you call it with command line arguments it writes a test file, when called without any that test file is read again: $ perl sample.pl gen $ perl sample.pl contains aˀ I hope this help, you can gradually morph it into the program you want, when you change something and it breaks you know what's wrong.	[reply] [d/l] [select]
Re^3: Regular Expressions on Unicode by Anonymous Monk on Dec 14, 2009 at 15:09 UTC
Why did you change it to decoding? `use open IO => ":encoding(UTF-8)";` [download]	[reply] [d/l]
Re^4: Regular Expressions on Unicode by Anonymous Monk on Dec 14, 2009 at 15:20 UTC
IO doesn't include STDOUT/STDIN `perl -Mopen=IO,:encoding(UTF-8) -le"print join q! !, $$_[0], PerlIO::g +et_layers($$_[0], output => $$_[1]) for [STDOUT,1], [STDIN,0], [ST +DERR,1] " main::STDOUT unix crlf main::STDIN unix crlf main::STDERR unix crlf` [download] You want `use open qw! :std :encoding(UTF-8) !;` ex: `perl -Mopen=:std,:encoding(UTF-8) -le"print join q! !, $$_[0], PerlIO: +:get_layers($$_[0], output => $$_[1]) for [STDOUT,1], [STDIN,0], [* +STDERR,1] " main::STDOUT unix crlf encoding(utf-8-strict) utf8 main::STDIN unix crlf encoding(utf-8-strict) utf8 *main::STDERR unix crlf encoding(utf-8-strict) utf8` [download]	[reply] [d/l] [select]
Re^2: Regular Expressions on Unicode by ikegami (Patriarch) on Dec 14, 2009 at 18:25 UTC
He's not reading from STDIN, he's reading from ARGV, which might read from STDIN. That will only work in some circumstances, if at all.	[reply]