psychomachine has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to clean up a corpus of utf8 texts which contain mainly Russian Cyrillic by removing all the Latin text that's in them. Of course, I need to keep the punctuation, spaces etc.
Unicode seems to be doing fine, because when I use: print s/[\P{InCyrillic}]//g; I really get only Cyrillic strings piled up against each other, no punctuation, no spaces and no latin letters. So far, so good.
But when I try to create my own character definition in a subroutine, things stop working as expected.
I only get digits without spaces, which makes no sense to me.#! usr/local/perl use utf8; sub InRussian{ return <<'END'; +utf8::Cyrillic +utf8::Mark +utf8::Number +utf8::Punctuation END } print s/[\P{InRussian}]//g;
Even when my subroutine contains only +utf8::Cyrillic-- the result is digits and not the same as print s/[\P{InCyrillic}]//g;
I'm calling the perl script from bash with -CS -p. I'm using perl 5.8.6 on Mac OS X. Any help would be greatly appreciated.
Mike
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: problem with user-defined unicode character properties
by BrowserUk (Patriarch) on Jun 11, 2007 at 14:19 UTC | |
by psychomachine (Initiate) on Jun 11, 2007 at 15:00 UTC | |
by BrowserUk (Patriarch) on Jun 11, 2007 at 15:41 UTC | |
|
Re: problem with user-defined unicode character properties
by PreferredUserName (Pilgrim) on Jun 11, 2007 at 18:56 UTC |