problem with user-defined unicode character properties

psychomachine has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody, a linguist in pain over here.

I am trying to clean up a corpus of utf8 texts which contain mainly Russian Cyrillic by removing all the Latin text that's in them. Of course, I need to keep the punctuation, spaces etc.

Unicode seems to be doing fine, because when I use: print s/[\P{InCyrillic}]//g; I really get only Cyrillic strings piled up against each other, no punctuation, no spaces and no latin letters. So far, so good.

But when I try to create my own character definition in a subroutine, things stop working as expected.

#! usr/local/perl 
use utf8;

sub InRussian{
return <<'END';
    +utf8::Cyrillic
    +utf8::Mark
    +utf8::Number
    +utf8::Punctuation
END
}

print s/[\P{InRussian}]//g;
[download]

I only get digits without spaces, which makes no sense to me.

Even when my subroutine contains only +utf8::Cyrillic-- the result is digits and not the same as print s/[\P{InCyrillic}]//g;

I'm calling the perl script from bash with -CS -p. I'm using perl 5.8.6 on Mac OS X. Any help would be greatly appreciated.

Mike

Comment on problem with user-defined unicode character properties Select or Download Code

Replies are listed 'Best First'.
Re: problem with user-defined unicode character properties by BrowserUk (Patriarch) on Jun 11, 2007 at 14:19 UTC
Try removing the leading whitespace (untested; no cyrillic text to hand): `sub InRussian{ return <<'END'; +utf8::Cyrillic +utf8::Mark +utf8::Number +utf8::Punctuation END }` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: problem with user-defined unicode character properties by psychomachine (Initiate) on Jun 11, 2007 at 15:00 UTC
Unfortunately, that doesn't solve the problem, although it may bring us a step closer to the solution. Without the trailing spaces, I've discovered that the last character pattern in the subroutine does get executed as necessary. For instance: `#! usr/local/perl use utf8; sub InRussian{ return <<'END'; +utf8::Cyrillic +utf8::Punctuation END } print s/[\P{InRussian}]//g;` [download] gets me only numbers and punctuation, whereas having the `+utf8::Cyrillic` after `+utf8::Punctuation` in the subroutine produces the same output as the direct application of the InCyrillic pattern `print s/[\P{InCyrillic}]//g;` Does this make any sense?	[reply] [d/l] [select]
Re^3: problem with user-defined unicode character properties by BrowserUk (Patriarch) on Jun 11, 2007 at 15:41 UTC
Um...two guesses. You are using negation `\P{}` and `NOT(+A +B)` doesn't mean what you intend. Eg. 'not in A and not in B'? Maybe you need (something like): `sub NotInRussian{ return <<'END'; !utf8::Cyrillic !utf8::Punctuation END } ... s/\p{NotInRussian}//g` [download] It might have something to do with this from the POD? A final note on the user-defined property tests and mappings: they will be used only if the scalar has been marked as having Unicode characters. Old byte-style strings will not be affected. Does your editor produce unicode source files? Will Perl promote ASCII source to unicode? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: problem with user-defined unicode character properties by PreferredUserName (Pilgrim) on Jun 11, 2007 at 18:56 UTC
The "-p" switch already implies printing the line, so you don't need to do it explicitly. print() returns the number of characters printed, which in your case is whatever s/// returns. Those are the numbers you're seeing. Change the last line to: `s/[\P{InRussian}]//g;` [download] and it should work.	[reply] [d/l]