Re: split text into words -- Unicode problem (I guess)

ї is not a valid character in perl; you need \x{1111}. Also, look for unicode character properties in perlunicode , you should probably find \p{L} class useful in regexes:

$_ = "a\x{2625}\x{10000}";
print map { sprintf "%x\n", ord } m/(\p{L})/g;
61
10000
[download]

Comment on Re: split text into words -- Unicode problem (I guess) Select or Download Code

Replies are listed 'Best First'.
Re^2: split text into words -- Unicode problem (I guess) by bogdan77 (Initiate) on Mar 29, 2007 at 14:49 UTC
@andye Um... no. Simply using space as delimiter would give me "monarchy.", "Japan.", "company?", and so on, not just the words themselves. @dk The text in the variable textBlock doesn't contain "ї" constructs -- it contains the real characters (for example, ț is "T with comma below"). The html encoding changed those characters into "ї" constructs when I submitted them.	[reply]
Re^3: split text into words -- Unicode problem (I guess) by bogdan77 (Initiate) on Mar 29, 2007 at 14:55 UTC
Argh... let me try again... @dk The text in the variable textBlock doesn't contain &#(numbers); constructs -- it contains the real characters (for example, &#five3hree9ine; is "T with comma below"). The html encoding changed those characters into &#(number); constructs when I submitted them.	[reply]
Re^4: split text into words -- Unicode problem (I guess) by Anonymous Monk on Mar 31, 2007 at 10:02 UTC
it works with "use encoding utf8;" thanks a lot, guys :^)	[reply]
Re^4: split text into words -- Unicode problem (I guess) by dk (Chaplain) on Mar 29, 2007 at 15:10 UTC
If your script is written in utf8, `use utf8` is needed to tell Perl about it. See more in utf8.	[reply] [d/l]
Re^4: split text into words -- Unicode problem (I guess) by bogdan77 (Initiate) on Mar 29, 2007 at 15:25 UTC
I used "use utf8;", and still no joy... :-( It seems there's some kind of overlaping, even if the delimiters are in proper order...	[reply]