Polyglot has asked for the wisdom of the Perl Monks concerning the following question:
I am writing an interface for a search engine that will pull its results from a database. I have, more or less, mastered the entire code in English, and am now trying to remaster it in CJK-compatible fonts. I would prefer to use uft8, and ideally the solution should be compatible with standard Latin characters as well. I am using Perl 5.8.7.
At the core of the problem is that I am unable to get a pattern match for words in a unicode-containing string. Even if I place spaces between the words, the regex will not recognize the space. I have tried \s \p{IsSpace} \P{IsWord} \b \X \p{IsZ} \p{IsZc} and others, with or without "use Encode;" or "use utf8;" in various forms, all to no avail. Here is an example of one line of my code:
Now, that line will function perfectly on an English input string, but as soon as I enter Chinese, it won't match anything. All that line needs to do is to insert an " AND " between any two spaced words, after verifying that those words are not search operator terms themselves. So, XXX YYY OR ZZZ should become XXX AND YYY OR ZZZ.1 while $line =~ s/(\p{IsWord})(?<!\p{IsSpace}AND|\p{IsSpace}XOR|\p{ +IsSpace}NOT|\p{IsWord}\p{IsSpace}OR)(\p{IsSpace}|\s|\p{IsMc}|\p{IsZs} +|\p{IsZ})(?!AND\p{IsSpace}|XOR\p{IsSpace}|NOT\p{IsSpace}|\&\&|\&\p{Is +Space}|\+|\|\||\||OR\p{IsSpace}|\^|\!)(\p{IsWord})/$1 AND $3/gi;
Any ideas for why this is not performing correctly in utf8? (I'm open to a total rewrite of the line, as it's obvious I haven't found the best solution yet.)
Thank you!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Unicode substitution regex conundrum
by Juerd (Abbot) on Oct 16, 2007 at 10:55 UTC | |
by Polyglot (Chaplain) on Oct 16, 2007 at 13:55 UTC | |
by moritz (Cardinal) on Oct 16, 2007 at 14:11 UTC | |
by Juerd (Abbot) on Oct 16, 2007 at 19:11 UTC | |
by moritz (Cardinal) on Oct 16, 2007 at 19:21 UTC | |
| |
by Polyglot (Chaplain) on Oct 17, 2007 at 03:35 UTC | |
by Lu. (Hermit) on Dec 16, 2007 at 22:39 UTC | |
by Polyglot (Chaplain) on Mar 04, 2008 at 06:14 UTC | |
by Juerd (Abbot) on Mar 14, 2008 at 01:08 UTC | |
Re: Unicode substitution regex conundrum
by moritz (Cardinal) on Oct 16, 2007 at 10:24 UTC |