I am writing an interface for a search engine that will pull its results from a database. I have, more or less, mastered the entire code in English, and am now trying to remaster it in CJK-compatible fonts. I would prefer to use uft8, and ideally the solution should be compatible with standard Latin characters as well. I am using Perl 5.8.7.
At the core of the problem is that I am unable to get a pattern match for words in a unicode-containing string. Even if I place spaces between the words, the regex will not recognize the space. I have tried \s \p{IsSpace} \P{IsWord} \b \X \p{IsZ} \p{IsZc} and others, with or without "use Encode;" or "use utf8;" in various forms, all to no avail. Here is an example of one line of my code:
Now, that line will function perfectly on an English input string, but as soon as I enter Chinese, it won't match anything. All that line needs to do is to insert an " AND " between any two spaced words, after verifying that those words are not search operator terms themselves. So, XXX YYY OR ZZZ should become XXX AND YYY OR ZZZ.1 while $line =~ s/(\p{IsWord})(?<!\p{IsSpace}AND|\p{IsSpace}XOR|\p{ +IsSpace}NOT|\p{IsWord}\p{IsSpace}OR)(\p{IsSpace}|\s|\p{IsMc}|\p{IsZs} +|\p{IsZ})(?!AND\p{IsSpace}|XOR\p{IsSpace}|NOT\p{IsSpace}|\&\&|\&\p{Is +Space}|\+|\|\||\||OR\p{IsSpace}|\^|\!)(\p{IsWord})/$1 AND $3/gi;
Any ideas for why this is not performing correctly in utf8? (I'm open to a total rewrite of the line, as it's obvious I haven't found the best solution yet.)
Thank you!
In reply to Unicode substitution regex conundrum by Polyglot
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |