in reply to Unicode substitution regex conundrum
"use utf8;" is needed if your source code is actually in UTF8 encoding. It does not affect regex matches.
Perl does not have UTF8 semantics, but instead Unicode semantics. The difference is that you work on *encodingless* strings in Perl, and that you use *normal* operators instead of separate ones. The important things are done internally.
Please read the Perl Unicode Tutorial and the Perl Unicode FAQ.
The following ought to suffice:
Unicode::Semantics works around a bug that causes the second half of latin1 to be ignored under certain circumstances.# untested! use Encode qw(decode); use Unicode::Semantics qw(up); up($line = decode 'UTF-8', $line); my $word = qr/\b(?!(?:AND|OR|XOR|NOT)\b)\w+/i; $line =~ s/($word)\s*($word)/$1 AND $2/g for 1..2;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Unicode substitution regex conundrum
by Polyglot (Chaplain) on Oct 16, 2007 at 13:55 UTC | |
by moritz (Cardinal) on Oct 16, 2007 at 14:11 UTC | |
by Juerd (Abbot) on Oct 16, 2007 at 19:11 UTC | |
by moritz (Cardinal) on Oct 16, 2007 at 19:21 UTC | |
by Juerd (Abbot) on Oct 16, 2007 at 21:29 UTC | |
by Juerd (Abbot) on Oct 16, 2007 at 21:35 UTC | |
by Polyglot (Chaplain) on Oct 17, 2007 at 03:35 UTC | |
by Lu. (Hermit) on Dec 16, 2007 at 22:39 UTC | |
|
Re^2: Unicode substitution regex conundrum
by Polyglot (Chaplain) on Mar 04, 2008 at 06:14 UTC | |
by Juerd (Abbot) on Mar 14, 2008 at 01:08 UTC |