Re^4: Unicode substitution regex conundrum

Well, I'm stumped. The program will match the spaces properly when doing a split// but not when doing a s///. I have now set the attribute on the HTML form to UTF8. I have inserted the code posted earlier, and still nothing. So, to demonstrate the exact conundrum I am up against, I have reduced my code to just the barest essentials for testing this UTF8 regex.

Please feel free to try this script on your own server to see if you can get it to work properly on Chinese fonts. I have included a sample Chinese phrase in the script which you should be able to copy and paste into it for testing purposes. Compare it with an English search, and you'll see why I'm frustrated!

#!/usr/bin/perl -wT -CE
use Encode;
use Encode qw(_utf8_on);
use Encode qw(encode decode);

##### PARSE THE FORM INPUT
    if ($ENV{CONTENT_LENGTH}) {
        read(STDIN, $buffer, $ENV{CONTENT_LENGTH});
        @pairs = split(/&/,$buffer);
    } else {
        $buffer = $ENV{QUERY_STRING};
        @pairs = split(/\+/,$buffer);
    }
    foreach $pair (@pairs) {
        ($name, $value) = split(/=/,$pair);
        $value =~ tr/+/ /;
        $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1))/eg;
        $input{$name} = $value;
    }
  $terms=$input{terms}; 

##### START TESTING PHASE
print "Content-type: text/html\n\n";
print "TERMS: $terms";

##### TRY A SPLIT
($a, $b, $c, $d, $e, $f) = split/\p{IsSpace}/, $terms;
print "<p>A:$a:<p>B:$b:<p>C:$c:<p>D:$d:<p>E:$e:<p>F:$f:\n";

##### NOW TRY A SUBSTITUTION
$word = qr/\b(?!(?:AND|OR|XOR|NOT)\b)\w+/i;
$terms =~ s/($word)\p{IsSpace}*($word)/$1 AND $2/g for 1..2;
print "<p>Terms:$terms\n";

##### PRINT THE WEBPAGE
print <<HTML;
<html lang=utf8>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf8">
<title>SEARCH</title>
</head><body>
<h1 align="center">Search</h1>
<form name="ff" method="POST" accept-encoding="UTF-8" 
                accept-charset="utf-8" action="$0">
Search terms: 
<input type="text" size="40" name="terms" value="$terms"></input>
<p>An example Chinese phrase: &#32102;&#32842; &#22825;&#23460;&#25152
+; &#26377;&#25104;&#21729;
<input type="submit" name="submit" value="Submit"></input>
</form></body></html>
HTML
[download]

Comment on Re^4: Unicode substitution regex conundrum Download Code

Replies are listed 'Best First'.
Re^5: Unicode substitution regex conundrum by Lu. (Hermit) on Dec 16, 2007 at 22:39 UTC
Hi, I may be too late, seeing your message has been here since nearly two months, but it could still be of use to you or someone else. I may be wrong, but it seems to me like the problem does not reside with the whitespaces, but with the definition of word in Perl : \w+ does not match chinese characters. On my system (with unicode locale and chinese readable in the console) : `$ perl -le 'print "ok" if ("我走" =~ m/\w+/)' $ perl -le 'print "ok" if ("hi" =~ m/\w+/)' ok` [download] (Chinese chars were jumbled, I didn't put the codes in the one-liner) Furthermore, I played a bit with your code, and when I replaced `$terms =~ s/($word)\p{IsSpace}*($word)/$1 AND $2/g for 1..2;` [download] with `$terms =~ s/(\p{IsSpace}/ AND /g;` [download] it did the job I expected of it. The quickest workaround I see at the moment would be to declare $word using CJK character ranges instead of \w. Hope I could be of help. Lu.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^5: Unicode substitution regex conundrum
by Lu. (Hermit) on Dec 16, 2007 at 22:39 UTC

$ perl -le 'print "ok" if ("&#25105;&#36208;" =~ m/\w+/)'
$ perl -le 'print "ok" if ("hi" =~ m/\w+/)'
ok
[download]

$terms =~ s/($word)\p{IsSpace}*($word)/$1 AND $2/g for 1..2;
[download]

$terms =~ s/(\p{IsSpace}/ AND /g;
[download]

[reply]
[d/l]
[select]