in reply to Re^3: Unicode substitution regex conundrum
in thread Unicode substitution regex conundrum
Please feel free to try this script on your own server to see if you can get it to work properly on Chinese fonts. I have included a sample Chinese phrase in the script which you should be able to copy and paste into it for testing purposes. Compare it with an English search, and you'll see why I'm frustrated!
#!/usr/bin/perl -wT -CE use Encode; use Encode qw(_utf8_on); use Encode qw(encode decode); ##### PARSE THE FORM INPUT if ($ENV{CONTENT_LENGTH}) { read(STDIN, $buffer, $ENV{CONTENT_LENGTH}); @pairs = split(/&/,$buffer); } else { $buffer = $ENV{QUERY_STRING}; @pairs = split(/\+/,$buffer); } foreach $pair (@pairs) { ($name, $value) = split(/=/,$pair); $value =~ tr/+/ /; $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1))/eg; $input{$name} = $value; } $terms=$input{terms}; ##### START TESTING PHASE print "Content-type: text/html\n\n"; print "TERMS: $terms"; ##### TRY A SPLIT ($a, $b, $c, $d, $e, $f) = split/\p{IsSpace}/, $terms; print "<p>A:$a:<p>B:$b:<p>C:$c:<p>D:$d:<p>E:$e:<p>F:$f:\n"; ##### NOW TRY A SUBSTITUTION $word = qr/\b(?!(?:AND|OR|XOR|NOT)\b)\w+/i; $terms =~ s/($word)\p{IsSpace}*($word)/$1 AND $2/g for 1..2; print "<p>Terms:$terms\n"; ##### PRINT THE WEBPAGE print <<HTML; <html lang=utf8> <head> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf8"> <title>SEARCH</title> </head><body> <h1 align="center">Search</h1> <form name="ff" method="POST" accept-encoding="UTF-8" accept-charset="utf-8" action="$0"> Search terms: <input type="text" size="40" name="terms" value="$terms"></input> <p>An example Chinese phrase: 給聊 天室所 +; 有成員 <input type="submit" name="submit" value="Submit"></input> </form></body></html> HTML
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^5: Unicode substitution regex conundrum
by Lu. (Hermit) on Dec 16, 2007 at 22:39 UTC |