Apache+PerlCGI: accent problems

pablofaria has asked for the wisdom of the Perl Monks concerning the following question:

Hi, all.

First of all, I really made a lot of search around the web and here too, but I find nothing helpful for my case... So, I hope you can bring some light for me.

I have a web application (running on Apache) that consists of a HTML query form to search through text files. The form calls a Perl script that prepares some other details of the search, and calls a Java program (using back ticks) that does the search and returns the output in HTML code to the script, who does some final treatment and send it back to the browser.

The problem is that when the query informed by the user includes accented chars (its portuguese texts files), these are received by the Java program misconfigured, so the search returns nothing. Adapting the query, by removing the accents, doesn't solve the problem because the accented chars must be found. I created a version of the script, so it could be executed from a shell and it works fine with accents. So I thought it could be something with Apache+Perl but I have no idea of what/where... Just to mention, I tried all kind of conversions of IO charset etc. and it didn't work...

A simple version of the script:

#!/usr/bin/perl
print "Content-type: text/html;charset=utf-8", "\n\n";
$out1=(`java -classpath /usr/local/lib/CS.jar csearch/CorpusSearch 'HT
+MLQ((naő Exists))' c_006_pos.txt.cs`);
print $output;
[download]

Piece of the output to the browser:
--------
search domain: $ROOT
query: (na�� Exists)
---------

As it shows, the original "naő" was processed by the Java program as "na��". But if I ask the script to show de query right before submitting it to the Java program (before the "$out=(`..."), it shows it ok ("naő"). All the other accented chars in the output are ok too, because I set the charset to utf8. As I said before, the same script works fine when I call it direct from the shell. The form is sending in utf-8 too. Is there anything between Apache and shell commands via Perl that I am missing?

Well, that's it. Any ideas?
Thanks,

Comment on Apache+PerlCGI: accent problems Download Code

Replies are listed 'Best First'.
Re: Apache+PerlCGI: accent problems by Joost (Canon) on Feb 07, 2008 at 20:24 UTC
Did you mean "print $out1" ? Anyway, a couple of things may be going wrong. One thing is that it's possible that the backticks don't interpret the command's output as UTF8. Try this: `#!/usr/bin/perl use warnings; use strict; print "Content-type: text/html;charset=utf-8", "\n\n"; # open command and explicitly request utf-8 open (COMMAND,"-\|:encoding(UTF-8)","java -classpath /usr/local/lib/CS. +jar csearch/CorpusSearch 'HTMLQ((naő Exists))") or die $!; my $out1 = join('',<COMMAND>); # put all lines in one string binmode(STDOUT,":utf8"); # this marks the output as accepting utf8 print $out1;` [download] You may also need to be careful about how you're reading the input data. See also http://perldoc.perl.org/functions/open.html "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re: Apache+PerlCGI: accent problems by pc88mxer (Vicar) on Feb 07, 2008 at 20:30 UTC
I'm pretty sure this is an character encoding problem. The java program is probably expected utf8 (or utf16) encoded characters. The solution is to `use Encode;` and is multi-fold: when reading user input, make sure you decode the strings correctly. when passing the strings to the java program, make sure you encode them to what the java program expects when gettings the results back from the java program, make sure they are decoded correctly finally, when you emit the results back to the user, make sure your characters are again encoded correctly. Note that some of these step may be omitted if the incoming and outgoing character encodings are the same. As a first step, try to figure out what encoding the java program is expecting as input and also what it is producing as output. For instance, try something like: use Encode; my $word = "naő Exists"; my $encoded_word = Encode::encode('utf8', $word); $out1=(`java -classpath /usr/local/lib/CS.jar csearch/CorpusSearch 'HT +MLQ(($encoded_word))' c_006_pos.txt.cs`); print $out1; [download] and see what `$out1` looks like. You can use Firefox to do this --just use View -> Character Encoding -> More Encodings -> Unicode to try some different encodings out. If utf8 doesn't work, try 'utf16' which is another popular encoding to use with java. After you've figured out the java part, then you should decide on an output encoding (either latin1 or utf8), add a `charset` parameter to your `Content-type` header, and use `Encode::encode` to encode the output, e.g.: `print "Content-type: text/html; charset=utf-8 ... set $out1 from java program ... print Encode::encode('utf8', $out1);` [download]	[reply] [d/l] [select]
Re: Apache+PerlCGI: accent problems by pablofaria (Initiate) on Feb 08, 2008 at 18:15 UTC
Thank you, guys. Your help didn't solve the problem, but it actually helped me to get closer to the real place of the problem. I did a lot more tests and found out that the problem was in the interface between java and apache. Doing the search with 'grep' works just fine. Then it came to me the possibility that the java was itself messing things up. So I put a "LANG=pt_BR.UTF-8" before the java command and finally the problem was solved! Oh, I'm tired... Do you think it deserves a Meditation?	[reply]