CGI.pm: "Malformed UTF-8 character" in apache's error.log

isync has asked for the wisdom of the Perl Monks concerning the following question:

As a spin-off from this thread, I've got a problem with CGI.pm and file uploads from utf8 html pages. On larger uploads (valid binary files like mp3s etc) I get a "CGI.pm: Server closed socket during multipart read (client aborted?)." on the client/browser side and in apache2's error.log multiple lines with "Malformed UTF-8 character (unexpected continuation byte ..."

To replicate the error I use the script from http://www.perlfect.com/articles/upload.shtml

<FORM ENCTYPE="multipart/form-data" ACTION="/cgi-bin/upload.pl" METHOD
+="POST"> Please choose directory to upload to:<br> <SELECT NAME="dir"
+> <OPTION VALUE="images">images</OPTION> <OPTION VALUE="sounds">sound
+s</OPTION> </SELECT> <p> Please select a file to upload: <BR> <INPUT 
+TYPE="FILE" NAME="file"> <p> <INPUT TYPE="submit"> </FORM>
[download]

which is served as "Content-Encoding: utf8;"

Then I upload a file of approx. 800K (on smaller files like 300K I manage to get them up despite of errors...) to this script:

#!/usr/bin/perl -CS
use CGI::Carp qw(fatalsToBrowser);
use CGI;
my $cgi = new CGI;

my $file = $cgi->param('file');

$file=~m/^.*(\\|\/)(.*)/; # strip the remote path and keep the filenam
+e

my $name = $2;

open(LOCAL, ">/var/www/mypath/$file") or die "$!: path: /var/www/mypat
+h/$name file: $file";
while(<$file>) {
        print LOCAL $_;
}

print $cgi->header();
print "$file has been successfully uploaded... thank you.\n";
[download]

Now, the important part is the #!/usr/bin/perl -CS switch! It tells perl to perate in utf8 mode on input, output and stderror. A method I also had in effect with different approaches, i.e. with binmode STDIN, STDOUT etc. set to utf8.

I need this functionality as I use CGI::Application to serve pages in utf8. But what it also seems to do is treating file-uploads as utf8 which leads to the error I am describing.
For example, setting the switch to #!/usr/bin/perl -COE (= utf8 for output and stderror) will not yield the error.

1. Now, is this behaviour an error of CGI.pm (not differentiating between form-(text)-data and file-uploads or intended behaviour?

2.And what is the correct application design? Should I use the switch -COE so output and err is utf8 while input stays :raw - which I then convert on selected form-fields via my $param_f = decode("utf8", $q->param("f") )?

Comment on CGI.pm: "Malformed UTF-8 character" in apache's error.log Select or Download Code

Replies are listed 'Best First'.
Re: CGI.pm: "Malformed UTF-8 character" in apache's error.log by Joost (Canon) on Feb 26, 2008 at 17:58 UTC
This is expected behaviour since POST data is read from STDIN. You will have to switch STDIN to binary before reading the binary upload data. If everything in CGI.pm works as I expect, you can probably do a `binmode STDIN` and possibly `binmode $file` right before reading from $file. By the way, I've always distrusted the use of the param() as both providing the file name and the handle. If that doesn't work, you may want to use `$cgi->upload('file')` to get the file handle. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^2: CGI.pm: "Malformed UTF-8 character" in apache's error.log by isync (Hermit) on Feb 26, 2008 at 18:54 UTC
`#!/usr/bin/perl -CS use CGI; use CGI::Carp qw(fatalsToBrowser); my $cgi = new CGI; my $filename = $cgi->param('file'); my $fh = $cgi->upload('file'); binmode $fh; open(OUTPUT, ">/var/www/mypath/$filename") or die $!; while(<$fh>) { print OUTPUT $_; } print $cgi->header(); print "$file has been successfully uploaded... thank you.\n";` [download] does not work (same error) so switching the filehandle/STDIN (I tried both) just before storing the file does not seem to be possible after giving perl the -CS switch. update: changing `binmode $fh;` to `binmode $fh, ":utf8";` does the trick! (Is this what you meant and does that mean I am still on the right track or is it a hint of a problem?)	[reply] [d/l] [select]
Re^3: CGI.pm: "Malformed UTF-8 character" in apache's error.log by ikegami (Patriarch) on Feb 26, 2008 at 19:17 UTC
Don't use `:utf8` on untrusted data. Use `:encoding(UTF-8)`. For that matter, don't use `:utf8`. You decode bytes to chars, but you never encode the chars back to bytes when writing them. You'll get wide character warnings for non-ASCII chars, and you could get a mix of iso-latin-1 and UTF-8 characters in more complex programs. What awful inconsistency in your file handle names: `$fh` and `OUTPUT`? And using a global variable too? Fixed: `... my $fh_in = $cgi->upload('file'); binmode $fh_in, ':encoding(UTF-8)'; open(my $fh_out, '>:encoding(UTF-8)', "/var/www/mypath/$filename") or die $!; while (<$fh_in>) { print $fh_out $_; }` [download] But there's absolutely no reason to convert to chars in the above code, so you'd be better off as `... my $fh_in = $cgi->upload('file'); open(my $fh_out, '>', "/var/www/mypath/$filename") or die $!; while (<$fh_in>) { print $fh_out $_; }` [download] You also mentioned binary uploads (like MP3s). For that, you'd use `... my $fh_in = $cgi->upload('file'); open(my $fh_out, '>', "/var/www/mypath/$filename") or die $!; binmode $fh_in; binmode $fh_out; local $/ = \4096; # Don't wait to find "\n" while (<$fh_in>) { print $fh_out $_; }` [download] The code for binary files also works with text files.	[reply] [d/l] [select]
Re: CGI.pm: "Malformed UTF-8 character" in apache's error.log by Juerd (Abbot) on Feb 26, 2008 at 20:58 UTC
Because it doesn't look like it will be repaired any time soon... Let's at least warn people. The -C flag is implemented with the unsafe ":utf8" layer instead of the safe ":encoding(utf8)" layer. Therefore, -CI, -CS, -Ci, -CD, and their numeric equivalents, are potential security risks. Likewise, -CA is implemented by setting the SvUTF8 flag (like _utf8_on) and should also be avoided. Instead of -CI, use: `binmode STDIN, ":encoding(utf8)";` Instead of -Ci, use: `use open ":encoding(utf8)";` Instead of -CA, use: `utf8::decode($_) for @ARGV;` Instead of -CS, use -COE and: `binmode STDIN, ":encoding(utf8)";` Instead of -CD, use -Co and: `use open ":encoding(utf8)";` (Using the ":utf8" layer is safe for output streams.)	[reply]
Re: CGI.pm: "Malformed UTF-8 character" in apache's error.log by ikegami (Patriarch) on Feb 26, 2008 at 18:14 UTC
Now, is this behaviour an error of CGI.pm No. You corrupt the CGI commuication before CGI.pm even sees it. STDIN is not yours to play with when using CGI.pm. And what is the correct application design? Tell CGI the charset you want to use for the parameters, and use `binmode ':encoding(UTF-8)'` on the input file handle of uploaded files when appropriate. In this case, you don't need to use `:encoding` since all you do it dump the file into another file. If you do use `:encoding` here, you'd need it on the output file handle as well (to convert the chars back into bytes).	[reply] [d/l] [select]
Re: CGI.pm: "Malformed UTF-8 character" in apache's error.log by isync (Hermit) on Feb 26, 2008 at 19:26 UTC
From your input (~~not yet woven in ikegami's last post, working on it~~) I constructed this variation, which should replicate in a simplified form, what the routine in my more complex CGI::Application script will be like: #!/usr/bin/perl -COE use CGI; use CGI::Carp qw(fatalsToBrowser); use utf8; use Encode; my $cgi = new CGI; my $formtext = decode("utf8", $cgi->param("formtext") ); # this may be + abc or contain chinese my $filename = $cgi->param('file'); my $fh_in = $cgi->upload('file'); open(my $fh_out, '>', "/var/www/clipland/www.clipland.com/$filename") +or die $!; binmode $fh_in; binmode $fh_out; local $/ = \4096; # Don't wait to find "\n" while (<$fh_in>) { print $fh_out $_; } $cgi->header(-charset => 'utf-8'); print $cgi->header(); print "$filename has been successfully uploaded... Now some umlauts: ä +öü"; [download] 1. It only uses the -COE switch for Stdoutput and Stderr. (using -CS will give the error, despite ikegami's fixes) and as Juerd pointed out: it's safer. 2. I do not switch utf8 on for the Stdinput, thus I need to manually decode form data from text fields - right? Is this OK? Anything I've overseen? BTW: no kidding? I have to differentiate between text and binary data?? A great relief to hear that utf8 text files are also binary data... ;-)	[reply] [d/l]