UTF-8 woes with file upload

Farenji has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a webapplication in perl which uses UTF-8 everywhere. At the top of my scripts is "use open ':utf8';" and "use open ':std';" which makes everything work correctly. Without these lines 'special' characters that come f.e. from the database turn into question marks in the browser.

Recently I added a file upload option to the application, but when I try to upload, I get tons of errors in my log complaining about bad utf-8:

[Wed Aug 15 20:55:42 2007] [error] [client 127.0.0.1] [Wed Aug 15 20:5
+5:42 2007] script.pl: Malformed UTF-8 character (unexpected continuat
+ion byte 0x99, with no preceding start byte) in index at (eval 44) li
+ne 15., referer: http://localhost/formtest.html
[Wed Aug 15 20:55:42 2007] [error] [client 127.0.0.1] [Wed Aug 15 20:5
+5:42 2007] script.pl: Malformed UTF-8 character (byte 0xff) in index 
+at (eval 44) line 15., referer: http://localhost/formtest.html
[Wed Aug 15 20:55:42 2007] [error] [client 127.0.0.1] [Wed Aug 15 20:5
+5:42 2007] script.pl: Malformed UTF-8 character (unexpected non-conti
+nuation byte 0x0d, immediately after start byte 0xd9) in index at (ev
+al 44) line 15., referer: http://localhost/formtest.html
[Wed Aug 15 20:55:42 2007] [error] [client 127.0.0.1] [Wed Aug 15 20:5
+5:42 2007] script.pl: Wide character in print at (eval 38) line 85., 
+referer: http://localhost/formtest.html
[download]

The uploaded file is totally corrupt when saved (btw I use CGI.pm). It seems that the file is not sent as UTF-8 for some unknown reason; I do specify UTF-8 charset in my forms and html, and the http headers confirm that it's supposed to be UTF-8.

When I remove the two "use open" lines from my perl script, the uploads go fine, but then the special chars are screwed in my app. It's a multi lingual site so correct display of special chars like c cedille and umlauts is essential. Hence the choice for UTF-8.

I'm caught between a rock and a hard place. How do I get out?

Comment on UTF-8 woes with file upload Download Code

Replies are listed 'Best First'.
Re: UTF-8 woes with file upload by Joost (Canon) on Aug 15, 2007 at 23:09 UTC
Very likely the input you receive from the browser isn't utf-8 encoded. the big issue is that AFAIK there is no standard that describes the encoding that should be used when uploading text files. Probably the file is sent in whatever encoding it was in originally. Even more annoying is that you'll probably get no indication of what that encoding is. Since file uploads and POST data is read from STDIN and STDIN is normally not used for anything else, you may be able to explicitly set the IO layers for STDIN to raw prior to processing the input. After you've received the uploaded data from CGI, you can then try to decode the data as utf-8 or latin-1 / latin-15 (which, assuming you're in the USA or europe are probably the most common encodings). "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: UTF-8 woes with file upload by daxim (Curate) on Aug 17, 2007 at 19:46 UTC
The short name for ISO 8859-15 is Latin 9, not Latin 15. Wiki \| CPAN Testers wiki	[reply]
Re: UTF-8 woes with file upload by graff (Chancellor) on Aug 15, 2007 at 22:52 UTC
Well, it ultimately depends on whether the file being uploaded is valid utf8, which apparently it is not. If you can isolate the portion of your code at which uploaded file data is being read from the client, you should be able to put `binmode STDIN, ":raw";` at the start of that portion, and `binmode STDIN, ":utf8";` at the end of it (though perhaps setting it back to utf8 wouldn't be necessary, if the end of the file upload is also the end of the cgi transaction?) For that matter, you might be better off having STDIN be "raw" at all times, and decode incoming stuff as utf8 only when appropriate. (I'm guessing that the actual file upload is taking place over STDIN -- and if I'm wrong about that, forgive me... file uploads are not an area where I have much experience.)	[reply] [d/l] [select]