OK, I've been trying to track down some problems with the transfer of some UTF-8 encoded XML between two Perl based systems for a week, and have finally had some success. I'm documenting a few facts here for the benefit of others, rather than to present a question (though there is a question or two, later); it's been very painful trying to sort through the problems, and I hope this will be of help to someone else.
1. One system (the sender) generates a set of name/value pairs to be transferred to another system (the receiver). The sender is using LWP to generate and send the data, the receiver is using CGI.pm in a CGI script running under Apache to receive the data.
2. The sender is required (by spec) to send the name/value pairs as POST data (via an HTTP POST method, of course) to the receiver, and is required to encode the name/value pairs as an application/x-www-form-urlencoded form. (which encodes various reserved characters such as # with a hex based encoding - you can look up the details elsewhere).
3. Unfortunately we immediately have an inconsistency here, as the application/x-www-form-urlencoded encoding is strictly defined only for US-ASCII data, and I want to send UTF-8 encoded XML. (More on this later).
However, LWP *can* URL-encode a string containing Unicode characters, if we first convert it to its UTF-8 encoded representation using, say, Encode::encode_utf8, (which returns what is from LWP's point of view, a sequence of bytes), and allowing LWP to URL-encode each byte individually.
4. At the receiving end, things become more difficult; as far as I can tell CGI.pm cannot decode the URL-encoded data generated by LWP; with a form containing multiple URL-encoded UTF-8 form elements, I've only ever been able to reconstruct *one* of these correctly at the receiver, the others being mangled in various ways. Furthermore, I've never managed to reconstruct the XML data at all. I can't meaningfully comment on the problem here, as I haven't had enough time to dig into any further (suggestions or solutions will be gladly accepted)
I've been using CGI.pm's Vars() method to retrieve the params; AFAICS the results would be no different using param(), though I haven't tested this.
5. The code I've been using to reconstruct the UTF-8 data in the CGI script looks roughly like:
where %params contains all of the form params. Note the untaint pseudocode above. This seems to be vital, as Encode::decode_utf8 fails to generate a string marked as UTF-8 without it. I assume that this is a bug in Encode::decode_utf8 as it's not documented.foreach (keys %params) { <untaint $params{$_} - see below> $params{$_} = Encode::decode_utf8($params{$_}); }
6. So, sending UTF-8 data via an application/x-www-form-urlencoded form is not successful - do we have any alternatives. Luckily, the answer is yes, there are two.
Firstly, we need not encode the data in any particular format, but merely send the name value pairs as raw POST data, each separated by newlines perhaps. This would guarantee that the data is received intact (as the TCP connection from the sender to the receiving Apache is 8 bit clean, of course) but it would mean we have to retrieve the raw binary data and parse it ourself. This may be tricky in some implementations (Java ? .Net ? How easy is it to get access to raw POST data in those worlds ?), so let's forget that
Secondly, we can use the multipart/form-data encoding to treat the name/value pairs as form elements. This particular form encoding is less common, but luckily is supported by both HTTP::Common::Request::POST and CGI.pm. More importantly, by design, it is 8 bit clean; each form element is bundled up in its own little section in the encoding, with each section potentially specifying its own Content-type header (like application/xml) and encoding (like UTF-8). The one obvious disadvantage of sending multipart/form-data is, again, implementation support - in Perl land we're lucky, but Java and .Net afficionados may not be.
It's also easy to convince HTTP::Common::Request::POST to send its parameters as multipart/form-data; one merely adds a Content_Type => 'form-data' argument to the call.
And here the story ends, more or less. By sending the parameters as multipart/form-data, I've managed to transfer UTF-8 encoded XML, and other parameters, successfully between the sender and receiver, sidestepping a week of pain and grief in tracking down problems with the application/x-www-form-urlencoded approach. (Aside: I haven't done anything special to set the charset attribute of the Content-type header of the various components of the multipart/form-data form; this seems not to bother CGI.pm, and maybe doesn't matter because I manually convert the appropriate parameters into Unicode strings anyway, as shown above)
In summary:
a) send UTF-8 encoded data as multipart/form-data (or perhaps as raw POST data); eschew the path of the application/x-www-form-urlencoded encoding; it's painful and will make you unhappy. multipart/form-data is handled fine by both LWP and CGI.pm.
b) when you retrieve a parameter from CGI.pm, ensure that it is untainted before generating a Unicode string using Encode::decode_utf8; this looks like a bug to my eyes, but then I'm not familiar with Perl's internals.
c) If you want to get UTF-8 encoded data into and out of a Perl program, you will almost certainly have to understand the use of Encode functions, and expose yourselves a little too much to Perl's internal UTF-8 representation. Despite well meaning suggestions to the contrary, my experience is that you won't get too far without becoming familiar with this stuff when you run into problems (particularly if you need to interpret data grabbed by, say, Ethereal, and compare it to the data you were expecting to send).
OK, that's enough. I hope someone finds it of use. Corrections or comments are welcome of course. I don't understand this stuff as well as I'd like to, so if any experts want to expand on details, please do so.
Steve Collyer
In reply to UTF-8, CGI.pm and LWP: some observations by scollyer
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |