in reply to HTTP::Request::Common::POST and UTF-8

I'm not sure if this will fit your needs, but here's one possible solution. I added one new line to your code and modified another, both marked with ### <--.
#!/usr/bin/perl use strict; use warnings; use LWP; use HTTP::Request::Common; use Encode; use charnames qw(greek); use URI::Escape qw(uri_escape_utf8); ### <-- new line binmode(STDOUT, ":utf8"); my $utf8_data = "<\N{alpha}\N{beta}\N{gamma}\N{delta}>"; print $utf8_data, "\n\n"; print Encode::is_utf8($utf8_data) ? "\$utf8_data marked as UTF-8\n\n" : "\$utf8_data not marked as UTF-8\n\n"; my $request = POST("http://localhost/test", Content => [ data => uri_escape_utf8($utf8_dat +a), ### <-- modified line more_data => "some more data", ] ); my $req_string = $request->as_string(); print Encode::is_utf8($req_string) ? "\$req_string marked as UTF-8\n\n" : "\$req_string not marked as UTF-8\n\n"; print $req_string, "\n";

Replies are listed 'Best First'.
Re^2: HTTP::Request::Common::POST and UTF-8
by scollyer (Sexton) on Sep 28, 2005 at 15:32 UTC
    > I'm not sure if this will fit your needs, but here's one
    > possible solution.

    Thanks very much. That looks a lot better. (I needed to upgrade to a new version of URI::Escape though). I guess it might be nice if you could override the default escape routine inside POST, rather than doing it manually.

    It's also suitable for the real code, too.

    Steve Collyer

Re^2: HTTP::Request::Common::POST and UTF-8
by ikegami (Patriarch) on Sep 28, 2005 at 16:57 UTC

    Won't that escape data twice? Without actually running it, it looks like
    "\x{1234}"
    would be transformed by uri_escape_utf8 into
    "%C8%B4"
    which would be transformed by POST into
    "%25C8%25B4"
    while the right answer would be
    "%C8%B4"

    What he actually needs is

    my $request = POST( "http://localhost/test", Content => [ data => encode("UTF-8", $utf8_data), more_data => "some more data", ] );

    The core problem is that the url-encoded format didn't anticipate data using character sets other than US-ASCII. There is a defacto standard, which consists of encoding a string as UTF-8, and escaping the resulting bytes as if they were encoded using US-ASCII. The above converts the string to UTF-8 bytes, which will be subsequently escaped by POST's guts.

      >Won't that escape data twice?

      Yup, just discovered that. Your solution appears to work correctly, with the corresponding unescaping being:

      decode("UTF-8", uri_unescape($req_string))
      Thanks for this.

      I think I'll go and hit myself with a stick now. It'll be less pain than doing UTF-8 in Perl ...

      Steve Collyer