Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^7: Determining content-length for an HTTP Post

by WizardOfUz (Friar)
on Nov 26, 2009 at 10:42 UTC ( [id://809526]=note: print w/replies, xml ) Need Help??


in reply to Re^6: Determining content-length for an HTTP Post
in thread Determining content-length for an HTTP Post

Hello again, ikegami. I think we are approaching the OP's problem from different angles, which might be a bit confusing for others. So I would like to clarify a few things; maybe you will join me.

First of all, given the limited amount of information provided by the OP, I was looking at the HTTP level only, ignoring the actual message content.

With that being said, I really think that the best way to determine the Content-Length of a HTTP message if its content cannot be reliably encoded as bytes (we do not know what the OP's $xmldata actually contains) is to use length() with the bytes pragma in effect. This assumes of course that the message content is not being encoded afterwards and that the content string's UTF-8 flag has not been fiddled with.

Some code to play with:

#!/usr/bin/perl use strict; use warnings; use FindBin qw( $Bin ); use File::Spec::Functions qw( catfile ); my $file = catfile( $Bin, 'bytes_pragma.data' ); my $string_1 = "\x{C9}"; # LATIN CAPITAL LETTER E WITH ACUTE; 2 byte +s in UTF-8 my $string_2 = "\x{20AC}"; # EURO SIGN; 3 bytes in UTF-8 { my $string = $string_1; print '[1] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[1] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[1] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[1] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_1; utf8::encode( $string ); print '[2] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[2] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[2] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[2] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_2; print '[3] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[3] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[3] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[3] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_2; utf8::encode( $string ); print '[4] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[4] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[4] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[4] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_1; utf8::upgrade( $string ); print '[5] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[5] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[5] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[5] actual size is: ' . ( -s $file ) . "\n"; } # but ... { use bytes; my $string = $string_1; utf8::upgrade( $string ); print '[6] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[6] length() returns: ' . length( $string ) . " (use bytes +)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[6] actual size is: ' . ( -s $file ) . "\n"; }

Output:

joerg@Marvin:~> '/home/joerg/bytes_pragma.pl' [1] is_utf8() returns: false [1] length() returns: 1 (no bytes) [1] length() returns: 1 (use bytes) [1] actual size is: 1 [2] is_utf8() returns: false [2] length() returns: 2 (no bytes) [2] length() returns: 2 (use bytes) [2] actual size is: 2 [3] is_utf8() returns: true [3] length() returns: 1 (no bytes) [3] length() returns: 3 (use bytes) Wide character in print at /home/joerg/bytes_pragma.pl line 42. [3] actual size is: 3 [4] is_utf8() returns: false [4] length() returns: 3 (no bytes) [4] length() returns: 3 (use bytes) [4] actual size is: 3 [5] is_utf8() returns: true [5] length() returns: 1 (no bytes) [5] length() returns: 2 (use bytes) [5] actual size is: 1 [6] is_utf8() returns: true [6] length() returns: 2 (use bytes) [6] actual size is: 2

Update: Added a clarification.

Replies are listed 'Best First'.
Re^8: Determining content-length for an HTTP Post
by ikegami (Patriarch) on Nov 26, 2009 at 15:56 UTC

    if its content cannot be reliably encoded as bytes

    HTTP can *only* send bytes, so your premise is flawed and your argument is moot.

      Well, maybe it's the language barrier. My point is that it is simply not possible to encode $xmldata without knowing from what / to what. The OP told us nothing about the content of $xmldata or the desired encoding. Therefore, the only (quick) advice I could give him was to make sure that length() treats $xmldata as a series of bytes. And that is exactly what the bytes pragma is for. When the bytes pragma is in effect, length() returns the number of bytes taken by Perl's internal string representation. Which is exactly what we need to know for the Content-Length header (assuming that no PerlIO layer has been specified for the outstream):

      "A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings, but it becomes relevant when outputting Unicode strings to a stream without a PerlIO layer -- one with the "default" encoding. In such a case, the raw bytes used internally (the native character set or UTF-8, as appropriate for each string) will be used, and a "Wide character" warning will be issued if those strings contain a character beyond 0x00FF."
      (From the perluniintro, emphasis mine)

      The examples in your previous post were certainly interesting, but missed the point, especially the third one, because the only thing we really need to know for the Content-Length header is how many bytes are going to be sent. See above.

      Furthermore, it is simply not true that the bytes pragma is as unreliable as you depicted it. It only fails (in this context) if you try really hard. See my examples above.

      And yes, I'm aware that if my advice had solved the wrong Content-Length problem, the follow-up question would probably have been: "Help! My message content is garbled!". That would have been your opportunity to shine ...

      Peace.

        Well, maybe it's the language barrier. My point is that it is simply not possible to encode $xmldata without knowing from what / to what.

        Correct, just like you can't use use bytes; to encode strings.

        If you revisit what I said, you'll notice I said he needed to encode as per the encoding specified in the <?xml?> directive. (UTF-8 is the default, btw.)

        Therefore, the only (quick) advice I could give him was to make sure that length() treats $xmldata as a series of bytes.

        use bytes; does no such thing.

        When the bytes pragma is in effect, length() returns the number of bytes taken by Perl's internal string representation.

        Yes, but we don't want or need that. We want the number of bytes in the string.

        the only thing we really need to know for the Content-Length header is how many bytes are going to be sent. See above.

        Even if I can't convince you that use bytes; is bad in general, I can clearly show that it doesn't give us the information you just said we needed.

        $ perl -E' my $buf = ""; { open my $fh, ">", \$buf; utf8::upgrade( my $all_255_bytes = join "", map chr, 0..255 ); say length $all_255_bytes; say do { use bytes; length $all_255_bytes }; print $fh $all_255_bytes; } say length($buf); ' 256 length without use bytes 384 length with use bytes 256 actual content length

        Given a string of bytes,
        length without use bytes; always gives the number of bytes.
        length with use bytes; doesn't always give the number of bytes.

        Given a string of chars,
        length without use bytes; always gives the number of chars.
        length with use bytes; doesn't always give the number of chars.
        length with use bytes; doesn't always give the bytes of the UTF-8 encoding of the chars either.

        Furthermore, it is simply not true that the bytes pragma is as unreliable as you depicted it. It only fails (in this context) if you try really hard. See my examples above.

        Compared to not using use bytes; which always returns the right value? Yes, it is.

        I'm aware that if my advice had solved the wrong Content-Length problem, the follow-up question would probably have been: "Help! My message content is garbled!". That would have been your opportunity to shine ...

        Something can be wrong and still work. Bad code sometimes works.

Re^8: Determining content-length for an HTTP Post
by ikegami (Patriarch) on Nov 26, 2009 at 22:57 UTC

    Looking at your results, the only case where use bytes; helped is the one where encode was needed and Perl told you encode was needed. Are you saying that use bytes; is equivalent to encode?

      Are you saying that use bytes; is equivalent to encode?

      No. Please see my other reply.

Re^8: Determining content-length for an HTTP Post
by Anonymous Monk on Nov 26, 2009 at 11:05 UTC

      Of course, the use bytes / length() solution is not perfect (just see my examples [5] and [6] above). But it is a reasonable approach considering that we do not know what the OP's $xmldata is and that "normal" users are usually not fiddling with the UTF-8 flag.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://809526]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-23 18:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found