Re^6: Determining content-length for an HTTP Post

This should give us the correct value for the Content-Length header

No.

If the XML is valid, length gives the right answer without use bytes:

$ perl -le'
    $_ = "<?xml version=\"1.0\"?><root>\x{C9}ric</root>";
    utf8::encode($_);
    utf8::downgrade($_);
    print length;
    print do { use bytes; length };
'
39
39
[download]

You can get the wrong answer if you use use bytes;:

$ perl -le'
    $_ = "<?xml version=\"1.0\"?><root>\x{C9}ric</root>";
    utf8::encode($_);
    utf8::upgrade($_);
    print length;
    print do { use bytes; length };
'
39
41   XXX Should be 39
[download]

If the XML hasn't been encoded, use bytes can give you the right result if the desired encoding is UTF-8, but it's unreliable:

$ perl -le'
    $_ = "<?xml version=\"1.0\"?><root>\x{C9}ric</root>";
    print do { use bytes; length };
'
38   XXX Should be 39
[download]

In no case is use bytes; the appropriate answer.

Perl has two different formats for storing strings. use bytes; causes opcodes to look directly at the internal buffer of the string no matter which format was used. Since Perl is free to change how it internally stores the string at will, it's quite useless to use use bytes; without taking into checking which format Perl used for that string.

Update: Rephrased for clarity.

Comment on Re^6: Determining content-length for an HTTP Post Select or Download Code

Replies are listed 'Best First'.
Re^7: Determining content-length for an HTTP Post by WizardOfUz (Friar) on Nov 26, 2009 at 10:42 UTC
Hello again, ikegami. I think we are approaching the OP's problem from different angles, which might be a bit confusing for others. So I would like to clarify a few things; maybe you will join me. First of all, given the limited amount of information provided by the OP, I was looking at the HTTP level only, ignoring the actual message content. With that being said, I really think that the best way to determine the `Content-Length` of a HTTP message if its content cannot be reliably encoded as bytes (we do not know what the OP's `$xmldata` actually contains) is to use `length()` with the `bytes` pragma in effect. This assumes of course that the message content is not being encoded afterwards and that the content string's UTF-8 flag has not been fiddled with. Some code to play with: #!/usr/bin/perl use strict; use warnings; use FindBin qw( $Bin ); use File::Spec::Functions qw( catfile ); my $file = catfile( $Bin, 'bytes_pragma.data' ); my $string_1 = "\x{C9}"; # LATIN CAPITAL LETTER E WITH ACUTE; 2 byte +s in UTF-8 my $string_2 = "\x{20AC}"; # EURO SIGN; 3 bytes in UTF-8 { my $string = $string_1; print '[1] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[1] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[1] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[1] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_1; utf8::encode( $string ); print '[2] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[2] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[2] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[2] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_2; print '[3] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[3] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[3] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[3] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_2; utf8::encode( $string ); print '[4] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[4] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[4] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[4] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_1; utf8::upgrade( $string ); print '[5] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[5] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[5] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[5] actual size is: ' . ( -s $file ) . "\n"; } # but ... { use bytes; my $string = $string_1; utf8::upgrade( $string ); print '[6] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[6] length() returns: ' . length( $string ) . " (use bytes +)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[6] actual size is: ' . ( -s $file ) . "\n"; } [download] Output: joerg@Marvin:~> '/home/joerg/bytes_pragma.pl' [1] is_utf8() returns: false [1] length() returns: 1 (no bytes) [1] length() returns: 1 (use bytes) [1] actual size is: 1 [2] is_utf8() returns: false [2] length() returns: 2 (no bytes) [2] length() returns: 2 (use bytes) [2] actual size is: 2 [3] is_utf8() returns: true [3] length() returns: 1 (no bytes) [3] length() returns: 3 (use bytes) Wide character in print at /home/joerg/bytes_pragma.pl line 42. [3] actual size is: 3 [4] is_utf8() returns: false [4] length() returns: 3 (no bytes) [4] length() returns: 3 (use bytes) [4] actual size is: 3 [5] is_utf8() returns: true [5] length() returns: 1 (no bytes) [5] length() returns: 2 (use bytes) [5] actual size is: 1 [6] is_utf8() returns: true [6] length() returns: 2 (use bytes) [6] actual size is: 2 [download] Update: Added a clarification.	[reply] [d/l] [select]
Re^8: Determining content-length for an HTTP Post by ikegami (Patriarch) on Nov 26, 2009 at 15:56 UTC
if its content cannot be reliably encoded as bytes HTTP can only send bytes, so your premise is flawed and your argument is moot.	[reply]
Re^9: Determining content-length for an HTTP Post by WizardOfUz (Friar) on Nov 26, 2009 at 20:46 UTC
Well, maybe it's the language barrier. My point is that it is simply not possible to encode `$xmldata` without knowing from what / to what. The OP told us nothing about the content of `$xmldata` or the desired encoding. Therefore, the only (quick) advice I could give him was to make sure that `length()` treats `$xmldata` as a series of bytes. And that is exactly what the `bytes` pragma is for. When the `bytes` pragma is in effect, `length()` returns the number of bytes taken by Perl's internal string representation. Which is exactly what we need to know for the `Content-Length` header (assuming that no PerlIO layer has been specified for the outstream): "A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings, but it becomes relevant when outputting Unicode strings to a stream without a PerlIO layer -- one with the "default" encoding. In such a case, the raw bytes used internally (the native character set or UTF-8, as appropriate for each string) will be used, and a "Wide character" warning will be issued if those strings contain a character beyond 0x00FF." (From the perluniintro, emphasis mine) The examples in your previous post were certainly interesting, but missed the point, especially the third one, because the only thing we really need to know for the `Content-Length` header is how many bytes are going to be sent. See above. Furthermore, it is simply not true that the `bytes` pragma is as unreliable as you depicted it. It only fails (in this context) if you try really hard. See my examples above. And yes, I'm aware that if my advice had solved the wrong `Content-Length` problem, the follow-up question would probably have been: "Help! My message content is garbled!". That would have been your opportunity to shine ... Peace.	[reply] [d/l] [select]
Re^10: Determining content-length for an HTTP Post by ikegami (Patriarch) on Nov 26, 2009 at 22:43 UTC
Re^11: Determining content-length for an HTTP Post by WizardOfUz (Friar) on Nov 27, 2009 at 12:01 UTC
Some notes below your chosen depth have not been shown here
Re^8: Determining content-length for an HTTP Post by ikegami (Patriarch) on Nov 26, 2009 at 22:57 UTC
Looking at your results, the only case where `use bytes;` helped is the one where `encode` was needed and Perl told you `encode` was needed. Are you saying that `use bytes;` is equivalent to `encode`?	[reply] [d/l] [select]
Re^9: Determining content-length for an HTTP Post by WizardOfUz (Friar) on Nov 27, 2009 at 12:07 UTC
Are you saying that `use bytes;` is equivalent to `encode`? No. Please see my other reply.	[reply] [d/l] [select]
Re^8: Determining content-length for an HTTP Post by Anonymous Monk on Nov 26, 2009 at 11:05 UTC
#34772: HTTP::Request::Common::PUT should set content length	[reply]
Re^9: Determining content-length for an HTTP Post by WizardOfUz (Friar) on Nov 26, 2009 at 11:43 UTC
Of course, the `use bytes` / `length()` solution is not perfect (just see my examples `[5]` and `[6]` above). But it is a reasonable approach considering that we do not know what the OP's `$xmldata` is and that "normal" users are usually not fiddling with the UTF-8 flag.	[reply] [d/l] [select]