in reply to Re^4: Determining content-length for an HTTP Post
in thread Determining content-length for an HTTP Post

Huh? The bytes pragma simply forces $xmldata to be treated as a series of bytes. This should give us the correct value for the Content-Length header whether $xmldata is a character string or an UTF-8 encoded byte string. Or am I missing something?

Replies are listed 'Best First'.
Re^6: Determining content-length for an HTTP Post
by ikegami (Patriarch) on Nov 25, 2009 at 20:38 UTC

    This should give us the correct value for the Content-Length header

    No.

    If the XML is valid, length gives the right answer without use bytes:

    $ perl -le' $_ = "<?xml version=\"1.0\"?><root>\x{C9}ric</root>"; utf8::encode($_); utf8::downgrade($_); print length; print do { use bytes; length }; ' 39 39

    You can get the wrong answer if you use use bytes;:

    $ perl -le' $_ = "<?xml version=\"1.0\"?><root>\x{C9}ric</root>"; utf8::encode($_); utf8::upgrade($_); print length; print do { use bytes; length }; ' 39 41 XXX Should be 39

    If the XML hasn't been encoded, use bytes can give you the right result if the desired encoding is UTF-8, but it's unreliable:

    $ perl -le' $_ = "<?xml version=\"1.0\"?><root>\x{C9}ric</root>"; print do { use bytes; length }; ' 38 XXX Should be 39

    In no case is use bytes; the appropriate answer.

    Perl has two different formats for storing strings. use bytes; causes opcodes to look directly at the internal buffer of the string no matter which format was used. Since Perl is free to change how it internally stores the string at will, it's quite useless to use use bytes; without taking into checking which format Perl used for that string.

    Update: Rephrased for clarity.

      Hello again, ikegami. I think we are approaching the OP's problem from different angles, which might be a bit confusing for others. So I would like to clarify a few things; maybe you will join me.

      First of all, given the limited amount of information provided by the OP, I was looking at the HTTP level only, ignoring the actual message content.

      With that being said, I really think that the best way to determine the Content-Length of a HTTP message if its content cannot be reliably encoded as bytes (we do not know what the OP's $xmldata actually contains) is to use length() with the bytes pragma in effect. This assumes of course that the message content is not being encoded afterwards and that the content string's UTF-8 flag has not been fiddled with.

      Some code to play with:

      #!/usr/bin/perl use strict; use warnings; use FindBin qw( $Bin ); use File::Spec::Functions qw( catfile ); my $file = catfile( $Bin, 'bytes_pragma.data' ); my $string_1 = "\x{C9}"; # LATIN CAPITAL LETTER E WITH ACUTE; 2 byte +s in UTF-8 my $string_2 = "\x{20AC}"; # EURO SIGN; 3 bytes in UTF-8 { my $string = $string_1; print '[1] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[1] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[1] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[1] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_1; utf8::encode( $string ); print '[2] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[2] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[2] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[2] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_2; print '[3] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[3] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[3] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[3] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_2; utf8::encode( $string ); print '[4] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[4] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[4] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[4] actual size is: ' . ( -s $file ) . "\n"; } { my $string = $string_1; utf8::upgrade( $string ); print '[5] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[5] length() returns: ' . length( $string ) . " (no bytes) +\n"; print '[5] length() returns: ' . do { use bytes; length( $string +) } . " (use bytes)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[5] actual size is: ' . ( -s $file ) . "\n"; } # but ... { use bytes; my $string = $string_1; utf8::upgrade( $string ); print '[6] is_utf8() returns: ' . ( utf8::is_utf8( $string ) ? 'tr +ue' : 'false' ) . "\n"; print '[6] length() returns: ' . length( $string ) . " (use bytes +)\n"; open FH, "> $file" or die; print FH $string; close FH; print '[6] actual size is: ' . ( -s $file ) . "\n"; }

      Output:

      joerg@Marvin:~> '/home/joerg/bytes_pragma.pl' [1] is_utf8() returns: false [1] length() returns: 1 (no bytes) [1] length() returns: 1 (use bytes) [1] actual size is: 1 [2] is_utf8() returns: false [2] length() returns: 2 (no bytes) [2] length() returns: 2 (use bytes) [2] actual size is: 2 [3] is_utf8() returns: true [3] length() returns: 1 (no bytes) [3] length() returns: 3 (use bytes) Wide character in print at /home/joerg/bytes_pragma.pl line 42. [3] actual size is: 3 [4] is_utf8() returns: false [4] length() returns: 3 (no bytes) [4] length() returns: 3 (use bytes) [4] actual size is: 3 [5] is_utf8() returns: true [5] length() returns: 1 (no bytes) [5] length() returns: 2 (use bytes) [5] actual size is: 1 [6] is_utf8() returns: true [6] length() returns: 2 (use bytes) [6] actual size is: 2

      Update: Added a clarification.

        if its content cannot be reliably encoded as bytes

        HTTP can *only* send bytes, so your premise is flawed and your argument is moot.

        Looking at your results, the only case where use bytes; helped is the one where encode was needed and Perl told you encode was needed. Are you saying that use bytes; is equivalent to encode?