in reply to Re^10: Determining content-length for an HTTP Post
in thread Determining content-length for an HTTP Post

Given a string of bytes,
...
length with use bytes; doesn't always give the number of bytes.

You are spreading (dangerous) FUD, no offence meant (really!). The bytes pragma works as advertised. The problem with your example is that you are fiddling with the internal UTF-8 flag. The perlunicode document clearly calls utf8::upgrade() a "low-level" function. The right way to UTF-8-encode a string is utf8::encode() or, maybe even better, the Encode module. See:

joerg@Marvin:~> perl -E' my $buf = ""; { open my $fh, ">", \$buf; utf8::encode( my $all_256_bytes = join "", map chr, 0..255 ); say length $all_256_bytes; say do { use bytes; length $all_256_bytes }; print $fh $all_256_bytes; } say length($buf); ' 384 384 384

Btw, I have already demonstrated this in my previous post. You might want to reread it; example [6] is especially interesting.

Therefore, the only (quick) advice I could give him was to make sure that length() treats $xmldata as a series of bytes.
use bytes; does no such thing.

This is taken right from the "bytes" documentation:

"Perl normally assumes character semantics in the presence of character data (i.e. data that has come from a source that has been marked as being of a particular character encoding). When use bytes is in effect, the encoding is temporarily ignored, and each string is treated as a series of bytes."

Maybe you should rethink your statement.

We want the number of bytes in the string.

No, we (only) need to know how many bytes are going to be sent. Please reread my previous post.

Something can be wrong and still work. Bad code sometimes works.

Agreed. But, given the limited amount of information we have, my suggestion is still the best first step to take in solving the OP's problem.

Peace.

Replies are listed 'Best First'.
Re^12: Determining content-length for an HTTP Post
by ikegami (Patriarch) on Nov 27, 2009 at 16:22 UTC

    You are spreading (dangerous) FUD, no offence meant (really!). The bytes pragma works as advertised.

    I agree, and it's not the right thing. It doesn't give the the number of bytes print will print. Everything else is moot.

    The problem with your example is that you are fiddling with the internal UTF-8 flag.

    Yes. Perl is free to do that whenever it wants. You can't count on it being in either state. If your solution requires it being off, you've got a bug.

    The call to utf8::upgrade represents Perl changing the internal format of its string for whatever reason. I wouldn't actually use utf8::upgrade in that fashion, so your talk of it being a low-level function is moot. I could write the example without using utf8::upgrade, but I wanted to keep things simple.

    Note that using use bytes; requires the use of those low-level functions, so it's low-level too.

    The right way to UTF-8-encode a string is utf8::encode()

    The program was suppose to output a sequence of 256 characters. You had to break it to make use bytes; work.

    You might want to reread it; example [6] is especially interesting.

    Indeed. It's the one where Perl told you had a bug, and you chose to fix the symptoms instead of the bug. It's also the only case where use bytes; helped. Are you saying that use bytes; only helps when Perl tells you you've made an error?

    But, given the limited amount of information we have, my suggestion is still the best first step to take in solving the OP's problem.

    Do you turn off use strict; and use warnings; when they start issuing messages? Solving the symptom (if not making things worse) is not the right first step.

      Do you turn off use strict; and use warnings; when they start issuing messages?

      Actually, I was hoping for a "Wide character in print ..." warning (and a correct Content-Length header). That would have solved the whole mystery ...

      But enough of that. The funny thing is, I'm currently writing a web framework and one of the design decisions I have not made yet is whether to allow a character string as the response body or not, because of this whole bytes / print issue. So I would really like to see a real world example where Perl gets the length in bytes of a string wrong when compared to the number of bytes actually printed, assuming no PerlIO layer and no direct manipulation of the string's UTF-8 flag.

        This is only true if you output utf8 and your Perl is built to use utf8 as its internal Unicode representation. If any of the two changes, { use bytes; length } will not give the correct results anymore.