Re^3: Size and anatomy of an HTTP response

More precisely length() always returns what it thinks are the number of characters in the string. This "thinking" relies on the value of the utf8 flag. The reply you linked to refered to a "unicode string", i.e. one with its unicode flag set.

If the utf8 flag is set, it assumes each byte is an octet and glues octets together into single characters as needed, so you might have bytes = characters or not. If the utf8 flag is NOT set, then it counts pure bytes on the assumption that there is a one-to-one relationship between bytes and characters. In that case there is no difference between the byte count and the character count. If your utf8 octets are all in the ascii range you will never notice the difference and byte count will equal character count, but if for some reason you have a string full of utf8 octets and the utf8 flag gets switched off (perhaps you opened a stream raw mode but the file was filled with non-ascii utf8 octets?), length will return the number of bytes NOT the number of characters.

Here is a quick example of the difference a flag makes. Nothing has changed in the content of $s. Only the utf8 bit has been changed, and presto the length goes from 1 to 2.

use Encode;

my $s=chr(0x0160);
printf "chr=<%s> utf8-flag=%s length=%d\n"
  , $s, Encode::is_utf8($s)?'yes':'no', length($s);
#outputs: chr=<?> utf8-flag=yes length=1

Encode::_utf8_off($s);
printf "chr=<%s> utf8-flag=%s length=%d\n"
  , $s, Encode::is_utf8($s)?'yes':'no', length($s);
#outputs: chr=<?> utf8-flag=no length=2
[download]

Comment on Re^3: Size and anatomy of an HTTP response Select or Download Code

Replies are listed 'Best First'.
Re^4: Size and anatomy of an HTTP response by Discipulus (Canon) on Dec 16, 2010 at 09:47 UTC
ok I have learned a lot but, may be I have a brick instead of a brain, I'm still not sure about the answer to my question 1):To count the bytes received : A) I have to check the utf8 flag (with is_utf8) against the content of the body returned by LWP::ua -> HTTP::Request -> HTTP::Response(etc ..) and if is not utf8 use the bytes::length($response->content) or B) I can assume that Perl receiving a string assume Latin-1 (and so my be will print some beesheet on my monitor)and the corrispondence char/bytes is assured and so i will normally use length($response->content)?? I'm not sure but I think some html page in the world is not utf8, rigth? header cannot be encoded, I hope, rigth?? Thanks to all poster Lor* there are no rules, there are no thumbs..	[reply]
Re^5: Size and anatomy of an HTTP response by Anonymous Monk on Dec 16, 2010 at 10:13 UTC
->content only deals in bytes ( octets, 8-bits), only takes bytes, only gives bytes, so length will always return byte count; you can use bytes::length for peace of mind	[reply]
Re^5: Size and anatomy of an HTTP response by afoken (Chancellor) on Dec 16, 2010 at 14:06 UTC
One link: bytes Or, a little bit explained: there is a pragma / module named bytes that allows you to force Perl to use byte semantics for everything. It can be used in two ways: `use bytes;` in a certain scope will force byte semantics for that scope. Similary, `no bytes;` will disable byte semantics for a scope. Use the functions implemented in `bytes` instead of the CORE functions, i.e. bytes::length() instead of length. Make sure not to accidentally enable byte semantics for your file by NOT importing anything from `bytes` (i.e. write `use bytes ();` or `require bytes;` instead of `use bytes;`. The functions in `bytes` are actually the CORE functions, called in wrapper functions with enforced byte semantics. Note that Perl 5.12 warns not to use `bytes` except for debugging: This pragma reflects early attempts to incorporate Unicode into perl and has since been superseded. It breaks encapsulation (i.e. it exposes the innards of how the perl executable currently happens to store a string), and use of this module for anything other than debugging purposes is strongly discouraged. If you feel that the functions here within might be useful for your application, this possibly indicates a mismatch between your mental model of Perl Unicode and the current reality. In that case, you may wish to read some of the perl Unicode documentation: perluniintro, perlunitut, perlunifaq and perlunicode. I think you have exactly that mismatch problem here. All data you receive from outside your script comes as stream of bytes. As long as you do not decode those bytes (either manually or inside a library or by using a PerlIO layer), but instead just stuff them unmodified into a string, perl will not treat those bytes in a different way than it did before Unicode. Perl treats each byte as a single character, and `length()` will return the number of characters, which is equal to the number of bytes. When you decode those bytes, e.g. from UTF-8 or UTF-16, into Perls internal character representation, `length()` will still return the number of characters. But due to the decoding, it may be different from the number of bytes that were used to store the encoded string outside Perl. Behind the scenes, Perl has two different ways to store strings. The ill-named UTF8 flag switches between the two ways. In "classic mode", the UTF8 flag is off, each byte represents a single character, like in ancient perls. In "Unicode mode", the UTF8 flag is on, a character may spread over several bytes. As far as I know, the string is currently stored in some kind of "relaxed" or "extended" UTF-8 encoding, hence the name of the flag. But it does not and should not matter. You should not be interested in the way perl stores characters in memory. The next release could start storing characters encoded as UTF-32 or a hypothetical UTF-64 and you should see absolutely no difference from inside perl. Unless, of course, you start flipping the UTF8 bit without changing the actual in-memory encoding. See Encode. If you want to know how many bytes a string occupies in a certain encoding, you should use the Encode module to convert that string into a byte stream with that encoding, and get its length. For the special case of HTTP::Request / HTTP::Response, both inherit from HTTP::Message, which treats the content as a string of bytes. So `length($msg->content())` will always() return the number of bytes. `HTTP::Message` also has a `decoded_content()` method that returns a string of characters, that may or may not have the UTF8 flag set. `length($msg->decoded_content(...))` will always return the number of characters, given a decodable content. To test if the content is decodable, call the `decodable()` method. () "always" is not quite correct: You can replace the content with its decoded version by calling `$msg->decode()`; after that, `length($msg->content())` returns the number of characters. You can also undo that, with `$msg->encode($encoding)`. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^5: Size and anatomy of an HTTP response by ELISHEVA (Prior) on Dec 16, 2010 at 10:34 UTC
Regardless of the conventions of your tools, you can be assured of always getting the byte count and only the byte count by turning the utf8 flag off. So if you are uncertain: `#copy so we don't muck utf8 flag on original string my $sTmp=$someData; Encode::_utf8_off($sTmp); my $iLength = length($sTmp);` [download] or for future use you could just wrap this up in a sub: sub countBytes { my $s=$_[0]; #makes copy Encode::_utf8_off($s); return length($s); } # or to save memory by avoiding a copy for loooong strings # BUT note: may not be a good idea if string is shared by multiple # threads since this is not atomic and another thread could grab # control while the utf8 bit is temporarily off. # The copy approach is more stable and thread friendly. sub countBytes { my $bUtf = Encode::is_utf8($_[0]); Encode::_utf8_off($_[0]); my $i=length($_[0]); Encode::_utf8_on($_[0]) if $bUtf; return $i; } #calc bytes before printf to show flag is indeed preserved my $s=chr(0x160); my $iBytes = countBytes($s); printf "chr=<%s> utf8-flag=%s length=%d bytes=%s\n" , $s, Encode::is_utf8($s)?'yes':'no', length($s), $iBytes; #outputs: chr=<?> utf8-flag=yes length=1 bytes=2 [download] Best of luck with your project. Update: added memory friendly, thread unfriendly version of countBytes()	[reply] [d/l] [select]
Re^6: Size and anatomy of an HTTP response by Discipulus (Canon) on Dec 16, 2010 at 12:49 UTC
ok many many thanks for the patience with me.. for future readers I'will warmly encourage the reading of The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) EDIT: I also found http://perlgeek.de/en/article/encodings-and-unicode Lor* EDIT2: also read a new amazing topic about this at Simplest Possible Way To Disable Unicode EDIT3: also read another topic about encoding Comparing Unicode Greek Characters/Code Points there are no rules, there are no thumbs..	[reply]