in reply to Size and anatomy of an HTTP response

Some of the confusion may be due to history. Prior to Perl 5.8 strings were simply bytes so length could only return the bytes. Support for character encodings was introduced in 5.8 (so says this: Encode - I'm not at all an encoding guru, but your question got me curious).

From what I understand of that document, if the string is marked as utf8 (a bit set in the C guts of Perl), it's length will be counted as characters because it knows to check if each byte is a complete or partial character. Otherwise it's length is counted as bytes. You can see the flag value using _is_utf8. It is normally set automatically to your input stream's encoding when you read in characters, but if you aren't sure about the history of the string you can use that function to check its status. For more information, see the section on messing with Perl's internals in Encode.

There are also methods for explicitly selecting whether your string will be read as bytes or utf8 octets and for chosing the rules for converting back and forth from raw bytes to utf8 - see the same document for encode, decode and from_to.

Update: added more information about controlling the utf8 status.

Replies are listed 'Best First'.
Re^2: Size and anatomy of an HTTP response
by afoken (Chancellor) on Dec 15, 2010 at 13:52 UTC

      More precisely length() always returns what it thinks are the number of characters in the string. This "thinking" relies on the value of the utf8 flag. The reply you linked to refered to a "unicode string", i.e. one with its unicode flag set.

      If the utf8 flag is set, it assumes each byte is an octet and glues octets together into single characters as needed, so you might have bytes = characters or not. If the utf8 flag is NOT set, then it counts pure bytes on the assumption that there is a one-to-one relationship between bytes and characters. In that case there is no difference between the byte count and the character count. If your utf8 octets are all in the ascii range you will never notice the difference and byte count will equal character count, but if for some reason you have a string full of utf8 octets and the utf8 flag gets switched off (perhaps you opened a stream raw mode but the file was filled with non-ascii utf8 octets?), length will return the number of bytes NOT the number of characters.

      Here is a quick example of the difference a flag makes. Nothing has changed in the content of $s. Only the utf8 bit has been changed, and presto the length goes from 1 to 2.

      use Encode; my $s=chr(0x0160); printf "chr=<%s> utf8-flag=%s length=%d\n" , $s, Encode::is_utf8($s)?'yes':'no', length($s); #outputs: chr=<?> utf8-flag=yes length=1 Encode::_utf8_off($s); printf "chr=<%s> utf8-flag=%s length=%d\n" , $s, Encode::is_utf8($s)?'yes':'no', length($s); #outputs: chr=<?> utf8-flag=no length=2

        ok I have learned a lot but, may be I have a brick instead of a brain, I'm still not sure about the answer to my question 1):To count the bytes received :
          A) I have to check the utf8 flag (with is_utf8) against the content of the body returned by LWP::ua -> HTTP::Request -> HTTP::Response(etc ..) and if is not utf8 use the bytes::length($response->content) or
          B) I can assume that Perl receiving a string assume Latin-1 (and so my be will print some beesheet on my monitor)and the corrispondence char/bytes is assured and so i will normally use length($response->content)??


        I'm not sure but I think some html page in the world is not utf8, rigth? header cannot be encoded, I hope, rigth??

        Thanks to all poster

        Lor*
        there are no rules, there are no thumbs..