Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract one or more bytes from a string which may contain text or arbitrary binary data. How can I go about doing it? substr() looked good at first, but the perldoc says it works on characters, but I want bytes.

Any help is appreciated.

  • Comment on How do I safely, portably extract one or more bytes from a string?

Replies are listed 'Best First'.
Re: How do I safely, portably extract one or more bytes from a string?
by Aristotle (Chancellor) on Nov 29, 2003 at 00:31 UTC
    unpack would be an alternative.

    Makeshifts last the longest.

Re: How do I safely, portably extract one or more bytes from a string?
by jweed (Chaplain) on Nov 29, 2003 at 00:27 UTC
    Check out the bytes pragma.
    Update:

    With the previous solution you can then use substr normally, but it would return by byte instead of by character.

    Or, perhaps more clearly, you could use the vec function with a third argument of 8. This has the advantage of being an lvalue (the bytes::substr() function isn't, unfortunately).



    Who is Kayser Söze?

      Just to demo "use byte" as jweed mentioned.

      use strict; my $a = chr(4000); { use bytes; print substr($a, 0, 1), "\n"; } { print substr($a, 0, 1), "\n"; }

      You will see a warning saying "wide character in print", don't worry about that, that is what supposed to happen.

Re: How do I safely, portably extract one or more bytes from a string?
by Beechbone (Friar) on Nov 29, 2003 at 03:38 UTC
    Ok, you've got answers how to extract bytes from an (UTF-8) string. Good.

    But I think the real question should be: Why don't you know what is in that string? If it can contain binary data and UTF-8 then you should know how it comes into your program. And there is the point you have work on.

    For example: If you read the string from a file, you should know if it is a binary file (open it :bytes then) or a UTF-8 file (open it :utf8) or a EBCDIC file or Big5...

    Or is it binary with embedded UTF-8 strings? Then open it :bytes and convert the extracted text pieces.

    Note: Or aren't you working with UTF-8 at all? Then characters are bytes for you. No need to worry.


    Search, Ask, Know

      Well, that's the thing, the string is filled from some external input source and I have no idea what type of data is in it. All I know is how many bytes it contains and how many I need to extract from it. Most likely it will contain arbitrary binary data like a JPEG or tarball, however, it may also contain 8 bit ISO Latin-1 and even Unicode, which means substr() won't work right and will end up extracting entire multi-byte characters instead of just bytes.

        ...I have no idea what type of data is in it...

        Then how do you think Perl can know (conclusively) what encoding it is in? You need to tell Perl when opening a file (either directly or implicitely with the open pragma), that the contents of that file has a special encoding. From perldoc -f open:

        You may use the three-argument form of open to specify IO "lay- ers" (sometimes also referred to as "disciplines") to be applied to the handle that affect how the input and output are processed (see open and PerlIO for more details). For example
                         open(FH, "<:utf8", "file")
        
        will open the UTF-8 encoded file containing Unicode characters, see perluniintro. (Note that if layers are specified in the three-arg form then default layers set by the "open" pragma are ignored.)

        Only then will Perl do conversions to turn it into UTF-8 internally and mark the data internally as UTF-8 (which will cause the character semantics to be applied to that data, rather than byte semantics). If you don't tell with which encoding the file should be read, Perl will assume bytes and substr() will work as expected.

        Hope this helps.

        Liz

Re: How do I safely, portably extract one or more bytes from a string?
by davido (Cardinal) on Nov 29, 2003 at 02:40 UTC
    This may not work as portably as you want, but if it does, it's definately the obscure way:

    use strict; use warnings; require 5.8.0; my $string = "......bytes...."; local $/=\1; open FH, "<", \$string or die $!; while (my $byte = <FH> ) { #do your stuff with $byte } close FH;

    It relies on the fact that setting the $/ input separator to a numeric value reads in that number of bytes. It also relies on the Perl 5.8.0 or later "In-memory file" open, where you can essentially open a scalar instead of a file. You then read the scalar in byte by byte.

    I wouldn't recommend it for much, but it's an interesting exercise.

    Update: Thanks Anonymous Monk for catching the glitch. I knew I was forgetting something. My original code read: local $/=1;. I've now corrected my snippet.

    Update 2: After some testing and re-reading the appropriate documentation, it appears that this method will work, as long as you're using it on Perl 5.8.0 or later.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
      It relies on the fact that setting the $/ input separator to a numeric value reads in that number of bytes.

      You need to set $/ to a reference to a number: $/=\1;. The example you gave sets the record separator to "1", which isn't quite the same thing :-)

      That won't work. If the string contains byte sequences that look like unicode characters, then reading 1 character will return multiple bytes, just as it would if you were reading from a unicode file.

      I tried almost the exactly the same code as AnonyMonk, but got different results...leastwise I did last night! Today, I'm getting different results? I guess I just saw what I was expecting to see:(


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

        I'm pretty sure that using a reference to an integer as the record separator is strictly a byte oriented operation. At least the following still reads one byte at a time (though length reports 1 character as expected):

        my $string = chr(400); print length($string),"\n"; local $/=\1; open FH, "<", \$string or die $!; while (my $byte = <FH> ) { print "<$byte>\n"; } close FH;
        I suspected that might be the case, but couldn't find the relevant documentation. perlvar states that setting $/ to a reference to an integer will cause file reads to read in no more than that number of bytes per iteration.

        I've re-scanned over: perlopentut, perllocale, perlport, perlunicode, and perluniintro. I know you're probably right, and that it's probably in there somewhere.

        So I guess what I'm saying is, which POD have I missed that discusses the effects of locales on the behavior of local $/ = \$integer; ?


        Dave


        "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: How do I safely, portably extract one or more bytes from a string?
by thospel (Hermit) on Nov 29, 2003 at 04:57 UTC
    All characters can be viewed as binary data too, so there isn't really a difference between these two. A possible problem is that with recent perls there may however be a difference between characters and bytes. That's not because bytes are somehow more binary than characters, but because the amount of representable characters has been extended beyond what can be encoded in a byte.

    Several people suggested to use use bytes. However, this is usually a bad idea, since it makes the result of your operations depend on if the string is coded internally with utf8 or not, a representation detail you should most of the time not care about. Observe:

    #!/usr/bin/perl -lw $a="à"; # A high latin1 character, doesn't even need unicode print '$a Normal substr: ', ord(substr($a,0,1)); { use bytes; print '$a Bytes substr: ', ord(substr($a,0,1)); } $b = $a . chr(256); chop $b; print '$a equals $b, but $b is internally in UTF8' if $a eq $b; print '$b Normal substr: ', ord(substr($b,0,1)); { use bytes; print '$b Bytes substr: ', ord(substr($b,0,1)); } Giving: $a Normal substr: 224 $a Bytes substr: 224 $a equals $b but $b is internally in UTF8 $b Normal substr: 224 $b Bytes substr: 195

    Be suspicious of code that uses use bytes.

    The real question is "where does your string come from". That will determine the proper answer.

    If you were talking about a string coming from most oldfashioned sources, e.g. read from a file (opened in binmode, or a plain open if there isn't a utf8 default), there is simply no difference between bytes and characters, and you can simply use substr(), which will work independent of the internal representation of the string.

    If you indeed were talking about a string for example read from a file opened for utf8 and you want the byte at a certain offset in the sequence of raw bytes representing that string, maybe you should in fact have opened the file binary instead...

    If the string is something else that could be unicode (e.g. coming from some unicode aware subroutine), and you want the byte at a certain offset in the utf8 representation of the string, the cleanest way is probably to use encode to get the octet string:

    use Encode; $octets = encode("utf8", $string);
    Instead you could also first "upgrade" the string using utf8::upgrade to make sure the internal representation is UTF8, after which "use bytes" will give predictable results. But I think that's rather hacky.

    And finally there is the possibility that the string is indeed one with unicode characters but that you should stop thinking in terms of bytes and just get the n-th character which corresponds to a certain codepoint whose value can now simply exceed 255. In which case plain substr() is what you want again.

    You can always view a string as a sequence of integers, where these integers can represent certain characters. Recent perls just allow some of the integers to be 256 or greater. And UTF8 is just a way (and NOT the only way) to encode this sequence of integers using a classic sequence of bytes. The byte sequence however isn't necessarily the same as the integer sequence (though it can be if all integers are small enough), which is in the end why the proper answer to your question depends on in which of these two sequences you want the element at a given offset.

      The problem is that you messed up byte context with character context. Just adding one line, your demo code can be easily fixed to demo the right result:

      #!/usr/bin/perl -lw $a="à"; # A high latin1 character, doesn't even need unicode print '$a Normal substr: ', ord(substr($a,0,1)); { use bytes; print '$a Bytes substr: ', ord(substr($a,0,1)); } { use bytes;#I added this $b = $a . chr(256); } chop $b; print '$a equals $b, but $b is internally in UTF8' if $a eq $b; print '$b Normal substr: ', ord(substr($b,0,1)); { use bytes; print '$b Bytes substr: ', ord(substr($b,0,1)); }

      This gives:

      $a Normal substr: 224 $a Bytes substr: 224 $a equals $b, but $b is internally in UTF8 $b Normal substr: 224 $b Bytes substr: 224

      update after read thospel's reply:

      thospel, my point is not to argue with you about the encoding or representation. The point is that, you tried to use your demo to disapprove "use bytes", but it actually did the opposite, and proved "use bytes" is alright. In your case, the first byte of $a and $b are different, and Perl did printed different ord, so it proved that "use bytes" is just fine.

      All what the OP asked is how to safely get the first byte, and "use bytes" is one of the correct way to do it. I just don't get how your big lesson on encoding is related to the original question. By reading the original post, to me, the author does not sounds like someone has no idea about all the encoding stuff, my feeling is that he knows quite a lot, otherwise he would not even ask the right question.

      Your demo on "use bytes" simply cannot be used to disapprove "use bytes", and is misleading in general.

        No, I didn't mess it up, I demonstrated exactly what I wanted to demonstrate: that the same character has two different possible internal representations in perl (notice they compare eq for perl). And that these two representations give a different result for substr() under use bytes.

        You can for example use Dump() from Devel::Peek to see that internally they are different of course. But code shouldn't depend on how the string happens to be encoded internally if it can be avoided.

        Your example however leaves both $a and $b with the same internal representation (non-utf8), so of course they print the same. It also isn't related to my point anymore. Notice that the "but $b is internally in UTF8" isn't actually true for your code.

        update after reading pg's reply

        I wasn't trying to "disprove" use bytes, it obviously does what it is supposed to do. I was however trying to show that it makes the result depend on the internal representation. Our disagreement is about first if that's a good idea and second if that's what the OP wanted. You obviously think he wants the n-th byte of the internal representation of the string, while I assumed that if he's talking about unicode (which I'm still not sure of), he'd want the n-th byte of the UTF8 representation of the logical string.