stevieb has asked for the wisdom of the Perl Monks concerning the following question:

Hey all,

I ran into a situation recently where I was reading from an I2C device, and the byte ordering was in the opposite endianness that I was expecting. Although I fixed the problem by reversing the bytes before returning them, I went off to research exactly what endian was. I've spent numerous hours testing, reading and more testing, but I still can't grasp it fully. I'm hoping just one or two more examples will have it 'click'.

So, I'll start off with a couple of examples here to see if I have the basics down. Please feel free to elaborate with other examples or comments etc.

Set up our number, and two byte scalars (full version copy/pastable at the bottom of the post):

use warnings; use strict; use feature 'say'; my $num = 1023; # 0x03ff my ($b1, $b2);

Now, if I do the following bit shifting, the printf() is printing the bytes in big endian format, correct?

$b1 = ($num & 0xff00) >> 8; $b2 = $num & 0xff; printf("%x, %x\n", $b1, $b2); # 3, ff

Likewise, if I reverse the operations/bytes, this one will print in little endian format, right?

$b1 = $num & 0xff; $b2 = ($num & 0xff00) >> 8; printf("%x, %x\n", $b1, $b2); # ff, 3

Full code:

use warnings; use strict; use feature 'say'; my $num = 1023; my ($b1, $b2); $b1 = ($num & 0xff00) >> 8; $b2 = $num & 0xff; printf("%x, %x\n", $b1, $b2); $b1 = $num & 0xff; $b2 = ($num & 0xff00) >> 8; printf("%x, %x\n", $b1, $b2);

Replies are listed 'Best First'.
Re: Understanding endianness of a number
by haukex (Archbishop) on Jul 23, 2017 at 19:25 UTC

    In many cases, endianness is transparent to the programmer, except when explicitly converting between numbers and bytes or when you need to look at the memory directly. In that way, it's kind of like a character encoding: in your program, you work mostly with strings and characters, usually not caring how they're stored internally, and only when converting to and from streams of bytes does it become important how those more abstract notions of characters are represented as bytes (e.g. UTF-8 vs. UTF-16 vs. many more). The same way, in a Perl program, you can say my $x = 48879;, and you don't have to care how that number gets represented in memory, until you have to think about how to read or write it to a binary file or send/receive it over a data link as a series of bytes. In both cases, there are two levels of thinking here, the "more abstract" notion of numbers/characters, versus the machine-level bytes, and explicit conversion is needed between the two. The conceptual issues arise because this conversion is many times implicit instead of explicit, and so programmers don't often have to think about it.

    (For the purpose of this explanation, assume byte addressable memory everywhere, and let's ignore that modern machines of course work with words of multiple bytes. Your question comes from I2C anyway, which works on the byte level.)

    So assuming you want to store my $x = 48879;, or my $x = 0xBEEF; as a 16-bit unsigned integer in two bytes, there are two ways to do that: with the most significant byte 0xBE at the lower memory address, or at the higher one (or in the case of a protocol, 0xBE being transmitted first, or second).

    48879 = 0xBEEF ^ ^ 0xBE = MSB LSB = 0xEF Memory Address: 0 | 1 | 2 | 3 | | | Little Endian: ... | LSB | MSB | ... | 0xEF | 0xBE | | | | Big Endian: ... | MSB | LSB | ... | 0xBE | 0xEF | # little endian $ perl -MData::Dump -e 'dd pack "S<", 0xBEEF' "\xEF\xBE" # big endian $ perl -MData::Dump -e 'dd pack "S>", 0xBEEF' "\xBE\xEF"

    What can sometimes be confusing is that some diagrams of memory addresses or transmission protocols place the least significant bit on the right side of the diagram (because bytes are typically written with their most significant bit first, as in 170 == 0b10101010), but at the same time put the least significant byte lowest memory address on the left, and there are often other variations of this. In fact, if I recall correctly, sorting out this initial left-to-right/right-to-left confusion was probably one of the most important things to help make endianness "click" for me. Another thing to keep in mind is that when you write 0xBEEF in your source code, that's still a single 16-bit value, and not yet two bytes; you don't yet know how it'll be represented in memory.

    To answer your two questions: Yes, you're correct. In 0x03FF, the MSB is 0x03 (the "big end") and the LSB is 0xFF (the "little end"), so the first is big endian order since you print the big end first, and the second is little endian order because you print the little end first. But just to be clear, on the other hand, $b1 and $b2 are two unconnected variables - so what you've really got there is two separate bytes, not a two-byte value stored in a certain order. (Update: I wouldn't have picked this nit if you had stored them in an array instead since the array indicies take the place of the memory addresses :-) )

    I've for now ignored additional topics like bigger numbers stored as four or more bytes, where at least in theory there are more than two possible orderings, but I hope that if the principle and the 16-bit version makes sense, understanding the documentation for wider values will be easier.

    (By the way, I like to shift first and then mask, i.e. $b1 = ($num >> 8) & 0xFF;, because I've been burned on a small microprocessor where the C compiler implemented the bit shift with a rotate instruction instead. I forget which processor and compiler it was though... plus I don't think Perl would run on such a uC, so it's really just a preference I've developed as a result.)

    (As the AM post points out, endianness could also refer to the order in which bits of a byte are transmitted, but in my experience, I've pretty much always seen the term endianness referring only to byte order; most protocol descriptions I've read will instead explicitly state "the least/most significant bit is transmitted first/last".)

    Minor updates for clarity.

      Thanks a boatload, haukex!

      After I wrote my question here, I continued doing research. I was comparing endianness with the year long work I've been doing with hardware registers, so I could not quite grasp things until it did click (then you re-affirmed) that endianness (mostly) refers to byte storage, not (typically) bit storage.

      By the time I wrote my post here, I thought I had it, but just wanted clarification. You covered it perfectly, even down to sometimes bits, and how some 32 or 64 bit ints may even have the middle bytes reversed. I think I'll leave that for another day until I run into it, if ever. :)

      I also agree with your shift first then perform and/or-ing. I ran into that while playing around during this whole lesson. Not only can it catch edge case problems, I feel it's quicker to digest/comprehend when glancing through code. Thanks for that tidbit too.

      The endianness issue cropped up because I was getting weird (read: backwards) results when trying to read two bytes as a single 16-bit int from an Arduino over I2C. This case is present in this post that I made, within the Arduino sketch portion of it, within the __read_analog() function.

      Thanks again,

      -stevieb

        Glad to help!

        endianness (mostly) refers to byte storage, not (typically) bit storage

        Yeah, I think there's some ambiguity there, e.g. whether LSB and MSB mean bytes or bits - Wikipedia makes a good point about context:

        LSB can also stand for least significant byte. ... If the abbreviation's meaning least significant byte isn't obvious from context, it should be stated explicitly to avoid confusion with least significant bit.

        Although the entry on Endianness does differentiate a bit more clearly:

        The order of bits within a byte or word can also have endianness (as discussed later); however, a byte is typically handled as a numerical value or character symbol and so bit sequence order is obviated.

        And then goes on talk about "Bit endianness" in its own section.

        how some 32 or 64 bit ints may even have the middle bytes reversed
        I don't know of a thing like that. How the data is stored in memory has a simple algorithm. See this link which also gives a list of "endianess" for various file formats.
Re: Understanding endianness of a number
by Anonymous Monk on Jul 23, 2017 at 15:22 UTC
    Endianness is just what order the bytes of a multi byte number are stored in. E.g. int 4660 can be stored as 0x12 at the lower memory address and 0x34 at the higher one, or vice versa. It can also refer to the order of transmission of bits in a byte over a serial link, but typically it's the bytes of a word. An easy mnemonic is little endian = "little end first", i.e. LSB first (at the lowest address). Big endian = "big end first", i.e. MSB first (lowest address). See Endianness

      Thanks AnonyMonk,

      In this case, it definitely is related to data transmission over the wire. Arduino's Wire library's Wire.write() method sends the data back big endian. This confused me when I first ran into it, as my 2-byte word was backwards when I re-assembled the bytes.

      What caused me further confusion, was that I kept relating endianness with what I've been learning for the past year dealing with hardware registers. For some reason, I wasn't correlating that we're dealing (usually) with *bytes* with endianness, not specifically bits as dealt with in hardware registers regarding LSB and MSB etc.

      Little end first, big end first :)

A reply falls below the community's threshold of quality. You may see it by logging in.