in reply to Re^2: Associative array
in thread Associative array

Indeed. I put this somewhere in the same category as "lists in scalar context". It fits the orthography and good luck trying to get people to see differently. However, if one views a hash as an associative array one is likely to end up in all sorts of confusion, especially if one is first familiar with Lisp.

Two of the most important sources of confusion are: (a) confusion of implementation with concept and (b) confusion over ordering.

The name "associative array" confuses an implementation of a hash with the concept of a hash. Conceptually, a Perl hash is a data structure that will allow one to retrieve a value by name rather than position. There are many ways to implement this. For example, in Lisp there are two different ways to get hash-like effects: alists and plists.

In many other languages, Perl included, hashes (sometimes referred to as dictionaries) are implemented using data structures that are much more complicated than either a-lists or p-lists. Perl in fact has two entirely different C structures for hashes (HV) and arrays (AV). Although both have ARRAY, FILL, and MAX to describe their memory usage, the similarity stops there. The HV structure is carefully organized to maximize the speed with which values can be retrieved by property name and to conserve storage (in some cases keys are stored as pointers to a shared set of keys rather than having their own private string representation). See PerlGuts Illustrated.

Now, I suppose you could argue that even Perl stores key-value pairs in its ARRAY member and that makes it an associative array. However, that is a misleading oversimplification. And it brings us to the next issue: ordering.

Another problem with viewing hashes as associative arrays is that an array has an explicit ordering to its elements, whether those elements are simple scalars or key-value pairs. Hashes, by contrast, have an undefined unordering. Failing to grasp the distinction can cause serious problems, including hard to track intermittent bugs. One who views a hash as an associative array is likely to assume that ordering of hash members is consistent between runs of programs and may perhaps base their program on that assumption. However, this is a dangerous assumption. Perl hashes have an unpredictable order by intent. One of the techniques to safeguard software from algorithmic complexity attacks is to randomize the order of hash keys, thus making it harder to find a predictable spot to insert dangerous code. See keys and perlsec for further discussion.

Best, beth

Replies are listed 'Best First'.
Re^4: Associative array
by tilly (Archbishop) on Jul 13, 2009 at 09:57 UTC
    I would argue that you have completely reversed the proper comparison; calling the data structure a hash confuses implementation and concept, while calling it an associative array focuses on the concept.

    As Wikipedia says, an associative array is an abstract data type that maps a set of keys onto values. The analogy to an array is that arrays associate indexes with values, while associative arrays associate more general keys with values. So it is like an array, but with a more flexible choice of association. (Depending on the language and/or library, keys may even be arbitrary objects.) There are many ways to implement them. In fact the two Lisp examples you gave are valid (albeit inefficient) implementations!

    The term "associative array" is used in many languages. You can see how ubiquitous the term is across languages by googling for associative array. When I did that, the top links were to the Wikipedia page I gave, then articles on PHP, JavaScript, Oracle PL/SQL, PHP again, an entry in a data structures dictionary, an article on C++ compilers, a Perl article, etc. Clearly the term is very well established in the broader programming world.

    By contrast when we say hash we're referring to the implementation. Perl hashes are internally implemented using hash tables. (Common Lisp also has hash tables built in.) However there is no guarantee that Perl's hashes will always be so implemented. Perl has changed hashing algorithms in the past, and theoretically could switch to using a BTree at some point in the future. (Extremely unlikely though.)

      It is easier to say hash than AA without spilling beer :)

      When I first responded to the OP I suspect I was misreading the term "associative array" for "association list" (the particular way the OP was using hash and array reinforced that for me). But I persisted in my point because the more I think about it, I really do think that the term "associative array" is one of the not-so-smart-terms that has developed in our industry and is fundamentally misleading, despite a history that goes back at least as far as AWK.

      The term 'associative array' isn't implementation free to everyone, Wikipedia and even some recent text books aside. At the bottom of the Wikipedia discussion page for "associative array" there is a brief back and forth about potential confusion between the two terms "associative list" and "associative array" (the decision was to discourage implementation specific meanings). Consider also, for example, this definition from Mastering Algorithms with C, (Kyle Louden, O'Reilly, 1999) which defines an associative array as:

      Associative arrays consist of data arranged so that the nth element of one array corresponds to the nth element of another. p. 142

      I agree with you that the term "hash" is also laden with implementation associations, though less so, in my opinion. Although the term hash implies a hash function (as opposed to a balanced tree which uses structure to facilitate retrieval), there are many different hash function implementations, so ultimately even a hash function is more of a concept than a specific implementation. More importantly, it is hard to make assumptions about the physical arrangement of a data structure when the "implementation" is a category of functions.

      I don't think the same thing can be said of the word "array". An array is a physical structure that supports both sequenced and random access. Because its indices are integers, it has implications for ordering that are directly contradictory to the intent of the abstract meaning of the term "associative array" used in the AWK documentation or the Wikipedia definition. Both those definitions attempt to generalize the term "array" by focusing on random access (via non-numerical keys) at the expense of sequenced access.

      There are, of course, better conceptual terms: "map", "dictionary", or even "association container". If we were talking Java, I would use the term Map (Java's abstract container for this purpose). However, in Perl we are stuck with "hash" and it is the Perl use of the word we need to explain.

      Best, beth

        I'm not arguing that associative array is a particularly good term. Just that it is a very widely used and accepted term. I think you accept that point. And I accept yours that "associative array" can cause some people to associate confusing things about arrays.

        If we're looking for alternatives I don't like "map" because it has too many other meanings. "Dictionary" strikes me as better. "Associative container" is too wordy ("associative array" has the same problem), but is used for this exact purpose in the C++ STL library. If we're going to make up a term, "association" is not bad.

        However multiple language communities already call them "associative arrays" (and then usually call them something shorter in daily use), and we're not going to change that.

        And a random note. If I tie a Perl hash to a DB_File DB_BTREE file, in what sense do I now have a hash? The tie interface really exposes the fact that we've got an abstract datatype, and the implementation it is tied to there has absolutely nothing to do with hash functions. That's what I mean when I say that the common term specifies an implementation too closely.

Re^4: Associative array
by BrowserUk (Patriarch) on Jul 13, 2009 at 14:34 UTC

    Hm. Sounds like you're trying to start YAPS (Yet Another Perl Shibboleth). You'll excuse me if I do not wish you well with that.

    1. The alists & plists stuff is a read [intentially sic] herring:

      because, (to the best of my very limited Lisp knowledge), neither can perform the fundamental operation of both associative arrays and hashtables: that of random access of values by key.

      To the best of my knowledge, access to both is strictly sequential from the head.

    2. The implementation details are irrelevant:

      because associative array is an abstract datatype defined in terms of its operations. It can be implemented many different ways.

    3. The implicit ordering is a misnomer;

      You are conflating the order in which the data is accessed when iterated, with the ordering of the data itself.

      • Nobody except the rawest newbie expects AA/hashtables to be ordered.

        And when they do, its for the wrong reasons. As LW is reputed to have said: "iterating over the keys of a hash is like clubbing someone to death with a loaded Uzi".

      • Saying that ordinary arrays have implicit ordering is equally a misnomer.

        If you iterate an array in ascending index order; shuffle the contents and again iterate by ascending index, the data (values) will be differently ordered.

      • The pseudo-randomising of the hash seed for security reasons has as much to do with the iteration order of a hash, as the type of lock on a door has to do with the order people will come through it in the morning.

        The hash seed does not determine the order of iteration. That is (and remains) the sequentially ascending sequence of the bucket array (with diversions for non unitary buckets), just as for arrays.

        The hash seed simply determines which bucket the key/value pair is mapped to. Ie. It affects insertion not traversal.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Your initial question gave me an opportunity to think through some issues that have been annoying me for some time and for that I am grateful. But I am puzzled by the response above.

      1. In Lisp values are randomly accessible by key via helper functions (assoc or find for alists, get for plists). Are you suggesting that the lack of a syntax to hide the need for helper functions disqualifies them?
      2. The implementation details are irrelevant ... please see my response to tilly: Re^5: Associative array.

      3. Viewing array indices as keys (as opposed to mere access order) is the theoretical basis for claiming that "associative arrays" are a generalization of normal arrays, i.e. they have a more general universe of possible keys. If anyone is doing the confusing, it is whoever first came up with the term "associative array".

        My objection to the term comes from the fact that generalizing the key set also changes the mathematical properties of that key set. And that in turn changes the range of valid operations on those keys and how we work with them on a theoretical and even practical level.

        What operations are different? Well, for one, in a normal array we can meaningfully talk about the distance between elements because integer keys can be subtracted one from another. We can also divide indices (or more precisely the distance between two indices) by some constant. Those two properties play an important role in certain searching and sorting algorithms (binary search and sort for example)

        And then there is ordering. A finite range of integers has a natural well-ordering based on '<='. An arbitrary set of keys only does if we have defined (and applied) a function that maps those keys to the set of positive integers (or some subset). There are many, many ways to do this (locale based, asciibetical). In fact it is mathematically guarenteed that we can find such a function for any finite set. However, this is an additional step that must be done in order to make an associative array behave like a normal array, assuming you even want to do that.

        But often we don't. We use normal arrays and "associative arrays" for very different kinds of algorithms. Associative arrays are best used when our algorithm relies on the semantic significance of each element. Normal arrays are used when our algorithm relies on the order of elements.

        Sorting and reordering only changes what value is associated with what key. It doesn't change the fundamental fact that the values are associated with integral keys and that integers have certain useful mathematical properties. In fact, one of the motivations for using normal arrays in sorting is their ability to map elements to a well ordered set of keys.

      4. Even if Perl is always using bucket order, isn't the net result a different key order? The documentation for keys says (I have bolded the key words):
        The keys are returned in an apparently random order. The actual random order is subject to change in future versions of perl, but it is guaranteed to be the same order as either the values or each function produces (given that the hash has not been modified). Since Perl 5.8.1 the ordering is different even between different runs of Perl for security reasons (see "Algorithmic Complexity Attacks" in perlsec).

        What am I misunderstanding there?

      Best, beth

        1. Are you suggesting that the lack of a syntax to hide the need for helper functions disqualifies them?

          No. I'm saying that (TTBOMK) alist & plists are lists and can only be sequentially accessed--which isn't random access, and therefore disqualifies them from being AAs. Ie. searching them is O(n) instead of O(1).

          As I also said, my Lisp knowledge is limited. Then best evidence I have for this are a couple of quotes from the Common Lisp Manual:hashtables: "Finding the value is very fast, even if there are many entries, because hashing is used; this is an important advantage of hash tables over property lists." & "adding a value to a hash table is a destructive operation; the hash table is modified. By contrast, association lists can be augmented non-destructively."

        2. I read the linked post.
        3. in a normal array we can meaningfully talk about the distance between elements because integer keys can be subtracted one from another. We can also divide indices (or more precisely the distance between two indices) by some constant. Those two properties play an important role in certain searching and sorting algorithms (binary search and sort for example)

          Hm. You can do math with the bucket indicies too. It won't allow you to write a binary search, but then you don't need to.

          'Arrays' & 'associative arrays' are different things. Like 'trees' and 'shoe trees'; 'salmon' and 'rock salmon'; .... See below.

        4. Even if Perl is always using bucket order, isn't the net result a different key order?

          Yes. But you would also get a different data ordering if:

          • they chose a better hashing algorithm;
          • or they adjusted the re-hash threshhold;
          • or if you insert or delete a key;

            The point is that the iteration ordering doesn't change because of the hash seed randomisation. The ordering of the keys (one of a pair of data items) changes. But the iteration order has nothing to do with the ordering of the values--for either hashes or arrays. I'll attempt to make that clearer :)

            The "implicit ordering" of arrays, has nothing to do with the ordering of the values held within it. If you take the same list of data and unshift them into an array; and push them onto an array; the iteration order remains the same; but the perceived ordering of the data will be different(reversed). And if you spliced them into the 'middle', you'd percieve a different ordering when you iterated the array sequentially.

            This is no different with hashes; the buckets are iterated sequentially+. Which perceived ordering values come out in, depends on where they were put during insertion, not the the iteration ordering. And the insertion ordering is dependant upon many things: the current size of the bucket array; the number of values added; the hashing algorithm (including the seed used). Even just adding some keys and then removing them again will caused the percieved ordering of the (same) keys to change:

            undef @h{ 1 .. 8 };; print keys %h;; 6 3 7 2 8 1 4 5 undef @h{ 9 .. 16 };; delete $h{ $_ } for 9 .. 16;; print keys %h;; 7 2 1 6 3 8 4 5
            What am I misunderstanding there?

            I believe you are conflating the iteration ordering and the (resultant) data ordering. The hash seed randomisation affects which buckets the keys are placed in, not the order those buckets are iterated.

            Sure, if all else remains unchanged, the net effect of hash key randomisation might be a difference in the perceived order of the keys as output. But if all else stays equal (including the seed), and any one of those other things changes, you will get the same perception. Just as you would with an array, if a tied array decided to use a hash as it's storage mechanism.

        P>Conclusion: Whilst I do see some potential for confusion between 'array' and 'associative array', it think it is minor. And I don't think most of your arguments in support of your assertion stand up to scrutiny. Programmers will quickly become familiar with the properties of associative arrays and not confuse them with arrays.

        And non-programmers won't know what either are, and might guess they are more similar than they really are. But that's really not a problem because they won't know the properties of arrays, so will have no basis for real confusion.

        Telling others not to use an industry standard term, because you've reached the conclusion that it's a bad one, is ...

        I don't object to your pointing out your objections to the term--I did that myself recently for undefined behaviour--and I might even agree with you. But it is one thing to raise and support one's objections to a industry standard term; it's all together a different thing to tell people off for using it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.