BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

You employ a class that provides a hash-like interface designed for multi-threaded operation. A part of that interface is the provision of iterators. Whilst one thread is using an iterator, other threads may be modifying the data structure. Modifications might include any of:

  1. Modifying existing values;
  2. Adding new key/value pairs;

    Which in-turn might cause a resize operation.

  3. Deleting existing key/value pairs;
  4. Explicitly clearing all the key/value pairs.

    As opposed to implicitly because of going out of scope.

It obviously becomes necessary to invalidate existing iterators when some of the modifications occur. But is it necessary for all of them?

For example, with case A), if the only modification is the updating of a value; any existing iterator has either: a) already traversed that pair and retrieved the old value; or b) will traverse it sometime and get the new one. In other words, it will get a value which happens to be the current value at that moment in time and provided that the threads are properly decoupled (ie. not timing dependent) correct algorithms should be fine in either case.

The other cases are less clear cut. One school of thought is that if a deletion or insertion occurs behind an existing iterator; then the affected key/value pair has already been processed, so it doesn't matter; and if it happens ahead of the existing iterator they'll get processes once the iterator gets there.

The problem is that the nature of hash structures is such that an insertion or deletion either ahead or behind could affect the structure (eg. cause a rehash or resize) such that the iterator might now revisit pairs already processed and/or skip keys that haven't yet been. And what if the deletion deletes what would be the next pair to be returned. If the iterator returned undef -- what else could it do? -- then that would be confused with the normal iterator completed return.

If another thread completely empties the hash; then it could be viewed that having existing iterators simply end is the right thing to do. But of course, for some algorithms that could result in false statistics, data or conclusions.

So, the questions rattling around my brain are:

I've been thinking about this for a while and I'm beginning to suspect that there is no one right answer to it. What I'm looking for is some kind of feel for what might constitute a 'most cases' right answer.

Thoughts?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
any

Replies are listed 'Best First'.
Re: Design thoughts: iterator invalidation (depends)
by tye (Sage) on Feb 20, 2015 at 16:09 UTC

    Depends how you implement the iterator. I would often implement the iterator by caching the list of keys and iterating that (could be a problem if the number of keys is huge). Then you don't have to invalidate the iterator, just check 'exists' before returning each key. Though, exactly how you do that can still lead to race conditions, of course.

    It also depends on the operation being done with the iterator. If you are doing an operation with an iterator where changes in the keys would cause a problem, then you might need to implement some form of locking. Your description makes me suspect that the operations you are doing are unlikely to be such that missing a new key is a big deal. If it is, then I might do optimistic locking where, after I've finished iterating, I take a lock, check that the list of keys hasn't been updated since I initialized the iterator, apply the update indicated by my calculations, then unlock.

    If you are just iterating the keys because you need to process every key eventually, then invalidating the iterator each time a change is made can lead to your loop starting over frequently such that the loop never gets to the end and so keys that don't get iterated early are simply never reached.

    There is no one right answer. So, yes, there may be value in having an option when creating the iterator.

    As for how to indicate that an iterator has become invalid? I'd throw an exception. Though, you could also just allow for checking for validity, thus providing the caller the option as to whether or not they care (such as only at the end). But if it is likely that there are cases where changes invalidate the result being accumulated, then I'd rather have the option of having the checking done inside the iterator rather than force the caller to remember to do the checking.

    How to check that an iterator has become invalid? I'd track a generation number for the state that is being iterated. An operation that makes an important change to the state of what is being iterated would increment the generation number. The iterator would copy the generation number when created and throw an exception when iterated if the generation number has changed.

    The worst case here calls for using MVCC (where you track a generation number for transactions and can have multiple versions for each key and you purge old versions when there are no more "live" transactions whose life span intersected with the version's life span).

    - tye        

      I would often implement the iterator by caching the list of keys and iterating that (could be a problem if the number of keys is huge).

      Your parenthetical excludes this approach for my purposes. For uses where the hash is small enough for that to not be a problem; then it is simpler if the user just takes a none shared copy of the shared hash and iterates that.

      There is no one right answer. So, yes, there may be value in having an option when creating the iterator.

      By the same logic that precedes the above quote, I've reached the same conclusion. And I concur with your logic that folows it also.

      Here's my first pass description of what I think I'm going to implement ('scuse the formatting and rambling thought process.):

      When creating an iterator, the user can pass a set of flags that indic +ate their requirements for the iterator: INVALIDATE_NONE 0x0 INVALIDATE_ONRESIZE 0xFFFF INVALIDATE_ONKEYADD 0xFFFF0000 INVALIDATE_ONKEYDELETE 0xFFFF00000000 INVALIDATE_ONVALUECHANGE 0xFFFF000000000000 INVALIDATE_ALL 0XFFFFFFFFFFFFFFFF Each hash will contain a version number, a simple monotonic counter. The flags will control which actions will update that counter. Each iterator will take a copy of the counter at the point of its crea +tion; and compare that copy with the counter in the hash each time it is cal +led. If they are different, it will free the iterator storage and raise an +exception. Nice idea -- economical -- but it would mean that either: the flags would need to be applied to the hash not the iterator an +d therefore all iterators for a given hash would have the same attrib +utes. or each of the monitored operations would have to update the versions + in all current iterators. Even for modest numbers of iterators, the looping, indirections an +d locking involved would impose a substantial hit on the performance +of the affected operations. or the hash would need separate counts for resizes, adds, deletes and + value changes :( Could they be safetly combined into a single count? (eg. union{ U6 +4 version; struct { U16 updates, adds, deletes, resizes; } }) With the flags specifying a mask applied to the U64 version to det +ect the changes? What are the odds of an iterator being called when exactly 2**16 d +isallowed changes (* the number of disallowed types of change) have o +ccurred? Minimal! The iterator would have to be called the first time a +fter creation, when exactly 65536 (or some exact multiple thereof) ch +anges of the disallowed type had been made; and not once before. And if more than one disallowed type the odds are 2^16 (*2^16) + slimmer. And in the unlikely event that occurred; the iterator would be + invalidated then next time it was called unless it also happened aft +er an exactly multiple of 65536 changes have been made. As close to impossible as I need!. Conclusion: The version counter will not be a simple monotonic counter of all +changes, but rather a 64-bit value consisting of 4 x 16-bit unsigned, + rollover counters. The per iteration check consists of masking the iterator copy and +comparing against a masked view of the hash version.

      Summary:

      Each hash contains 4 16-bit counters viewable as a 64-bit uint: union{ U64 version; struct { U16 updates, adds, deletes, resizes; } } The fields are incremented by the corresponding operations.

      Each iterator takes a copy of this on instantiation; and then compares its copy with the hash copy on each iteration (masking per instantiation flags). If a disallowed operation has occurred, the iterator frees itself; and raises an exception.

      Further thoughts, critiques?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Design thoughts: iterator invalidation
by hdb (Monsignor) on Feb 20, 2015 at 15:53 UTC

    Isn't this the same issue like for reading data from database concurrently? Would you want to introduce transactions and commits into your scheme?

    And some of your issues can occur already when you have a single iterator over a hash like when using each. Or not?

      Isn't this the same issue like for reading data from database concurrently? Would you want to introduce transactions and commits into your scheme?

      Yes. Same issue, the whole start/end transaction, commit/rollback is a very heavyweight solution for an in-memory problem.

      Often referred to as Software Transactional Memory, a seductive term that is actually just a whole lot of smoke and mirrors around what comes down to:

      1. take a copy of the data you want to modify;
      2. make a checksum of the original;
      3. modify the copy;
      4. recalculate the checksum of the original;
      5. if the checksum has changed, goto step 1.
      6. copy the modified copy over the original.

      Now think about the locking required when performing steps 1, 2, 4, 6; and ask yourself; wouldn't it have been quicker to just: lock the data; modify it; unlock it.

      some of your issues can occur already when you have a single iterator over a hash like when using each. Or not?

      Indeed. All of them. In fact, I seem to recall that someone once meditated on the idea that each should be deprecated because of it. (I did try to find it and failed; though I did turn up the one where the guy wanted to deprecate if in OO code! Reckoned it could all be done with subclassing and overrides as I recall, though I didn't reread it. :)

      But the problem becomes exponentially worse when it's not just code/subroutines you call from your iterating loop that might (deterministically) undo you, but any other thread with visibility of the hash, and with non-deterministic timing. Hence I feel that it warrants some effort.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Design thoughts: iterator invalidation
by SuicideJunkie (Vicar) on Feb 20, 2015 at 15:58 UTC

    Given the wide variety of ways to use a hash, I'm not sure if it is even possible to cover 'most cases' instead of just the most common case.

    My guess at a default case for inexperienced coders would be to iterate over the list of keys present at the start, and skip any that were prematurely deleted.

    Personally, I don't think I'd trust the iterator after changing the hash, but I'd probably hope that it would iterate until there are zero keys in the hash that it has not yet returned.

    I think a poll on the topic would be worthwhile.

      Personally, I don't think I'd trust the iterator after changing the hash, but I'd probably hope that it would iterate until there are zero keys in the hash that it has not yet returned.

      There are two problems with that statement:

      1. How would you know that the hash had changed if the change(s) occur in (a) different thread(s).
      2. Remembering which keys have already been returned would impose a huge burden.

        Ditto: noticing that a new key had been added at a point preceding the current position of the iterator.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        I do agree that it would probably be terrible to actually implement, moreso with threads involved. It would just be convenient if it magically worked that way.

Re: Design thoughts: iterator invalidation
by RichardK (Parson) on Feb 20, 2015 at 16:38 UTC

    It entirely depends on the implementation details of the underlying data store, and I don't know enough about perls' hash low level implementation to comment. However, separating read locks from write locks might be useful. RCU does it that way, so reading up on how it works may give you some ideas. RCU is optimized for the read case and so writes can be very slow.

      RCU is optimized for the read case and so writes can be very slow.

      Writes can't be very slow. Garbage collection of prior copies can be significantly delayed.

      The closest thing to writes being slow is that you have to create an entire new unit so having huge units leads to having to copy the full unit in order to make an update. If that makes your writes slow, then you probably need to split that huge unit up.

      - tye        

        True, but as there can only be one active writer, if there are many threads trying to write to the list you have to wait your turn to get the write_lock. Once you've got the lock _then_ it's fast.

      I reached the (my) conclusion that RCU doesn't really help with my implementation of a concurrent hash.

      My design is based upon the ideas presented in A lock-free hash table. (Note. the premise is that the implementation does not require internal locks --though there will be some internal spin-waits.)

      That doesn't mean that the user will not have the ability to lock both individual key/value pairs for those situations where the new value of a key is derived from the old value. I'm also reaching the conclusion that I will have to provide the ability to lock the entire hash.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Nice talk. The blog he started just before the video was hosted at his employer who changed the URL to http://www.azulsystems.com/blog/cliff, which also links to his new blog, http://www.cliffc.org/blog/. I also found a link to one version of the code.

        The design actually comes quite close to RCU in many ways. He uses atomic Check-And-Set as a form of optimistic locking for updaters. (And there is significant work beyond just RCU in order to make that all work.) This means updaters don't have to wait for a lock but they may have to retry their updates (but the number of retries is bounded by the number of threads and, in practice, is almost always very small).

        Sounds like it never got folded in to java.util, though. I'm a bit curious why but haven't yet spent the time to try to find any discussions on that.

        The discussion about how the size of the hash is tracked indicates that having a single generation number could be a bottleneck when having a huge number of simultaneous updates. He tracks the number of entries in the hash by having a striped counter where each thread will only update one (partial) count and to get the count you have to add up all of the partial counts. So the multiple (cyclic) generation counts in a single 64-bit word that you mentioned would likely be a bottleneck for huge numbers of threads. There are probably tons of use cases where that ends up mattering very little, though.

        - tye        

      RCU does it that way, so reading up on how it works may give you some ideas.

      Ignore this. I started to respond negatively to this suggestion then realised that I didn't yet fully understand the implications of the thing and how it might be adapted to my situation.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Design thoughts: iterator invalidation
by Anonymous Monk on Feb 20, 2015 at 16:11 UTC
    I'm beginning to suspect that there is no one right answer to it

    That would be my thought too :-) I think which operations can be supported without invalidating the iterator depends very much on how the underlying data structure is implemented. Also the solution depends on your requirements: when do you want the iterators to be invalidated - do you really need every thread to always see the latest version of the data, or are some changes (you mention value updates) acceptable? Plus, there might be performance considerations (i.e. lock the whole hash vs. selective locking for selected operations)?

    Anyway, I just thought I'd mention that I've found Java's java.util.concurrent classes fairly well thought out, just for example ConcurrentHashMap or CopyOnWriteArrayList, perhaps that could serve as a bit of inspiration...

      I just thought I'd mention that I've found Java's java.util.concurrent classes fairly well thought out, just for example ConcurrentHashMap or CopyOnWriteArrayList, perhaps that could serve as a bit of inspiration...

      Trouble is, when it comes to iterators, the designers of ConcurrentHashMap seem to have abdicated their responsibility. The docs say:

      The view's iterator is a "weakly consistent" iterator that will never throw ConcurrentModificationException, and guarantees to traverse elements as they existed upon construction of the iterator, and may (but is not guaranteed to) reflect any modifications subsequent to construction.

      And I can find no mention at all of what happens if another thread resizes the thing when the iterator is half way through. They even proudly announce:"and there is not any support for locking the entire table in a way that prevents all access."

      I translate that as: this is a hard problem, so we didn't make any attempt to handle it in a clean way; nor to detect possible inconsistencies; nor even provide a way for you to detect them; nor even a way for you to prevent them.

      But oh, they did think to provide a method to look up keys from their values:containsValue(Object value) Returns true if this map maps one or more keys to the specified value. Note: This method requires a full internal traversal of the hash table, and so is much slower than method containsKey..

      A bit like a gun maker providing a tool for looking down the barrel to see if it is loaded; and attaching a note: Be careful!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        It's been quite a while since I last worked with it so this is mostly IIRC, but if I am understanding the documentation correctly, methods like keySet essentially return a snapshot of the key set at the time the iterator was created, so I would think that if the hash is resized by another thread the Iterator would iterate over the "old" key set, and only "may" reflect subsequent changes. And while not quite explicitly stated, because of this snapshot idea, I don't think the Iterators would ever return a key twice or omit a key that has not been modified since the creation of the Iterator. As for locking the entire structure, in Java that's pretty easy to accomplish by a synchronized block over some object (such as the hash table itself), and/or by using synchronizedMap from Collections, although of course locking the entire table is not particularly efficient if done often. I think that's some of the reasoning behind the java.util.concurrent classes: provide a data structure that can safely be used by multiple threads, while still providing decent performance - probably the docs for the package say it best:

        The "Concurrent" prefix used with some classes in this package is a shorthand indicating several differences from similar "synchronized" classes. For example java.util.Hashtable and Collections.synchronizedMap(new HashMap()) are synchronized. But ConcurrentHashMap is "concurrent". A concurrent collection is thread-safe, but not governed by a single exclusion lock. In the particular case of ConcurrentHashMap, it safely permits any number of concurrent reads as well as a tunable number of concurrent writes. "Synchronized" classes can be useful when you need to prevent all access to a collection via a single lock, at the expense of poorer scalability. In other cases in which multiple threads are expected to access a common collection, "concurrent" versions are normally preferable. And unsynchronized collections are preferable when either collections are unshared, or are accessible only when holding other locks.

        Most concurrent Collection implementations (including most Queues) also differ from the usual java.util conventions in that their Iterators provide weakly consistent rather than fast-fail traversal. A weakly consistent iterator is thread-safe, but does not necessarily freeze the collection while iterating, so it may (or may not) reflect any updates since the iterator was created.

        So I think it's a question of the requirements: If changes to the data set need to be seen immediately by all threads, some central lock is probably necessary, but if not, then an interface like this one might be a solution.

Re: Design thoughts: iterator invalidation
by SimonPratt (Friar) on Feb 23, 2015 at 12:03 UTC

    Wow, this is ambitious / tough :-)

    My immediate thought is to have a look at the way MS has implemented their in-memory OLTP database technology. Essentially the idea is that every change is saved and linked through the use of timestamps. The current value of something is maintained and updated only upon completion of transactions. Its really memory intensive, however it is ACID and blisteringly fast.

    Back to your requirements - Keeping a data set that maintains a record of transactions in time means that you can create an iterator that will return the contents of the hash as it existed at the time of the iterators creation, which I think would prevent a whole lot of these issues. Making your object time aware then gives you options that allow users of the object to select the behaviour (eg, let them choose if they want to iterate completely through the hash without anything updating, or iterate until the live hash was updated more than x milliseconds ago (or whatever other logic you want to have, such as keys deleted / added), or even iterate only over live data - When the iterator is invalidated is then controlled by the logic of the choice the user makes)

Re: Design thoughts: iterator invalidation
by locked_user sundialsvc4 (Abbot) on Feb 23, 2015 at 01:37 UTC

    I suppose that my thoughts are [completely ...] caught off-guard by the notion of “iterating through” a hash, which I consider to be a stictly unordered structure.   I would therefore consider “the keys” to be strictly a set, not a sequence.   Iterators, therefore, would eventually return all of the keys, but would not be guaranteed to do so in any particular order.   There could be no concept of a “repeatable read.”

    Edit:   You describe this as a hash-like structure.   If it is not, in fact, entirely based upon Perl5 “hashes,” but is instead some kind of truly awesome contrivance for storing (key, value) pairs ... which in your case would not surprise me at all ... then these comments might not apply in the slightest.   I know this.   As do you.

    In the “simplistic” case of “real Perl5 hashes,” I would suggest merely letting the iterators run as they now do.   Don’t try to do anything to remedy the situation because, I think, you really can’t.   If an iterator is running over a volatile data structure, it might encounter some keys more than once and/or encounter keys not-at-all.   If it is traversing a section of the data structure that has vanished beneath its feet, it has nothing that it can do but to return undef, and the threads must live with what they’ve got.   (If it would be meaningful for them to know the difference, I would handle this by throwing an exception instead of returning undef.)

    If what you actually need is for the threads to, say, start-over when an insertion or a deletion occurs, maybe the most efficient mechanism would be to send the threads a signal, which would cause each of them to set some kind of a flag that causes them to break out of their current iterator loop, reset the flag, acquire a new iterator, and start-over using it.   (Knowing that any of them could have already encountered a problem.)

    Basically, to me, it comes down to this question (the answer to which, I cannot know):   how tolerant are the threads to the possibility that they might be served-up the same key twice, and/or that they might not be served up all of the keys that existed?   How tolerant can they be of the possibility that a particular use of the iterator might be drastically shortened?   Or, to put it another way but perhaps a much simpler way:  how tolerant can they be, of the possibility that the iterator’s outcome might be (meaningfully) wrong, when weighed against the probability that such an event might occur [undetected ...] during any particular thread’s use of an iterator?   (And, how much wall-time does any particular thread’s complete use of an iterator actually take?)

    M-a-y-b-e you could try this approach:   maintain a counter that is incremented every time an insert/delete event occurs, and have each thread capture that value before and then after they use an iterator.   If, at the completion of an iterator loop (and regardless of when or how it ended ...), the thread found that this value had changed, it knows that the results are suspect:   perhaps the thread throws-away its results and tries again.   Otherwise, it knows that the results it just obtained are valid, and it records them.   (Variations on this theme, in which the threads check “periodically” at some agreed-upon interval to reduce lock-contention time, are obvious.)

    Here, you are specifically “checking for what-it-is that you are afraid of, optimistically hoping that you will find that it has not occurred, instead of pessimistically assuming that it did might.   “It’s cheaper/easier/faster to get forgiveness than permission.”   Instead of designing the iterator-pool to compensate for threads who are obliged to know nothing ... and to “compensate” in a way that just might wind up racing with all of them ... you instead take the approach of always hoping that the threads will run the gauntlet successfully.

    In a high-performance situation, I rather like this alternative . . .

      You sad ^&^%&*&*%^. You simply cannot help yourself can you.

      1. 'Iteration' does not imply 'order'.

        Even an inherently ordered array can be iterated out-of-order: print $a[ $_ ] for shuffle 0 .. $#a;

        You've obviously never even used a Perl hash. Why am I not surprised. I seriously doubt if you have ever written a single, complete, working program in Perl.

        (I'm seriously beginning to doubt if you've ever done so in any language.)

        Which just reinforces the question as to why you feel the need to continue hanging around this place; let alone sticking your confused and muddled oar in all the time.

      2. The rest is just regurgitated-out-of-context, meaningless garbage.

      Please stop posting. (In my threads at the very least!)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      A reply falls below the community's threshold of quality. You may see it by logging in.