halfbaked has asked for the wisdom of the Perl Monks concerning the following question:

I'm in the design stage for a new application and I wanted to bounce something off the Perl Monks.

I need to have multiple people and applications access the same data set and I need to manage data writes effectively so that one user doesn't write over someone else's changes.

The core of the application would be a Perl web service. I'm assuming that I'll write it with mod_perl because that's what I've done in the past and it worked really well.

But with this application, some clients such as an Android device will have a local copy of a subset of the data. So in theory multiple users could have a cached copy of the data and they could both save changes to the data at roughly the same time and the last one to write would win and overwrite the other's changes.

I'm not going to have hundreds of users accessing the same data, but more like 10 at the most, but I don't want confuse the user. I want some of the clients to have their own local copies of the data so that if they go offline they can still use the application, but when it comes back online to resync their changes with other user's changes.

My question is, is there a Perl module or something Perl can do that can help me deal with this data race/synchronization problem. And if not a Perl solution, perhaps someone could point me to a document online with a design pattern that will work for me.

It's not that I don't have some idea of how to solve this, but I'm never disappointed with the answers I get from Perl Monks.

  • Comment on How to deal with data race issues with Perl?

Replies are listed 'Best First'.
Re: How to deal with data race issues with Perl?
by JavaFan (Canon) on Dec 28, 2010 at 16:00 UTC
    You first have to decide how you want to resolve conflicts, before looking for existing solutions.

    Suppose you have two users, one changes "foo" into "bar", while at the same time another changes "foo" into "quux".

    What do you want to the final result to be?

    For cases like one user changes chapter 1, and another user changes chapter 2, you could consider using git. But that will throw conflicts back to the user.

    Unless you know what you want to do in case of conflicts, there's nothing we can offer you.

      I'm leaning towards either asking the user what to do, or to just use the last write as the winner and not worry about it.

      The data is not super critical, it's important, but if someone had to do something more than once, it wouldn't be that big of a deal.

        I'm leaning towards either asking the user what to do, or to just use the last write as the winner and not worry about it.
        If you're going to ask the user, you'll have to make a UI where the user can edit the conflict. You may want to have a look at various wikis, who have to deal with this as well.

        If the last write is going to be a winner, just use git, with something like git merge -s recursive -X ours. (Or -X theirs, depending on which way you're merging).

Re: How to deal with data race issues with Perl? (send pre and post)
by tye (Sage) on Dec 28, 2010 at 21:16 UTC

    For each field to be updated, send what the user changed it to and what the user started from (likely the cached value). Only send fields that were presented to the user to be updated (and identifying information, of course). But you may want to send fields shown to the user that the user chose not to make any changes to.

    Then the service can decide what makes sense to do with a change request. I find two key aspects to making such decisions: Merging changes to different fields and merging changes to the same field.

    In a lot of cases, me changing one field ("author", for example) when some other user changed some other field ("title", for example) should cause no conflict. Just apply both changes.

    In rarer cases, you might want to enforce fields being updated as a group. You might not want to allow me to change the price at the same time as somebody else changes the "qualifies for free shipping" flag, for example. But I tend to think these cases are rarer than a lot of software designers tend to think.

    I find it annoying, for example, that bugzilla refuses to merge my changes with somebody else's when I mark the bug as "resolved" at the same time as my PM bulk updates some project parameters like target milestone or release date. I'd rather have my changes merged and just be notified so I can resolve any merge conflicts with another updated when needed.

    If you have a rare case of fields whose updates should be grouped, then it might become important to send the fields shown to the user that the user chose not to update. In fact, this is one general, conservative method for determining when field updates should be grouped. If I update the price from a form that doesn't even show the "qualifies for free shipping" flag, then surely the value of that flag shouldn't matter as far as my update is concerned, even if the value of that flag changed while I was making my update.

    Second, for each field, apply the differences to that field. For large text fields this problem boils down to the classic text merge that is done by revision control systems such as git. If I make a change that only touches the 2nd paragraph while your change only touches the 4th paragraph, both changes can be applied w/o conflict rather easily. Though, web-based text entries tend to not be "line-based" and so traditional "diff" and "merge" approaches usually take some work to adapt while avoiding potentially large CPU resource consumption.

    For other types of fields, other "merge" methods might be appropriate. For example, at PerlMonks, most of the numeric fields that get updated by user activity are counts so my updating a node's reputation from 4 to 5 (because I upvoted it) at the same time as you update it from 4 to 5 (by upvoting also) should result in the reputation being set to 6. The updates are viewed as increment / decrement operations and can be silently combined.

    But the most common case is to just apply the update to the field, overwriting the previous update. The first user usually can't tell that the second update didn't happen right after their update (by a user who saw the first update) rather than "at the same time" (by a user who didn't see the first update).

    Implicit in this advice is that you don't do the "worst case" type of update (like PerlMonk's original node cache and Ruby's Active Record chose to) and send as the update "the entire record/object after the user's changes" and then "save" that record/object whole. That easily leads to me adding something trivial and blowing away any number of important changes (even my own changes because I later made a update of something trivial from a stale web page).

    But I also think you should almost always avoid the other end of the spectrum where you note that something trivial changed while I was doing my editing and therefore refuse to accept my changes. Making it trivial for me to, after reviewing the changes I had missed, just reapply my original changes, makes this end of the spectrum less horrid, but it still is too cautious, IMHO. You'll get people losing significant contributions because they didn't notice the conflict notification quickly enough and getting annoyed when forced to jump through this extra hoop over trivia.

    The most conservative I would usually go would be to apply my updates that might be in conflict and give me a notice of the potential conflict and let me apply more updates in the rare case when such are warranted.

    In summary, the common case is pretty simple. If the current value matches the user's "starting value" (before their edit), then just updated the field with their edited value. If starting values don't match, you probably update anyway (but only if the starting and ending values are different) but might instead do a "merge" update or notify the user that their update overwrote UserX's change of "$field" from "$old" to "$new".

    Logs of recent changes can be immensely helpful in such systems.

    - tye        

Re: How to deal with data race issues with Perl?
by Corion (Patriarch) on Dec 28, 2010 at 16:01 UTC

    The lo-tek solution I've used is to have a "time of last change" sent out and submitted back with each dataset. If the time of "last change" in the database and the time of "last change" (re)submitted are different, the change is rejected because there is a conflict.

    The high-tech solution would be something like version control, but for that, you really need to be far more specific as for what kind of data you have and what operations are to be performed on it. The two "easy" solutions I can envision would be to use something like git to handle the merging of two datasets, or alternatively, recording all changes to the dataset (instead of the result), and then replaying all changes on the server when a client synchronizes. Of course, conflicts will happen. Conflict resolution is likely something that needs a social/organizational solution.

      I actually think conflicts will be fairly infrequent, so perhaps just asking the user what they want to do is probably is the easiest and least confusing for them.

      Or just make the last write the one that wins and just do my best to educate users as to why the data they saved isn't there anymore.

      The data will be tiny chunks that are not earth-shattering if they get lost.

Re: How to deal with data race issues with Perl?
by Your Mother (Archbishop) on Dec 28, 2010 at 18:35 UTC

    Sounds like an HTML5 app on the client side? You can use SQLite in the client for offline and also on the backend for your main app if so.

    Something I have done for edit collisions is to send a digest of the record with the form. It becomes a "known last edit token."

    In DBIx::Class/Catalyst, e.g., something like-

    # Token is sent with form. my $token = md5_hex(join("",sort $row->get_columns)); if ( $token eq $c->request->body_params->{token} ) { # Record hasn't been updated by anyone since the form was loaded. # Continue with edit/update... } else { # Warn about edit collision, pretty format diffs/choices # with Algorithm::Diff/Algorithm::Diff::Apply. }

    As far as modperl, your deployment options will be limited. Unless you have an actual need of deep hooks in the request cycle + apache, there are "better" options today; nginx or lighttpd with Starman for example. I'd encourage you, strongly, to checkout Plack and the various web app frameworks like Catalyst, Dancer, and several other new ones.

    Update: fixed syntax error in dummy code.

Re: How to deal with data race issues with Perl?
by jethro (Monsignor) on Dec 28, 2010 at 16:30 UTC

    If you can, just store both alternatives and let the users (any user who thinks he knows best, not just one of the two having the conflict) resolve that with another edit. Software that uses the data should always take the last of the alternatives (if they are ordered by date)

    But there is a worst case were you might have to store more than two alternatives, but you can manage this by allowing only a maximum number and after that drop the oldest alternative to make room for the newest

    About the race condition issue. I don't know if this might be a problem with mod_perl, but if it is, you can use either file locking or a database engine. Both methods will make sure that you can't corrupt the data if two users try to write at the same time.

Re: How to deal with data race issues with Perl?
by ELISHEVA (Prior) on Dec 29, 2010 at 13:45 UTC

    When there are conflicts to a particular field (or set of cooperating fields), date isn't the only issue. Sending a conflict back to the user is only helpful if the user knows which value is right or can easily find out. In that case, who did the conflicting update might matter as well as, possibly their authority. You might also need to do some workflow analysis.

    For example, if "final contract price" is being updated, and two sales team members of equal standing enter different prices, resolving the conflict might require a telephone call to the other committer. If there is more than one person in contact with the client, it might not be obvious who that is unless the conflict report tells me. On the other hand, if a person lower down discovers that their local copy has a value that is out of sync with one set by the team leader, the change might be rejected automatically, but the team leader might be sent a report just in case he or she had delegated the final decision to that team member and did want to accept the change.

      You paint a scary picture, but also one I've never seen implemented (nor do I expect to ever see such). A system where I'm allowed to update a field but not if somebody of higher standing than me has updated it? ("at the same time"? or just "recently"? or "ever"?) That sounds like a nightmare design and I can't fathom what purpose it would serve. It makes no sense to distill what changes to apply when a race occurs based on each author's standing.

      - tye        

        I've seen the requirement in sales support systems and also a campaign finance management system.

        In sales systems, I think the requirement is pretty rare. Data can be localized with the potential for update issues when a company uses a travelling sales team with on-site visits, but in that case many sales teams try to have a sole contact person handling financial commitments. Every one else just says - call the account exec. When multiple people are interacting with a client and have some authority to make financial arrangements (i.e. a call center), the data is not normally distributed so the possibly of the main copy and a local copy both changing at the same time isn't a concern. But that is just speculation. I haven't developed a call center system. Maybe some of the larger ones do have multiple physical centers and distributed data.

        Campaign finance is another matter. There is per-person limit on campaign donations. I recall about 20+ years ago designing a system for congressional campaigns. Transactions could be entered locally by an on-site team, but the donation wouldn't get posted to final central accounts unless the donors total was below the minimum. Instead a notification was sent to the campaign finance officer. Officially authority level allowed one to resolve the problem by choosing one donation and rejecting another, by returning a check, or by sending a letter explaining why the pledged amount had to be reduced. What actually happened was another matter. (We called it the sleeze module).

        But today, this situation likely would not occur (sleeze aside) because it is nearly trivial to just have a central web application and set up the fund raising event with an internet connection. Back in the 1980's data needed to be distributed because transferring data at 300 baud is SLOW. it wasn't practical to set up dedicated T1 lines at each and every travelling campaign event. Off-line distributed systems and ugly merge processes were the result.

        Timing issues aside authority based updating rules are not uncommon in financial systems. Another system I was involved in had a rule whereby expense account transactiosn would be rejected (and sent onto the boss) based on the person's authority level, the amount of the expense, and total amount of expenses posed in the last week or month. The failed posting could be overridden by the boss.

        Looking over these examples, I think timing+authority based requirements are more likely to occur when adding/removing records to a set of child records rather than when values are changed in a non-derived field on the parent record. Multiple child records are much more likely to come from a variety of distributed sources than normalized data on the parent. If there is one "thing" it is possible to assign one person to manage that one thing ("ask the account exec") and give them sole editing rights.

Re: How to deal with data race issues with Perl?
by locked_user sundialsvc4 (Abbot) on Dec 30, 2010 at 14:18 UTC

    One very useful concept that ought to be tossed-in to the design of such a system is that of generations.   Each time a change to a record is accepted, a new “generation” of that record is created, and if you ask for “the record” you are given the latest generation of it, but all of the generations are kept.

    You also aggressively use the notion of “unique identifiers,” or as I like to call them, monikers.   Monikers are associated with the identity of the base-document, and other monikers are also associated with every generation.   So, when a particular user wants to synchronize, there is never any ambiguity to figure-out.   The client software can hand you an unambiguous list of exactly what he’s holding, and can likewise refer unambiguously both to “the thing being updated” and to “the update itself.”   (If you use a well-known algorithm such as UUID, notice that clients can generate monikers too, and those monikers will never “crash” against anybody else’s.   A provably-strong message digest algorithm, such as SHA1, provides a useful field-value that can be used to detect differences between the various stored items.)

    I believe that these are two most-important elements of any data synchronization design:   retaining all copies of the information, and removing all sources of ambiguity.   Message-digests and UUID-based monikers are completely unambiguous, and they’re short-n-sweet.

      How does this help solve the problem the OP is describing: two users, each having a local copy, modify the data, and now want to resync?

        I think his suggestion was offered in the spirit of raising additional design issues for consideration. Does the client need only the merged view or both a history of changes and the merged view? It affects how the resync process is carried out and what data is stored after it is done.

        Full audit trails (who wanted to make what change when) are a legal and liability necessity in certain accounting and legal/corporate document editing systems. The OP has not stated the nature of the data nor the business process.

        Secondly, he is proposing an architecture that allows separation of concerns - (a) the making and tracking of changes (b) the merging of changes into a cohesive view. If each separate change request is recorded, the algorithm for merging them can be changed over time based on developing user requirements.

        Separation of concerns can make this an attractive base architecture when the user's requirements are in flux due to an internal learning curve for the business process in question or when needs are liable to change due to a fast changing business environment. On the other hand, it costs more to implement and increases data storage demands, so a base requirement for an audit trail in addition to the merged view is probably in order before pursuing such a design.