MD5(Timestamp+message)
That's an idea, thanks. If both timestamp and contents of two messages coincide, then the messages are, indeed, same. (Though I think that SHA1 is better in terms of collisions, and I won't employ SHA2 there because cryptographic integrity will be protected by SSL.)
Also, I still need a way to distinguish the messages as they are delivered to different users (I need some time to prepare an answer for you, BrowserUK).
Then you sync by comparing lists of the hash, which is conveniently fixed length and easily ordered.
So, part of the question is comparing two sorted arrays of 20-byte numbers. Thank you, now I understand better how tools like rsync work. The open part of the question is: which shortcuts I can apply to avoid exchanging full lists of messages? (for me, the worst case would be (1e5 * 20) = 2M of data per sync attempt, which is okay if performed rarely, but quite a lot - and may require resuming - on a EGDE connection)
| [reply] |
MD5(Timestamp+message)
I strongly suggest that you keep the timestamps outside of (separate from) whichever digest function you use. Digests are inherently unorderable with respect to time.
The open part of the question is: which shortcuts I can apply to avoid exchanging full lists of messages?
In the following assume: messageNo is a monotonically increase number from oldest to newest; messageId is timestamp+digest[+deviceId+deviceId].
- Swap counts; same and you're done.
- Otherwise; swap messageId of the message at messageNo (smaller count/2) from the beginning.
- Same; add (smaller count/4) to previous messageNo (smaller count/2) and exchange messageIds. goto step 3.
- Different; subtract (smaller count/4) from previous messageNo and exchange messageIds. goto step 3.
Basically; do a binary search (oldest->newest) to locate the first difference; and then move forward from that point; exchanging first messageIds; then full messages if needed.
| [reply] |
I strongly suggest that you keep the timestamps outside of (separate from) whichever digest function you use. Digests are inherently unorderable with respect to time.
I've been a little busy this week so didn't bother looking back at this thread, but my (unstated) reasoning behind putting the timestamps inside the hash was for easy sortability on a single field, at the expense of easy time ordering. It also guarantees that short messages that may be have dupes in the past or future get unique digests. It might be reasonable to include the timestamp both inside the digest and outside as a separate column, depending on the application.
| [reply] |