in reply to Re^4: Date Array Convolution
in thread Date Array Convolution

The tally method lent itself to your application because you identified your dates as being DDHHMM.

This meant that an entire month period can be covered, using a granularity of minutes, by an array of 60*24*31=44,640 elements.

If you needed to deal with a year at a time, then using a tally stick of 16,338,240 is still feasible -- especially if you move to using vec to store an array 8 or 16-bit ints to store your values. But the problem is that then the role-over from one month to the next is complicated by the fact that months have differing numbers of days.

But trying to use the epoch (by which I assume you mean the *nix epoch of 1/1/1970) mean that the tally would have to be 653,529,600 elements long which is just silly. And will be even sillier if you are using the Windows epoch of 1/1/1600!

If you really cannot break your data sets up into 1 month subsets, then you are going to have to go back to using the O(N^2) process of comparing every range with every other range in order to detect your overlaps, rather than using the O(N) method of the tally stick. This because once the set of ranges can span month and year boundaries you will need to handle all the messy details of data arithmetic. Variable length months and leap years etc.

Essentially, whether you should move forward with the tally method or go over to using proper date math should depend entirely upon the frequency (therefore performance constraint) of having to perform your processing versus the difficulty in having to pre-filter your ranges into by-month subsets.

If performance is important and the subsetting easy, then stick with the tally method and start of month offsets. If the subseting is hard and the performance requirement is low, then encode your dates using a proper date package and use full date arithmetic to detect your overlaps.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^6: Date Array Convolution
by choroba (Cardinal) on Nov 07, 2011 at 15:10 UTC
    This meant that an entire month period can be covered, using a granularity of minutes, by an array of 60*24*31=44,640 elements.
    In my solution, I only need the array of the size of 2 × the number of input ranges, because there can be no change in value if there is no start/end of a range.

      Cool! But you're comparing apples with oranges.

      Besides which, your solution contains bugs.

      Update:Dataset removed as it seems that PM cannot handle it. Despite that it appeared fine in preview.

      If you want the data set which I observed produces this incorrect output (note the second last range is invalid) I can email it to you.:

      C:\test>junk88 935755.dat [ ["010005", "010022", 41], ["010023", "012359", 0], ["020000", "022359", 0], ["030000", "032359", 0], ["040000", "042359", 0], ["050000", "052359", 0], ["060000", "062359", 0], ["070000", "072359", 0], ["080000", "082359", 0], ["090000", "092359", 0], [100000, 102359, 0], [110000, 112359, 0], [120000, 122359, 0], [130000, 132359, 0], [140000, 142359, 0], [150000, 152359, 0], [160000, 162359, 0], [170000, 172359, 0], [180000, 182359, 0], [190000, 192359, 0], [200000, 202359, 0], [210000, 212359, 0], [220000, 222359, 0], [230000, 232359, 0], [240000, 242359, 0], [250000, 252359, 0], [260000, 262359, 0], [270000, 272359, 0], [280000, 282359, 0], [290000, 292359, 0], [300000, 302359, 0], [310000, 312356, 0], [312357, 312356, 52], [312357, 312359, 27], ]

      Instead of this outptu:

      C:\test>935755 935755.dat [ ["010005", "010022", 41], ["010023", "012359", 0], ["020000", "022359", 0], ["030000", "032359", 0], ["040000", "042359", 0], ["050000", "052359", 0], ["060000", "062359", 0], ["070000", "072359", 0], ["080000", "082359", 0], ["090000", "092359", 0], [100000, 102359, 0], [110000, 112359, 0], [120000, 121942, 0], [121943, 122359, 0], [130000, 132359, 0], [140000, 142359, 0], [150000, 152359, 0], [160000, 162359, 0], [170000, 172359, 0], [180000, 182359, 0], [190000, 192359, 0], [200000, 202359, 0], [210000, 211659, 0], [211700, 211741, 0], [211742, 212359, 0], [220000, 222359, 0], [230000, 232359, 0], [240000, 242359, 0], [250000, 252359, 0], [260000, 262359, 0], [270000, 272359, 0], [280000, 282355, 0], [282356, 282359, 0], [290000, 292359, 0], [300000, 300936, 0], [300937, 302359, 0], [310000, 311315, 0], [311316, 312042, 0], [312043, 312159, 0], [312200, 312356, 0], [312357, 312359, 27], ]
        Thanks for the dataset. My code updated. The only difference between our solutions now is that my solution merges neighbouring intervals, e.g.
        [010010, 010020, 2], [010021, 010030, 2] becomes [010010, 010030, 2]