in reply to Re^2: Date Array Convolution
in thread Date Array Convolution

figure out how it works!

Okay. if you look at the code you'll see it essentially consists of 3 consecutive loops:

  1. my @expd = map{ ... } @inputs

    This takes an input like ["010000", "010110", 6] and adds two columns to it to make ["010000", "010110", 6, 0, 70].

    Ie. It takes each of the day/hour/minute values and converts them to an integer (dhm2int()) and appends them. This gets all the messy string handling out the way up front and simplifiies all comparisons.

    Also, within that map there is a while loop. Its job is to check to see if the range being integerised spans a day boundary. If it does, it breaks the range into two (or more) ranges as required.

  2. The second block consists of two nested for loops that construct two parallel arrays @tally, @id from @expd.
    1. The outer loop runs over the list of ranges.
    2. The inner loop runs from start to the end of the range.
    3. At each position, if the current value at that position on the "tally stick", @tally is undefined, or greater than value of the current range, then replace it with the value from the current range, and record the id(index) of the current range in the parallel array @id

    At the end of the (double) loop @tally looks like this:

    6 6 6 6 6 1 1 1 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 +6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 + 6 5 5 5 5 5 5 5 4 4 + 4 4 4 3 3 3 3 3 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3

    And @id looks like this:

    0 0 0 0 0 4 4 4 4 4 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 1 1 1 1 1 1 1 5 5 + 5 5 5 2 2 2 2 2 6 6 6 6 6 7 7 7 7 3 3 3 3 3 3 3 3 3

    As you can see, we now have a list of the lowest value available for every minute of the period, and the index of the associated range that contributed it. Though the latter information is actually now redundant. Any minute that is not covered by any range has both value and ID as undef and shows up blank above.

    All we need to do is walk our way down the tally stick and convert the start and end index minute of each contiguous range of values back to the corresponding day/hour/minute (int2dhm()) and we can construct the required list of output ranges, in sorted order, without any complex comparison logic.

  3. And that's what the final while loop does.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^4: Date Array Convolution
by alanonymous (Sexton) on Nov 06, 2011 at 20:48 UTC
    BrowserUK, get ready for the longest post ever.

    First of all, the explanation has allowed me to actually understand what's going on in the code, thank you.

    I have a couple last little questions for you, but first off, here is my current-ish code:

    use strict; use warnings; #convert DDHHMM into int of m sub d2i { my( $d, $h, $m ) = unpack '(A2)*', $_[0]; return ( ( $d - 1 ) * 24 + $h ) * 60 + $m; } #convert int of m into DDHHMM sub i2d { sprintf "%02d%02d%02d", int($_[0]/1440)+1, int($_[0]/60)%24, $_[0] +%60; } #find and open files ... only ever 1 of each file type my ($d,$t); foreach (<*>) { $d = $_ if (/\.dat$/); $t = $_ if (/\.mrg$/); } open(D, $d) or die "Unable to open DAT file. Exiting.\n"; my @dl = <D +>; close(D); open(T, $t) or die "Unable to open MRG file. Exiting.\n"; my @tl = <T +>; close(T); #create big array of data points with added d2i my @big; foreach (@dl,@tl) { chomp($_); if (/^\/\d\d\d\d\d\d\//) { my @n = split(/\//); push @big,[$n[1],$n[2],$n[5],d2i($n[1]),d2i$n[2]]; } } #break by day if needed my @ex = map { my $s = $_->[3]; my $e = $_->[4]; my @out; while (int($s/1440) != int($e/1440)) { my $newe = ( int($s/1440) + 1) * 1440 - 1; push @out, [i2d($s),i2d($newe),$_->[2],$s,$newe]; $s = $newe + 1; } (@out, [i2d($s),$_->[1],$_->[2],$s,$e]); } @big; #build parallel arrays of minute and values ... total minimized for ov +erlaps my (@tally, @id); for my $e (0 .. $#ex) { my $r = $ex[$e]; for my $i ($r->[3] .. $r->[4] ) { if( !defined($tally[$i]) or $tally[$i] > $r->[2] ) { $tally[$i] = $r->[2]; $id[$i] = $e; } } } #recreate [DDHHMM,DDHHMM,V] with overarching tally and ids my @res; my $i = 0; while ($i < $#id) { ++$i until defined $id[$i]; my $id = $id[$i]; my $start = $i; ++$i while defined ($id[$i]) and $id[$i] == $id; my $end = $i - 1; push @res, [i2d($start),i2d($end),$tally[$start]]; } #output organized final to file open (C,'>CDA.txt'); my @last = ("","",""); print C "To do later: ... this still needs specific formatting work(ea +sy)\n"; foreach (@res) { if (substr($last[0],0,2) ne substr(@$_[0],0,2)) { print C "\n"; } print C "@$_[0] @$_[1] @$_[2]\n"; $last[0] = @$_[0]; } close(C); #[temporary] report comparison for manual check @big = sort {$a->[0] <=> $b->[0]} (@big); foreach (@big) { print "@$_[0] @$_[1] @$_[2]\n"; } print "\n"; foreach (@res) { print "@$_[0] @$_[1] @$_[2]\n"; }
    This all works EXACTLY as intended.

    A couple easy questions:
    1) Does the map function assume an output format of @array, [1,2,3,4,5]? Why isn't it push @array, [1,2,3,4,5] ?
    2) Fundamentally what is the difference bewteen $_[0] and $_->[0]? When I start replacing $e and some $s in the map with the array counterpart (either $_[x] or $_->[x]) things break. I understand that $s may or may not be modified and needs to be variablarized (I bet that's not a word), but why does it break for just pointing to $e's array location($_->[4])?
    3) The tally method was quite clever :)

    So now that the easy questions are done, I have some new questions ... and a warning: you might throw your arms up in exasperation.

    So I've just realized that in my effort to simplify the task enough to ask for help, I made a very unfortunate assumption that has bad consequences for the code. I think I've fixed most of it but I have a couple little problems.

    Essentially, the two digit date value in DDHHMMV represents day of month and has the potential to roll over halfway through a file. Also, toward the top of each input file is a line that identifies the period of time applicable to that file, for instance '/XXX/XXX/110000ZNOV/140000ZNOV/XXX/XXX' where the value is 'DDHHHHZMON'. The input files will only ever cover a few days at a time (3-10ish), so there is never an issue of which month the data is applicable to, as the date tag in the beginning identifies that. The potential problem I see is when different months rollover (ie, /XXX/XXX/300000ZSEP/030000ZOCT/XXX/XXX'), the day breaking map effectively increments forever.

    Here's an example of an input file (only a couple important lines, the rest is nonsense I filled in to test parsing):
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa f sd fa sdfasdf /XXX/XXX/300000ZSEP/020000ZOCT/XXX/XXX a sdfasdfasdf asdf df sa fas dfASDFASDF AS DF ASDF SDFASDF /// //sdfasdf/123 \\ /asdfasdf adsf sdf// /555/234/ /301222/301232/234/4234/011.0/ /fasdfasd /301240/301250/asdf/fdsa/011.3/ /302340/302355/9j9js9j9j9jf9sjfd/9j9sfj9df9323/010.0/ /302359/010002/kfjakdjfakdfasdf/salkdjfaklsdjflkasjd/008.1/ /011200/011400/kfjakdjfakdfasdf/salkdjfaklsdjflkasjd/008.1/ asd /1/2/3/4/5 // /
    As you can also see, there is no 'year' date tag in input files, so there is still an assumption being made about DEC->JAN rollovers too.

    Here's my new code that almost fixes the problem:
    use strict; use warnings; use Time::Local; #convert MMddhhmm into minutes since epoch (w/ leap year accounted for +) sub d2i { my( $mo, $d, $h, $m ) = unpack '(A2)*', $_[0]; my @temp = localtime(time); return int(timelocal(0,$m,$h,$d,$mo,$temp[4]) / 60); } #convert minutes since epoch into MMddhhmm (w/ leap year accounted for +) sub i2d { my @times = localtime($_[0]*60); sprintf "%02d%02d%02d%02d", $times[4], $times[3], $times[2], $time +s[1]; } #return month sub fmon { if ($_[0] =~ /JAN$/) {return "00"; } if ($_[0] =~ /FEB$/) {return "01"; } if ($_[0] =~ /MAR$/) {return "02"; } if ($_[0] =~ /APR$/) {return "03"; } if ($_[0] =~ /MAY$/) {return "04"; } if ($_[0] =~ /JUN$/) {return "05"; } if ($_[0] =~ /JUL$/) {return "06"; } if ($_[0] =~ /AUG$/) {return "07"; } if ($_[0] =~ /SEP$/) {return "08"; } if ($_[0] =~ /OCT$/) {return "09"; } if ($_[0] =~ /NOV$/) {return "10"; } if ($_[0] =~ /DEC$/) {return "11"; } } #find and open files ... only ever 1 of each file type my ($d,$t); foreach (<*>) { $d = $_ if (/\.dat$/); $t = $_ if (/\.mrg$/); } open(D, $d) or die "Unable to open DAT file. Exiting.\n"; my @dl = <D +>; close(D); open(T, $t) or die "Unable to open MRG file. Exiting.\n"; my @tl = <T +>; close(T); #create big array of data points with added d2i my (@big, $startmon, $stopmon);#@big format: [MMddhhmm,MMddhhmm,V,I,I] + where I is minutes from epoch foreach (@dl,@tl) { chomp($_); if (/^\/\w{3}\/\w{3}\/\d{6}/) { #find month values my @n = split(/\//); $startmon = fmon($n[3]); #cheating because start/s +top always preceeds values $stopmon = fmon($n[4]); } if (/^\/\d{6}\//) { #read in actual data my @n = split(/\//); #if ($startmon ne $stopmon && substr($n_[0],0,2) < 15) { # push @big,[$startmon.$n[1],$stopmon.$n[2],$n[5],d2i($star +tmon.$n[1]),d2i($stopmon.$n[2])]; #} push @big,[$startmon.$n[1],$stopmon.$n[2],$n[5],d2i($startmon. +$n[1]),d2i($stopmon.$n[2])]; } } #break by day if needed my @ex = map { my $s = $_->[3]; my $e = $_->[4]; my @out; while (int($s/1440) != int($e/1440)) { my $newe = ( int($s/1440) + 1) * 1440 - 1; push @out, [i2d($s),i2d($newe),$_->[2],$s,$newe]; $s = $newe + 1; } (@out, [i2d($s),$_->[1],$_->[2],$s,$e]); } @big; #build parallel arrays of minute and values ... total minimized for ov +erlaps my (@tally, @id); for my $e (0 .. $#ex) { my $r = $ex[$e]; for my $i ($r->[3] .. $r->[4] ) { if( !defined($tally[$i]) or $tally[$i] > $r->[2] ) { $tally[$i] = $r->[2]; $id[$i] = $e; } } } #recreate [MMddhhmm,MMddhhmm,V] with overarching tally and ids my @res; my $i = 0; while ($i < $#id) { ++$i until defined $id[$i]; my $id = $id[$i]; my $start = $i; ++$i while defined ($id[$i]) and $id[$i] == $id; my $end = $i - 1; push @res, [i2d($start),i2d($end),$tally[$start]]; } #output organized final to file open (C,'>CDA.txt'); my @last = ("","",""); print C "To do later: ... a lot of specific formatting work(easy)\n"; foreach (@res) { if (substr($last[0],0,2) ne substr(@$_[0],0,2)) { print C "\n"; } print C substr(@$_[0],2,6), " ", substr(@$_[1],2,6), " ", @$_[2], + "\n"; $last[0] = @$_[0]; } close(C); #[temporary] report comparison for manual check @big = sort {$a->[0] <=> $b->[0]} (@big); foreach (@big) { print "@$_[0] @$_[1] @$_[2]\n"; } print "\n"; foreach (@res) { print "@$_[0] @$_[1] @$_[2]\n"; }
    The problems I am having are:
    1) Making the assumption of 'current year', as when this rolls over it would mess up.
    2) Breaking days apart doesn't seem to be working and I don't see why ... I'm just using minutes since epoch instead of minutes since start of month.
    3) Wastes a lot of resources (maybe in #recreate loops, start counting at 100000000 instead of 0 or something?
    4) The way the input files are implemented real world is typically from today-ish to 3-10 days in the future, so the assumption of current year is valid except for the last couple days of the year when the file spans after the rollover.

    Sorry to be such a pain in the butt, and thank you again for your help. On the bright side, I think I figured a lot of it out!

    -Alan

      The tally method lent itself to your application because you identified your dates as being DDHHMM.

      This meant that an entire month period can be covered, using a granularity of minutes, by an array of 60*24*31=44,640 elements.

      If you needed to deal with a year at a time, then using a tally stick of 16,338,240 is still feasible -- especially if you move to using vec to store an array 8 or 16-bit ints to store your values. But the problem is that then the role-over from one month to the next is complicated by the fact that months have differing numbers of days.

      But trying to use the epoch (by which I assume you mean the *nix epoch of 1/1/1970) mean that the tally would have to be 653,529,600 elements long which is just silly. And will be even sillier if you are using the Windows epoch of 1/1/1600!

      If you really cannot break your data sets up into 1 month subsets, then you are going to have to go back to using the O(N^2) process of comparing every range with every other range in order to detect your overlaps, rather than using the O(N) method of the tally stick. This because once the set of ranges can span month and year boundaries you will need to handle all the messy details of data arithmetic. Variable length months and leap years etc.

      Essentially, whether you should move forward with the tally method or go over to using proper date math should depend entirely upon the frequency (therefore performance constraint) of having to perform your processing versus the difficulty in having to pre-filter your ranges into by-month subsets.

      If performance is important and the subsetting easy, then stick with the tally method and start of month offsets. If the subseting is hard and the performance requirement is low, then encode your dates using a proper date package and use full date arithmetic to detect your overlaps.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        This meant that an entire month period can be covered, using a granularity of minutes, by an array of 60*24*31=44,640 elements.
        In my solution, I only need the array of the size of 2 × the number of input ranges, because there can be no change in value if there is no start/end of a range.