in reply to Re: search through hash for date in a range
in thread search through hash for date in a range

That is a good question.

The reason is that the logs are uploaded to a central server for processing and are contained in a .tgz file. Once opened up, there is no guarantee that each hour being processed by the loader script is in sequential order and the processing of the log files is done in parallel in a queue with work handed out to child processing scripts. So, the DB is the only point of reference. Also, storing a byte count would not work when the files are rotated (supposedly once a day) and the file starts over at 0 (but on some systems the logrotate does not always happen so those files just keep growing).

It could be that hour 3, 5 and 6 are processed and THEN hour 4 gets processed so again the byte count from hour 6 will not be applicable to hour 4. But with hour 4 having been included in the hour 5 and hour 6 upload, there is no need to actually pull out and load the data but without checking the dates for inclusion in the range, there is no way to guarantee this.

Given all of this, dates are the only reliable means of knowing data has been processed before.

But, even with being able to process only dates after a byte skip, I would still need to process the dates in the range. And there are a lot of them each hour on some of the busier systems.

Hence, the original question, how to make the lookup faster/more efficient?

  • Comment on Re^2: search through hash for date in a range

Replies are listed 'Best First'.
Re^3: search through hash for date in a range
by poj (Abbot) on Mar 06, 2018 at 18:56 UTC

    Can the ranges overlap each other ? In the 3 shown as example they don't.

    poj

      Yes, sorry; The date ranges can overlap each other. The search sub just returns when it find the first match of a date in any range.

      My example should have shown this.

      The ranges are more like this:

      2018-03-05 06:00:00 -> 2018-03-06 01:00:00 2018-03-05 06:00:00 -> 2018-03-06 02:00:00 2018-03-05 06:00:00 -> 2018-03-06 03:00:00 2018-03-05 06:00:00 -> 2018-03-06 04:00:00 2018-03-05 06:00:00 -> 2018-03-06 05:00:00 2018-03-05 06:00:00 -> 2018-03-06 06:00:00 2018-03-06 06:00:00 -> 2018-03-06 07:00:00 2018-03-06 06:00:00 -> 2018-03-06 08:00:00 2018-03-06 06:00:00 -> 2018-03-06 10:00:00 2018-03-06 06:00:00 -> 2018-03-06 11:00:00 2018-03-06 06:00:00 -> 2018-03-06 12:00:00 2018-03-06 06:00:00 -> 2018-03-06 13:00:00

      It is of operational assumption given the above set of ranges (notice 06:00 -> 09:00 is missing which happens often) that we are "normally" processing hour 13 (06:00 -> 14:00 but that is no guarantee ... (Around 6:30 am the log is rotated so the range stars over ...)

      If my next load is from 08:00 -> 09:00 then I can skip loading the dates that are already loaded in the 06:00 -> 10:00 data set.

      I know, complicated ...

        Ok, I was thinking about using a 86,400 element array. One for each second of the day (ranges crossing days would need some sort of sliding window and offset I guess). Anyway something to investigate perhaps.

        #!/usr/bin/perl use warnings; use strict; my @range=(); while (<DATA>){ chomp; if (/^R:(.*)/){ my ($i,undef,$s,undef,$e) = split /[, ]/,$1; $s = sec($s); $e = sec($e); for ($s..$e){ $range[$_] = $i; } } else { my (undef,$t) = split /-/,$_; my $rangeid = $range[ sec($t) ]; if ( defined $rangeid) { print "Found range: $rangeid for $t\n"; } else { print "No range found for $t\n"; } } } sub sec { my ($h,$m,$s) = split /:/,shift; return $h*60*60 + $m*60 + $s } __DATA__ R:1,2018-03-06 14:20:00,2018-03-06 14:30:00 R:2,2018-03-06 13:00:00,2018-03-06 13:40:00 R:3,2018-03-06 13:45:00,2018-03-06 13:50:00 D:03/06/2018-14:29:41 D:03/06/2018-13:33:38 D:03/06/2018-13:54:47 D:03/06/2018-12:53:34 D:03/06/2018-13:29:19 D:03/06/2018-12:52:47 D:03/06/2018-14:21:51 D:03/06/2018-13:49:20 D:03/06/2018-13:36:18 D:03/06/2018-13:44:25
        poj