in reply to search through hash for date in a range

Alternatively, why not just store the line number of the last line processed between runs? It will be much faster to just discard the first N lines on the next run than to do date-based processing on them.

  • Comment on Re: search through hash for date in a range

Replies are listed 'Best First'.
Re^2: search through hash for date in a range
by bfdi533 (Friar) on Mar 06, 2018 at 18:19 UTC

    That is a good question.

    The reason is that the logs are uploaded to a central server for processing and are contained in a .tgz file. Once opened up, there is no guarantee that each hour being processed by the loader script is in sequential order and the processing of the log files is done in parallel in a queue with work handed out to child processing scripts. So, the DB is the only point of reference. Also, storing a byte count would not work when the files are rotated (supposedly once a day) and the file starts over at 0 (but on some systems the logrotate does not always happen so those files just keep growing).

    It could be that hour 3, 5 and 6 are processed and THEN hour 4 gets processed so again the byte count from hour 6 will not be applicable to hour 4. But with hour 4 having been included in the hour 5 and hour 6 upload, there is no need to actually pull out and load the data but without checking the dates for inclusion in the range, there is no way to guarantee this.

    Given all of this, dates are the only reliable means of knowing data has been processed before.

    But, even with being able to process only dates after a byte skip, I would still need to process the dates in the range. And there are a lot of them each hour on some of the busier systems.

    Hence, the original question, how to make the lookup faster/more efficient?

      Can the ranges overlap each other ? In the 3 shown as example they don't.

      poj

        Yes, sorry; The date ranges can overlap each other. The search sub just returns when it find the first match of a date in any range.

        My example should have shown this.

        The ranges are more like this:

        2018-03-05 06:00:00 -> 2018-03-06 01:00:00 2018-03-05 06:00:00 -> 2018-03-06 02:00:00 2018-03-05 06:00:00 -> 2018-03-06 03:00:00 2018-03-05 06:00:00 -> 2018-03-06 04:00:00 2018-03-05 06:00:00 -> 2018-03-06 05:00:00 2018-03-05 06:00:00 -> 2018-03-06 06:00:00 2018-03-06 06:00:00 -> 2018-03-06 07:00:00 2018-03-06 06:00:00 -> 2018-03-06 08:00:00 2018-03-06 06:00:00 -> 2018-03-06 10:00:00 2018-03-06 06:00:00 -> 2018-03-06 11:00:00 2018-03-06 06:00:00 -> 2018-03-06 12:00:00 2018-03-06 06:00:00 -> 2018-03-06 13:00:00

        It is of operational assumption given the above set of ranges (notice 06:00 -> 09:00 is missing which happens often) that we are "normally" processing hour 13 (06:00 -> 14:00 but that is no guarantee ... (Around 6:30 am the log is rotated so the range stars over ...)

        If my next load is from 08:00 -> 09:00 then I can skip loading the dates that are already loaded in the 06:00 -> 10:00 data set.

        I know, complicated ...