in reply to Re^3: search through hash for date in a range
in thread search through hash for date in a range

Yes, sorry; The date ranges can overlap each other. The search sub just returns when it find the first match of a date in any range.

My example should have shown this.

The ranges are more like this:

2018-03-05 06:00:00 -> 2018-03-06 01:00:00 2018-03-05 06:00:00 -> 2018-03-06 02:00:00 2018-03-05 06:00:00 -> 2018-03-06 03:00:00 2018-03-05 06:00:00 -> 2018-03-06 04:00:00 2018-03-05 06:00:00 -> 2018-03-06 05:00:00 2018-03-05 06:00:00 -> 2018-03-06 06:00:00 2018-03-06 06:00:00 -> 2018-03-06 07:00:00 2018-03-06 06:00:00 -> 2018-03-06 08:00:00 2018-03-06 06:00:00 -> 2018-03-06 10:00:00 2018-03-06 06:00:00 -> 2018-03-06 11:00:00 2018-03-06 06:00:00 -> 2018-03-06 12:00:00 2018-03-06 06:00:00 -> 2018-03-06 13:00:00

It is of operational assumption given the above set of ranges (notice 06:00 -> 09:00 is missing which happens often) that we are "normally" processing hour 13 (06:00 -> 14:00 but that is no guarantee ... (Around 6:30 am the log is rotated so the range stars over ...)

If my next load is from 08:00 -> 09:00 then I can skip loading the dates that are already loaded in the 06:00 -> 10:00 data set.

I know, complicated ...

Replies are listed 'Best First'.
Re^5: search through hash for date in a range
by poj (Abbot) on Mar 06, 2018 at 19:15 UTC

    Ok, I was thinking about using a 86,400 element array. One for each second of the day (ranges crossing days would need some sort of sliding window and offset I guess). Anyway something to investigate perhaps.

    #!/usr/bin/perl use warnings; use strict; my @range=(); while (<DATA>){ chomp; if (/^R:(.*)/){ my ($i,undef,$s,undef,$e) = split /[, ]/,$1; $s = sec($s); $e = sec($e); for ($s..$e){ $range[$_] = $i; } } else { my (undef,$t) = split /-/,$_; my $rangeid = $range[ sec($t) ]; if ( defined $rangeid) { print "Found range: $rangeid for $t\n"; } else { print "No range found for $t\n"; } } } sub sec { my ($h,$m,$s) = split /:/,shift; return $h*60*60 + $m*60 + $s } __DATA__ R:1,2018-03-06 14:20:00,2018-03-06 14:30:00 R:2,2018-03-06 13:00:00,2018-03-06 13:40:00 R:3,2018-03-06 13:45:00,2018-03-06 13:50:00 D:03/06/2018-14:29:41 D:03/06/2018-13:33:38 D:03/06/2018-13:54:47 D:03/06/2018-12:53:34 D:03/06/2018-13:29:19 D:03/06/2018-12:52:47 D:03/06/2018-14:21:51 D:03/06/2018-13:49:20 D:03/06/2018-13:36:18 D:03/06/2018-13:44:25
    poj

      I took a hybrid approach and am using a bigger array but with the same idea. No need to worry about the sliding window here.

      I query the DB first for the min/max dates and store these as their epoch. In this sample, that is symbolized with the "X" row. (The values on the "X" line are actual dates from my DB.

      Then I calculate epoch for each date in the ranges "R" date and subtract the minimum from it to get the array index.

      Great suggestion. Seems pretty workable.

      #!/usr/bin/env perl use Date::Manip::Date; use Time::Piece; use warnings; use strict; $|++; my @vkeys; my $dmd = new Date::Manip::Date; my $cmp_dt = new Date::Manip::Date; my %ranges_dt; my @range; my $tn; my $tx; while (<DATA>) { chomp; if (s/^(\w+)://) { my $cat = $1; if ($cat eq "X") { my ($n, $x) = split ','; $tn = Time::Piece->strptime($n,"%Y-%m-%d %H:%M:%S")->epoch +; $tx = Time::Piece->strptime($x,"%Y-%m-%d %H:%M:%S")->epoch +; } elsif ($cat eq "R") { my ($i, $s, $e) = split ','; my $ts = Time::Piece->strptime($s,"%Y-%m-%d %H:%M:%S")->ep +och - $tn; my $te = Time::Piece->strptime($e,"%Y-%m-%d %H:%M:%S")->ep +och - $tn; for ($ts..$te) { $range[$_] = $i; } } else { my $cd = Time::Piece->strptime($_,"%m/%d/%Y-%H:%M:%S")->ep +och - $tn; my $rangeid = $range[$cd]; if (!defined $rangeid) { print "No range found for $_\n"; } else { print "Found range: $rangeid for $_\n"; } } } } __DATA__ X:2018-02-15 22:49:41,2018-12-13 15:59:59 R:1,2018-03-06 14:20:00,2018-03-06 14:30:00 R:2,2018-03-06 13:00:00,2018-03-06 13:40:00 R:3,2018-03-06 13:45:00,2018-03-06 13:50:00 D:03/06/2018-14:29:41 D:03/06/2018-13:33:38 D:03/06/2018-13:54:47 D:03/06/2018-12:53:34 D:03/06/2018-13:29:19 D:03/06/2018-12:52:47 D:03/06/2018-14:21:51 D:03/06/2018-13:49:20 D:03/06/2018-13:36:18 D:03/06/2018-13:44:25

        Trying this out on my actual code shows a HUGE time improvement:

        Results from range lookup with Time::Piece and subroutine

        Line: 37000 : 119 seconds : tps: 8.40336134453782 Line: 38000 : 115 seconds : tps: 8.69565217391304 Line: 39000 : 121 seconds : tps: 8.26446280991735 Line: 40000 : 120 seconds : tps: 8.33333333333333 Line: 41000 : 114 seconds : tps: 8.7719298245614 Line: 42000 : 139 seconds : tps: 7.19424460431655 Line: 43000 : 126 seconds : tps: 7.93650793650794 Line: 44000 : 122 seconds : tps: 8.19672131147541 Line: 45000 : 177 seconds : tps: 5.64971751412429 Line: 46000 : 161 seconds : tps: 6.2111801242236

        Results with array (seconds) lookup

        Line: 37000 : 6 seconds : tps: 166.666666666667 Line: 38000 : 6 seconds : tps: 166.666666666667 Line: 39000 : 7 seconds : tps: 142.857142857143 Line: 40000 : 6 seconds : tps: 166.666666666667 Line: 41000 : 5 seconds : tps: 200 Line: 42000 : 7 seconds : tps: 142.857142857143 Line: 43000 : 7 seconds : tps: 142.857142857143 Line: 44000 : 6 seconds : tps: 166.666666666667 Line: 45000 : 7 seconds : tps: 142.857142857143 Line: 46000 : 7 seconds : tps: 142.857142857143

      That is a great idea! I will play around with that to see what I can do with it ...