I have log files that I am processing every hour and they grow from the beginning of the day to the end of the day and typically are in the millions of lines per log by the end of the day. I have previously been doing a cmp to get just the new lines and keeping the previous file to check against. Still, I am running into duplicates and need to prevent that. As I am already keeping the start and end dates from the log in a DB, I am trying to switch to checking the dates in the log against what has already been loaded.

To do this, I am building a hash from the a DB query that stores the ID, start and end dates. The dates are stored in the hash by converting them to seconds since 1970 to make comparisons faster. However, I think the sub that checks for a date, though straight-forward, is way too slow to check millions of log lines against and run in a rapid manner. The sub in question is check_date_in_range. I thought that only calculating the array of sorted keys would speed things up so I am "caching" it but that does not appear to be the slowdown ...

Obviously, the DB code is left out and the building of the ranges_dt hash is somewhat separate and the logic to process the log file is rather complicated. But this is a working code set uses the sub so you can see what I am trying to do.

Here is the sample and complete working code.

#!/usr/bin/env perl use Date::Manip::Date; use warnings; use strict; $|++; my @vkeys; my $dmd = new Date::Manip::Date; my $cmp_dt = new Date::Manip::Date; my %ranges_dt; # ================================================== sub check_date_in_range { my ($value, %h) = @_; $value =~ s/-/ /; $cmp_dt->parse($value); my $cvalue = $cmp_dt->secs_since_1970_GMT; if (!@vkeys) { @vkeys = sort keys %h; } foreach my $k (@vkeys) { if (($cvalue >= $h{$k}{start}) && ($cvalue <= $h{$k}{end})) { return $k; } } return; } while (<DATA>) { chomp; if (s/^(\w+)://) { my $cat = $1; if ($cat eq "R") { my ($i, $s, $e) = split ','; $dmd->parse($s); $ranges_dt{$i}{start} = $dmd->secs_since_1970_GMT; $dmd->parse($e); $ranges_dt{$i}{end} = $dmd->secs_since_1970_GMT; } else { my $rangeid = check_date_in_range($_, %ranges_dt); if (!defined $rangeid) { print "No range found for $_\n"; } else { print "Found range: $rangeid\n"; } } } } __DATA__ R:1,2018-03-06 14:20:00,2018-03-06 14:30:00 R:2,2018-03-06 13:00:00,2018-03-06 13:40:00 R:3,2018-03-06 13:45:00,2018-03-06 13:50:00 D:03/06/2018-14:29:41 D:03/06/2018-13:33:38 D:03/06/2018-13:54:47 D:03/06/2018-12:53:34 D:03/06/2018-13:29:19 D:03/06/2018-12:52:47 D:03/06/2018-14:21:51 D:03/06/2018-13:49:20 D:03/06/2018-13:36:18 D:03/06/2018-13:44:25

Is there a better way to do the range check?

Something more efficient that will be very fast?

Update: I should note that the @vkeys array is just under 500 items that need to be checked for every date in question. Hence, the need to speed this up.


In reply to search through hash for date in a range by bfdi533

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.