bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:
I have log files that I am processing every hour and they grow from the beginning of the day to the end of the day and typically are in the millions of lines per log by the end of the day. I have previously been doing a cmp to get just the new lines and keeping the previous file to check against. Still, I am running into duplicates and need to prevent that. As I am already keeping the start and end dates from the log in a DB, I am trying to switch to checking the dates in the log against what has already been loaded.
To do this, I am building a hash from the a DB query that stores the ID, start and end dates. The dates are stored in the hash by converting them to seconds since 1970 to make comparisons faster. However, I think the sub that checks for a date, though straight-forward, is way too slow to check millions of log lines against and run in a rapid manner. The sub in question is check_date_in_range. I thought that only calculating the array of sorted keys would speed things up so I am "caching" it but that does not appear to be the slowdown ...
Obviously, the DB code is left out and the building of the ranges_dt hash is somewhat separate and the logic to process the log file is rather complicated. But this is a working code set uses the sub so you can see what I am trying to do.
Here is the sample and complete working code.
#!/usr/bin/env perl use Date::Manip::Date; use warnings; use strict; $|++; my @vkeys; my $dmd = new Date::Manip::Date; my $cmp_dt = new Date::Manip::Date; my %ranges_dt; # ================================================== sub check_date_in_range { my ($value, %h) = @_; $value =~ s/-/ /; $cmp_dt->parse($value); my $cvalue = $cmp_dt->secs_since_1970_GMT; if (!@vkeys) { @vkeys = sort keys %h; } foreach my $k (@vkeys) { if (($cvalue >= $h{$k}{start}) && ($cvalue <= $h{$k}{end})) { return $k; } } return; } while (<DATA>) { chomp; if (s/^(\w+)://) { my $cat = $1; if ($cat eq "R") { my ($i, $s, $e) = split ','; $dmd->parse($s); $ranges_dt{$i}{start} = $dmd->secs_since_1970_GMT; $dmd->parse($e); $ranges_dt{$i}{end} = $dmd->secs_since_1970_GMT; } else { my $rangeid = check_date_in_range($_, %ranges_dt); if (!defined $rangeid) { print "No range found for $_\n"; } else { print "Found range: $rangeid\n"; } } } } __DATA__ R:1,2018-03-06 14:20:00,2018-03-06 14:30:00 R:2,2018-03-06 13:00:00,2018-03-06 13:40:00 R:3,2018-03-06 13:45:00,2018-03-06 13:50:00 D:03/06/2018-14:29:41 D:03/06/2018-13:33:38 D:03/06/2018-13:54:47 D:03/06/2018-12:53:34 D:03/06/2018-13:29:19 D:03/06/2018-12:52:47 D:03/06/2018-14:21:51 D:03/06/2018-13:49:20 D:03/06/2018-13:36:18 D:03/06/2018-13:44:25
Is there a better way to do the range check?
Something more efficient that will be very fast?
Update: I should note that the @vkeys array is just under 500 items that need to be checked for every date in question. Hence, the need to speed this up.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: search through hash for date in a range
by choroba (Cardinal) on Mar 06, 2018 at 17:09 UTC | |
by bfdi533 (Friar) on Mar 06, 2018 at 18:46 UTC | |
by bfdi533 (Friar) on Mar 06, 2018 at 18:21 UTC | |
|
Re: search through hash for date in a range
by hippo (Archbishop) on Mar 06, 2018 at 17:19 UTC | |
by bfdi533 (Friar) on Mar 06, 2018 at 18:19 UTC | |
by poj (Abbot) on Mar 06, 2018 at 18:56 UTC | |
by bfdi533 (Friar) on Mar 06, 2018 at 19:07 UTC | |
by poj (Abbot) on Mar 06, 2018 at 19:15 UTC | |
| |
|
Re: search through hash for date in a range
by Laurent_R (Canon) on Mar 07, 2018 at 07:32 UTC | |
|
Re: search through hash for date in a range
by Anonymous Monk on Mar 07, 2018 at 12:54 UTC | |
by QM (Parson) on Mar 07, 2018 at 13:27 UTC | |
by bfdi533 (Friar) on Mar 07, 2018 at 15:04 UTC | |
|
Re: search through hash for date in a range
by pwagyi (Monk) on Mar 23, 2018 at 03:24 UTC | |
| A reply falls below the community's threshold of quality. You may see it by logging in. |