Working with multiple records with different dates

rruser has asked for the wisdom of the Perl Monks concerning the following question:

Currently we receive 3 files monthly and I have a script which outputs the date in yyyy,mm,dd format. My need is to create a file containing the record with the longest time spread, as you can see there can be multiple dates or only 1 date. my output files are in the following format:

file1:
COMPANY      ABCD 764200 E  2013,12,13   2013,12,19
COMPANY      BCDX 156167 L  2013,11,29   2013,12,03
COMPANY      BCYX 165230 L  2013,12,13   2013,12,19

file2:
COMPANY      ABCD 764200 E  2013,12,13   2013,12,19
COMPANY      BCDX 156167 L  2013,12,28   2013,12,31

file3:
COMPANY      ABCD 764200 E  2013,12,13   2013,12,17
COMPANY      BCDX 156167 L  2013,11,30   2013,12,03
COMPANY      BCYX 165230 L  2013,12,13   2013,12,17
COMPANY      BCYX 156095 L  2013,11,30   2013,12,08
[download]

What I have so far:

Have extraction script append all 3 files to 1 file

Check for identical entries with script checking for duplicates

Script in progress that gets the difference between the 2 dates and skips any difference less that 5 (4 days and under are free days)

I just don't know how to code for extracting the record with the longest time spread discarding any others.

Here is the script I am working on (your input and suggestions are much appreciated...thanks)

#!/usr/bin/perl
# 
use strict;
use warnings;
use Date::Calc qw( Delta_Days );

my @entries=();   ## hold my entries

open (my $file, '<', (@ARGV)) or die $!;
while (<$file>) {
    my @flds = split '\s+', $_;
    s/\s+$//;
    
    ## load my entries
    my $entry;
        
        $entry->{COMPANY} =    $flds[0];
        $entry->{CAR_PART_1} = $flds[1];
        $entry->{CAR_PART_2} = $flds[2];
        $entry->{LE} =         $flds[3];
        $entry->{BEG_DATE} =   $flds[4];
        $entry->{END_DATE} =   $flds[5];

    ##  Push $entry onto @entries
        push (@entries, $entry);
}        
close $file;
               
        foreach my $ent (sort @entries) {
            my @ymd1 = split ',',$ent->{BEG_DATE};
            my @ymd2 = split ',',$ent->{END_DATE};
            my $diff = Delta_Days(@ymd1, @ymd2);
            if ($diff < 5) { next; }
            else {
                ## this is where I need logic for grabbing only the re
+cord with the most days
            printf "$ent->{CAR_PART_1} $ent->{CAR_PART_2} $ent->{LE} $
+ent->{BEG_DATE} $ent->{END_DATE} $diff\n";   ## testing output
        }
}
exit;
[download]

Comment on Working with multiple records with different dates Select or Download Code

Replies are listed 'Best First'.
Re: Working with multiple records with different dates by LloydRice (Beadle) on Jan 18, 2014 at 00:49 UTC
A somewhat elaborate but thorough way to do this is to convert all of the dates to Unix/Perl timecode values. This will allow you to distinguish values to 1 second in an easily-sortable format (numbers). The code to handle time of day is a bit tricky. Maybe you do not need that part. Here is the code I use. # Converting date/time info to Unix time code use Time::Local; # timelocal() returns a Unix time code number # Function timelocal() takes a 6-element list, the same as the first # six arguments of the localtime function. If you have # $timecode = timelocal @list; # then mday = list[3] = 1..31 # mon = list[4] = 0..11 # yr = list[5] = year-1900 # Assume you have "hr:min:sec am/pm" as variables # $arghr, $argmin, $argsec, $ampm If the time-of-day # does not matter, I usually set them as "12:1:1 am", # (just after noon) to avoid being near a day changeover. $arghr = $1; $argmin = $2; $argsec = $3; $ampm = $4; # Assume you have "day,mon,yr" as variables # $argday, $monname, $argyr # "day" = 1 .. 31 # "mon" = Jan .. Dec # and 2-digit (21st century) year $argday = $1; $monname = ucfirst $2; $argyr = $3; # Given the above seven arguments, the following code # will produce the corresponding Unix (Perl) timecode value %monhash = ( "Jan" =>1, "Feb" =>2, "Mar" =>3, "Apr" =>4, "May" =>5, "Jun" =>6, "Jul" =>7, "Aug" =>8, "Sep" =>9, "Oct" =>10, "Nov" =>11, "Dec" =>12 +); $hrmin = 0; $hrmax = 23; if ( $ampm ne "" ) { $hrmin = 1; $hrmax = 12; $addhrs = 0; if ( $ampm =~ /^p/ ) { $addhrs = 12; } } if ( $arghr < $hrmin or $arghr > $hrmax ) { die "Hours portion of date argument '$arghr' out of range.\n"; } if ( $ampm ne "" ) { if ( $arghr == 12 ) { $arghr = 0; } $arghr += $addhrs; } if ( $argmin < 0 or $argmin > 59 ) { die "Minutes portion of date argument '$argmin' out of range.\n"; } if ( $argsec < 0 or $argsec > 59 ) { die "Seconds portion of date argument '$argsec' out of range.\n"; } print "The time arg is $arghr:$argmin:$argsec\n"; if ( not defined $monhash{ $monname } ) { die "Could not parse month portion '$monname' of date argument.\n"; } $argmon = $monhash{ $monname }; $argyear = 1900 + $argyr; $argtime = timelocal( $argsec, $argmin, $arghr, $argday, $argmon, $argyear - 1900 ); [download]	[reply] [d/l]
Re: Working with multiple records with different dates by kcott (Archbishop) on Jan 18, 2014 at 13:21 UTC
G'day rruser, Here's a solution using the builtin modules Time::Piece and Time::Seconds. It only requires a single script (instead of your three) and doesn't create an additional file concatenating all the data from the original three files. With the data you posted, "`COMPANY BCYX 156095 L 2013,11,30 2013,12,08`" has the greatest time spread (8 days). For my testing, I duplicated that record in file2 and file3 and added a unique record to file1 which also had a spread of 8 days. Here's the data I worked with: $ cat pm_1071030_file1.txt COMPANY ABCD 764200 E 2013,12,13 2013,12,19 COMPANY BCDX 156167 L 2013,11,29 2013,12,03 COMPANY BCYX 165230 L 2013,12,13 2013,12,19 COMPANY BCYX 156095 L 2013,11,29 2013,12,07 $ cat pm_1071030_file2.txt COMPANY ABCD 764200 E 2013,12,13 2013,12,19 COMPANY BCDX 156167 L 2013,12,28 2013,12,31 COMPANY BCYX 156095 L 2013,11,30 2013,12,08 $ cat pm_1071030_file3.txt COMPANY ABCD 764200 E 2013,12,13 2013,12,17 COMPANY BCDX 156167 L 2013,11,30 2013,12,03 COMPANY BCYX 165230 L 2013,12,13 2013,12,17 COMPANY BCYX 156095 L 2013,11,30 2013,12,08 COMPANY BCYX 156095 L 2013,11,30 2013,12,08 [download] Here's the script: #!/usr/bin/env perl use strict; use warnings; use autodie; use Time::Piece; use Time::Seconds; my $format = '%Y,%m,%d'; my $free_days = 4; my @wanted_record_data = (0, []); for my $file (map { "pm_1071030_file${_}.txt" } 1 .. 3) { open my $fh, '<', $file; while (<$fh>) { my ($start, $end) = (split)[4, 5]; my $t0 = Time::Piece->strptime($start, $format); my $t1 = Time::Piece->strptime($end, $format); my $diff = ($t1 - $t0)->days; next if $diff <= $free_days; push @{$wanted_record_data[1]}, $_ if $diff == $wanted_record_ +data[0]; @wanted_record_data = ($diff, [$_]) if $diff > $wanted_record_ +data[0]; } } print "Greatest time spread = $wanted_record_data[0] days.\n"; print "Records with this time spread:\n"; my %seen; print for grep { ! $seen{$_}++ } @{$wanted_record_data[1]}; [download] Here's the output: `Greatest time spread = 8 days. Records with this time spread: COMPANY BCYX 156095 L 2013,11,29 2013,12,07 COMPANY BCYX 156095 L 2013,11,30 2013,12,08` [download] -- Ken	[reply] [d/l] [select]
Re^2: Working with multiple records with different dates by rruser (Acolyte) on Jan 21, 2014 at 23:01 UTC
Thanks Ken that gets me closer, I didn't take into account the longest time spread may be different for different records. `example COMPANY ABCD 764200 E 2013,12,13 2013,12,19 COMPANY BCYX 156095 L 2013,11,30 2013,12,08` [download] The first record has 6 days and the second 8 days. I need both records with any duplicates or shorter time span dates removed. I'm not sure if I am explaining this well. I need each unique record so if it exists only in 1 file then I use that date. If records exists in multiple files then the one with the longest time frame. Any suggestions, I sure appreciate your time and expertise	[reply] [d/l]
Re^3: Working with multiple records with different dates by kcott (Archbishop) on Jan 22, 2014 at 11:59 UTC
"I'm not sure if I am explaining this well." I don't think you are. You say "The first record has 6 days and the second 8 days.". None of the three files you show have records like that. I wondered if you meant "file" instead of "record", but that leaves no mention of the the third file (which is the one with an 8 day time spread). My best guess is that you should process the files separately (perhaps storing interim data something like '`$data_for_file{$file} = [ @wanted_record_data ]`') and then extract the records you want after all files have been read. -- Ken	[reply] [d/l]