rruser has asked for the wisdom of the Perl Monks concerning the following question:

Currently we receive 3 files monthly and I have a script which outputs the date in yyyy,mm,dd format. My need is to create a file containing the record with the longest time spread, as you can see there can be multiple dates or only 1 date. my output files are in the following format:

file1: COMPANY ABCD 764200 E 2013,12,13 2013,12,19 COMPANY BCDX 156167 L 2013,11,29 2013,12,03 COMPANY BCYX 165230 L 2013,12,13 2013,12,19 file2: COMPANY ABCD 764200 E 2013,12,13 2013,12,19 COMPANY BCDX 156167 L 2013,12,28 2013,12,31 file3: COMPANY ABCD 764200 E 2013,12,13 2013,12,17 COMPANY BCDX 156167 L 2013,11,30 2013,12,03 COMPANY BCYX 165230 L 2013,12,13 2013,12,17 COMPANY BCYX 156095 L 2013,11,30 2013,12,08

What I have so far:

Have extraction script append all 3 files to 1 file

Check for identical entries with script checking for duplicates

Script in progress that gets the difference between the 2 dates and skips any difference less that 5 (4 days and under are free days)

I just don't know how to code for extracting the record with the longest time spread discarding any others.

Here is the script I am working on (your input and suggestions are much appreciated...thanks)

#!/usr/bin/perl # use strict; use warnings; use Date::Calc qw( Delta_Days ); my @entries=(); ## hold my entries open (my $file, '<', (@ARGV)) or die $!; while (<$file>) { my @flds = split '\s+', $_; s/\s+$//; ## load my entries my $entry; $entry->{COMPANY} = $flds[0]; $entry->{CAR_PART_1} = $flds[1]; $entry->{CAR_PART_2} = $flds[2]; $entry->{LE} = $flds[3]; $entry->{BEG_DATE} = $flds[4]; $entry->{END_DATE} = $flds[5]; ## Push $entry onto @entries push (@entries, $entry); } close $file; foreach my $ent (sort @entries) { my @ymd1 = split ',',$ent->{BEG_DATE}; my @ymd2 = split ',',$ent->{END_DATE}; my $diff = Delta_Days(@ymd1, @ymd2); if ($diff < 5) { next; } else { ## this is where I need logic for grabbing only the re +cord with the most days printf "$ent->{CAR_PART_1} $ent->{CAR_PART_2} $ent->{LE} $ +ent->{BEG_DATE} $ent->{END_DATE} $diff\n"; ## testing output } } exit;

Replies are listed 'Best First'.
Re: Working with multiple records with different dates
by LloydRice (Beadle) on Jan 18, 2014 at 00:49 UTC

    A somewhat elaborate but thorough way to do this is to convert all of the dates to Unix/Perl timecode values. This will allow you to distinguish values to 1 second in an easily-sortable format (numbers). The code to handle time of day is a bit tricky. Maybe you do not need that part. Here is the code I use.

    # Converting date/time info to Unix time code use Time::Local; # timelocal() returns a Unix time code number # Function timelocal() takes a 6-element list, the same as the first # six arguments of the localtime function. If you have # $timecode = timelocal @list; # then mday = list[3] = 1..31 # mon = list[4] = 0..11 # yr = list[5] = year-1900 # Assume you have "hr:min:sec am/pm" as variables # $arghr, $argmin, $argsec, $ampm If the time-of-day # does not matter, I usually set them as "12:1:1 am", # (just after noon) to avoid being near a day changeover. $arghr = $1; $argmin = $2; $argsec = $3; $ampm = $4; # Assume you have "day,mon,yr" as variables # $argday, $monname, $argyr # "day" = 1 .. 31 # "mon" = Jan .. Dec # and 2-digit (21st century) year $argday = $1; $monname = ucfirst $2; $argyr = $3; # Given the above seven arguments, the following code # will produce the corresponding Unix (Perl) timecode value %monhash = ( "Jan" =>1, "Feb" =>2, "Mar" =>3, "Apr" =>4, "May" =>5, "Jun" =>6, "Jul" =>7, "Aug" =>8, "Sep" =>9, "Oct" =>10, "Nov" =>11, "Dec" =>12 +); $hrmin = 0; $hrmax = 23; if ( $ampm ne "" ) { $hrmin = 1; $hrmax = 12; $addhrs = 0; if ( $ampm =~ /^p/ ) { $addhrs = 12; } } if ( $arghr < $hrmin or $arghr > $hrmax ) { die "Hours portion of date argument '$arghr' out of range.\n"; } if ( $ampm ne "" ) { if ( $arghr == 12 ) { $arghr = 0; } $arghr += $addhrs; } if ( $argmin < 0 or $argmin > 59 ) { die "Minutes portion of date argument '$argmin' out of range.\n"; } if ( $argsec < 0 or $argsec > 59 ) { die "Seconds portion of date argument '$argsec' out of range.\n"; } print "The time arg is $arghr:$argmin:$argsec\n"; if ( not defined $monhash{ $monname } ) { die "Could not parse month portion '$monname' of date argument.\n"; } $argmon = $monhash{ $monname }; $argyear = 1900 + $argyr; $argtime = timelocal( $argsec, $argmin, $arghr, $argday, $argmon, $argyear - 1900 );
Re: Working with multiple records with different dates
by kcott (Archbishop) on Jan 18, 2014 at 13:21 UTC

    G'day rruser,

    Here's a solution using the builtin modules Time::Piece and Time::Seconds. It only requires a single script (instead of your three) and doesn't create an additional file concatenating all the data from the original three files.

    With the data you posted, "COMPANY      BCYX 156095 L  2013,11,30   2013,12,08" has the greatest time spread (8 days). For my testing, I duplicated that record in file2 and file3 and added a unique record to file1 which also had a spread of 8 days. Here's the data I worked with:

    $ cat pm_1071030_file1.txt COMPANY ABCD 764200 E 2013,12,13 2013,12,19 COMPANY BCDX 156167 L 2013,11,29 2013,12,03 COMPANY BCYX 165230 L 2013,12,13 2013,12,19 COMPANY BCYX 156095 L 2013,11,29 2013,12,07 $ cat pm_1071030_file2.txt COMPANY ABCD 764200 E 2013,12,13 2013,12,19 COMPANY BCDX 156167 L 2013,12,28 2013,12,31 COMPANY BCYX 156095 L 2013,11,30 2013,12,08 $ cat pm_1071030_file3.txt COMPANY ABCD 764200 E 2013,12,13 2013,12,17 COMPANY BCDX 156167 L 2013,11,30 2013,12,03 COMPANY BCYX 165230 L 2013,12,13 2013,12,17 COMPANY BCYX 156095 L 2013,11,30 2013,12,08 COMPANY BCYX 156095 L 2013,11,30 2013,12,08

    Here's the script:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use Time::Piece; use Time::Seconds; my $format = '%Y,%m,%d'; my $free_days = 4; my @wanted_record_data = (0, []); for my $file (map { "pm_1071030_file${_}.txt" } 1 .. 3) { open my $fh, '<', $file; while (<$fh>) { my ($start, $end) = (split)[4, 5]; my $t0 = Time::Piece->strptime($start, $format); my $t1 = Time::Piece->strptime($end, $format); my $diff = ($t1 - $t0)->days; next if $diff <= $free_days; push @{$wanted_record_data[1]}, $_ if $diff == $wanted_record_ +data[0]; @wanted_record_data = ($diff, [$_]) if $diff > $wanted_record_ +data[0]; } } print "Greatest time spread = $wanted_record_data[0] days.\n"; print "Records with this time spread:\n"; my %seen; print for grep { ! $seen{$_}++ } @{$wanted_record_data[1]};

    Here's the output:

    Greatest time spread = 8 days. Records with this time spread: COMPANY BCYX 156095 L 2013,11,29 2013,12,07 COMPANY BCYX 156095 L 2013,11,30 2013,12,08

    -- Ken

      Thanks Ken that gets me closer, I didn't take into account the longest time spread may be different for different records.

      example COMPANY ABCD 764200 E 2013,12,13 2013,12,19 COMPANY BCYX 156095 L 2013,11,30 2013,12,08

      The first record has 6 days and the second 8 days. I need both records with any duplicates or shorter time span dates removed. I'm not sure if I am explaining this well. I need each unique record so if it exists only in 1 file then I use that date. If records exists in multiple files then the one with the longest time frame.

      Any suggestions, I sure appreciate your time and expertise

        "I'm not sure if I am explaining this well."

        I don't think you are.

        You say "The first record has 6 days and the second 8 days.". None of the three files you show have records like that. I wondered if you meant "file" instead of "record", but that leaves no mention of the the third file (which is the one with an 8 day time spread).

        My best guess is that you should process the files separately (perhaps storing interim data something like '$data_for_file{$file} = [ @wanted_record_data ]') and then extract the records you want after all files have been read.

        -- Ken