perl to Remove duplicates lines from a csv file based on timestamp most recent

john.tm has asked for the wisdom of the Perl Monks concerning the following question:

Hi i have a large csv file with 5 columns, i have managed to remove duplicates based on 2 columns and timestamp most recent. but i am how do i print the updated list with al 5 columns.

i am getting Global symbol "$col3" requires explicit package name warning Global symbol "$col4" requires explicit package name warning

#!/usr/bin/perl
use strict;
use warnings;
use POSIX 'strftime';
my @now = localtime();
my $todaysday = strftime("%d", localtime());
my $mth = strftime("%m" , localtime());
my $secs = strftime("%S" , localtime());
my $mins = strftime("%M" , localtime());
my $hr = strftime("%H" , localtime());
my $year = strftime("%Y" , localtime());
my $dtime = "$year-$mth-$todaysday $hr:$mins";
my %most_recent;
my $header = <DATA>;
while ( my $line = <DATA> ) {
    chomp $line;
    my ($col1,$date_and_time,$col2,$col3,$col4) = split( /,/, $line );
    $date_and_time =~ s/^\s+$//g;
    my $dtime = $date_and_time;
    if ( not defined $most_recent{$col1}{$col2}
        or $most_recent{$col1}{$col2} lt $dtime )
    {
        $most_recent{$col1}{$col2} = $dtime;
    }
}
print "Most recent:\n";
foreach my $col1  ( keys %most_recent ) {
    foreach my $col2 ( keys %{$most_recent{$col1}} ) {
             print "$col1, $col2, $most_recent{$col1}{$col2}, \n";
      #print "$col2,$col1,$col3,$col4,\n";
    }
}

 __DATA__
 
 LONDO,2015-01-02 11:35,GE04_TDP,ted,fu
 LONDO,2015-01-02 13:15,GE03_TDP,ted,fu
 LONDO,2015-01-02 15:42,GE03_TDP,ted,fu
 LONDO,2015-01-02 15:22,GE04_TDP,ome,ful
 LONDO,2015-01-02 17:15,GE03_TDP,omp,ful
 LONDO,2015-01-02 17:32,GE04_TDP,omp,ful
 LONDO,2015-01-02 20:44,CW02,et,ful
 LONDO,2015-01-02 19:26,CW03,et,ful
 LONDO,2015-01-02 20:25,CW01,let,pped
 LONDO,2015-01-02 19:57,CW04,let,pped
 LONDO,2015-01-02 19:24,EXCHP,let,ucc
 LONDO,2015-01-02 19:25,EXCHP,let,ucc
 LONDO,2015-01-02 19:43,GE03,let,ucc
 LONDO,2015-01-02 20:41,GE04,Co,ucc
 LONDO,2015-01-02 21:33,GE03_TDP,Co,ucc
 LONDO,2015-01-02 21:17,EXCHP,Co,ucc
 LONDO,2015-01-02 23:24,EXCHDP,Co,ucc
 LONDO,2015-01-02 23:27,EXCHDP,Co,ucc
 LONDO,2015-01-03 01:20,EXCHDP,il,02
 LONDO,2015-01-03 01:11,EXCHDP,ro
[download]

Comment on perl to Remove duplicates lines from a csv file based on timestamp most recent Download Code

Replies are listed 'Best First'.
Re: perl to Remove duplicates lines from a csv file based on timestamp most recent by soonix (Chancellor) on Jan 05, 2015 at 10:13 UTC
Just to add to pme's solution: You are declaring `my ($col1,$date_and_time,$col2,$col3,$col4)` within the while loop, so they go out of scope after each iteration. However, even if you move the declaration outside the loop, you still would have single values, so you'd end up printing the values of the very last input line.	[reply] [d/l]
Re: perl to Remove duplicates lines from a csv file based on timestamp most recent by pme (Monsignor) on Jan 05, 2015 at 07:47 UTC
Hi john.tm You can add hashref to %most_recent instead of a scalar ($dtime). `... if ( not defined $most_recent{$col1}{$col2} or $most_recent{$co +l1}{$col2}->{dtime} lt $dtime ) { $most_recent{$col1}{$col2}->{dtime} = $dtime; $most_recent{$col1}{$col2}->{col3} = $col3; $most_recent{$col1}{$col2}->{col4} = $col4; } } print "Most recent:\n"; foreach my $col1 ( keys %most_recent ) { foreach my $col2 ( keys %{$most_recent{$col1}} ) { print "$col1, $col2, $most_recent{$col1}{$col2}->{dtim +e}, $most_recent{$col1}{$col2}->{col3}, $most_recent{$col1}{$col2}->{ +col4}\n"; } }` [download]	[reply] [d/l]