Removing duplicate lines from a file

green_lakers has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys, I have this file...

2009-01-08 09:29:19    ABCDEF    943973    MS08-011    Security Update
+ for Microsoft Works Suite 2005 (KB943973)
2009-01-08 09:29:19    ABCDEF    943973    MS08-011    Security Update
+ for Microsoft Works Suite 2005 (KB943973)
2009-01-08 09:29:19    ABCDEF    951944    MS08-055    Security Update
+ for the 2007 Microsoft Office System (KB951944)
2009-01-08 09:29:19    ABCDEF    953432        Update for Microsoft Of
+fice Outlook 2003 (KB953432)
2009-01-08 09:29:19    ABCDEF    954038    MS08-051    Security Update
+ for 2007 Microsoft Office System (KB954038)
2009-01-08 09:29:19    ABCDEF    954326    MS08-052    Security Update
+ for the 2007 Microsoft Office System (KB954326)
2009-01-08 09:29:19    ABCDEF    956391        Cumulative Security Upd
+ate for ActiveX Killbits for Windows 2000 (KB956391)
2009-01-08 09:29:20    ABCDEF    956828    MS08-072    Security Update
+ for the 2007 Microsoft Office System (KB956828)
2009-01-08 09:29:20    ABCDEF    956828    MS08-072    Security Update
+ for the 2007 Microsoft Office System (KB956828)
2009-01-08 09:29:20    ABCDEF    957832        Update for Microsoft Of
+fice Outlook 2003 Junk Email Filter (KB957832)
2009-01-08 09:29:22    ABCDEF    958439    MS08-074    Security Update
+ for the 2007 Microsoft Office System (KB958439)
2009-01-08 09:29:22    ABCDEF    958439    MS08-074    Security Update
+ for the 2007 Microsoft Office System (KB958439)
[download]

I want to remove the duplicate lines so the output looks like this:--

        953432  Update for Microsoft Office Outlook 2003 (KB953432)   
+  ABCDEF  2009-01-08
        956391  Cumulative Security Update for ActiveX Killbits for Wi
+ndows 2000 (KB956391)     ABCDEF  2009-01-08
        957832  Update for Microsoft Office Outlook 2003 Junk Email Fi
+lter (KB957832)   ABCDEF  2009-01-08
MS08-011        943973  Security Update for Microsoft Works Suite 2005
+ (KB943973)       ABCDEF  2009-01-08
MS08-051        954038  Security Update for 2007 Microsoft Office Syst
+em (KB954038)     ABCDEF  2009-01-08
MS08-052        954326  Security Update for the 2007 Microsoft Office 
+System (KB954326) ABCDEF  2009-01-08
MS08-055        951944  Security Update for the 2007 Microsoft Office 
+System (KB951944) ABCDEF  2009-01-08
MS08-072        956828  Security Update for the 2007 Microsoft Office 
+System (KB956828) ABCDEF  2009-01-08
MS08-074        958439  Security Update for the 2007 Microsoft Office 
+System (KB958439) ABCDEF  2009-01-08
[download]

But i am not getting the last line "MS08-074 958439 Security Update for the 2007 Microsoft Office System (KB958439) ABCDEF 2009-01-08" in my output.this is the output i am getting:--

        953432  Update for Microsoft Office Outlook 2003 (KB953432)   
+  ABCDEF  2009-01-08
        956391  Cumulative Security Update for ActiveX Killbits for Wi
+ndows 2000 (KB956391)     ABCDEF  2009-01-08
        957832  Update for Microsoft Office Outlook 2003 Junk Email Fi
+lter (KB957832)   ABCDEF  2009-01-08
MS08-011        943973  Security Update for Microsoft Works Suite 2005
+ (KB943973)       ABCDEF  2009-01-08
MS08-051        954038  Security Update for 2007 Microsoft Office Syst
+em (KB954038)     ABCDEF  2009-01-08
MS08-052        954326  Security Update for the 2007 Microsoft Office 
+System (KB954326) ABCDEF  2009-01-08
MS08-055        951944  Security Update for the 2007 Microsoft Office 
+System (KB951944) ABCDEF  2009-01-08
MS08-072        956828  Security Update for the 2007 Microsoft Office 
+System (KB956828) ABCDEF  2009-01-08
[download]

this is the code i am using:--

#!/usr/local/bin/perl

open (MYFILE, 'file.txt');
@file = <MYFILE>;
close (MYFILE);    
print (" - Found (" . scalar ( @file) . ")\n");


foreach $line (@file)
                {
                chomp ($line);
                @split=split(/\t/, $line);
                @date=split(/\s+/, $split[0]);
                push (@sort ,"@split[1]\t@split[2]\t@split[3]\t@split[
+4]\t@date[0]");
                }


@sorted = sort (@sort);


                       foreach $Endpoint (@sorted)
                                                {       $Endpoint =~ s
+/\s*$//;
                                print "FIRST - $Endpoint\n";
                                
                            }


undef (@sort);
print ("Found (" . scalar (@sorted) . ")\n");    
print ("Remove duplicate lines\n");
$prev="";
$index=0;
foreach $line (@sorted) 
                {
        $index++;
        if ("$prev" eq ""){
            $prev = $line;
        }else {
            if ($prev eq $line) {
            } else {
                push (@filtered,$prev);                
            }            
        }
        
        if ($index == scalar(@softed)){ 
            push (@filtered,$line);        
        }
        
        $prev = $line;
    }
    
    @sorted = sort (@filtered);
        undef (@filtered);
    print ("Found (" . scalar (@sorted) . ")\n");    
    print ("format each line so that its formated as BulletinID,KBID,T
+itle,Endpointname,Date\n");
        
    foreach $line(@sorted){
        @split=split(/\t/, $line);
                print "LINEPRINT - $line\n";
        push (@sort ,"@split[2]\t@split[1]\t@split[3]\t@split[0]\t@spl
+it[4]");
    }
    
    @sorted = sort (@sort);
    undef (@sort);
    print ("Found (" . scalar (@sorted) . ")\n");    


                       foreach $Endpoint (@sorted)
                                                {       $Endpoint =~ s
+/\s*$//;
                                print "$Endpoint\n";
                                
                            }
[download]

Comment on Removing duplicate lines from a file Select or Download Code

Replies are listed 'Best First'.
Re: Removing duplicate lines from a file by zwon (Abbot) on Jan 19, 2009 at 21:24 UTC
Much better, but you should also learn about perltidy :). You should start all your scripts with: `use warnings; use strict;` [download] this will help you to catch errors like this: `if ($index == scalar(@softed)){` [download] note softed here. Also you can write `unless ($prev eq $line){ push (@filtered,$prev); }` [download] instead of `if ($prev eq $line) { } else { push (@filtered,$prev); }` [download]	[reply] [d/l] [select]
Re^2: Removing duplicate lines from a file by johngg (Canon) on Jan 19, 2009 at 23:46 UTC
... Also you can write `unless ($prev eq $line){ push (@filtered,$prev); }` [download] I would go a little further by using a statement modifier and I'd also omit the brackets around the push arguments. That saves typing two pairs of brackets and a pair of braces and, to my eye, looks clearer; others may disagree. `push @filtered, $prev unless $prev eq $line;` [download] Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Removing duplicate lines from a file by jethro (Monsignor) on Jan 19, 2009 at 22:53 UTC
or equally simple `if (not $prev eq $line) {` or `if ($prev neq $line) {`	[reply] [d/l] [select]
Re: Removing duplicate lines from a file by gwadej (Chaplain) on Jan 19, 2009 at 21:18 UTC
The Module List::MoreUtils has a `uniq` method that does what it sounds like you want. G. Wade	[reply] [d/l]
Re: Removing duplicate lines from a file by GrandFather (Saint) on Jan 19, 2009 at 22:19 UTC
First, always use strictures (use strict; use warnings;). They will give you an early heads up about many silly errors and typos. Use the three parameter version of open and check the result. It's more secure, the intent is clearer and checking the result saves a heap of time debugging silly errors. Avoid slurping files (`@file = <MYFILE>;`). It doesn't scale well. It doesn't generally improve performance and it doesn't generally help code clarity. `push (@sort ,"@split[2]\t@split[1]\t@split[3]\t@split[0]\t@split[4]");` [download] is better written: `push (@sort ,"$split[2]\t$split[1]\t$split[3]\t$split[0]\t$split[4]");` [download] but is much clearer using an array slice: `push @sort, join "\t", @split[1, 2, 3, 4, 0];` [download] However, in Perl when you think 'unique', you should generally then think 'hash'. Consider: use strict; use warnings; my $data = <<DATA; 2009-01-08 09:29:19 ABCDEF 943973 MS08-011 Security Update + for Microsoft Works Suite 2005 (KB943973) 2009-01-08 09:29:19 ABCDEF 943973 MS08-011 Security Update + for Microsoft Works Suite 2005 (KB943973) 2009-01-08 09:29:19 ABCDEF 951944 MS08-055 Security Update + for the 2007 Microsoft Office System (KB951944) 2009-01-08 09:29:19 ABCDEF 953432 Update for Microsoft Of +fice Outlook 2003 (KB953432) 2009-01-08 09:29:19 ABCDEF 954038 MS08-051 Security Update + for 2007 Microsoft Office System (KB954038) 2009-01-08 09:29:19 ABCDEF 954326 MS08-052 Security Update + for the 2007 Microsoft Office System (KB954326) 2009-01-08 09:29:19 ABCDEF 956391 Cumulative Security Upd +ate for ActiveX Killbits for Windows 2000 (KB956391) 2009-01-08 09:29:20 ABCDEF 956828 MS08-072 Security Update + for the 2007 Microsoft Office System (KB956828) 2009-01-08 09:29:20 ABCDEF 956828 MS08-072 Security Update + for the 2007 Microsoft Office System (KB956828) 2009-01-08 09:29:20 ABCDEF 957832 Update for Microsoft Of +fice Outlook 2003 Junk Email Filter (KB957832) 2009-01-08 09:29:22 ABCDEF 958439 MS08-074 Security Update + for the 2007 Microsoft Office System (KB958439) 2009-01-08 09:29:22 ABCDEF 958439 MS08-074 Security Update + for the 2007 Microsoft Office System (KB958439) DATA open my $inFile, '<', \$data; my %entries; while (<$inFile>) { my ($date, $time, $endpoint, $kbid, $id, $title) = /(\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\w+-\d+\s+)? (.*)/x; $id \|\|= ''; $entries{$kbid} = { date => $date, time => $time, endpoint => $endpoint, id => $id, kbid => $kbid, title => $title, }; } close $inFile; print join ("\t", @{$_}{qw(id kbid title endpoint date)}), "\n" for sort {$a->{id} cmp $b->{id} or $a->{kbid} <=> $b->{kbid}} valu +es %entries; [download] Prints: 953432 Update for Microsoft Office Outlook 2003 (KB953432) A +BCDEF 2009-01-08 956391 Cumulative Security Update for ActiveX Killbits for Wind +ows 2000 (KB956391) ABCDEF 2009-01-08 957832 Update for Microsoft Office Outlook 2003 Junk Email Filt +er (KB957832) ABCDEF 2009-01-08 MS08-011 943973 Security Update for Microsoft Works Suite 20 +05 (KB943973) ABCDEF 2009-01-08 MS08-051 954038 Security Update for 2007 Microsoft Office Sy +stem (KB954038) ABCDEF 2009-01-08 MS08-052 954326 Security Update for the 2007 Microsoft Offic +e System (KB954326) ABCDEF 2009-01-08 MS08-055 951944 Security Update for the 2007 Microsoft Offic +e System (KB951944) ABCDEF 2009-01-08 MS08-072 956828 Security Update for the 2007 Microsoft Offic +e System (KB956828) ABCDEF 2009-01-08 MS08-074 958439 Security Update for the 2007 Microsoft Offic +e System (KB958439) ABCDEF 2009-01-08 [download] Perl's payment curve coincides with its learning curve.	[reply] [d/l] [select]
Re: Removing duplicate lines from a file by jwkrahn (Abbot) on Jan 19, 2009 at 22:01 UTC
It looks like you need something more like this: #!/usr/local/bin/perl use warnings; use strict; open MYFILE, '<', 'file.txt' or die "Cannot open file.txt: $!"; my %unique; while ( <MYFILE> ) { chomp; next unless s/^(\d{4}-\d\d-\d\d)\s+\d\d:\d\d:\d\d\s+//; $unique{ "$_\t$1" }++; # updated! } print " - Found ($.)\n"; close MYFILE; print "Remove duplicate lines\n"; print "Found (", scalar keys %unique, ")\n"; my @sorted = sort keys %unique; print "format each line so that its formated as BulletinID,KBID,Title, +Endpointname,Date\n"; my @sort; for my $line ( @sorted ) { print "LINEPRINT - $line\n"; push @sort, join "\t", ( split /\t/, $line )[ 2, 1, 3, 0, 4 ]; } [download]	[reply] [d/l]
Re^2: Removing duplicate lines from a file by green_lakers (Novice) on Jan 19, 2009 at 22:13 UTC
Thanks it works. I was wondering from this array @sort how can i get an output like this `TOTAL PATCH 2009-01-08 ANYDATE 1 MS08-011 1 0 1 MS08-051 1 0 1 MS08-052 1 0 1 MS08-055 1 0` [download]	[reply] [d/l]
Re: Removing duplicate lines from a file by apl (Monsignor) on Jan 20, 2009 at 01:28 UTC
If the data file exists before your program runs, I'd use the *nix sort -u command first. Then your program could process the unique records present in the file.	[reply]
Re: Removing duplicate lines from a file by bradcathey (Prior) on Jan 20, 2009 at 02:30 UTC
And there are more than just "Guys" lurking around these hallowed halls... —Brad "The important work of moving the world forward does not wait to be done by perfect men." George Eliot	[reply]