green_lakers has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys, I have this file...
2009-01-08 09:29:19 ABCDEF 943973 MS08-011 Security Update + for Microsoft Works Suite 2005 (KB943973) 2009-01-08 09:29:19 ABCDEF 943973 MS08-011 Security Update + for Microsoft Works Suite 2005 (KB943973) 2009-01-08 09:29:19 ABCDEF 951944 MS08-055 Security Update + for the 2007 Microsoft Office System (KB951944) 2009-01-08 09:29:19 ABCDEF 953432 Update for Microsoft Of +fice Outlook 2003 (KB953432) 2009-01-08 09:29:19 ABCDEF 954038 MS08-051 Security Update + for 2007 Microsoft Office System (KB954038) 2009-01-08 09:29:19 ABCDEF 954326 MS08-052 Security Update + for the 2007 Microsoft Office System (KB954326) 2009-01-08 09:29:19 ABCDEF 956391 Cumulative Security Upd +ate for ActiveX Killbits for Windows 2000 (KB956391) 2009-01-08 09:29:20 ABCDEF 956828 MS08-072 Security Update + for the 2007 Microsoft Office System (KB956828) 2009-01-08 09:29:20 ABCDEF 956828 MS08-072 Security Update + for the 2007 Microsoft Office System (KB956828) 2009-01-08 09:29:20 ABCDEF 957832 Update for Microsoft Of +fice Outlook 2003 Junk Email Filter (KB957832) 2009-01-08 09:29:22 ABCDEF 958439 MS08-074 Security Update + for the 2007 Microsoft Office System (KB958439) 2009-01-08 09:29:22 ABCDEF 958439 MS08-074 Security Update + for the 2007 Microsoft Office System (KB958439)
I want to remove the duplicate lines so the output looks like this:--
953432 Update for Microsoft Office Outlook 2003 (KB953432) + ABCDEF 2009-01-08 956391 Cumulative Security Update for ActiveX Killbits for Wi +ndows 2000 (KB956391) ABCDEF 2009-01-08 957832 Update for Microsoft Office Outlook 2003 Junk Email Fi +lter (KB957832) ABCDEF 2009-01-08 MS08-011 943973 Security Update for Microsoft Works Suite 2005 + (KB943973) ABCDEF 2009-01-08 MS08-051 954038 Security Update for 2007 Microsoft Office Syst +em (KB954038) ABCDEF 2009-01-08 MS08-052 954326 Security Update for the 2007 Microsoft Office +System (KB954326) ABCDEF 2009-01-08 MS08-055 951944 Security Update for the 2007 Microsoft Office +System (KB951944) ABCDEF 2009-01-08 MS08-072 956828 Security Update for the 2007 Microsoft Office +System (KB956828) ABCDEF 2009-01-08 MS08-074 958439 Security Update for the 2007 Microsoft Office +System (KB958439) ABCDEF 2009-01-08

But i am not getting the last line "MS08-074 958439 Security Update for the 2007 Microsoft Office System (KB958439) ABCDEF 2009-01-08" in my output.this is the output i am getting:--

953432 Update for Microsoft Office Outlook 2003 (KB953432) + ABCDEF 2009-01-08 956391 Cumulative Security Update for ActiveX Killbits for Wi +ndows 2000 (KB956391) ABCDEF 2009-01-08 957832 Update for Microsoft Office Outlook 2003 Junk Email Fi +lter (KB957832) ABCDEF 2009-01-08 MS08-011 943973 Security Update for Microsoft Works Suite 2005 + (KB943973) ABCDEF 2009-01-08 MS08-051 954038 Security Update for 2007 Microsoft Office Syst +em (KB954038) ABCDEF 2009-01-08 MS08-052 954326 Security Update for the 2007 Microsoft Office +System (KB954326) ABCDEF 2009-01-08 MS08-055 951944 Security Update for the 2007 Microsoft Office +System (KB951944) ABCDEF 2009-01-08 MS08-072 956828 Security Update for the 2007 Microsoft Office +System (KB956828) ABCDEF 2009-01-08

this is the code i am using:--

#!/usr/local/bin/perl open (MYFILE, 'file.txt'); @file = <MYFILE>; close (MYFILE); print (" - Found (" . scalar ( @file) . ")\n"); foreach $line (@file) { chomp ($line); @split=split(/\t/, $line); @date=split(/\s+/, $split[0]); push (@sort ,"@split[1]\t@split[2]\t@split[3]\t@split[ +4]\t@date[0]"); } @sorted = sort (@sort); foreach $Endpoint (@sorted) { $Endpoint =~ s +/\s*$//; print "FIRST - $Endpoint\n"; } undef (@sort); print ("Found (" . scalar (@sorted) . ")\n"); print ("Remove duplicate lines\n"); $prev=""; $index=0; foreach $line (@sorted) { $index++; if ("$prev" eq ""){ $prev = $line; }else { if ($prev eq $line) { } else { push (@filtered,$prev); } } if ($index == scalar(@softed)){ push (@filtered,$line); } $prev = $line; } @sorted = sort (@filtered); undef (@filtered); print ("Found (" . scalar (@sorted) . ")\n"); print ("format each line so that its formated as BulletinID,KBID,T +itle,Endpointname,Date\n"); foreach $line(@sorted){ @split=split(/\t/, $line); print "LINEPRINT - $line\n"; push (@sort ,"@split[2]\t@split[1]\t@split[3]\t@split[0]\t@spl +it[4]"); } @sorted = sort (@sort); undef (@sort); print ("Found (" . scalar (@sorted) . ")\n"); foreach $Endpoint (@sorted) { $Endpoint =~ s +/\s*$//; print "$Endpoint\n"; }

Replies are listed 'Best First'.
Re: Removing duplicate lines from a file
by zwon (Abbot) on Jan 19, 2009 at 21:24 UTC

    Much better, but you should also learn about perltidy :). You should start all your scripts with:

    use warnings; use strict;
    this will help you to catch errors like this:
    if ($index == scalar(@softed)){
    note softed here. Also you can write
    unless ($prev eq $line){ push (@filtered,$prev); }
    instead of
    if ($prev eq $line) { } else { push (@filtered,$prev); }
      ... Also you can write
      unless ($prev eq $line){ push (@filtered,$prev); }

      I would go a little further by using a statement modifier and I'd also omit the brackets around the push arguments. That saves typing two pairs of brackets and a pair of braces and, to my eye, looks clearer; others may disagree.

      push @filtered, $prev unless $prev eq $line;

      Cheers,

      JohnGG

      or equally simple
      if (not $prev eq $line) {
      or if ($prev neq $line) {
Re: Removing duplicate lines from a file
by gwadej (Chaplain) on Jan 19, 2009 at 21:18 UTC

    The Module List::MoreUtils has a uniq method that does what it sounds like you want.

    G. Wade
Re: Removing duplicate lines from a file
by GrandFather (Saint) on Jan 19, 2009 at 22:19 UTC

    First, always use strictures (use strict; use warnings;). They will give you an early heads up about many silly errors and typos.

    Use the three parameter version of open and check the result. It's more secure, the intent is clearer and checking the result saves a heap of time debugging silly errors.

    Avoid slurping files (@file = <MYFILE>;). It doesn't scale well. It doesn't generally improve performance and it doesn't generally help code clarity.

    push (@sort ,"@split[2]\t@split[1]\t@split[3]\t@split[0]\t@split[4]");

    is better written:

    push (@sort ,"$split[2]\t$split[1]\t$split[3]\t$split[0]\t$split[4]");

    but is much clearer using an array slice:

    push @sort, join "\t", @split[1, 2, 3, 4, 0];

    However, in Perl when you think 'unique', you should generally then think 'hash'. Consider:

    use strict; use warnings; my $data = <<DATA; 2009-01-08 09:29:19 ABCDEF 943973 MS08-011 Security Update + for Microsoft Works Suite 2005 (KB943973) 2009-01-08 09:29:19 ABCDEF 943973 MS08-011 Security Update + for Microsoft Works Suite 2005 (KB943973) 2009-01-08 09:29:19 ABCDEF 951944 MS08-055 Security Update + for the 2007 Microsoft Office System (KB951944) 2009-01-08 09:29:19 ABCDEF 953432 Update for Microsoft Of +fice Outlook 2003 (KB953432) 2009-01-08 09:29:19 ABCDEF 954038 MS08-051 Security Update + for 2007 Microsoft Office System (KB954038) 2009-01-08 09:29:19 ABCDEF 954326 MS08-052 Security Update + for the 2007 Microsoft Office System (KB954326) 2009-01-08 09:29:19 ABCDEF 956391 Cumulative Security Upd +ate for ActiveX Killbits for Windows 2000 (KB956391) 2009-01-08 09:29:20 ABCDEF 956828 MS08-072 Security Update + for the 2007 Microsoft Office System (KB956828) 2009-01-08 09:29:20 ABCDEF 956828 MS08-072 Security Update + for the 2007 Microsoft Office System (KB956828) 2009-01-08 09:29:20 ABCDEF 957832 Update for Microsoft Of +fice Outlook 2003 Junk Email Filter (KB957832) 2009-01-08 09:29:22 ABCDEF 958439 MS08-074 Security Update + for the 2007 Microsoft Office System (KB958439) 2009-01-08 09:29:22 ABCDEF 958439 MS08-074 Security Update + for the 2007 Microsoft Office System (KB958439) DATA open my $inFile, '<', \$data; my %entries; while (<$inFile>) { my ($date, $time, $endpoint, $kbid, $id, $title) = /(\S+)\s+ (\S+)\s+ (\S+)\s+ (\S+)\s+ (\w+-\d+\s+)? (.*)/x; $id ||= ''; $entries{$kbid} = { date => $date, time => $time, endpoint => $endpoint, id => $id, kbid => $kbid, title => $title, }; } close $inFile; print join ("\t", @{$_}{qw(id kbid title endpoint date)}), "\n" for sort {$a->{id} cmp $b->{id} or $a->{kbid} <=> $b->{kbid}} valu +es %entries;

    Prints:

    953432 Update for Microsoft Office Outlook 2003 (KB953432) A +BCDEF 2009-01-08 956391 Cumulative Security Update for ActiveX Killbits for Wind +ows 2000 (KB956391) ABCDEF 2009-01-08 957832 Update for Microsoft Office Outlook 2003 Junk Email Filt +er (KB957832) ABCDEF 2009-01-08 MS08-011 943973 Security Update for Microsoft Works Suite 20 +05 (KB943973) ABCDEF 2009-01-08 MS08-051 954038 Security Update for 2007 Microsoft Office Sy +stem (KB954038) ABCDEF 2009-01-08 MS08-052 954326 Security Update for the 2007 Microsoft Offic +e System (KB954326) ABCDEF 2009-01-08 MS08-055 951944 Security Update for the 2007 Microsoft Offic +e System (KB951944) ABCDEF 2009-01-08 MS08-072 956828 Security Update for the 2007 Microsoft Offic +e System (KB956828) ABCDEF 2009-01-08 MS08-074 958439 Security Update for the 2007 Microsoft Offic +e System (KB958439) ABCDEF 2009-01-08

    Perl's payment curve coincides with its learning curve.
Re: Removing duplicate lines from a file
by jwkrahn (Abbot) on Jan 19, 2009 at 22:01 UTC

    It looks like you need something more like this:

    #!/usr/local/bin/perl use warnings; use strict; open MYFILE, '<', 'file.txt' or die "Cannot open file.txt: $!"; my %unique; while ( <MYFILE> ) { chomp; next unless s/^(\d{4}-\d\d-\d\d)\s+\d\d:\d\d:\d\d\s+//; $unique{ "$_\t$1" }++; # updated! } print " - Found ($.)\n"; close MYFILE; print "Remove duplicate lines\n"; print "Found (", scalar keys %unique, ")\n"; my @sorted = sort keys %unique; print "format each line so that its formated as BulletinID,KBID,Title, +Endpointname,Date\n"; my @sort; for my $line ( @sorted ) { print "LINEPRINT - $line\n"; push @sort, join "\t", ( split /\t/, $line )[ 2, 1, 3, 0, 4 ]; }
      Thanks it works. I was wondering from this array @sort how can i get an output like this
      TOTAL PATCH 2009-01-08 ANYDATE 1 MS08-011 1 0 1 MS08-051 1 0 1 MS08-052 1 0 1 MS08-055 1 0
Re: Removing duplicate lines from a file
by apl (Monsignor) on Jan 20, 2009 at 01:28 UTC
    If the data file exists before your program runs, I'd use the *nix sort -u command first. Then your program could process the unique records present in the file.
Re: Removing duplicate lines from a file
by bradcathey (Prior) on Jan 20, 2009 at 02:30 UTC

    And there are more than just "Guys" lurking around these hallowed halls...

    —Brad
    "The important work of moving the world forward does not wait to be done by perfect men." George Eliot