in reply to Re^2: how to change this code into perl
in thread how to change this code into perl

Thank you Laurent_R!!! the one liner is not printing all the lines, say I have three duplicates its only printing the last two or one duplicate, not all of them.

1 twenty 2 thirty 1 forty 1 fifty
output 1 twenty 1 forty 1 fifty

is there a way to script it instead of a oneliner. Thank you guys

Replies are listed 'Best First'.
Re^4: how to change this code into perl
by Laurent_R (Canon) on Aug 30, 2015 at 17:57 UTC
    OK, a real script that should detect all lines having duplicate keys (quick script, untested, no time now, but based on something I am doing quite often, so, hopefully, I've it right).
    my ($previous_key, $previous_line); open my $IN, "<", $infile or die "cannot open $infile $!"; while (<$IN>) { my $key = $1 if /^(\w+)/; if ($key eq $previous_key) { print $previous_line if defined $previous_line; print $_; undef $previous_line; } else { $previous_line = $_; } $previous_key = $key; }
Re^4: how to change this code into perl
by Laurent_R (Canon) on Aug 30, 2015 at 17:42 UTC
    Sure, where there are two entries with the same key, it only prints the second one (the duplicate, not the original one); when there are three, it will print only the second one and the third one. And of course, it will work only if the lines are properly sorted.

    If you need to print all the lines that are duplicates, then it is slightly more complicated, because you need to keep track of recent history. And then, yes, it is probably better to write a real script.

    Another way is to use a hash to keep track of everything in memory.

Re^4: how to change this code into perl
by perlnewbie012215 (Novice) on Aug 30, 2015 at 19:07 UTC

    Hi poj, thank you for the quick response, I tried the script and could not get the duplicate rows, the outcome came up with zero rows. below is the script i tried

    open IN,'<','/home/scripts/imageoutcome.txt' or die "Could not open $i +nfile : $!"; my %count = (); my @lines = (); while (<IN>){ push @lines,$_; # print $_; if (/^(\S+)/){ ++$count{$1}; } } close IN; open OUT,'>','/home/scripts/outcome.txt' or die "Could not open $outfi +le : $!"; #print @lines; for (@lines){ if (/^(\S+)/){ print $count{$1}; print OUT $_ if $count{$1} > 0; } } close OUT;

      Did you try it with the sample you provided ?

      1 twenty 2 thirty 1 forty 1 fifty

      Update : Does your file have spaces at the beginning of the lines ?

      poj

        It seems like some special characters and space, I delete those and its working perfectly now

Re^4: how to change this code into perl
by perlnewbie012215 (Novice) on Aug 30, 2015 at 17:29 UTC

    the file will be around 20000 rows and the first columns will always be text..

      #!perl use strict; use warnings; my $infile = $ARGV[0]; my $outfile = $ARGV[1]; open IN,'<',$infile or die "Could not open $infile : $!"; my %count = (); my @lines = (); while (<IN>){ push @lines,$_; if (/^(\S+)/){ ++$count{$1}; } } close IN; open OUT,'>',$outfile or die "Could not open $outfile : $!"; for (@lines){ if (/^(\S+)/){ print OUT $_ if $count{$1} > 1; } } close OUT;
      poj
Re^4: how to change this code into perl
by perlnewbie012215 (Novice) on Aug 30, 2015 at 19:14 UTC

    Thank you very much Laurent_R, I tried the script and its printing all the rows, instead of duplicates. Laurent_R, this code looks very interesting, can you please explain it

    #!/usr/bin/perl my ($previous_key, $previous_line); open my $IN, "<", '/home/scripts/imageoutcome.txt' or die "cannot open + $infile $!"; while (<$IN>) { my $key = $1 if /^(\w+)/; if ($key eq $previous_key) { print $previous_line if defined $previous_line; print $_; undef $previous_line; } else { $previous_line = $_; } $previous_key = $key; }
      I tried the script and its printing all the rows
      Then you have to show me your input data. I've just tried that script with the following input data:
      aa blah bb blah bb blahblah bb foo cc dlqskjf cc cfkqs dd dkls ee dsjkqjs ff blah gg klsqdj gg sqkl
      and it print only the lines where the first column is a duplicate, as shown in this output:
      bb blah bb blahblah bb foo cc dlqskjf cc cfkqs gg klsqdj gg sqkl
      This seems to work perfectly.

      Otherwise, the way it works is that it reads the file one line at a time, and store this line ($previous_line), as well as the comparison key until the next line is read. If they have the same key, then I print the previous line (if defined) and the current one; in such case, I undef the previous line to prevent it from being printed twice if there are triplicates.

      If it does not work properly for you, please show your input and/or test data.

Re^4: how to change this code into perl
by poj (Abbot) on Aug 30, 2015 at 17:23 UTC

    How big are the files and is the first column always numeric ?

    poj
Re^4: how to change this code into perl
by perlnewbie012215 (Novice) on Aug 30, 2015 at 19:40 UTC

    Hi poj, you are correct, I forgot chomp, its working now. thank you so much for helping me.

Re^4: how to change this code into perl
by perlnewbie012215 (Novice) on Sep 01, 2015 at 22:39 UTC

    Hi Laurent_R, That was my bad, I had hidden characters in it, thats why I did not work. Your script is working...thank you so much for helping me and explaining it..