marto9 has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I wanted to create a perl script that removes all the duplicate lines from a file. So I searched on google for some scripts that do that. But they don't work properly. I need a script that removes all duplicates, but also the blank lines, spaces and tabs. Here's the script I've got now:
use strict; use warnings; my $passwdfile = "d.txt"; my %seen = (); { local @ARGV = ($passwdfile); local $^I = '.bac'; while(<>){ $_ =~ s/^\s+//; $seen{$_}++; next if $seen{$_} > 1; print; } } print "finished";
Input:
5 5 5 5 5 6 55 6 66 5 5 5 5
Output:
5 5 6 55 66
I know it can be done with regex, but I'm a real beginner in perl and regex still looks very hard to me. Thx in advance

Replies are listed 'Best First'.
Re: Duplicate lines with spaces, tabs...
by moritz (Cardinal) on Jul 17, 2008 at 11:50 UTC
    I know it can be done with regex, but I'm a real beginner in perl and regex still looks very hard to me. Thx in advance

    A regex isn't always the best way, what you have so far seems just fine for me, both in terms of correctness and in complexity. Sure, you can assemble all lines into a regex instead of a hash, but it will only bring you closer to insanity.

    Update: Uhm, I think I mis-read the question. The concern seems to be the duplicate 5 in the output, which seems to be caused by trailing whitespaces. You can remove those with a second regex:

    s/[\t ]+$//;

    Second Update: Contrary to what others have written, s/\s+$// isn't enough, because it also strings the trailing newline, thus deletes all newline characters from your file (unless you add them to the print statement separately)

Re: Duplicate lines with spaces, tabs...
by pjotrik (Friar) on Jul 17, 2008 at 11:52 UTC

    Simple check next unless $_; as a second line of the while cycle will do. All whitespaces from the beginning of the line were removed by $_ =~ s/^\s+//;, so $_ is empty at that time if it only contained whitespace (or nothing).

    UPDATE: And regarding the duplicate "5", you'll have to trim the trailing whitespace characters as well. s/\s+$//; will help with that. For the sake of your education :-), that means: in $_, if there is a sequence of 1 or more (that's the +) whitespace characters(\s) occuring just before the end of the string ($), replace it (s/.../.../) with nothing (the nothing between the last two slashes)

    UPDATE2: As pointed out by moritz, you'll have to add a newline to the print statement if all trailing whitespaces including the newline were trimmed. The body of the while loop will look like

    s/^\s+//; s/\s+$//; next unless $_; $seen{$_}++; next if $seen{$_} > 1; print "$_\n";
Re: Duplicate lines with spaces, tabs...
by psini (Deacon) on Jul 17, 2008 at 11:56 UTC

    Your regex $_ =~ s/^\s+//; removes the leading spaces but not the trailing ones, so "5" is different from "5 " to your script.

    Try adding $_ =~ s/\s+$//;

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Re: Duplicate lines with spaces, tabs...
by apl (Monsignor) on Jul 17, 2008 at 12:10 UTC
    The *nix sort command has a -u option that results in only unique lines being displayed.
Re: Duplicate lines with spaces, tabs...
by marto9 (Beadle) on Jul 17, 2008 at 13:00 UTC
    Thx, it worked! :D But I've got two more questions: 1. What does "next unless $_;" do? 2. When the file is cleaned I still get a blank line on the end of the file. How to remove that.
      1. What does "next unless $_;" do?

      It will do the next iteration of the loop (provided the condition is still true) unless $_ is empty. Could also be written as:
      if (not $_) { next }
      An empty string, zero, or undef, are all false in Perl.
      When I run your code with pjotriks changes I see no blank line at the end of the result file

        Maybe you didn't copy the entire example, including the blank lines. This is my result.
        5 6 55 66
        Apparantly perlmonks.org doesn't show blank lines. But after '66' I get a new blank line.