Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
What should be the best way to open a text file and compare a line input for duplication and if it is duplicated delete the duplicated line from the text file. Trying to find some information on that but nothing specific.
Thanks for helping!

Replies are listed 'Best First'.
Re: Checking LInes in Text File
by GrandFather (Saint) on Jun 01, 2006 at 19:02 UTC

    For small files read the file a line at a time and add the lines to a hash.

    When you have read a line first check that it's not in the hash already. If not, print it to the output file and insert it into the hash, otherwise skip to the next line.

    If you can't figure that out, write some code that is your best guess at how to do it, ask again and include your code.


    DWIM is Perl's answer to Gödel
      Grandfather, I have a further question related to your suggestion. If I have hundres of lines of data like this:

      tag1:xxxxxxx tag2:xxxxxx tag3:xxxxxxx tag4:yyyyy

      How can I remove all lines that have the same tags 1 through 3 and replace it with a single line that has a new tag4? Currently I am able to remove all excess lines with the same tags 1 through 3 using your method but am unable to change tag4 because the hash method works by not writing subsequent values. Hence once I find out I have a duplicate it is too late to change it as the first has already been written. Any suggestions? Thanks!

        Provide half a dozen lines of sample data, the test code you are currently using, and a sample of the output you expect to see.

        For the test code it is easiest to use a __DATA__ section for the test data rather than an external file and simply print the result rather than generating an external file.


        DWIM is Perl's answer to Gödel
Re: Checking LInes in Text File
by davidrw (Prior) on Jun 01, 2006 at 19:05 UTC
    You can also do this on the commandline:
    sort -u infile > outfile # or (you might many some of uniq's extra options): sort infile | uniq > outfile
Re: Checking LInes in Text File
by Anonymous Monk on Jun 01, 2006 at 19:44 UTC
    It would help me to know what you are trying to do. The solutions offered thus far -- "sort -u" and using a hash -- both assume that you want to eliminate a line if it has a duplicate anywhere in the file. If all you want to do is eliminate successive repeated lines, something like this might be better:
    my $last = $_ = <>; print; while (<>) { print if ($_ ne $last); $last = $_; }
      Successive repeated lines can also be eliminated at the (unixy) command line with uniq infile > outfile, so long as you don't run the data through sort first.

      Also note that, of the solutions provided thus far, the hash-based option is the only one which will both eliminate all duplicates (printing only the first appearance of each line) and also preserve the original order of the (remaining) lines, which may or may not be significant to you.