in reply to Remove Duplicate Lines

if you want to remove duplicates from a data file and the file has a header
#!/usr/bin/perl $ifile=$ARGV[0]; $ofile=$ARGV[1]; $header=`sed -n '1p' $ifile` ; $data=`sed '1d' $ifile | sort -u` ; open(my $fh, '>', $ofile) or die "Could not open file '$ofile' $!"; print $fh $header; print $fh $data; close $fh; exit 0

Replies are listed 'Best First'.
Re^2: Remove Duplicate Lines
by afoken (Chancellor) on Aug 01, 2019 at 19:47 UTC

    Let's see:

    • use strict missing
    • use warnings missing
    • Missing my for $ifile, $ofile, $header, $data.
    • no check that the program is called with the correct number of arguments
    • Forking a shell (1) via qx (``) begs for trouble - see Improve pipe open?
    • ... to run sed, just to read the first line of a file
    • ... while making sed read the entire file
    • ... and ignoring all quoting issues by simply not quoting at all - see The problem of "the" default shell
    • ... and ignoring the fact that sed is not available by default on Windows and other operating systems
    • Forking another shell via qx to pipe sed output to sort -u input
    • ... again without any qouting
    • ... again assuming sed is available everywhere
    • ... assuming a POSIX sort is available everywhere. DOS/Windows sort does not understand -u and can't sort and filter out dupes
    • ... reading the entire output of sort -u into memory
    • ... just to write it out again three lines later
    • And finally, exit 0 is redundant

    This is highly inefficient and has several issues with "interesting" filenames.

    In Re: Remove Duplicate Lines, BrowserUk explains how to use perl properly.

    Another option - if running on a POSIX compatible system - is to use sort properly. Without headers, it is trivial:

    sort -u < inputfile > outputfile

    With headers, this will do:

    head -n 1 inputfile > outputfile sed '1d' inputfile | sort -u >> outputfile

    This way, head can stop processing the input file after the first line, unlike sed -n '1p'. Directly writing to the outputfile avoids all further overhead of your script.

    Alexander


    (1) yes, given a sane filename, perl may start the first sed without help of the default shell. Change the filename to something interesting and perl will start sed via the default shell.

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)