Checking LInes in Text File

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Checking LInes in Text File by GrandFather (Saint) on Jun 01, 2006 at 19:02 UTC
For small files read the file a line at a time and add the lines to a hash. When you have read a line first check that it's not in the hash already. If not, print it to the output file and insert it into the hash, otherwise skip to the next line. If you can't figure that out, write some code that is your best guess at how to do it, ask again and include your code. DWIM is Perl's answer to Gödel	[reply]
Re^2: Checking LInes in Text File by Anonymous Monk on Jun 01, 2006 at 20:47 UTC
Grandfather, I have a further question related to your suggestion. If I have hundres of lines of data like this: tag1:xxxxxxx tag2:xxxxxx tag3:xxxxxxx tag4:yyyyy How can I remove all lines that have the same tags 1 through 3 and replace it with a single line that has a new tag4? Currently I am able to remove all excess lines with the same tags 1 through 3 using your method but am unable to change tag4 because the hash method works by not writing subsequent values. Hence once I find out I have a duplicate it is too late to change it as the first has already been written. Any suggestions? Thanks!	[reply]
Re^3: Checking LInes in Text File by GrandFather (Saint) on Jun 01, 2006 at 23:13 UTC
Provide half a dozen lines of sample data, the test code you are currently using, and a sample of the output you expect to see. For the test code it is easiest to use a __DATA__ section for the test data rather than an external file and simply print the result rather than generating an external file. DWIM is Perl's answer to Gödel	[reply]
Re^4: Checking LInes in Text File by Anonymous Monk on Jun 02, 2006 at 00:27 UTC
Re^5: Checking LInes in Text File by GrandFather (Saint) on Jun 02, 2006 at 00:38 UTC
Some notes below your chosen depth have not been shown here
Re: Checking LInes in Text File by davidrw (Prior) on Jun 01, 2006 at 19:05 UTC
You can also do this on the commandline: `sort -u infile > outfile # or (you might many some of uniq's extra options): sort infile \| uniq > outfile` [download]	[reply] [d/l]
Re: Checking LInes in Text File by Anonymous Monk on Jun 01, 2006 at 19:44 UTC
It would help me to know what you are trying to do. The solutions offered thus far -- "sort -u" and using a hash -- both assume that you want to eliminate a line if it has a duplicate anywhere in the file. If all you want to do is eliminate successive repeated lines, something like this might be better: `my $last = $_ = <>; print; while (<>) { print if ($_ ne $last); $last = $_; }` [download]	[reply] [d/l]
Re^2: Checking LInes in Text File by dsheroh (Monsignor) on Jun 01, 2006 at 20:01 UTC
Successive repeated lines can also be eliminated at the (unixy) command line with `uniq infile > outfile`, so long as you don't run the data through `sort` first. Also note that, of the solutions provided thus far, the hash-based option is the only one which will both eliminate all duplicates (printing only the first appearance of each line) and also preserve the original order of the (remaining) lines, which may or may not be significant to you.	[reply] [d/l] [select]