How to delete lines with similar data

Jeffro Tull has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to delete lines with similar data by mirod (Canon) on Feb 05, 2002 at 11:15 UTC
A couple of (uncalled for ;--) comments on your script: if a system function (`open`) fails you have to `die`, just using `\|\| "error"` will not help, `$!` is the error description, it is usally a good idea to display it when a system function fails, to test existence is usally done using a hash. A couple of style points: if you want to open a file for reading (regfile), why not say so in the open: `open( FH, "<file")` instead of using `>>`, BTW usually filehandles are in CAPS, you can use `undef` as a placeholder in `split`, there is no need to escape . and \| in strings, instead of reading a whole file in an array and then processing each line why not read each line, process it and then go to the next one? This way you don't have to slurp the whole file in memory. So your code modified: `#!/bin/perl -w use strict; open (REGFILE, "<$datadir/projects.dat") \|\| die "Cannot open File1: $! +"; open (NEWFILE,">>$datadir/subscribe\.dat") \|\| die "Cannot open File2: +$!"; my %seen_email; # hash email => 1 while( <REGFILE>) { my( $email, $name, $format); ($email, $name, $format, undef) = split /\\|/; next if( $seen_email{$email}); print NEWFILE"$email\|$name\|$format\n"; $seen_email{$email}=1; # either just the email or the whole li +ne } close (NEWFILE); close REGFILE;` [download] An other option, if you are on any kind of unix (or cygwin) is to preprocess the initial file using `sort -u`: `#!/bin/perl -w use strict; # the die is actually useless here open (REGFILE, "sort -u $datadir/projects.dat \|") \|\| die "Cannot open +File1: $!"; open (NEWFILE,">>$datadir/subscribe\.dat") \|\| die "Cannot open File2: +$!"; while( <REGFILE>) { my( $email, $name, $format); ($email, $name, $format, undef) = split /\\|/; print NEWFILE"$email\|$name\|$format\n"; } close (NEWFILE); close( REGFILE) \|\| die "could not sort $datadir/projects.dat: $!"; # i +n this case the error happens when you close the file` [download]	[reply] [d/l] [select]
Re: How to delete lines with similar data by gmax (Abbot) on Feb 05, 2002 at 11:06 UTC
If you want to get unique records, the idiomatic way in Perl is by using a hash. I don't know what you want to accomplish with your `($b, $b, $b, $b, $base_email, $b, $b, $b) = split` but I assume that you are using split to get the fields into the scalars you want to process. Here is a simple test case, that is using DATA as input and STDOUT as output. There is a duplicate in my sample DATA, and it is skipped by the hash mechanism. `#!/usr/bin/perl -w use strict; my @regfile = <DATA>; my %uniques = (); foreach my $i (@regfile) { my ($email, $name, $date, $format) = split (/\\|/, $i); $uniques{$email.$name.$date.$format}++; print "$email\\|$name\\|$date\\|$format" unless $uniques{$email.$name.$date.$format} > 1; } __DATA__ dummy@fake.com\|John\|10-10-2000\|XZXXXXXXXXX dummy2@fake.com\|Joe\|11-10-2000\|XXXXXXXXXXX alone@dunno.com\|Jim\|10-09-2000\|YYYYYYYYYYYYY dummy2@fake.com\|Joe\|11-10-2000\|XXXXXXXXXXX` [download] Output: `dummy@fake.com\|John\|10-10-2000\|XZXXXXXXXXX dummy2@fake.com\|Joe\|11-10-2000\|XXXXXXXXXXX alone@dunno.com\|Jim\|10-09-2000\|YYYYYYYYYYYYY` [download] The key to the hash should be the concatenation of all pieces of information that will make your record unique. Before I forget. Did you remember to `use strict` and `-w` ? Hope it was helpful. `_ _ _ _ (_\|\| \| \|(_\|>< _\|` [download]	[reply] [d/l] [select]
Re: How to delete lines with similar data by gellyfish (Monsignor) on Feb 05, 2002 at 10:53 UTC
Almost every time you start thinking about unique in a Perl program you are going to be using a hash IMO You should read every row of NEWFILE and use the lines as the keys of the hash - then in your foreach loop you will use `exists` using your built up row as the key to test for. /J\	[reply] [d/l]
Re: How to delete lines with similar data by jmcnamara (Monsignor) on Feb 05, 2002 at 11:53 UTC
There is a lot of good advice in the previous posts and I think that you should take it into account. However, apart from gellyfish, they also seem to have missed the point. `;-)` To avoid duplicates in "NEWFILE" you could do something like the following: `#!/usr/local/bin/perl -w use strict; open REGFILE, "projects.dat" or die "Error message here: $!\n"; open NEWFILE, "+>>subscribe.dat" or die "Error message here: $!\n"; # Rewind the file for reading seek(NEWFILE, 0, 0); # Store all lines in NEWFILE in %seen using a hash slice my %seen; @seen{<NEWFILE>} = undef; while (<REGFILE>) { if (not exists $seen{$_}) { $seen{$_} = undef; # Do your split etc here print NEWFILE $_; } } close (NEWFILE); close (REGFILE);` [download] -- John.	[reply] [d/l]
Re: Re: How to delete lines with similar data by Jeffro Tull (Novice) on Feb 05, 2002 at 14:21 UTC
This is to everyone who responded- I appreciate all the great feedback. I am very new at Perl and therefore, very green. I am learning as I go and your response has helped me a lot. I did in the end go with a modified version of John's code. It was very close to what I wanted in the end and was easy enough for me to modify to do exactly what I wanted. Thanks again for your help. Jeff	[reply]
Re: How to delete lines with similar data by metadoktor (Hermit) on Feb 05, 2002 at 10:14 UTC
As you process each line simply push each record onto an array (if the database is small) and compare the record that you're processing at the moment against the list of records that you've already processed. You could make the compare operation a sub. Simply, don't include the records which are duplicates. metadoktor "The doktor is in."	[reply]