Jeffro Tull has asked for the wisdom of the Perl Monks concerning the following question:

I have a database of users that is pipe delimeted. From this database, I am creating a mailing list. I think the following code is on track but I am not sure how to test all the lines in the new file. Any help or suggestions would be really appreciated.

open (regfile, ">>$datadir/projects\.dat") || "Cannot open File1."; my @regfile = <regfile>; close (regfile); open (NEWFILE,">>$datadir/subscribe\.dat") || "Cannot open File2."; foreach $i (@regfile) { ($b, $b, $b, $b, $base_email, $b, $b, $b) = split (/\|/, $i); If ($base_email\|$name\|$date\|$format ne ANY LINE IN NEWFILE) { print NEWFILE "$base_email\|$name\|$date\|$format\n"; } } close (NEWFILE);


Thanks,
Jeff

Replies are listed 'Best First'.
Re: How to delete lines with similar data
by mirod (Canon) on Feb 05, 2002 at 11:15 UTC

    A couple of (uncalled for ;--) comments on your script:

    • if a system function (open) fails you have to die, just using || "error" will not help,
    • $! is the error description, it is usally a good idea to display it when a system function fails,
    • to test existence is usally done using a hash.

    A couple of style points:

    • if you want to open a file for reading (regfile), why not say so in the open: open( FH, "<file") instead of using >>, BTW usually filehandles are in CAPS,
    • you can use undef as a placeholder in split,
    • there is no need to escape . and | in strings,
    • instead of reading a whole file in an array and then processing each line why not read each line, process it and then go to the next one? This way you don't have to slurp the whole file in memory.

    So your code modified:

    #!/bin/perl -w use strict; open (REGFILE, "<$datadir/projects.dat") || die "Cannot open File1: $! +"; open (NEWFILE,">>$datadir/subscribe\.dat") || die "Cannot open File2: +$!"; my %seen_email; # hash email => 1 while( <REGFILE>) { my( $email, $name, $format); ($email, $name, $format, undef) = split /\|/; next if( $seen_email{$email}); print NEWFILE"$email|$name|$format\n"; $seen_email{$email}=1; # either just the email or the whole li +ne } close (NEWFILE); close REGFILE;

    An other option, if you are on any kind of unix (or cygwin) is to preprocess the initial file using sort -u:

    #!/bin/perl -w use strict; # the die is actually useless here open (REGFILE, "sort -u $datadir/projects.dat |") || die "Cannot open +File1: $!"; open (NEWFILE,">>$datadir/subscribe\.dat") || die "Cannot open File2: +$!"; while( <REGFILE>) { my( $email, $name, $format); ($email, $name, $format, undef) = split /\|/; print NEWFILE"$email|$name|$format\n"; } close (NEWFILE); close( REGFILE) || die "could not sort $datadir/projects.dat: $!"; # i +n this case the error happens when you close the file
Re: How to delete lines with similar data
by gmax (Abbot) on Feb 05, 2002 at 11:06 UTC
    If you want to get unique records, the idiomatic way in Perl is by using a hash.

    I don't know what you want to accomplish with your ($b, $b, $b, $b, $base_email, $b, $b, $b) = split but I assume that you are using split to get the fields into the scalars you want to process.
    Here is a simple test case, that is using DATA as input and STDOUT as output.
    There is a duplicate in my sample DATA, and it is skipped by the hash mechanism.
    #!/usr/bin/perl -w use strict; my @regfile = <DATA>; my %uniques = (); foreach my $i (@regfile) { my ($email, $name, $date, $format) = split (/\|/, $i); $uniques{$email.$name.$date.$format}++; print "$email\|$name\|$date\|$format" unless $uniques{$email.$name.$date.$format} > 1; } __DATA__ dummy@fake.com|John|10-10-2000|XZXXXXXXXXX dummy2@fake.com|Joe|11-10-2000|XXXXXXXXXXX alone@dunno.com|Jim|10-09-2000|YYYYYYYYYYYYY dummy2@fake.com|Joe|11-10-2000|XXXXXXXXXXX
    Output:
    dummy@fake.com|John|10-10-2000|XZXXXXXXXXX dummy2@fake.com|Joe|11-10-2000|XXXXXXXXXXX alone@dunno.com|Jim|10-09-2000|YYYYYYYYYYYYY
    The key to the hash should be the concatenation of all pieces of information that will make your record unique.

    Before I forget. Did you remember to use strict and  -w ?

    Hope it was helpful.
    _ _ _ _ (_|| | |(_|>< _|
Re: How to delete lines with similar data
by gellyfish (Monsignor) on Feb 05, 2002 at 10:53 UTC

    Almost every time you start thinking about unique in a Perl program you are going to be using a hash IMO

    You should read every row of NEWFILE and use the lines as the keys of the hash - then in your foreach loop you will use exists using your built up row as the key to test for.

    /J\

Re: How to delete lines with similar data
by jmcnamara (Monsignor) on Feb 05, 2002 at 11:53 UTC

    There is a lot of good advice in the previous posts and I think that you should take it into account.

    However, apart from gellyfish, they also seem to have missed the point. ;-) To avoid duplicates in "NEWFILE" you could do something like the following:

    #!/usr/local/bin/perl -w use strict; open REGFILE, "projects.dat" or die "Error message here: $!\n"; open NEWFILE, "+>>subscribe.dat" or die "Error message here: $!\n"; # Rewind the file for reading seek(NEWFILE, 0, 0); # Store all lines in NEWFILE in %seen using a hash slice my %seen; @seen{<NEWFILE>} = undef; while (<REGFILE>) { if (not exists $seen{$_}) { $seen{$_} = undef; # Do your split etc here print NEWFILE $_; } } close (NEWFILE); close (REGFILE);

    --
    John.

      This is to everyone who responded-

      I appreciate all the great feedback. I am very new at Perl and therefore, very green. I am learning as I go and your response has helped me a lot.

      I did in the end go with a modified version of John's code. It was very close to what I wanted in the end and was easy enough for me to modify to do exactly what I wanted.

      Thanks again for your help.

      Jeff
Re: How to delete lines with similar data
by metadoktor (Hermit) on Feb 05, 2002 at 10:14 UTC
    As you process each line simply push each record onto an array (if the database is small) and compare the record that you're processing at the moment against the list of records that you've already processed. You could make the compare operation a sub. Simply, don't include the records which are duplicates.

    metadoktor

    "The doktor is in."