Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hey all,

I have created an email list program in perl which creates a text file of email addresses, In doing some modifying of the script I managed to make a mistake with a foreach routine and manage to triple each list :) (oops)

What I want to do now is create a little script which will go through each text file and remove any duplicate emails in each file, so if an file has 5 of the same emails only one is left..

...The theory is fairly simple but I am little stuck when it comes to writing the script and I'm not sure how to approach it, e.g. The only way I can think of is doing a foreach and some sort of count, but that way would remove every line after the first :(

Any help would be greatly appreciated to get out of this mess,

Thanks

Edited 2001-05-23 by Ovid

  • Comment on Removing duplicate lines from files (was 'Files')

Replies are listed 'Best First'.
(Ovid - hash to control printing) Re: Files
by Ovid (Cardinal) on May 23, 2001 at 19:15 UTC

    We'll need more information to (ahem!) address this properly. What size are the files you need to manipulate and what is the format of the files? It you have small files, much of the important data can simply be read into memory. If not, you'll have to explore alternatives.

    For the sake of argument, we'll assume each line of the file(s) has one email address and nothing else. Use a hash to track the addresses (untested code follows):

    #!/usr/bin/perl -w use strict; my $in_file = 'email.log'; my $out_file = 'new_email.log'; # Note that this is written to a *diff +erent* file # so we can go back if we screw up # If that's not good, backup the email + log. open IN, "< $in_file" or die "Can't open $in_file for reading: $!"; open OUT, "> $out_file" or die "Can't open $out_file for writing: $!"; my %address; while (<IN>) { print OUT, $_ if ! $address{ $_ }++; } close OUT or die "Can't close $out_file: $!" close IN or die "Can't close $in_file: $!";

    You also might want to check out the Perl Cookbook. There's not a line of original code above. All of it was shamelessly stolen from many hours of enjoying this tome.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Files
by petdance (Parson) on May 23, 2001 at 19:08 UTC
    When you're talking about "deduping" or "seeing if I've already seen one of these", you're talking about hashes.

    Take the following sample code:

    my %seen; foreach my $addr ( @bigaddrlist ) { ++$seen{ $addr }; } # foreach my @deduped = keys %seen;
    If preserving order is important, do something like so:
    my %seen; my @deduped; foreach my $addr ( @bigaddrlist ) { push( @deduped, $addr ) unless $seen{$addr}++; }

    xoxo,
    Andy

    %_=split/;/,".;;n;u;e;ot;t;her;c; ".   #   Andy Lester
    'Perl ;@; a;a;j;m;er;y;t;p;n;d;s;o;'.  #   http://petdance.com
    "hack";print map delete$_{$_},split//,q<   andy@petdance.com   >
    
Re: Files
by arturo (Vicar) on May 23, 2001 at 19:09 UTC

    most straightforward answer, assuming you have duplicate lines. Whether this is a good idea depends also on the size of the file and your system memory, so keep that in mind.

    Basic technique: load the lines into an array, then remove the duplicate members of that array. Easiest way to do that is by mapping the members of the array into the keys of a hash (hash keys cannot be duplicated).

    open FILE, $filename or die "Can't open $filename\n"; my @lines = <FILE>; close FILE; my @new_list = keys map {$_ => 1} @lines;

    Then just print out @new_list to the new version of the file.

    update if you're on a *nix-type system, sort filename | uniq > filename.new will do the same thing, albeit by sorting the lines first. The perl technique won't necessarily preserve the original order, either, but has more general application.

    HTH.

    perl -e 'print "How sweet does a rose smell? "; chomp ($n = <STDIN>); +$rose = "smells sweet to degree $n"; *other_name = *rose; print "$oth +er_name\n"'
Re: Files
by bwana147 (Pilgrim) on May 23, 2001 at 19:21 UTC

    We all know that TIMTOWTDY, but do all the ways have to have something to do with Perl?

    If the order of the addresses is irrelevant, and assuming that the file consists of one address per line, I'd use sort (1).

    sort -u -o address_file.txt address_file.txt

    When the only tool one has at hand is a hammer,
    one tends to believe every problem is a nail.

    And I confess that this saying applies to me all too often ;-)

    --bwana147


    Update: I hadn't seen perlplexer's contribution. It might be interesting to note that uniq works only if duplicate lines follow each other. It won't work if duplicates are scattered across the file. In such a case, you'll have to sort the file first, and pipe the sorted output into uniq, which is exactly what sort -u does.

    My € 0.02

Re: Files
by perlplexer (Hermit) on May 23, 2001 at 19:18 UTC
    the simplest way is to use uniq (see 'man uniq')
    
    If you really want to do this in Perl or if this is really
    your homework *grin* then:
    
    #assuming that your file has only one email address per line my %emails = (); open FH, "<emails.dat" or die "error: $!"; $emails{$_}++ while (<FH>); close FH; open FH, ">new_emails.dat" or die "error: $!"; print FH $_ while ($_ = each(%emails)); close FH;
    --perlplexer
Removing duplicate lines from files (was 'Files')
by da (Friar) on May 23, 2001 at 19:50 UTC