vxp has asked for the wisdom of the Perl Monks concerning the following question:

Hi. Suppose I have a huge list of email addresses, in a file. I want to make sure that one email address doesn't appear in that list twice... so that, for example, user@aol.com and USER@AOL.com and UsEr@AoL.CoM are all considered to be the _same_ address.. take the "unclean" list, feed it to the case insensitive hash and print out the "clean" list. :) any ideas on how to do this, or if this is a good way to do this?

Replies are listed 'Best First'.
Re: Not case sensitive hash?
by runrig (Abbot) on Oct 22, 2002 at 17:25 UTC
    Use lc or uc. You could use a tied hash, but that might be overkill (see Hash::Case modules).
      A code for a tied hash could look like:

      File TieHashCI.pm:

      package TieHashCI; use Tie::Hash; use vars qw (@ISA); @ISA = qw(Tie::StdHash); #my $exists = 0; sub STORE { my ($self, $key, $value) = @_; return( $self->{ lc $key } = $value ); } sub FETCH { my ($self, $key) = @_; # print ("Fetch $key\n"); return( $self->{ lc $key } ); } sub EXISTS { my ($self, $key) = @_; # $exists++; # print STDOUT ("."); return( exists $self->{ lc $key } ); } sub DEFINED { my ($self, $key) = @_; # print ("Defined $key\n"); return( defined $self->{ lc $key } ); } sub DELETE { my ($self, $key) = @_; # print ("Delete $key\n"); return (delete $self->{ lc $key } ); } # sub CLEAR {} # sub FIRSTKEY{}; # sub NEXTKEY{}; END { # print ("Exists: $exists\n"); } 1;
      and use it in your main program:
      use TieHashCI; my %data = (); tie (%data, 'TieHashCI');
      and then just work with it...

      If TieHashCI.pm is not in your @INC, try something like:

      BEGIN { use FindBin qw($Bin); use lib "$Bin/lib"; # if TieHashCI is in ./lib } # BEGIN use TieHashCI;

      But it wastes lots of ressources...

      Best regards,
      perl -e "s>>*F>e=>y)\*martinF)stronat)=>print,print v8.8.8.32.11.32"

Re: (nrd) Not case sensitive hash?
by newrisedesigns (Curate) on Oct 22, 2002 at 17:28 UTC

    You aren't washing this list, are you?

    Anyhoo, there's always mass tr/A-Z/a-z/ then force the list into a hash, then dump keys %hash into a file.

    Where's chip when you need him? :) (see Massive Sorting)

    John J Reiser
    newrisedesigns.com

Re: Not case sensitive hash?
by Rodney_Hampton (Initiate) on Oct 22, 2002 at 18:36 UTC
    cat email_names_file| perl -e 'while(<STDIN>){print lc($_);}'|sort -u

    You can then pipe this output anywhere, including into a perl
    script for mailing

    Rodney A. Hampton

      Avoiding perl you could do
      sort -u -f input_file > output_file

      Antonio

      The stupider the astronaut, the easier it is to win the trip to Vega - A. Tucket
Eliminating addresses in bulk
by rir (Vicar) on Oct 23, 2002 at 01:25 UTC
    Update: The below code has two bugs in it. Don't use it. Twas a joke.

    If I were tempted to deal with a mailing list like you describe I'd use this program to clean it up real quick. Then I could go on to other things.

    This is faster than the other suggestions you've gotten so far, well for almost every input file. It's especially efficient on larger files, note the not very subtle tricks used to increase speed. Certainly you don't have to go to shell to handle this.

    This is probably not quite as good as what Chip might come up with though.

    #!/usr/bin/perl -w use strict; my $file = "rawlistfile"; my @eadds; open INPUT, "+>$file" or die "Can't open $file for reading!"; my @eadds = <INPUT>; foreach ( @eadds) { $_ = lc } my $prev; @eadds = sort @eadds; my $cur = $eadds[0]; my $i = 0; while ( $cur le $eadds[-1] ) { print "$prev\n" if $prev ne $cur and $prev; $prev = $curr; $curr = $eadds[++$i]; }
      Why all the effort? perl -e'@uniq{map lc, <>}=(); print keys %uniq' listfile

      Makeshifts last the longest.

        My is very much faster on large inputs than yours. They are very different.

        Hint, wink, wink, wink. I don't think anyone got the joke. I'll know better next time.