Removing duplicate lines from files (was 'Files')

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(Ovid - hash to control printing) Re: Files by Ovid (Cardinal) on May 23, 2001 at 19:15 UTC
We'll need more information to (ahem!) address this properly. What size are the files you need to manipulate and what is the format of the files? It you have small files, much of the important data can simply be read into memory. If not, you'll have to explore alternatives. For the sake of argument, we'll assume each line of the file(s) has one email address and nothing else. Use a hash to track the addresses (untested code follows): #!/usr/bin/perl -w use strict; my $in_file = 'email.log'; my $out_file = 'new_email.log'; # Note that this is written to a diff +erent file # so we can go back if we screw up # If that's not good, backup the email + log. open IN, "< $in_file" or die "Can't open $in_file for reading: $!"; open OUT, "> $out_file" or die "Can't open $out_file for writing: $!"; my %address; while (<IN>) { print OUT, $_ if ! $address{ $_ }++; } close OUT or die "Can't close $out_file: $!" close IN or die "Can't close $in_file: $!"; [download] You also might want to check out the Perl Cookbook. There's not a line of original code above. All of it was shamelessly stolen from many hours of enjoying this tome. Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: Files by petdance (Parson) on May 23, 2001 at 19:08 UTC
When you're talking about "deduping" or "seeing if I've already seen one of these", you're talking about hashes. Take the following sample code: `my %seen; foreach my $addr ( @bigaddrlist ) { ++$seen{ $addr }; } # foreach my @deduped = keys %seen;` [download] If preserving order is important, do something like so: `my %seen; my @deduped; foreach my $addr ( @bigaddrlist ) { push( @deduped, $addr ) unless $seen{$addr}++; }` [download] xoxo, Andy %_=split/;/,".;;n;u;e;ot;t;her;c; ". # Andy Lester 'Perl ;@; a;a;j;m;er;y;t;p;n;d;s;o;'. # http://petdance.com "hack";print map delete$_{$_},split//,q< andy@petdance.com >	[reply] [d/l] [select]
Re: Files by arturo (Vicar) on May 23, 2001 at 19:09 UTC
most straightforward answer, assuming you have duplicate lines. Whether this is a good idea depends also on the size of the file and your system memory, so keep that in mind. Basic technique: load the lines into an array, then remove the duplicate members of that array. Easiest way to do that is by mapping the members of the array into the keys of a hash (hash keys cannot be duplicated). `open FILE, $filename or die "Can't open $filename\n"; my @lines = <FILE>; close FILE; my @new_list = keys map {$_ => 1} @lines;` [download] Then just print out @new_list to the new version of the file. update if you're on a nix-type system, `sort filename \| uniq > filename.new` will do the same thing, albeit by sorting the lines first. The perl technique won't necessarily preserve the original order, either, but has more general application. HTH. `perl -e 'print "How sweet does a rose smell? "; chomp ($n = <STDIN>); +$rose = "smells sweet to degree $n"; other_name = *rose; print "$oth +er_name\n"'` [download]	[reply] [d/l] [select]
Re: Files by bwana147 (Pilgrim) on May 23, 2001 at 19:21 UTC
We all know that TIMTOWTDY, but do all the ways have to have something to do with Perl? If the order of the addresses is irrelevant, and assuming that the file consists of one address per line, I'd use `sort (1).` `sort -u -o address_file.txt address_file.txt` [download] When the only tool one has at hand is a hammer, one tends to believe every problem is a nail. And I confess that this saying applies to me all too often ;-) --bwana147 Update: I hadn't seen perlplexer's contribution. It might be interesting to note that `uniq` works only if duplicate lines follow each other. It won't work if duplicates are scattered across the file. In such a case, you'll have to `sort` the file first, and pipe the sorted output into `uniq`, which is exactly what `sort -u` does. My € 0.02	[reply] [d/l] [select]
Re: Files by perlplexer (Hermit) on May 23, 2001 at 19:18 UTC
the simplest way is to use uniq (see 'man uniq') If you really want to do this in Perl or if this is really your homework grin then: `#assuming that your file has only one email address per line my %emails = (); open FH, "<emails.dat" or die "error: $!"; $emails{$_}++ while (<FH>); close FH; open FH, ">new_emails.dat" or die "error: $!"; print FH $_ while ($_ = each(%emails)); close FH;` [download] --perlplexer	[reply] [d/l]
Removing duplicate lines from files (was 'Files') by da (Friar) on May 23, 2001 at 19:50 UTC
`perl -ne 'print if (++$a{$_} == 1);' <oldfilename >newfilename` [download] `--- -DA` [download]	[reply] [d/l] [select]