File Checking

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: File Checking by I0 (Priest) on Jan 29, 2001 at 00:44 UTC
`while( <> ){ print "repeated address $_" if $address{$_}++; }` [download] see also perldoc -q "How can I remove duplicate elements from a list or array?"	[reply] [d/l]
Re: File Checking by OeufMayo (Curate) on Jan 29, 2001 at 00:42 UTC
To get past the flaw mentionned you can replace the appropriate line in the above code by the one below should make it work: `while (<WORDLIST>){push (@words => lc($_))}` [download] <kbd>-- PerlMonger::Paris(http => 'paris.pm.org');</kbd>	[reply] [d/l]
Re: File Checking by the_slycer (Chaplain) on Jan 29, 2001 at 00:10 UTC
The way I've always done this (and there are probably better ways).. `#!/usr/bin/perl -w use strict; my @words; my %list; unless ($ARGV[1]){ print "Usage is oneword infile outfile\n"; exit; } open (WORDLIST, "$ARGV[0]")\|\|die "Could not open file $!"; open (OUTFILE, ">$ARGV[1]")\|\|die "Could not open file $!"; while (<WORDLIST>){push (@words => $_)} foreach (@words){$list{$_}=$list{$_}} #this works because if the hash +key #exists already it is replaced! foreach (keys %list){print OUTFILE;}` [download]	[reply] [d/l]
Re: Re: File Checking by eg (Friar) on Jan 29, 2001 at 01:14 UTC
Hey slycer. You're looping around your data far too often. `while ( <WORDLIST> ) { print OUTFILE unless $list{lc($_)}++; }` [download] will be up to three time faster as it only needs to go through the data once. (Whoops! This is essentially what IO said.)	[reply] [d/l]
Re: Re: Re: File Checking by the_slycer (Chaplain) on Jan 29, 2001 at 01:37 UTC
Yah, I wrote that probably about 6 months ago as kind of a learning excersise.. I knew as soon as I posted it that it could be better, but oh well. Thanks for pointing out the errors :-)	[reply]
Re: Re: File Checking by Chady (Priest) on Jan 29, 2001 at 00:30 UTC
A quick test revealed a flaw here: what if we had : `someone@somewhere.com and someOne@SOMEWHERE.COM` [download] and this is the main reason for checking for duplicate in the first place I guess.. Chady \| http://chady.net/	[reply] [d/l]
Re: File Checking by Anonymous Monk on Jan 29, 2001 at 01:18 UTC
Hi. I ended up solving the problem in quite a weird way.. The following is the code I used. #!/usr/bin/perl open(ONE, 'one.db') or die "$!"; @one = <ONE>; close(ONE); open(TWO, 'two.db') or die "$!"; @two = <TWO>; close(TWO); open(LOGS, '>logs.db') or die "$!"; foreach $i (@one) { chomp($i); @res = grep(/$i/,@two); if($#res == 0) { print "Success for $i\n"; } elsif($#res > 0) { print "Repetition found for $i\n"; print LOGS "\nRepetition found for $i"; } } close(LOGS); print "\n\nProcess Terminated!"; It stores the file into two lists, and if it finds more than 1 match for any of the emails in the first list, in the second list, it will write it to a log file, and then i'll be able to remove it manually. It works super fast and really well! :) And it's not case sensitive! (which is good) :) Ralph :) www.argenteen.com	[reply]
Re: Re: File Checking by chromatic (Archbishop) on Jan 29, 2001 at 01:48 UTC
That's certainly One Way To Do It, but it's certainly not the fastest. The process goes something like this: read a line from the file stick it in the array repeat the previous steps for the second file loop through each line of the first array check against every line of the second array, with a case insensitive match As either array grows, the number of necessary checks grows. With 2 lines in each file, you'll do four checks. With 10 lines in each file, you'll do 100 checks. (At least, if my math unit is working today.) With the hash solution, you only loop through each file once. You don't have to check each element in one file against each element of the other file -- if it already exists in the hash, no problem. Besides that, you only have to run lc() on each line once, instead of having to build a case-insensitive version of each element in the second file for each line of the first file. If you have 2 lines in each file, you have 4 hashings, 4 lc calls, and 4 hash assignments. No big win there. If you have 10 lines in each file, you have 20 hash assignments. You do the math for 100 elements in each file. Besides that, it's less work. I'd say the hash is the clear winner in this case, and hopefully more people will understand why. (Apologies to the literati here for boring them. :)	[reply]
Re: File Checking by Anonymous Monk on Jan 29, 2001 at 01:21 UTC
Oops, I forgot to add the 'i' in the grep line. It should be: @res = grep(/$i/i,@two); Now it's case insensitive :) Ralph :) www.argenteen.com	[reply]
Re: File Checking by sierrathedog04 (Hermit) on Jan 29, 2001 at 04:22 UTC
I am surprised that no one solved this problem as I would have: first reduce all email addresses to a canonical form by removing trailing blanks, reducing characters to lowercase, etc. then sort using the default lexical sort finally loop through the resulting list once, seeing whether any two addresses in a row are the same. If they are then delete the first and move on. "There's more than one way to do it" -- Larry Wall, the genius who invented and guides perl to this day. He is to us what Linus is to home UNIXers.	[reply]
Re: File Checking by dsb (Chaplain) on Jan 29, 2001 at 20:46 UTC
You would need some code to print out the array of duplicates but this will build that array. `open( FH, "filename" ) \|\| die "no way man\n"; while ( <FH> ) { if ( $done{$_} ne "" ) { $dup++; } else { $done{$_} = 1; push( @dups, $_ ); } }` [download] - kel -	[reply] [d/l]