in reply to File Checking

Hi.

I ended up solving the problem in quite a weird way.. The following is the code I used.

#!/usr/bin/perl

open(ONE, 'one.db') or die "$!";
@one = <ONE>;
close(ONE);

open(TWO, 'two.db') or die "$!";
@two = <TWO>;
close(TWO);

open(LOGS, '>logs.db') or die "$!";

foreach $i (@one) {
chomp($i);

@res = grep(/$i/,@two);

if($#res == 0) { print "Success for $i\n"; }

elsif($#res > 0) {
print "Repetition found for $i\n";
print LOGS "\nRepetition found for $i";
}

}

close(LOGS);

print "\n\nProcess Terminated!";

It stores the file into two lists, and if it finds more than 1 match for any of the emails in the first list, in the second list, it will write it to a log file, and then i'll be able to remove it manually.

It works super fast and really well! :) And it's not case sensitive! (which is good) :)

Ralph :)

www.argenteen.com

Replies are listed 'Best First'.
Re: Re: File Checking
by chromatic (Archbishop) on Jan 29, 2001 at 01:48 UTC
    That's certainly One Way To Do It, but it's certainly not the fastest.

    The process goes something like this:

    • read a line from the file
    • stick it in the array
    • repeat the previous steps for the second file
    • loop through each line of the first array
    • check against *every* line of the second array, with a case insensitive match
    As either array grows, the number of necessary checks grows. With 2 lines in each file, you'll do four checks. With 10 lines in each file, you'll do 100 checks. (At least, if my math unit is working today.)

    With the hash solution, you only loop through each file once. You don't have to check each element in one file against each element of the other file -- if it already exists in the hash, no problem. Besides that, you only have to run lc() on each line once, instead of having to build a case-insensitive version of each element in the second file for each line of the first file.

    If you have 2 lines in each file, you have 4 hashings, 4 lc calls, and 4 hash assignments. No big win there. If you have 10 lines in each file, you have 20 hash assignments. You do the math for 100 elements in each file.

    Besides that, it's less work. I'd say the hash is the clear winner in this case, and hopefully more people will understand why. (Apologies to the literati here for boring them. :)