Re: Binary Search

If you're on Unix, there's no need for Perl. (Windows has both the Unix utilities and Cygwin.)

sort file1.csv > file1.sort.csv
sort file2.csv > file2.sort.csv
comm file1.sort.csv file2.sort.csv
[download]

Plus, 61k lines isn't very large at all ... I'm curious as to how you were comparing them originally that you felt was too slow. Maybe your initial algorithm was borked. Could it have been something like

open FIRST, "first.csv";
while (my $line = <FIRST>) {
    open SECOND, "second.csv";
    while (my $otherline = <SECOND>) {
        if ( $line eq $otherline ) {
            print "Found a match!\n":
        }
    }
    close SECOND;
}
close FIRST;
[download]

Because that's really slow!

My wife's blog

Comment on Re: Binary Search Select or Download Code

Replies are listed 'Best First'.
Re^2: Binary Search by dave_pl (Sexton) on Apr 04, 2005 at 14:01 UTC
dragonchild I am currently doing this in win... i know... i have no choice though. Yes that is about exactly how i was doing it, i am testing borisz way now and that looks really fast! i am still learning perl as you undoubtedly have noticed. We all need to start somewhere :). Thanks	[reply]
Re^3: Binary Search by dragonchild (Archbishop) on Apr 04, 2005 at 14:16 UTC
I am currently doing this in win... i know... i have no choice though. Ok. You can install any of the following: Cygwin at http://www.cygwin.com a bunch of natively ported utilies from http://unxutils.sourceforge.net/ Microsoft's own Services for Unix at http://www.microsoft.com/windows/sfu/default.asp All three have comm and sort and work just fine. Yes that is about exactly how i was doing it, i am testing borisz way now and that looks really fast! There are two problems with the algorithm I outlined. The second problem is bigger than the first You are doing N iterations over the second array You are reading the second file N times The reason problem #2 is bigger is because I/O is extremely expensive. The following is an intermediate solution between borisz's excellent answer and your first algorithm. Even though I'm still iterating over the inner loop N times, I'm doing it in memory, which is much faster. `open my $fh1, 'first.csv'; my @first = <$fh1>; close $fh1; open my $fh2, 'second.csv'; my @second = <$fh1>; close $fh2; FIRST: foreach my $line (@first) { foreach my $otherline (@second) { if ( $line eq $otherline ) { print "Found a duplicate!\n"; next FIRST; } } }` [download] Note that I've also added short-circuiting to stop comparing the inner-loop when I've found a dupe. This will reduce the runtime quite a bit. The hash solution is still faster, and will take less memory. My wife's blog	[reply] [d/l]