in reply to Binary Search

If you're on Unix, there's no need for Perl. (Windows has both the Unix utilities and Cygwin.)
sort file1.csv > file1.sort.csv sort file2.csv > file2.sort.csv comm file1.sort.csv file2.sort.csv

Plus, 61k lines isn't very large at all ... I'm curious as to how you were comparing them originally that you felt was too slow. Maybe your initial algorithm was borked. Could it have been something like

open FIRST, "first.csv"; while (my $line = <FIRST>) { open SECOND, "second.csv"; while (my $otherline = <SECOND>) { if ( $line eq $otherline ) { print "Found a match!\n": } } close SECOND; } close FIRST;
Because that's really slow!

Replies are listed 'Best First'.
Re^2: Binary Search
by dave_pl (Sexton) on Apr 04, 2005 at 14:01 UTC
  • dragonchild

    I am currently doing this in win... i know... i have
    no choice though.

    Yes that is about exactly how i was doing it,
    i am testing borisz way now and that looks really fast!

    i am still learning perl as you undoubtedly have noticed.
    We all need to start somewhere :).
    Thanks
      I am currently doing this in win... i know... i have no choice though.

      Ok. You can install any of the following:

      All three have comm and sort and work just fine.

      Yes that is about exactly how i was doing it, i am testing borisz way now and that looks really fast!

      There are two problems with the algorithm I outlined. The second problem is bigger than the first

      1. You are doing N iterations over the second array
      2. You are reading the second file N times

      The reason problem #2 is bigger is because I/O is extremely expensive. The following is an intermediate solution between borisz's excellent answer and your first algorithm. Even though I'm still iterating over the inner loop N times, I'm doing it in memory, which is much faster.

      open my $fh1, 'first.csv'; my @first = <$fh1>; close $fh1; open my $fh2, 'second.csv'; my @second = <$fh1>; close $fh2; FIRST: foreach my $line (@first) { foreach my $otherline (@second) { if ( $line eq $otherline ) { print "Found a duplicate!\n"; next FIRST; } } }

      Note that I've also added short-circuiting to stop comparing the inner-loop when I've found a dupe. This will reduce the runtime quite a bit. The hash solution is still faster, and will take less memory.