Binary Search

dave_pl has asked for the wisdom of the Perl Monks concerning the following question:

Hi all i need some help with a search i am doing on two flat text files i am looking for duplicates between the two files these two files are 60908 lines avarage, and to do a linear search is just to slow, binary search from 'Mastering Algorithms with Perl' but it doesn`t seem to work. here is the code i am using.. ---------------Start--------------------

#!usr/bin/perl.exe

$a = 0;

open(DUP, ">Duplicates found.csv") || die $!;
open(NOR, ">Not in main.csv") || die $!;
open(FILEM, "main.csv");
open(FILET, "tmp1.csv");

@file1 = <FILEM>;
@file2 = <FILET>;

foreach $item(@file1){
 $item =~ /(\d[0-9]{6,15}\d)/;
 $val = &bin_search( \@file2, $1 );

 #Here are some example of the first line n @file2[1]
 #East Rd,GouisS Lees & PARTNERS, 12345678/1,JACOBS
 #and $1 is a example of 12345678 but it doesn`t match
 if(defined $val){
  print DUP "$item";
 }
}

close(FILEM);
close(FILET);
close(NOR);
close(DUP);

sub bin_search {
  my ($file2, $word) = @_;
  my ($low, $high) = ( 0, @$file2 - 1 );

 while ( $low <= $high ) { # While the window is open

   my $try = int( ($low+$high) /2 ); # Try the middle element

   $low = $try+1, next if $array->[$try] lt $word; # Raise bottom
   $high = $try-1, next if $array->[$try] gt $word; # Lower top
     return $try; # We've found the word!
  }
  return;
 }
}
[download]

--------------END---------------
ANY help with this would be great thanks a lot.

20050404 Edit by ysth: code tags; remove unneeded brs from text paragraph

Comment on Binary Search Download Code

Replies are listed 'Best First'.
Re: Binary Search by borisz (Canon) on Apr 04, 2005 at 13:06 UTC
Here is a way to it, if one of the files fit into your memory. ( untested ) `my %h; open my $fh, "<file1" or die $!; while ( defined( $_ = <$fh> ) ) { $h{$_} = $.; } open my $fh2, "<file2" or die $!; while ( defined( $_ = <$fh2> ) ) { if ( exists $h{$_} ) { print "line $. is a dup of line $h{$_}\n"; } }` [download] Boris	[reply] [d/l]
Re: Binary Search by dragonchild (Archbishop) on Apr 04, 2005 at 13:12 UTC
If you're on Unix, there's no need for Perl. (Windows has both the Unix utilities and Cygwin.) `sort file1.csv > file1.sort.csv sort file2.csv > file2.sort.csv comm file1.sort.csv file2.sort.csv` [download] Plus, 61k lines isn't very large at all ... I'm curious as to how you were comparing them originally that you felt was too slow. Maybe your initial algorithm was borked. Could it have been something like `open FIRST, "first.csv"; while (my $line = <FIRST>) { open SECOND, "second.csv"; while (my $otherline = <SECOND>) { if ( $line eq $otherline ) { print "Found a match!\n": } } close SECOND; } close FIRST;` [download] Because that's really slow! My wife's blog	[reply] [d/l] [select]
Re^2: Binary Search by dave_pl (Sexton) on Apr 04, 2005 at 14:01 UTC
dragonchild I am currently doing this in win... i know... i have no choice though. Yes that is about exactly how i was doing it, i am testing borisz way now and that looks really fast! i am still learning perl as you undoubtedly have noticed. We all need to start somewhere :). Thanks	[reply]
Re^3: Binary Search by dragonchild (Archbishop) on Apr 04, 2005 at 14:16 UTC
I am currently doing this in win... i know... i have no choice though. Ok. You can install any of the following: Cygwin at http://www.cygwin.com a bunch of natively ported utilies from http://unxutils.sourceforge.net/ Microsoft's own Services for Unix at http://www.microsoft.com/windows/sfu/default.asp All three have comm and sort and work just fine. Yes that is about exactly how i was doing it, i am testing borisz way now and that looks really fast! There are two problems with the algorithm I outlined. The second problem is bigger than the first You are doing N iterations over the second array You are reading the second file N times The reason problem #2 is bigger is because I/O is extremely expensive. The following is an intermediate solution between borisz's excellent answer and your first algorithm. Even though I'm still iterating over the inner loop N times, I'm doing it in memory, which is much faster. `open my $fh1, 'first.csv'; my @first = <$fh1>; close $fh1; open my $fh2, 'second.csv'; my @second = <$fh1>; close $fh2; FIRST: foreach my $line (@first) { foreach my $otherline (@second) { if ( $line eq $otherline ) { print "Found a duplicate!\n"; next FIRST; } } }` [download] Note that I've also added short-circuiting to stop comparing the inner-loop when I've found a dupe. This will reduce the runtime quite a bit. The hash solution is still faster, and will take less memory. My wife's blog	[reply] [d/l]
Re: Binary Search by Jaap (Curate) on Apr 04, 2005 at 13:10 UTC
Welcome to the monestary! Please use <code></code> around your code Please `use strict;` and `use warnings;` when you get stuck. It gives usefull error messages. What is `$array`? Where is it initialised? Are you sure you want a comma between `$low = $try+1` and `next if $array...`	[reply] [d/l] [select]
Re^2: Binary Search by dave_pl (Sexton) on Apr 04, 2005 at 13:43 UTC
Hi all damn thanks for the fast reply!!! Happy to be a part of Perl Monks! Hi Jaap sorry about the code tag i will remember it next time:) yes sorry about $array this should be $file2. just reference with the same name gives allot of problems so every refferance maid to @file2 should be $array i apologies for this but after testing so many times i forgot to change it. o_O ----------Bin Should be writen like this (SORRY)----------- `sub bin_search { my ($array, $word) = @_; my ($low, $high) = ( 0, @$array - 1 ); while ( $low <= $high ) { # While the window is open my $try = int( ($low+$high) /2 ); # Try the middle element $low = $try+1, next if $array->[$try] lt $word; # Raise bottom $high = $try-1, next if $array->[$try] gt $word; # Lower top return $try; # We've found the word! } return; }` [download] THANKS AGAIN!	[reply] [d/l]