dave_pl has asked for the wisdom of the Perl Monks concerning the following question:

Hi all i need some help with a search i am doing on two flat text files i am looking for duplicates between the two files these two files are 60908 lines avarage, and to do a linear search is just to slow, binary search from 'Mastering Algorithms with Perl' but it doesn`t seem to work. here is the code i am using.. ---------------Start--------------------
#!usr/bin/perl.exe $a = 0; open(DUP, ">Duplicates found.csv") || die $!; open(NOR, ">Not in main.csv") || die $!; open(FILEM, "main.csv"); open(FILET, "tmp1.csv"); @file1 = <FILEM>; @file2 = <FILET>; foreach $item(@file1){ $item =~ /(\d[0-9]{6,15}\d)/; $val = &bin_search( \@file2, $1 ); #Here are some example of the first line n @file2[1] #East Rd,GouisS Lees & PARTNERS, 12345678/1,JACOBS #and $1 is a example of 12345678 but it doesn`t match if(defined $val){ print DUP "$item"; } } close(FILEM); close(FILET); close(NOR); close(DUP); sub bin_search { my ($file2, $word) = @_; my ($low, $high) = ( 0, @$file2 - 1 ); while ( $low <= $high ) { # While the window is open my $try = int( ($low+$high) /2 ); # Try the middle element $low = $try+1, next if $array->[$try] lt $word; # Raise bottom $high = $try-1, next if $array->[$try] gt $word; # Lower top return $try; # We've found the word! } return; } }
--------------END---------------
ANY help with this would be great thanks a lot.

20050404 Edit by ysth: code tags; remove unneeded brs from text paragraph

Replies are listed 'Best First'.
Re: Binary Search
by borisz (Canon) on Apr 04, 2005 at 13:06 UTC
    Here is a way to it, if one of the files fit into your memory. ( untested )
    my %h; open my $fh, "<file1" or die $!; while ( defined( $_ = <$fh> ) ) { $h{$_} = $.; } open my $fh2, "<file2" or die $!; while ( defined( $_ = <$fh2> ) ) { if ( exists $h{$_} ) { print "line $. is a dup of line $h{$_}\n"; } }
    Boris
Re: Binary Search
by dragonchild (Archbishop) on Apr 04, 2005 at 13:12 UTC
    If you're on Unix, there's no need for Perl. (Windows has both the Unix utilities and Cygwin.)
    sort file1.csv > file1.sort.csv sort file2.csv > file2.sort.csv comm file1.sort.csv file2.sort.csv

    Plus, 61k lines isn't very large at all ... I'm curious as to how you were comparing them originally that you felt was too slow. Maybe your initial algorithm was borked. Could it have been something like

    open FIRST, "first.csv"; while (my $line = <FIRST>) { open SECOND, "second.csv"; while (my $otherline = <SECOND>) { if ( $line eq $otherline ) { print "Found a match!\n": } } close SECOND; } close FIRST;
    Because that's really slow!
    • dragonchild

      I am currently doing this in win... i know... i have
      no choice though.

      Yes that is about exactly how i was doing it,
      i am testing borisz way now and that looks really fast!

      i am still learning perl as you undoubtedly have noticed.
      We all need to start somewhere :).
      Thanks
        I am currently doing this in win... i know... i have no choice though.

        Ok. You can install any of the following:

        All three have comm and sort and work just fine.

        Yes that is about exactly how i was doing it, i am testing borisz way now and that looks really fast!

        There are two problems with the algorithm I outlined. The second problem is bigger than the first

        1. You are doing N iterations over the second array
        2. You are reading the second file N times

        The reason problem #2 is bigger is because I/O is extremely expensive. The following is an intermediate solution between borisz's excellent answer and your first algorithm. Even though I'm still iterating over the inner loop N times, I'm doing it in memory, which is much faster.

        open my $fh1, 'first.csv'; my @first = <$fh1>; close $fh1; open my $fh2, 'second.csv'; my @second = <$fh1>; close $fh2; FIRST: foreach my $line (@first) { foreach my $otherline (@second) { if ( $line eq $otherline ) { print "Found a duplicate!\n"; next FIRST; } } }

        Note that I've also added short-circuiting to stop comparing the inner-loop when I've found a dupe. This will reduce the runtime quite a bit. The hash solution is still faster, and will take less memory.

Re: Binary Search
by Jaap (Curate) on Apr 04, 2005 at 13:10 UTC
    • Welcome to the monestary!
    • Please use <code></code> around your code
    • Please use strict; and use warnings; when you get stuck. It gives usefull error messages.
    • What is $array? Where is it initialised?
    • Are you sure you want a comma between  $low = $try+1 and next if $array...
      Hi all damn thanks for the fast reply!!!
      Happy to be a part of Perl Monks!
    • Hi Jaap sorry about the code tag i will
      remember it next time:)
      yes sorry about $array this should be $file2.
      just reference with the same name gives allot of problems
      so every refferance maid to @file2 should be $array
      i apologies for this but after testing so many times
      i forgot to change it. o_O
      ----------Bin Should be writen like this (SORRY)-----------
      sub bin_search { my ($array, $word) = @_; my ($low, $high) = ( 0, @$array - 1 ); while ( $low <= $high ) { # While the window is open my $try = int( ($low+$high) /2 ); # Try the middle element $low = $try+1, next if $array->[$try] lt $word; # Raise bottom $high = $try-1, next if $array->[$try] gt $word; # Lower top return $try; # We've found the word! } return; }
      THANKS AGAIN!