dr_jgbn has asked for the wisdom of the Perl Monks concerning the following question:

To Perl Gurus, What is the fastest way to take each line of one file and search it against the entire contents of another file?
I have come to discover that using regex is quite slow.
In addition, when there is a match, I need to print out the line that matched as well as the previous line, also I need to count the number of matches of that given line.
The purpose of this is to take a file which is approx. 100000 lines long (multiple times with various modifications), and search each of these lines against another quite large file.


Thank-you in advance for all suggestions!

Dr.J

Replies are listed 'Best First'.
Re: "Pattern Matching", not using regex
by Abigail-II (Bishop) on Jul 22, 2002 at 16:27 UTC
    Well, it's not quite clear what you want, but I think you want the following: given two large files, F1, and F2, find all lines in F2 that are as well in F1.

    You could so something like (following code is untested, just shown here to guide you, not for cut-and-paste purposes):

    use AnyDBM_File; use Fcntl; my $pattern_file = "whatever 1"; my $target_file = "whatever 2"; my $dbm = "/tmp/matchdbm"; tie my %db => AnyDBM_File => $dbm, O_CREAT | O_RDWR, 0666 or die "Failed to tie to $dbm: $!"; open my $pf => $pattern_file or die "Failed to open $pattern_file: $!" +; open my $tf => $target_file or die "Failed to open $target_file: $!"; + $db {$_} ++ while <$pf>; my @buff = ("", "", ""); while (<$tf>) { @buff = (@buff [1, 2], $_); print @buff if $db {$buff [1]}; } @buff = (@buff [1, 2], ""); print @buff if $db {$buff [1]};

    Abigail

Re: "Pattern Matching", not using regex
by broquaint (Abbot) on Jul 22, 2002 at 16:10 UTC
    I have come to discover that using regex is quite slow
    Well if you're doing complex pattern matching then regexes are your only choice. However if you're merely searching for strings within strings then I'd recommend using either the index() or rindex() functions.
    HTH

    _________
    broquaint

Re: "Pattern Matching", not using regex
by VSarkiss (Monsignor) on Jul 22, 2002 at 16:03 UTC

    It's not clear from your question, but do you even need to use pattern matching at all? If what you're trying to match (the lines from the first file) are fixed strings, a simple eq will do the job.

    It may be obvious and not apply to what you're doing, but I wanted to point it out, because it's very common that people over-use pattern matching when a simple string compare will suffice.

Re: "Pattern Matching", not using regex
by krisahoch (Deacon) on Jul 22, 2002 at 17:03 UTC

    Dr. J

    I started writing this up earlier today, but I had to leave it to do some work. When I came back to post I saw that Abigail had already posted a solution. This solution may not be as fast as her's but it is cut and paste-able. If you want to use it feel free

    I also tried to make it simple as possible.

    Thanks,
    Kristofer

    #!/bin/perl -w use diagnostics; use strict; use warnings; ############################################################ sub getAllLinesFromAFileandReturnItAsAnArray($) { my ($FILE) = $_[0]; #Redundant check. This code should be unreachable just by #the way I defined this sub '($)'. I do manual checking #of each varible as an anal principle. Just my style - Kristofer if (!defined($FILE)) { die ("I did not receive a filename"); } open my $FileDescripter => $FILE or die "Could not open '$FILE'\n"; my @ListOfStringsToReturn = <$FileDescripter>; close $FileDescripter; chomp @ListOfStringsToReturn; # my @ListOfStringsToReturn = (); # open (FileDescripter, "$FILE"); # while (<FileDescripter>) # { # if (defined($_)) # { # chop; # push @ListOfStringsToReturn, $_; # } # } # close(FileDescripter); return @ListOfStringsToReturn; } #=========================================================== my $FirstFile = "FileOne.txt"; my $SecondFile = "FileThree.txt"; my @FileOne = getAllLinesFromAFileandReturnItAsAnArray($FirstFile); my @FileTwo = getAllLinesFromAFileandReturnItAsAnArray($SecondFile); for (my $i = 0; $i < @FileOne; $i++) { print "Comparing Line $i in $FirstFile ... "; my $test = undef; for (my $j = 0; $j < @FileTwo; $j++) { if ($FileOne[$i] eq $FileTwo[$j]) { print " matched to line $j in $SecondFile\n"; print " FileOne: '$FileOne[$i]'\n"; print " FileTwo: '$FileTwo[$j]'\n"; #Setting j to @FileTwo will kill this loop $j = @FileTwo; $test = 0; } } if (!defined($test)) { print " not matched. Compared to $#FileTwo entries\n"; } }
    Update based on Abigail's suggestion. Thanks Abigail for the lesson in Perl.
      Please, please, please. Reading in all lines of a file into an array is standard idiom. Do not use such a silly, inefficient, routine. And NEVER EVER forget to check the return value of open.

      Here's how you read the file into an array:

      open my $fh => $file or die "Open failed: $!"; my @array = <$fh>; close $fh;
      If you want to chomp of newlines, add the following line:
      chomp @array;
      Don't reinvent the wheel. And don't code perl primitives in Perl.

      Abigail

Re: "Pattern Matching", not using regex
by Cine (Friar) on Jul 22, 2002 at 17:48 UTC
    grep -B 1 'yourinput' yourfile its not perl, but it is a lot more effective.

    T I M T O W T D I