Efficent Text Checking

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have a script which basically does the following:

Accept an input file text
Check whether the exact text appear in one of the file in the repository.
We have a hundreds of files in store.
Return the name of the text in the repository if we have it, otherwise
Return 'no matching file'

For example we would then have:

prompt> perl check_filename.pl input_file.txt
prompt> Files in our repos: file1.txt
[download]

I was thinking if there is an efficient way to do this sort of thing on the fly? What I have now is simply storing them in hash. But it is very2 slow.

#!/usr/bin/perl -w
use strict;
use File::Slurp;
use File::Basename;

my $input = $ARGV[0];
my @repos = glob("repos/*.txt");

# Store repos in hash

my %hash;
foreach my $repos_file (@repos) {
  my $text = read_file($repos_file);
  my $base = basename($repos_file,".txt");
  $hash{$text} = $base;
}


# check input files:

my $input_text = read_file($input);
if ($hash{$input_text}) {
   print "File in our repos: $hash{$input_text}\.txt\n";
}
else {
   print "no matching file\n";
}
[download]

---
neversaint and everlastingly indebted.......

Comment on Efficent Text Checking Select or Download Code

Replies are listed 'Best First'.
Re: Efficent Text Checking by moritz (Cardinal) on Aug 27, 2007 at 13:26 UTC
I guess that you are loading large amounts of file contents into memory, which takes some time. But since you have only one lookup, you don't have to store the values - you could compare them on the fly: `for my $filename (@repos){ if ($input eq read_file($filename)){ print "File $filename matches\n"; exit; } } print "No match";` [download] If the files are rather large you can optimize the process further my just reading the first `$N` bytes (let's say 512) and compare them, and only if these first bytes are identical read all of the file. Or you could just dump the files into a database and let the database build an index, then all you have to do is query the database. If your repo doesn't change too frequently that's probably the fastest solution. Perl 6 in German -- Difficult Sudoku	[reply] [d/l] [select]
Re: Efficent Text Checking by liverpole (Monsignor) on Aug 27, 2007 at 13:35 UTC
Hi neversaint, "1. Accept an input file text" It seems you mean "Accept the name of an input file", right? And, if I'm following correctly, you're trying to find all files which exactly match the text from that input file? If you have large files, it's going to make your hash keys large and slow. Why not do a first pass to compare file sizes (using `-s <filename>`); that way, only files which exactly match the size of the input file will be considered. Any files left over have a much higher chance of matching! I would also recommend you don't store the text of the file as a hash, but rather use something like File::Compare, to compare each file against the input file (after, of course, performing the filter-by-size pass). s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply] [d/l]
Re: Efficent Text Checking by archfool (Monk) on Aug 27, 2007 at 13:50 UTC
Having the text itself as a key is probably very memory inefficient. Have you tried md5sum() on the text to make smaller keys. This will take extra time, but less memory. Also, you could store your hash to disk if you repeatedly run this many times, and the files don't change very much. You'd just need to read-in the md5sums and filenames rather than check them again and again and again...	[reply]