<assumptions>
2 files x 65_000 lines x 60 fields x 8 chars = ~60MB data
|-delimited ASCII
search query is field 1
data contains no 'escaped' |'s (e.g. \| or "xx|xx")
</assumptions>
Given the relatively 'small' (meaning under 128MB)
database size, the questions would be 'how often do you need
to search it?' and 'how often does the data change?'
As merlyn pointed out, you could index this file
based upon whatever field(s) you are searching by; in
this example, the ECL.
N.B. also that it would be simplest if (as one might
assume) the data is only appended to (and never 'changed')
to scan only the new data into the DB index. This example
assumes that you get 'new files;' e.g. by rotating out
files before indexing. If that is not the case (if the
files are appended to in place), keeping track of the
length of the file at the time it were indexed and then
using seek() to begin indexing after the end would be more
effective. (To 'reset' this index, just remove the db file.)
[mk_ecl_index]
#!/usr/bin/perl
use DB_File;
for my $filename (@ARGV)
{
my %ecl;
tie %ecl, DB_File, "$filename.db"
or die "Can't tie $filename.db: $!";
open ASCII, "<$filename";
while (<INPUT>)
{
chomp;
next unless m{^ ([^\|]*
\|
[^\|]*)
\|}x; # first two fields
# can't store refs in basic DB_File
# but data guaranteed not to contain \n, so... :-/
$ecl{$1} = ''
unless defined $ecl{$1};
$ecl{$1} .= $_ . "\n";
}
close ASCII;
untie %ecl;
}
[grep_ecl]
#!/usr/bin/perl
# n.b. args opposite of Unix grep; filename, query, q2...
my $filename = shift;
my %index;
tie %index, DB_File, "$filename.db"
or die "Can't tie to $filename.db: $!";
for my $query (@ARGV)
{
if ( exists $index{$query} )
{
print $index{$query}; # newlines already provided
} else {
print STDERR "$0: $filename: $query not found\n";
}
}
untie %index;
File locking is left as an exercise for the reader;
if you index in a cron job or logrotate script, you'll
likely need it.
Hope that helps ;-) |