comment on

<assumptions>
 2 files x 65_000 lines x 60 fields x 8 chars = ~60MB data
 |-delimited ASCII
 search query is field 1
 data contains no 'escaped' |'s (e.g. \| or "xx|xx")
</assumptions>
[download]

Given the relatively 'small' (meaning under 128MB) database size, the questions would be 'how often do you need to search it?' and 'how often does the data change?'

As merlyn pointed out, you could index this file based upon whatever field(s) you are searching by; in this example, the ECL.

N.B. also that it would be simplest if (as one might assume) the data is only appended to (and never 'changed') to scan only the new data into the DB index. This example assumes that you get 'new files;' e.g. by rotating out files before indexing. If that is not the case (if the files are appended to in place), keeping track of the length of the file at the time it were indexed and then using seek() to begin indexing after the end would be more effective. (To 'reset' this index, just remove the db file.)

[mk_ecl_index]
#!/usr/bin/perl
use DB_File;
for my $filename (@ARGV)
{

 my %ecl;
 tie %ecl, DB_File, "$filename.db"
  or die "Can't tie $filename.db: $!";
 open ASCII, "<$filename";
 while (<INPUT>)
 {
  chomp;
  next unless m{^ ([^\|]*
                     \|
                   [^\|]*)
                     \|}x; # first two fields
  # can't store refs in basic DB_File
  # but data guaranteed not to contain \n, so... :-/
  $ecl{$1} = ''
   unless defined $ecl{$1};
  $ecl{$1} .= $_ . "\n";
 }
 close ASCII;
 untie %ecl;
}

[grep_ecl]
#!/usr/bin/perl
# n.b. args opposite of Unix grep; filename, query, q2...
my $filename = shift;
my %index;
tie %index, DB_File, "$filename.db"
 or die "Can't tie to $filename.db: $!";
for my $query (@ARGV)
{
  if ( exists $index{$query} )
  {
    print $index{$query}; # newlines already provided
  } else {
    print STDERR "$0: $filename: $query not found\n";
  }
}
untie %index;
[download]

File locking is left as an exercise for the reader; if you index in a cron job or logrotate script, you'll likely need it.

Hope that helps ;-)

In reply to Re: Re: Grep Speeds by baku
in thread Grep Speeds by ImpalaSS

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.