Re: Grep Speeds

Replies are listed 'Best First'.
Re: Re: Grep Speeds by baku (Scribe) on Feb 06, 2001 at 21:22 UTC
`<assumptions> 2 files x 65_000 lines x 60 fields x 8 chars = ~60MB data \|-delimited ASCII search query is field 1 data contains no 'escaped' \|'s (e.g. \\| or "xx\|xx") </assumptions>` [download] Given the relatively 'small' (meaning under 128MB) database size, the questions would be 'how often do you need to search it?' and 'how often does the data change?' As merlyn pointed out, you could index this file based upon whatever field(s) you are searching by; in this example, the ECL. N.B. also that it would be simplest if (as one might assume) the data is only appended to (and never 'changed') to scan only the new data into the DB index. This example assumes that you get 'new files;' e.g. by rotating out files before indexing. If that is not the case (if the files are appended to in place), keeping track of the length of the file at the time it were indexed and then using seek() to begin indexing after the end would be more effective. (To 'reset' this index, just remove the db file.) [mk_ecl_index] #!/usr/bin/perl use DB_File; for my $filename (@ARGV) { my %ecl; tie %ecl, DB_File, "$filename.db" or die "Can't tie $filename.db: $!"; open ASCII, "<$filename"; while (<INPUT>) { chomp; next unless m{^ ([^\\|]* \\| [^\\|]*) \\|}x; # first two fields # can't store refs in basic DB_File # but data guaranteed not to contain \n, so... :-/ $ecl{$1} = '' unless defined $ecl{$1}; $ecl{$1} .= $_ . "\n"; } close ASCII; untie %ecl; } [grep_ecl] #!/usr/bin/perl # n.b. args opposite of Unix grep; filename, query, q2... my $filename = shift; my %index; tie %index, DB_File, "$filename.db" or die "Can't tie to $filename.db: $!"; for my $query (@ARGV) { if ( exists $index{$query} ) { print $index{$query}; # newlines already provided } else { print STDERR "$0: $filename: $query not found\n"; } } untie %index; [download] File locking is left as an exercise for the reader; if you index in a cron job or logrotate script, you'll likely need it. Hope that helps ;-)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Grep Speeds
by baku (Scribe) on Feb 06, 2001 at 21:22 UTC

<assumptions>
 2 files x 65_000 lines x 60 fields x 8 chars = ~60MB data
 |-delimited ASCII
 search query is field 1
 data contains no 'escaped' |'s (e.g. \| or "xx|xx")
</assumptions>
[download]

Given the relatively 'small' (meaning under 128MB) database size, the questions would be 'how often do you need to search it?' and 'how often does the data change?'

As merlyn pointed out, you could index this file based upon whatever field(s) you are searching by; in this example, the ECL.

N.B. also that it would be simplest if (as one might assume) the data is only appended to (and never 'changed') to scan only the new data into the DB index. This example assumes that you get 'new files;' e.g. by rotating out files before indexing. If that is not the case (if the files are appended to in place), keeping track of the length of the file at the time it were indexed and then using seek() to begin indexing after the end would be more effective. (To 'reset' this index, just remove the db file.)

[mk_ecl_index]
#!/usr/bin/perl
use DB_File;
for my $filename (@ARGV)
{

 my %ecl;
 tie %ecl, DB_File, "$filename.db"
  or die "Can't tie $filename.db: $!";
 open ASCII, "<$filename";
 while (<INPUT>)
 {
  chomp;
  next unless m{^ ([^\|]*
                     \|
                   [^\|]*)
                     \|}x; # first two fields
  # can't store refs in basic DB_File
  # but data guaranteed not to contain \n, so... :-/
  $ecl{$1} = ''
   unless defined $ecl{$1};
  $ecl{$1} .= $_ . "\n";
 }
 close ASCII;
 untie %ecl;
}

[grep_ecl]
#!/usr/bin/perl
# n.b. args opposite of Unix grep; filename, query, q2...
my $filename = shift;
my %index;
tie %index, DB_File, "$filename.db"
 or die "Can't tie to $filename.db: $!";
for my $query (@ARGV)
{
  if ( exists $index{$query} )
  {
    print $index{$query}; # newlines already provided
  } else {
    print STDERR "$0: $filename: $query not found\n";
  }
}
untie %index;
[download]

File locking is left as an exercise for the reader; if you index in a cron job or logrotate script, you'll likely need it.

Hope that helps ;-)

[reply]
[d/l]
[select]