Re: I need speed
by grinder (Bishop) on Jun 21, 2002 at 18:07 UTC
|
You don't say what platform you are running on. The following hold for Unix platforms.
It may seem counter intuitive, but I think you'll get a fair bang for your buck by opening an output pipe from the locate(1) program. It's designed to be efficient at answering these kinds of queries.
On the other hand, if you search for a given string "abc", it will output the name of the file even if "abc" is part of the pathname, not only the filename. But you could filter out the false hits by taking the basename of the returned file and checking it against your target string when you read it in.
# remember to untaint $str
open LOC, "/usr/bin/locate $str |" or die "Cannot open input pipe: $!\
+n";
while( <LOC> ) {
chomp;
my $file = basename $_;
next unless index($filename, $str) > -1;
# output this filename
}
close LOC;
I'm not convinced a DBM hash will speed things up: at the end of the day you're still doing a linear scan of 200K records. (Assuming that since you're doing an index partial substrings of names are allowed).
Antother idea would be to simplify the location file contents, to reduce its size. Just store the name of each file. If and when a person hits it in a search, stat the file at that point in time (the results will be fresher to boot). This will let you store more files per disk block, thus less disk blocks to store all of them.
Depending on a certain ratio of CPU to disk I/O performance that only testing can demonstrate, it may be faster to zip the file and read and decompress it on the fly, on the assumption that the extra CPU cost is offset by reducing the number of disk blocks read.
print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u' | [reply] [d/l] |
Re: I need speed
by Aristotle (Chancellor) on Jun 21, 2002 at 19:34 UTC
|
Here are some tips if you don't want to use external tools. The key is to be lazy and defer as much work as possible. One simple thing you can do is
my ($filename, $rest) = split /,/, $_, 2;
if (index($filename, $to_find) > -1) {
my ($cms, $path, $size, $day, $time) = split /,/, $rest;
Going quite a lot further:
open(F, "+< $infile");
while (sysread F, $_, 32768) {
$_ .= <F>;
next unless /\Q$to_find\E/; # quickskip
for(grep /^[^,]*?\Q$to_find\E/, split /\n/, $_) {
($filename, $cms, $path, $size, $day, $time) = split /,/;
$href = "file:\\\\netd\\data".$path."\\$filename";
$href =~ s/\s/%20/g;
$table .= "<TR><TD><A HREF=\"$href\">$path\\$filename</A><TD>$si
+ze<TD>$day $tim
+e</TR>";
}
}
close(F);
This greatly reduces the number of IO operations and restricts the heavy splittage to known matches.
I believe the following is an extra win, but I haven't benchmarked it.
open(F, "+< $infile");
while (sysread F, $_, 32768) {
$_ .= <F>;
next unless /\Q$to_find\E/; # quickskip
while(/^([^\n,]*?\Q$to_find\E[^\n]*)/gm) {
($filename, $cms, $path, $size, $day, $time) = split /,/, $1;
$href = "file:\\\\netd\\data".$path."\\$filename";
$href =~ s/\s/%20/g;
$table .= "<TR><TD><A HREF=\"$href\">$path\\$filename</A><TD>$si
+ze<TD>$day $tim
+e</TR>";
}
}
close(F);
Beyond these tricks, you quickly get to the point of pretty much handrolling your own database engine..
Makeshifts last the longest. | [reply] [d/l] [select] |
|
|
Thanks for your reply - I had thought about limiting the split, but I didn't believe it would enhance performance much. It does appear to help though. The simple reduced split you noted seems to improve performance by about 20%.
However, the other snippets you provided don't work. I'm trying to figure out why. They return no matches for files I know exist.
| [reply] |
|
|
| [reply] |
|
|
Re: I need speed
by Abigail-II (Bishop) on Jun 21, 2002 at 18:04 UTC
|
DBM Files are good for exact matches, but you aren't doing
an exact match. You are doing a substring match. To do that
efficiently (that is, significantly faster than matching all
files) you either need a complicated datastructure (that you
will have to make persistant somehow) or use lots and lots
of (disk)memory. Or some combination of both. It would also
mean that updating the information is going to take longer.
Abigail | [reply] |
Re: I need speed
by twerq (Deacon) on Jun 21, 2002 at 17:51 UTC
|
Using DB_File would definately be faster for index lookups, provided you used the btree interface.
By using the filename as the key, and the path as the value, you can quickly search for the location of your file.
One problem you may run into, though -- is that many files (under different paths) share identical names. See the section "Handling Duplicate Keys" in man DB_File.
--twerq | [reply] |
Re: I need speed
by perrin (Chancellor) on Jun 21, 2002 at 18:47 UTC
|
Have you ever tried the command "locate"? It does this very very well and you can easilly make your CGI script run it. I believe it is even available on Windows as part of the Cygwin package. | [reply] |
Re: I need speed
by RMGir (Prior) on Jun 21, 2002 at 17:58 UTC
|
If you already have a database server set up somewhere in your organisation, and it's not too heavily loaded, a database table and a bit of DBI code would probably help a lot.
--
Mike | [reply] |
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 18:49 UTC
|
To answer some questions: this is being done on a *cough* Win2k platform (no locate). I am a lowly knuckle-dragging grunt and access to a database server is not an option. This organization regulates that sort of thing tightly.
The bottleneck here seems to be processing speed, as opposed to memory or disk space. I suppose that could change if the index file grows to over 100M. It might increase efficiency somewhat to search a file containing only filenames, and then to retrieve path into from a DBM hash using the filename as a key, (with a little workaround for duplicate filenames in different directories). However, I think my initial idea of breaking the index file into separate, smaller indexes and choosing which to look through based on the search string may produce a greater increase in speed. If I didn't want to search for substrings in the filename this would be easy. The problem is that people don't always know the exact filename they are looking for, and it is useful to be able to display multiple matches (all files of a given extension, for example). | [reply] |
|
|
I am a lowly knuckle-dragging grunt and access to a database server is not an option. This organization regulates that sort of thing tightly.
Would anyone object to you installing DBI with DBD::SQLite? It installs a data base server, but it's a really tiny one, and a DB is conveniently stored as a single file.In short it means no administrative hassle and probably no need to ask anyone about it. SQLite implements the LIKE operator and wildcards in queries, so it should help you with your problem.
| [reply] |
Re: I need speed
by robobunny (Friar) on Jun 21, 2002 at 17:51 UTC
|
a DBM hash wouldn't do you any good, because you need to process the entire file anyway. the benefit of a DBM is that it speeds up accessing specific elements of the file. you are sequentially reading the whole thing. | [reply] |
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 19:20 UTC
|
Cygwin is great! Thank you so much. It's almost as good as having a unix box in my cube again.. which this place won't allow. | [reply] |
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 19:46 UTC
|
I don't know a lot about relational databases, but I'm a quick study. I installed MySQL once and played with it for a while - long enough to know I would love to develop some perl-SQL skills. I just didn't quite know where to go with the daemon and eventually uninstalled it because it was wasting resources. However, if I were to get started with something like this file search utility, I can foresee many other potential applications. Do you recommend SQLite over MySQL? | [reply] |
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 18:56 UTC
|
cygwin huh? I shall try it. Thanks! | [reply] |
Re: I need speed
by Galen (Beadle) on Jun 24, 2002 at 15:42 UTC
|
OK - the regex is working fine in your code Aristotle, but there is a mistake. Here is how I changed it to get it working:
sysopen(F, $infile, 0);
while (sysread F, $_, 32768) {
next unless /\Q$to_find\E/;
while(/^([^\007,]*?\Q$to_find\E[^\007]*)/gm) {
($filename, $cms, $path, $size, $day, $time) = split /,/, $1;
| [reply] [d/l] |
Re: I need speed
by Galen (Beadle) on Jun 21, 2002 at 20:34 UTC
|
Accidently double posted. | [reply] |