I need speed

Galen has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: I need speed by grinder (Bishop) on Jun 21, 2002 at 18:07 UTC
You don't say what platform you are running on. The following hold for Unix platforms. It may seem counter intuitive, but I think you'll get a fair bang for your buck by opening an output pipe from the `locate(1)` program. It's designed to be efficient at answering these kinds of queries. On the other hand, if you search for a given string "abc", it will output the name of the file even if "abc" is part of the pathname, not only the filename. But you could filter out the false hits by taking the basename of the returned file and checking it against your target string when you read it in. `# remember to untaint $str open LOC, "/usr/bin/locate $str \|" or die "Cannot open input pipe: $!\ +n"; while( <LOC> ) { chomp; my $file = basename $_; next unless index($filename, $str) > -1; # output this filename } close LOC;` [download] I'm not convinced a DBM hash will speed things up: at the end of the day you're still doing a linear scan of 200K records. (Assuming that since you're doing an index partial substrings of names are allowed). Antother idea would be to simplify the location file contents, to reduce its size. Just store the name of each file. If and when a person hits it in a search, stat the file at that point in time (the results will be fresher to boot). This will let you store more files per disk block, thus less disk blocks to store all of them. Depending on a certain ratio of CPU to disk I/O performance that only testing can demonstrate, it may be faster to zip the file and read and decompress it on the fly, on the assumption that the extra CPU cost is offset by reducing the number of disk blocks read. print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'	[reply] [d/l]
Re: I need speed by Aristotle (Chancellor) on Jun 21, 2002 at 19:34 UTC
Here are some tips if you don't want to use external tools. The key is to be lazy and defer as much work as possible. One simple thing you can do is `my ($filename, $rest) = split /,/, $_, 2; if (index($filename, $to_find) > -1) { my ($cms, $path, $size, $day, $time) = split /,/, $rest;` [download] Going quite a lot further: `open(F, "+< $infile"); while (sysread F, $_, 32768) { $_ .= <F>; next unless /\Q$to_find\E/; # quickskip for(grep /^[^,]?\Q$to_find\E/, split /\n/, $_) { ($filename, $cms, $path, $size, $day, $time) = split /,/; $href = "file:\\\\netd\\data".$path."\\$filename"; $href =~ s/\s/%20/g; $table .= "<TR><TD><A HREF=\"$href\">$path\\$filename</A><TD>$si +ze<TD>$day $tim +e</TR>"; } } close(F);` [download] This greatly reduces the number of IO operations and restricts the heavy `split`tage to known matches. I believe the following is an extra win, but I haven't benchmarked it. `open(F, "+< $infile"); while (sysread F, $_, 32768) { $_ .= <F>; next unless /\Q$to_find\E/; # quickskip while(/^([^\n,]?\Q$to_find\E[^\n])/gm) { ($filename, $cms, $path, $size, $day, $time) = split /,/, $1; $href = "file:\\\\netd\\data".$path."\\$filename"; $href =~ s/\s/%20/g; $table .= "<TR><TD><A HREF=\"$href\">$path\\$filename</A><TD>$si +ze<TD>$day $tim +e</TR>"; } } close(F);` [download] Beyond these tricks, you quickly get to the point of pretty much handrolling your own database engine.. Makeshifts last the longest.*	[reply] [d/l] [select]
Re: Re: I need speed by Galen (Beadle) on Jun 21, 2002 at 20:37 UTC
Thanks for your reply - I had thought about limiting the split, but I didn't believe it would enhance performance much. It does appear to help though. The simple reduced split you noted seems to improve performance by about 20%. However, the other snippets you provided don't work. I'm trying to figure out why. They return no matches for files I know exist.	[reply]
Re^3: I need speed by Aristotle (Chancellor) on Jun 22, 2002 at 00:58 UTC
Odd. Admitted, I didn't test them except for the regex in the last snippet when I posted them, but I did now and they seem to be working. shrug There's nothing obvious I can spot either. Makeshifts last the longest.	[reply]
Re: Re^3: I need speed by Galen (Beadle) on Jun 24, 2002 at 14:45 UTC
Re: I need speed by Abigail-II (Bishop) on Jun 21, 2002 at 18:04 UTC
DBM Files are good for exact matches, but you aren't doing an exact match. You are doing a substring match. To do that efficiently (that is, significantly faster than matching all files) you either need a complicated datastructure (that you will have to make persistant somehow) or use lots and lots of (disk)memory. Or some combination of both. It would also mean that updating the information is going to take longer. Abigail	[reply]
Re: I need speed by twerq (Deacon) on Jun 21, 2002 at 17:51 UTC
Using DB_File would definately be faster for index lookups, provided you used the btree interface. By using the filename as the key, and the path as the value, you can quickly search for the location of your file. One problem you may run into, though -- is that many files (under different paths) share identical names. See the section "Handling Duplicate Keys" in man DB_File. --twerq	[reply]
Re: I need speed by perrin (Chancellor) on Jun 21, 2002 at 18:47 UTC
Have you ever tried the command "locate"? It does this very very well and you can easilly make your CGI script run it. I believe it is even available on Windows as part of the Cygwin package.	[reply]
Re: I need speed by RMGir (Prior) on Jun 21, 2002 at 17:58 UTC
If you already have a database server set up somewhere in your organisation, and it's not too heavily loaded, a database table and a bit of DBI code would probably help a lot. -- Mike	[reply]
Re: I need speed by Galen (Beadle) on Jun 21, 2002 at 18:49 UTC
To answer some questions: this is being done on a cough Win2k platform (no locate). I am a lowly knuckle-dragging grunt and access to a database server is not an option. This organization regulates that sort of thing tightly. The bottleneck here seems to be processing speed, as opposed to memory or disk space. I suppose that could change if the index file grows to over 100M. It might increase efficiency somewhat to search a file containing only filenames, and then to retrieve path into from a DBM hash using the filename as a key, (with a little workaround for duplicate filenames in different directories). However, I think my initial idea of breaking the index file into separate, smaller indexes and choosing which to look through based on the search string may produce a greater increase in speed. If I didn't want to search for substrings in the filename this would be easy. The problem is that people don't always know the exact filename they are looking for, and it is useful to be able to display multiple matches (all files of a given extension, for example).	[reply]
Re: Re: I need speed by mirod (Canon) on Jun 21, 2002 at 19:28 UTC
I am a lowly knuckle-dragging grunt and access to a database server is not an option. This organization regulates that sort of thing tightly. Would anyone object to you installing DBI with DBD::SQLite? It installs a data base server, but it's a really tiny one, and a DB is conveniently stored as a single file.In short it means no administrative hassle and probably no need to ask anyone about it. SQLite implements the LIKE operator and wildcards in queries, so it should help you with your problem.	[reply]
Re: I need speed by robobunny (Friar) on Jun 21, 2002 at 17:51 UTC
a DBM hash wouldn't do you any good, because you need to process the entire file anyway. the benefit of a DBM is that it speeds up accessing specific elements of the file. you are sequentially reading the whole thing.	[reply]
Re: I need speed by Galen (Beadle) on Jun 21, 2002 at 19:20 UTC
Cygwin is great! Thank you so much. It's almost as good as having a unix box in my cube again.. which this place won't allow.	[reply]
Re: I need speed by Galen (Beadle) on Jun 21, 2002 at 19:46 UTC
I don't know a lot about relational databases, but I'm a quick study. I installed MySQL once and played with it for a while - long enough to know I would love to develop some perl-SQL skills. I just didn't quite know where to go with the daemon and eventually uninstalled it because it was wasting resources. However, if I were to get started with something like this file search utility, I can foresee many other potential applications. Do you recommend SQLite over MySQL?	[reply]
Re: I need speed by Galen (Beadle) on Jun 21, 2002 at 18:56 UTC
cygwin huh? I shall try it. Thanks!	[reply]
Re: I need speed by Galen (Beadle) on Jun 24, 2002 at 15:42 UTC
OK - the regex is working fine in your code Aristotle, but there is a mistake. Here is how I changed it to get it working: `sysopen(F, $infile, 0); while (sysread F, $_, 32768) { next unless /\Q$to_find\E/; while(/^([^\007,]?\Q$to_find\E[^\007])/gm) { ($filename, $cms, $path, $size, $day, $time) = split /,/, $1;` [download]	[reply] [d/l]
Re: I need speed by Galen (Beadle) on Jun 21, 2002 at 20:34 UTC
Accidently double posted.	[reply]