Faster searching

the_slycer has asked for the wisdom of the Perl Monks concerning the following question:

We have an "application" (which I have no control over at all) that essentially dumps up to 150 word docs a day into an FTP server.

Part of the doc contains a "status" field which can be a combination of a couple of numbers.

Our support desk has always changed the numbers of these fields to change the doc into a new status. They were completing this by using an ftp client, and renaming the doc online. OR, if there are multiple directories to search, typically they would hit the "root" share and search multiple directories for the doc, then make the change there. However, the application guys have asked that logging start being done on the changes or they will take away this functionality. So, no problem, they came to me and asked me to write up a web based form to make the changes to the documents. Which I have done.

The problem lies in the searching. Via an ftp client (to a specific folder) it was very quick. Via the share method (multiple dirs) it was slower, but still relatively fast. Now, they are using a pre-existing web based search that is extremely slow.

Feeling sorry for these guys (I was in that area) I took it upon myself to create a faster search than the one they were using (the original was not a perlish solution). The problem is, I cannot get this thing any faster. 44 seconds on average to search the entire tree, obviously less for smaller amounts of folders.

So, after all that rambling, my question to you is HOW? I've thought about starting an index via fork and Storable as soon as the client loads the search form, but
a) I'm not sure if fork plays nice with web pages?
b) It's still 45 seconds for the full index. I can't get the data from Storable finished before the client hits submit.

I've also thought about running a seperate process on the webserver which will index once every five minutes or so, but
a) This will cause unnecessary load on the server, and
b) will not be realtime.
Five minutes is not a long time for the difference, and from experience, I think it would work out OK (as the end-user normally does not call for a status change for several minutes after the doc is created), but there is still the outside possibility that this will not suffice.

Any more ideas?
If it matters, the server that my script is running on is IIS 5.0 on NT 4.

Comment on Faster searching

Replies are listed 'Best First'.
Re: Faster searching by tachyon (Chancellor) on Jul 09, 2001 at 20:35 UTC
It seems to me your answer is almost already to hand. You say that the purpose of your script is to log the changes or in other words the only 'allowed' way to make these changes should be through it - otherwise they can not be logged. This being the case why not create a script to index everything and then use your logger to update as well as search this index. Eventually someone will update without using your logform interface (Murphy's law) but this should allow you to have an index which may need to be updated as little as once a day, say via a cron during the lowest usage time on the server. You could also include a checkbox option on your logform interface like: "I can't find the damn status I want, I think the index must be out of sync so do a full reindex now please - yes I know it will take a while!" As a result the index is realtime updated and regularly resyncronised. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Faster searching by lhoward (Vicar) on Jul 09, 2001 at 20:37 UTC
How about creating your index from the ftp server's log (assuming it has one)? Have a process that tail's the FTP server's log and updates the index whenever something is put. I did something similar with wuftpd a few years back and it worked briliantly. You could even skip this whole problem by generating the "log of changes" for the application guys from the FTP server's log.	[reply]
Re: Faster searching by earthboundmisfit (Chaplain) on Jul 09, 2001 at 20:36 UTC
I assume you are searching for filenames and not the content of the files? Have you tried File::Find? `our @filelist = (); our $fdir = "/usr/blah" find(\&wanted, $fdir); die " cannot get file list: $!" unless(@filelist>0); #assuming the directory exists, @filelist now has all files in blah in +cluding subdirs. sub wanted { my $thisfile = $File::Find::name; push(@filelist, $thisfile) }` [download] See the doc for how to search using more complex criteria	[reply] [d/l]
Re: Faster searching by traveler (Parson) on Jul 09, 2001 at 21:37 UTC
This does not address the perl issue of faster searching, but if the "word docs" are MS Word docs, why not turn on "Revision Tracking" on the PCs the support guys use. This should then show not only the changes, but IIRC, who made them.	[reply]
Re: Faster searching by Anonymous Monk on Jul 09, 2001 at 23:12 UTC
The Idea is great(assuming I understand correctly): You deliver the Form not as HTML but from a CGI. Before(!) delivering the Form the CGI forks and the child indexes the tree. Then the Form is filled in (I assume that takes more than 45 sec), and on submit the index is ready. Instant gratification.	[reply]