comment on

I've been pondering changing an application that I created a year or so ago to make it somewhat more robust.

The application is a tool that allows searches through a whole bunch (say 700 or so) HTML files. It prints a listing (and a link to) of files, sorted on how many times the keyword is matched.

So far so good, this sounds easy right? The problem lies in the fact that the documents are consistently updated, say 6 or 7 files get changed every day. They are updated by multiple users. The searches need to be as "real-time" as possible.

The way that I've solved this in the past was by building 2 applications. One to check the files for updates (every 5 minutes), parse the files, and store a hash mapping keywords to filenames in a storable file.
The second app is just a cgi interface that loads the stored file, blazingly fast finds the "answers" to the search.

There are a couple of reasons that I don't like this approach, the main one being that every once in a while, the application checking for updates dies. Sometimes we don't notice and people are retriving out of date information.

The second reason that I don't like this is because the tool that is monitoring the files is run from a command prompt (yes this is all on Windows), which requires the server to be logged in.

The third reason that I want to re-write this is because I finished it when I was much more new to perl than I am now. There is some really ugly code in it, and I may be moving to a new job (actually just losing this one) and I want to leave my successor readable code.

So, I'm polling for suggestions, given the above scenario, what would you suggest the best way to accomplish my goals would be? Those goals, to clarify: pseudo-real time search, very fast, and stable

The ideas I've been kicking around:

When a search occurs, check the last update of the cached information, if too long, kick off a seperate process to check for updates/rebuild the cache
Offer an update button when creating documents
Rebuild an app that can function as a service in Windows to update the cached information
Real time search - possibly using a fork() to search multiple files at once

I can see positives and negatives in all of the above, what would you suggest?

In reply to Speed searching HTML docs by the_slycer

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.