in reply to Search Engine function
Supposedly if you had a data structure based on fixed page size, say ten results per tree limb in a tree-type structure, you could avoid walking nodes which represent pages before or after the page you want. But often the user is given an opportunity to change the number of results per page, or the engine operator may reduce the number of results returned per click so as to reduce server load.
Much processing power is usually spent on building indices to the word database. The code which returns matches to a user just traverses those indices efficiently. Depending on the amount of data to be searched, you may need to have a lot more processing done offline, or even use a C or C++ backend. But for smaller size data stores, Perl's own regex is quite powerful even without creating an index at all. One thing to remember is that if you do paging, you need to be able to guarantee the order of the results will be the same each time, otherwise you could lose results when calculating the paging. Databases are not supposed to be in any particular order, so you need to do sorting (in SQL, ORDER BY).
One problem you may want to consider if your data is any significant size, is memory/disk performance. Calling sort many times can slow you down, and slurping up tons of disk without smart use of indices is deadly (not as dangerous if you use even a Perl DBM). The biggest problem might be just getting too many hits internally, before you have selected which hits to display for the current page (assuming you have not run presearches like Google does). I believe (could be wrong) that most medium size engines do compile indices but don't actually do presearching and caching of queries.
An SQL query could easily fill up your available memory, so if you were intending to just fetch everything into an array all at once you definitely need to LIMIT. One way to save memory (and maybe speed up your search too) if you are unable to limit the number of results easily might be to only search for unique row id numbers and then select a range out of those after sorting them. You might be able to write a complex SQL query that does so (I'm thinking of PL/SQL language), or possibly your engine might automatically optimize and do it in the most efficient way (seems MySQL might do that). Mysql seems like one of the fastest ways to index a lot of read-only data and if you can watch out for filling up memory it might be a good solution.
|
|---|