Complicated searches in a very large text file.

waswas-fng has asked for the wisdom of the Perl Monks concerning the following question:

I have a home brew archive to dvd app that chunks data in directories and builds DVD's with that data. It also takes all of the fully qualified directory info and stores it into a text file with the attached Archive DVD number. End users can seach this index with a search string such as "this really blows !stupid fun" and get back matches for:

will return: 
"/data2/studio/projects/clientname/print/this really/funermaker/santa 
+blows/thefile.tif"
exists on Archive-dvd-000002122

but not this:
"/data2/studio/projects/clientname/print/this really/funermaker/santa 
+blows stupid/thefile.tif"
because it has stupid in the fq file name.
[download]

That is all fine and dandy, my issue is that I have about 420 million items so far (and it grows weekly) and my search is starting to take a long time to show all of the results. I have optimized the search to use qr// regexes for the words (in oder of word lenght) and short circuit out of the loop as soon as one of the qr fails to save time.

I am now jumping back on this to see if there is a "better way" to do this -- currently the search is taking about 1 minute to return all of the matches, my gaol is to get it down to less than 10 seconds if possible. I think i could use one of the many btree searches out there but I think the index size would be way too huge for this (its already very large).

Any sugestions please let me know I want to see how the rest of you would tackle this.

-Waswas

Comment on Complicated searches in a very large text file. Download Code

Replies are listed 'Best First'.
Re: Complicated searches in a very large text file. by chromatic (Archbishop) on Jun 30, 2003 at 03:36 UTC
I'd use a relational database. Recent versions of MySQL have `FULLTEXT` searching. Other databases have similar features. I personally (since you asked) would rather do the dishes than work up an inverted index, taking into account word stemming and boolean operations.	[reply] [d/l]
Re: Re: Complicated searches in a very large text file. by tachyon (Chancellor) on Jun 30, 2003 at 10:59 UTC
Another option is swish-e. C based and fast. As a bonus the output stuff is in Perl so edit TemplateDefault.pm and it looks totally custom. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: Complicated searches in a very large text file. by waswas-fng (Curate) on Jun 30, 2003 at 03:45 UTC
Wow, thanks chromatic. Back when I started this project a long time ago MySQL did not have anything like that -- this looks to be very cool. Yeah inverted indexes make my head ache, that is why i did not use them when i first wrote this -- although for the first 12 months this was in production the search times were not a factor but the amount of additions to the index now are starting to be 6 or 7 times as large as the original spec called for -- hence the issue. -Waswas	[reply]
Re: Re: Complicated searches in a very large text file. by waswas-fng (Curate) on Jun 30, 2003 at 16:31 UTC
Ack, after reading through the MySQL docs it seems as though the full text searching does not meet my requirements =(. search string such as "this really blows !stupid fun" and get back matches for: `will return: "/data2/studio/projects/clientname/print/this really/funermaker/santa +blows/thefile.tif" exists on Archive-dvd-000002122 but not this: "/data2/studio/projects/clientname/print/this really/funermaker/santa +blows stupid/thefile.tif" because it has stupid in the fq file name.` [download] You see I need to be able to substring search quickly and it looks to only support full word matches, or using the * boolean full-text search capability operator I can do for example fun* to match "funny" or "fungle" but i do not see a way to match "Senior Howfun" -- do you have any insight? I know I can fall back to using LIKE statments to do basicly the same thing i am doing in the original post, but it seems like that would just add the overhead of a DB to do the same thing -- any ideas? -Waswas	[reply] [d/l]
Re: Re: Re: Complicated searches in a very large text file. by ant9000 (Monk) on Jun 30, 2003 at 17:42 UTC
Tons of people already pointed out Swish-E: that's the way I'd go myself, considering it already does all the searches you are looking for (and more) and that you can access its index files directly from Perl, via its Perl interface.	[reply]
Re: Re: Re: Re: Complicated searches in a very large text file. by waswas-fng (Curate) on Jun 30, 2003 at 18:45 UTC
Re: Complicated searches in a very large text file. by tachyon (Chancellor) on Jun 30, 2003 at 10:53 UTC
If you want plug and play with minimal work http://siwsh-e.org provides the GNU data search engine. It will index almost anything (web content, Word docs, PDF), stems, etc, etc. You can write a plugin for it so it can index anything. It is currently used on most of the GNU sites for search. C, fast and stable. As a bonus all the templating/output stuff is perl so all you really need to do is edit TemplateDefault.pm and you can make it look like you wrote if spcifically for your site. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Complicated searches in a very large text file. by hardburn (Abbot) on Jun 30, 2003 at 14:29 UTC
A RDBMS is your best option, but barring that, let me pimp CGI::Search for a moment. Although it's in the CGI namespace, it doesn't have to be used in a CGI (it just does things in ways that are a bit more convieant for CGI programmers, but won't get in the way of more general-purpose programmers). The latest version supports compressed flat-files, which my basic benchmarking shows to give a 12-fold increase in speed. Still, you are better off with an RDBMS. In fact, one of the goals for CGI::Search was to make upgrading to an RDBMS easier (once you're programs are running it, you only need to change the module internals to search on an RDBMS). ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer Note: All code is untested, unless otherwise stated	[reply]