in reply to Complicated searches in a very large text file.

I'd use a relational database. Recent versions of MySQL have FULLTEXT searching. Other databases have similar features. I personally (since you asked) would rather do the dishes than work up an inverted index, taking into account word stemming and boolean operations.

Replies are listed 'Best First'.
Re: Re: Complicated searches in a very large text file.
by tachyon (Chancellor) on Jun 30, 2003 at 10:59 UTC

    Another option is swish-e. C based and fast. As a bonus the output stuff is in Perl so edit TemplateDefault.pm and it looks totally custom.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Re: Complicated searches in a very large text file.
by waswas-fng (Curate) on Jun 30, 2003 at 03:45 UTC
    Wow, thanks chromatic. Back when I started this project a long time ago MySQL did not have anything like that -- this looks to be very cool. Yeah inverted indexes make my head ache, that is why i did not use them when i first wrote this -- although for the first 12 months this was in production the search times were not a factor but the amount of additions to the index now are starting to be 6 or 7 times as large as the original spec called for -- hence the issue.

    -Waswas
Re: Re: Complicated searches in a very large text file.
by waswas-fng (Curate) on Jun 30, 2003 at 16:31 UTC
    Ack, after reading through the MySQL docs it seems as though the full text searching does not meet my requirements =(. search string such as "this really blows !stupid fun" and get back matches for:
    will return: "/data2/studio/projects/clientname/print/this really/funermaker/santa +blows/thefile.tif" exists on Archive-dvd-000002122 but not this: "/data2/studio/projects/clientname/print/this really/funermaker/santa +blows stupid/thefile.tif" because it has stupid in the fq file name.
    You see I need to be able to substring search quickly and it looks to only support full word matches, or using the * boolean full-text search capability operator I can do for example fun* to match "funny" or "fungle" but i do not see a way to match "Senior Howfun" -- do you have any insight?

    I know I can fall back to using LIKE statments to do basicly the same thing i am doing in the original post, but it seems like that would just add the overhead of a DB to do the same thing -- any ideas?

    -Waswas
      Tons of people already pointed out Swish-E: that's the way I'd go myself, considering it already does all the searches you are looking for (and more) and that you can access its index files directly from Perl, via its Perl interface.
        From the Swish docs,
        The wildcard (*) is available, however it can only be used at the end +of a word: otherwise is is considerd a normal character (i.e. can be +searched for if included in the WordCharacters directive). swish-e -w "librarian" -f myIndex this query only retrieves files which contain the given word. On the other hand: swish-e -w "librarian*" -f myIndex retrieves ``librarians'', ``librarianship'', etc. along with ``librari +an''.
        So how would you go about having a search string such as "test and apple" be able to match on a document conatining "an apple is good to eat when they come from the grabaltester reigon of madeupcountry."

        -Waswas