in reply to Re: Re: Re: Re: Re: speeding up a file-based text search
in thread speeding up a file-based text search

Phrase matching is not the same as "and" matching. It's not enough for two words to both be in the same record; they have to be there next to each other in the correct order. A word list can't do that, although it can be used to qualify records for further checking. I can do partial matching as part of that, although it requires a full scan of the word list. I'm going to try it.

Incidentally, I'm using index() instead of m// for partial matching, which should be faster. Giving users regex search capability is not a goal.

  • Comment on Re: Re: Re: Re: Re: Re: speeding up a file-based text search

Replies are listed 'Best First'.
Re7: speeding up a file-based text search
by dragonchild (Archbishop) on May 07, 2003 at 22:23 UTC
    One possibility for phrase matching is to build a second layer of indexing. The first layer is "This word exists". The second layer is "This other word is right after me at some place in the document".

    Now, this will give the possibility of false matches, depending on how you index. For example, the phrase "in the" might end up matching "Come on in. The tea is on the stove."

    Another problem is 3+ word phrases. The system I'm proposing will tell you if pairs are in the right order. But, using the above snippet, "in the stove" would match that document because "in the" and "the stove" are both phrases that exist, even though "in the stove" isn't there.

    But, it all depends on how perfect you want to be. "Good enough, I can give you now. Perfect will be along tomorrow."

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Re: Re: Re: Re: Re: Re: speeding up a file-based text search
by BrowserUk (Patriarch) on May 07, 2003 at 22:33 UTC

    Phrase matching is not the same as "and" matching.

    Agreed. Sorry if implied otherwise. But once you have a list of the records that contain all the terms, then validating these against the original phrase is considerably less costly than searching the whole 20MB.

    Good point. Using index on the keys of the hash does make more sense.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Re^7: speeding up a file-based text search (word list for phrase search)
by Aristotle (Chancellor) on May 09, 2003 at 19:04 UTC
    Depends on your word list. You could store the in-record location(s) of the word as well; then, when doing a phrase search, you can intersect the sets for each word by record and then check for consecutive locations in the correct order. This is, AFAIK and at least roughly, the way all of the big web search engines work.

    Makeshifts last the longest.