http://qs1969.pair.com?node_id=91433

swiftone has asked for the wisdom of the Perl Monks concerning the following question:

I need to be able to search a block of text to see if a given question is in there, with broad flexibility for different ways to state the question.

My workplace has a problem with too many people asking FAQs by email. To try and free up staff time, here's my plan:

  1. John Doe comes to our website and clicks on the "send comments and questions" link.
  2. John Doe fills out a form with contact info and a text block for comments and questions.
  3. When "submit" is clicked, the input is checked against a list of FAQs.
    • If there are no matches, the form is emailed to a customer service rep.
    • If there is a match, the matching Q&A (or a link) is returned to the user with an appropriate blurb. The user can then either confirm that the request was not answered (which results in the form being emailed), or leave happily.
The problem is determining how to best do this. I could try to compare sentences to the FAQs using String::Approx, but that would likely match strangely, and would be baffled by the lack of punctuation our customers often use.

I could go with a keyword search, but that requires that we add keywords to the FAQ list we have, not to mention that keyword isn't such a great way to match FAQs.

In general, I'm willing to learn towards more false matches than not. Any ideas?

Replies are listed 'Best First'.
Re: Matching a question in text
by voyager (Friar) on Jun 26, 2001 at 01:10 UTC
    Take the text of the email and throw way noise words (what, how, the, etc.). Then take the words left and see if they appear in the faqs. Keep track of how many words in the email match a particular FAQ so you can rank the FAQs.

    So when the user clicks "SEND", you can politely say something like "Here is a list of FAQs that might answer your question". Each FAQ has a link to the FAQ and a blurb so user can tell with out going to the FAQ whether it might help.

    You should have a button/link in a very obvious place that says, in effect, "None of the FAQs answer my questions, Submit the question Now".

    Over time you can tune your list of noise words and perhaps even recognize a list of words that should be given more relevance in sorting the FAQs for display.

      Noise words, important words, etc. tend to be domain-specific. What I do for my current project is for every search, I log:
      • what was typed (e.g. "show me all the foo and bar")
      • what i searched on (e.g., "foo bar") # we use Lingua::Stem and other tricks
      • how many "hits"
      This is written to a log file and a cron job dumps results into mysql db for easy reporting.

      So to finally answer your question, you determine noise words by looking at what your users do. HTH

      Hmm. Interesting. Any ideas for a good source of "noise" words, or do I just fake it?
        Search engines do this.
        It was either htdig or swift-e that had a file that contained such "noise words". Just use that. (I think swift-e had them in it's source code).

        You can find links to them here:
        http://www.searchtools.com/

        On another note, the source is available for alot of the search engines on the page. Code examples for things like fuzzy search and context searching might be available.

Re: Matching a question in text
by arturo (Vicar) on Jun 26, 2001 at 01:19 UTC

    Here's a thought. Use a thesaurus. You don't just want to fuzzy match strings, you want to match *similar* words. Oracle's "Oracle Text" does this sort of thing, even comes with a built-in thesaurus (matches wider terms to narrower terms, so, e.g. "Perl" could come up on a search on "Programming language").

    That should give you a start on ideas. I'm not aware of any open source alternatives, but some search engines do similar sorts of things.

    Hope that helps.

    perl -e 'print "How sweet does a rose smell? "; chomp ($n = <STDIN>); +$rose = "smells sweet to degree $n"; *other_name = *rose; print "$oth +er_name\n"'
Re: Matching a question in text
by cLive ;-) (Prior) on Jun 26, 2001 at 01:52 UTC
Re: Matching a question in text
by jorg (Friar) on Jun 26, 2001 at 01:02 UTC
    How about this :

    Next to the feedback form you provide a few links to the FAQ's themselves, making sure you put enough "Please read the faq's before submitting any questions!" type of hints around it.

    Educating the user is the key to successfull user support..

    Jorg

    "Do or do not, there is no try" -- Yoda

      That's already being done :) I try and tell myself there are far more people who read the FAQs and are happy. I'm not sure I believe it, but I keep telling myself.
Re: Matching a question in text
by suaveant (Parson) on Jun 26, 2001 at 01:05 UTC
    You are asking for A.I....

    I would define a set of keywords and keyphrases for FAQ questions, then make some sort of threshold check against the question asked by the user... anything else is just that much more work for not many more results...

                    - Ant

      And you should have a bunch of old requests laying about for testing whether whatever you come up with works for the typical requests that you get.

              - tye (but my friends call me "Tye")
      I realize nothing will be perfect. I was trying to see if anyone had done this before, and what model is most efficient. How do you recommend I run a "threshold check"? By percentage of keywords matched?
        percentage of keywords matched... maybe even weight the keywords and keyphrases and take anything that gets 5 points, or 10, or 2... or more, of course... kind of a reverse search

                        - Ant

Re: Matching a question in text
by clemburg (Curate) on Jun 26, 2001 at 19:07 UTC

    You could try reusing some functionality of Infobot by Kevin Lenzo.

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com