arunmep has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys, Iam developing a search utility for searching keywords in documents. The search keywords will be in the form of query like (A or B And (B or C*)).... can anybody tell me how to proceed with this kind of searches is there any built in package in perl please let me know thank you

Replies are listed 'Best First'.
Re: how to parse a query
by grep (Monsignor) on Oct 16, 2006 at 06:09 UTC
    The bigger problem, would be the full text search engine. Just regexing or indexing through documents in real time does not scale well.

    Look at DBIx::TextIndex. I have been using for a little while now and it works well. It also gives you grouping, boolean ops ('AND', 'OR', & 'NOT'), phrase search, and wildcards.

    I had problems using DBIx::TextIndex with PostgreSQL v8 but it should work with PostgreSQL v7. It also works with MySQL or SQLite.



    grep
    One dead unjugged rabbit fish later
Re: how to parse a query
by graff (Chancellor) on Oct 16, 2006 at 07:39 UTC
    I don't know whether this is relevant, but you might look at KinoSearch -- at least for ideas. Basically, you need some sort of process that will read through your set of documents and build an index to identify all the locations of all possible keywords. Then you need a separate query process that knows how to read the index data, and how to use the information provided there to locate the specific documents that meet specific conditions on particular keywords.

    A database solution would probably work okay, but people have built "search engine" apps that are better optimized for this kind of task. KinoSearch (which I personally have not used) is one such engine, built with Perl and C.

      If you want help with KinoSearch, because a huge package like that can be very daunting at first, its author, creamygoodness, is a regular here. You can find him in the Chatterbox, typically several times a week.

        ... and it would be nice to be able to give back something for all the Chatterbox help bart has given me.

        The path for getting started with KinoSearch is to copy and paste the sample code in KinoSearch::Docs::Tutorial and adapt it for your needs.

        I think graff has correctly divined that you're in need of a search engine library rather than a standalone search query parser. Nevertheless, for the sake of completeness, another CPAN module specifically dedicated to that task is Search::QueryParser.

        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com
Re: how to parse a query
by bart (Canon) on Oct 16, 2006 at 10:45 UTC
    There's an ancient, unmaintained, module on CPAN to parse such queries, if I understood its purpose correctly: Text::Query.
Re: how to parse a query
by derby (Abbot) on Oct 16, 2006 at 13:53 UTC

    You could use Lucene::QueryParser but you would have to modify the boolean operators to all caps:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use Lucene::QueryParser; my $struct = parse_query('(A OR B AND (B OR C*))' ); print Dumper( $struct );

    produces:

    $VAR1 = bless( [ bless( { 'subquery' => bless( [ bless( { 'query' => 'TERM', 'type' => 'NORMAL', 'term' => 'A' }, 'Lucene::QueryParser::Term' ), bless( { 'conj' => 'OR', 'query' => 'TERM', 'type' => 'NORMAL', 'term' => 'B' }, 'Lucene::QueryParser::Term' ), bless( { 'conj' => 'AND', 'subquery' => bless( [ bless( { 'query' => 'TERM', 'type' => 'NORMAL', 'term' => 'B' }, 'Lucene::QueryParser::Term' ), bless( { 'conj' => 'OR', 'query' => 'PREFIX', 'type' => 'NORMAL', 'term' => 'C' }, 'Lucene::QueryParser::Prefix' ) ], 'Lucene::QueryParser::TopLevel' ), 'query' => 'SUBQUERY', 'type' => 'NORMAL' }, 'Lucene::QueryParser::Subquery' ) ], 'Lucene::QueryParser::TopLevel' ), 'query' => 'SUBQUERY', 'type' => 'NORMAL' }, 'Lucene::QueryParser::Subquery' ) ], 'Lucene::QueryParser::TopLevel' );

    -derby