http://qs1969.pair.com?node_id=55041

ZydecoSue has asked for the wisdom of the Perl Monks concerning the following question:

Can anyone recommend a good module for adding good seach capabilities to a web site?

Yes, I know about grep, but I'm looking for something that can report individual hits as well as phrase hits. Ideally, this would provide soundex support as well.

For example, suppose your site catalogs albums and each page provides a track listing. The site is categorized by artist and musical style.

I'm looking for some that lets you search for "giogio morodor evolution" which would return pages listing

(Yes, it should be in a database, but let's keep it simple for the moment.)

I noticed that CPAN contains one called Search-InvertedIndex, but that seems really complicated for I thought should be a simple task.

Any suggestions?

Replies are listed 'Best First'.
Re: Searching module
by eduardo (Curate) on Jan 30, 2001 at 00:38 UTC
    ZydecoSue said:

    I noticed that CPAN contains one called Search-InvertedIndex, but that seems really complicated for I thought should be a simple task.

    And eduardo cringed... I have written search engines pretty much my entire professional programming life. All I did at every single employer I can think of was write indexers and search engines for different types of data. Relational data, flat data, ISAM data, geographic data, archaic data, encrypted data... Please, do yourself a favor, and realize that searching is one of the most time honored and well studied fields in computer science. If you point your browser to <a href="http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=0201896850&vm=">Sorting and Searching</a> by the great Knuth you will realize that if it took him 1/2 of a 780 page book, maybe there is more complexity to this entire "searching" thing that at first seems to be on the surface.

    The first and most important thing that you need to do is understand the data that you are searching through. Is it flat files, is it DBM's, are you looking at RDBMS tables, OORDBMS? What is the "nature" of the data, what is it's "thingness." What does it contain, what does it show you, how does it index?

    Most data that you will find, can be described in two categories:

    • That which has a key
    • That which does not have a key

    If you realize that your data is data that can be keyed, then your problems become much easier. There are 100's if not 1000's of mechanisms for the ease of searching through keyed information. You have choices ranging from:

    • Create a database with primary keys
    • Create DBM's which you tie
    • Create keyed index files
    • Use some pre-built system (it's amazing what's out there)


    If however, you are doing free form searching on data, data that can not be related as simply as key => value, then the problem is a slight bit more complicated. You are asking for things which are more "full-text" and open form. This is very difficult to implement right, which is why you have such a difference in the quality of search engines. A search engine (like Google) does just this, attempt to find a way to intelligently parse the free form data that exists on the internet. There is *never* a good reason to invent the wheel (well, I lie, sometimes for didactic purposes)... if it is this type of data you have, then I suggest you find an indexing / full text search system:

    • Glimpse is an amazing produce for full text searching
    • ht://dig is also pretty good


    However, all that I can suggest, is do yourself a favor, this is a more complex thing than just indexing and using grep. Understand your data, understand your structure, understand what it is that you are trying to accomplish, and remember, you can do what merlyn says in his WebTechniques column, use WWW::Search and rely on Altavista to do your searching for you :)
Re: Searching module
by eg (Friar) on Jan 29, 2001 at 23:31 UTC

    How about ht://dig or webglimpse? An indexing scheme will always be faster then a straight linear search through all of your data.

Re: Searching module
by lemming (Priest) on Jan 30, 2001 at 01:20 UTC
Re: Searching module
by baku (Scribe) on Jan 30, 2001 at 01:47 UTC

    All due deference to the learned ones, and keeping in mind that you don't want a database: use a "database."

    Actually, look into using a tie()'d hash to several (Berkeley DB or similar) files: e.g.

    tie %artist, DB_File, "$vardir/artists.db"; tie %album, DB_File, "$vardir/title.db"; tie %track, DB_File, "$vardir/track.db";

    This means your indexer can do something like:

    # tie to new files to keep from accidentally # re-using old values and to not update the db # while it may be being read by the search client my $vardir = "/var/music"; # whatever tie %artist, DB_File, "$vardir/.#artist.db#"; tie %album, DB_File, "$vardir/.#album.db#"; tie %track, DB_File, "$vardir/.#track.db#"; tie %by_id, DB_File, "$vardir/.#by_id.db#"; tie %keyword, DB_File, "$vardie/.#keyword.db#"; my $id = 0; open INDEX, "$vardir/my_ascii_index.csv" or die "can't index if I can't read the index: $!"; for my $line (<INDEX>) { my $this_artist, $this_album, @album_tracks = split /,/, $line; $artist{$this_artist} .= $id . ','; $album{$this_album} .= $id . ','; for my $this_track (@album_tracks) { $track{$this_track} .= $id . ','; } $by_id{$id} = join "\x00", $this_artist, $this_album, @album_tracks; for my $word (split /\s/, join (" ", $this_artist, $this_album, @album_tracks) ) { $keyword{$word} .= $id . ','; } $id++; } close INDEX; untie %album; untie %artist; untie %by_id; untie %album; untie %keyword; #[the data file assumed above would read like: # Pearl Jam,ten,Jeremy,Black,... # and could be created in Gnumeric or Excel as a CSV file]

    That's really nasty, not to mention probably very inefficient, but could be easy to adapt to your particular inputs...

    Then, to do a search query, do something like:

    # use CGI and get your query words in whatever form # load them into e.g. $artist_query, $title_query, &c. my @result_ids = (); if ($artist{$artist_query}) { push @result_ids, $artist{$artist_query} } if ($track{$track_query}) { push @result_ids, $track{$track_query} } if ($album{$album_query}) { push @result_ids, $album{$album_query}; } for my $word (split /\s/, $keyword_query) { if ($keyword{$word}) { push @result_ids, $keyword{$word}; } } unless (@result_ids) { print "<h1> No results </h1>"; return; } print "<h1> Found " . (scalar @result_ids) . ": </h1> <ol type=1> "; for my $id (@result_ids) { my $artist, $album, @tracks = split /\x00/, $by_id{$id}; print "<li> <big> <a href=\"http://somewhere/interesting/lookup_id.pl?$id\">$album</a> </big> by $artist <br> <small> <ol type=1> "; for my $track (@tracks) { print " <li> $track </li>\n"; } print "</ol></li>\n\n"; } print "\n</ol>\n"; return;

    Again, really nasty, but quick and simple. Does not allow any kind of search except by exact-match artist, track, or album, or by a keyword (which must be an exact match but can occur as any fragment of any field).

    As eduardo pointed out, anything more complex, go ahead and use a 'real' search system. The only advantage to this structure is that it allows for an 'advanced search' or similar:

    Enter keyword(s): ________

    <menu> Advanced search:
    • Artist (exact name): ______
    • Album Title (exact name): ______
    • Track Title (exact name): ______
    </menu> Submit
Re: Searching module
by ZydecoSue (Scribe) on Jan 30, 2001 at 02:10 UTC
    Well, this has been an interesting day. Thank you for taking the time to discuss this with, both in your posts and in the chatterbox.

    Yes, I'm aware of merlyn's articles. The first one didn't seem completely appropriate for it doesn't use indexes (apologies for daring to criticize, but) and the second presumes that your site is interesting to the search engines (and therefore visited). Mine isn't. Yes: I've submitted it, there's a robots.txt file, there are carefully chosen META tags and keywords. So far, they've come, but not listed me. So, that lets out the second approach.

    I'm not trying to reinvent the wheel. I'm trying to find a right-sized wheel that fits my needs.

    I looked at the suggested packages. One wants more money than I can afford (yes, there's a crippled version for free however their licensing seems a bit screwy.)

    Another is open source, but it's written in C. I'd prefer to find a Perl solution if possible, so I can learn from it.

    The links provided by lemming are promising. I'll try to work something out of those.

    baku's sample is interesting, but is taking my album example a little too seriously. :)

    In reality, I'm looking to index a large number of free-form text documents and a companion program to search those indexes, preferably something that uses proper style. For example, something that uses warnings, strict, and taint mode.

    I'd really appreciate it if this companion also provided support for soundex, word proximity, and root words, e.g. knowing that "search" should hit "searching," "searches," and so on.

    And, most important, I'm looking for something that you folks respect. I really don't want to have to try to rewrite stuff from Matt's Script Archive. Not only am I not that experienced, but I'm not sure I'd know where to start (other than the bits I already mentioned).

    Update: I just realized that you might think I'm asking you folks to write this. I'm not, but I am asking if such a thing has already been written.

    Again, thanks for your assistance.

Re: Searching module
by markwild (Sexton) on Jan 30, 2001 at 01:38 UTC
    Everyone's favorite Perl Hacker, Merlyn, writes a column for Web Techniques Magazine every month. Take a look at this script from April 97, or his improved version from December 1999. --Mark
Re: Searching module
by Maclir (Curate) on Jan 30, 2001 at 03:00 UTC
Re: Searching module
by dash2 (Hermit) on Jan 30, 2001 at 16:08 UTC
    On the other hand, if you just want web page searching capabilities, you could be totally lazy and go for an ASP style solution like Atomz. It's free last time I checked, and pretty good, but you have to include a button from there. Not very Perlish in spirit, either.

    Dave