Searching module

ZydecoSue has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Searching module
by eduardo (Curate) on Jan 30, 2001 at 00:38 UTC

I noticed that CPAN contains one called Search-InvertedIndex, but that seems really complicated for I thought should be a simple task.

eduardo

<a href="http://www1.fatbrain.com/asp/bookinfo/bookinfo.asp?theisbn=0201896850&vm=">Sorting and Searching</a>

That which has a key
That which does not have a key

Create a database with primary keys
Create DBM's which you tie
Create keyed index files
Use some pre-built system (it's amazing what's out there)

Google

Glimpse is an amazing produce for full text searching
ht://dig is also pretty good

Re: Searching module
by eg (Friar) on Jan 29, 2001 at 23:31 UTC

How about ht://dig or webglimpse? An indexing scheme will always be faster then a straight linear search through all of your data.

[reply]

Re: Searching module
by lemming (Priest) on Jan 30, 2001 at 01:20 UTC

Search Engine Theory

Search engine

Can dynamic sites be parsed by search engine robots?

Miniature search engine

eduardo

[reply]

Re: Searching module
by baku (Scribe) on Jan 30, 2001 at 01:47 UTC

All due deference to the learned ones, and keeping in mind that you don't want a database: use a "database."

Actually, look into using a tie()'d hash to several (Berkeley DB or similar) files: e.g.

 tie %artist, DB_File, "$vardir/artists.db";
 tie %album, DB_File, "$vardir/title.db";
 tie %track, DB_File, "$vardir/track.db";
[download]

This means your indexer can do something like:

 # tie to new files to keep from accidentally
 # re-using old values and to not update the db
 # while it may be being read by the search client

 my $vardir = "/var/music"; # whatever

 tie %artist, DB_File, "$vardir/.#artist.db#";
 tie %album, DB_File, "$vardir/.#album.db#";
 tie %track, DB_File, "$vardir/.#track.db#";
 tie %by_id, DB_File, "$vardir/.#by_id.db#";
 tie %keyword, DB_File, "$vardie/.#keyword.db#";

 my $id = 0;

 open INDEX, "$vardir/my_ascii_index.csv" or
   die "can't index if I can't read the index: $!";

 for my $line (<INDEX>)
 {
   my $this_artist, $this_album, @album_tracks = split /,/, $line;
   $artist{$this_artist} .= $id . ',';
   $album{$this_album} .= $id . ',';
   for my $this_track (@album_tracks)
   {
     $track{$this_track} .= $id . ',';
   }
   $by_id{$id} = join "\x00", $this_artist, $this_album,
     @album_tracks;
   for my $word (split /\s/, join (" ", $this_artist,
     $this_album, @album_tracks) )
   {
     $keyword{$word} .= $id . ',';
   }
   $id++;
 }

 close INDEX;
 untie %album;
 untie %artist;
 untie %by_id;
 untie %album;
 untie %keyword;

#[the data file assumed above would read like:
# Pearl Jam,ten,Jeremy,Black,... 
# and could be created in Gnumeric or Excel as a CSV file]
[download]

That's really nasty, not to mention probably very inefficient, but could be easy to adapt to your particular inputs...

Then, to do a search query, do something like:


# use CGI and get your query words in whatever form
# load them into e.g. $artist_query, $title_query, &c.

my @result_ids = ();

if ($artist{$artist_query}) { 
 push @result_ids, $artist{$artist_query} 
}
if ($track{$track_query}) { 
 push @result_ids, $track{$track_query} 
}
if ($album{$album_query}) {
 push @result_ids, $album{$album_query};
}
for my $word (split /\s/, $keyword_query)
{
 if ($keyword{$word}) {
  push @result_ids, $keyword{$word};
 }
}
unless (@result_ids) { 
 print "<h1> No results </h1>"; return;
}
print "<h1> Found " . (scalar @result_ids) . ": </h1>
<ol type=1>
";
for my $id (@result_ids)
{
 my $artist, $album, @tracks = split /\x00/, $by_id{$id};
 print "<li> <big>
 <a 
href=\"http://somewhere/interesting/lookup_id.pl?$id\">$album</a>
 </big> by $artist <br>
 <small> <ol type=1>
";
 for my $track (@tracks)
 {
   print " <li> $track </li>\n";
 }
 print "</ol></li>\n\n";
}
print "\n</ol>\n";
return;
[download]

Again, really nasty, but quick and simple. Does not allow any kind of search except by exact-match artist, track, or album, or by a keyword (which must be an exact match but can occur as any fragment of any field).

As eduardo pointed out, anything more complex, go ahead and use a 'real' search system. The only advantage to this structure is that it allows for an 'advanced search' or similar:

Enter keyword(s): ________

<menu> Advanced search:

Artist (exact name): ______
Album Title (exact name): ______
Track Title (exact name): ______

</menu> Submit

[reply]
[d/l]
[select]

Re: Searching module
by ZydecoSue (Scribe) on Jan 30, 2001 at 02:10 UTC

Yes, I'm aware of merlyn's articles. The first one didn't seem completely appropriate for it doesn't use indexes (apologies for daring to criticize, but) and the second presumes that your site is interesting to the search engines (and therefore visited). Mine isn't. Yes: I've submitted it, there's a robots.txt file, there are carefully chosen META tags and keywords. So far, they've come, but not listed me. So, that lets out the second approach.

I'm not trying to reinvent the wheel. I'm trying to find a right-sized wheel that fits my needs.

I looked at the suggested packages. One wants more money than I can afford (yes, there's a crippled version for free however their licensing seems a bit screwy.)

Another is open source, but it's written in C. I'd prefer to find a Perl solution if possible, so I can learn from it.

The links provided by lemming are promising. I'll try to work something out of those.

baku's sample is interesting, but is taking my album example a little too seriously. :)

In reality, I'm looking to index a large number of free-form text documents and a companion program to search those indexes, preferably something that uses proper style. For example, something that uses warnings, strict, and taint mode.

I'd really appreciate it if this companion also provided support for soundex, word proximity, and root words, e.g. knowing that "search" should hit "searching," "searches," and so on.

And, most important, I'm looking for something that you folks respect. I really don't want to have to try to rewrite stuff from Matt's Script Archive. Not only am I not that experienced, but I'm not sure I'd know where to start (other than the bits I already mentioned).

Update: I just realized that you might think I'm asking you folks to write this. I'm not, but I am asking if such a thing has already been written.

Again, thanks for your assistance.

[reply]

Re: Searching module
by markwild (Sexton) on Jan 30, 2001 at 01:38 UTC

this script

improved version

[reply]

Re: Searching module
by Maclir (Curate) on Jan 30, 2001 at 03:00 UTC

http://www.transport.nsw.gov.au/cgi-bin/swish-cgi.pl

Not too hard to get going.

Get the software from here:

http://sunsite.berkeley.edu/SWISH-E/Manual/quickstart.html

[reply]

Re: Searching module
by dash2 (Hermit) on Jan 30, 2001 at 16:08 UTC

Atomz

Dave

[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks