Maclir has asked for the wisdom of the Perl Monks concerning the following question:

Now while this may not be a strictly perl question, it relates to haveint web sites built with a variety of "template" tools - such as HTML::Mason, Template toolkit and EmbPerl::Object. My particular question concerns how you ensure any site internal search engines still return sensible results.

The site I manage has been using conventional (hand crafted) web pages since its inception, and we now have over 300 pages. We have a site search function, using the popular Swish-e tool. This is a C program, that is kicked off by a cron job each night, and scans each file in the server document tree, and builds search indexes and so on. When a person searches our site, they are given (hopefully) a list of pages, identified by document title - that is stuff between the <title> and </title> tags.

Now, since we are about to use EmbPerl::Object to have a far easier to manage site, each page only has the guts of the page as HTML stuff, with standard embperl files making the standard page headers, and so on. Any browser (or spider) getting pages through our server is delivered the complete HTML code, with titles, body stuff and so on. No problem there. But, swish-e, which runs the index generation outside of the web server, only sees the "raw" files. Hence, even though it indexes all the searchable text, there are no title tags in each content file.

Have other people faced this problem? Is there a version of swish-e - or something similar - that can be scheduled on a regular basis, but indexes documents retrieved through the web server itself? I am sure this coudl be done with LWP, but not wanting to invent the wheel . . .

How do those sites with large content management systems provide this search capability?

Replies are listed 'Best First'.
Re: Templated Web Sites and Search Engines
by btrott (Parson) on May 01, 2001 at 08:58 UTC
    Swish-E supports retrieval of docs through HTTP rather than on the filesystem. Take a look at the Spidering section of the manual.

    In fact, the implementation is done through LWP. I haven't looked at it much other than to notice that, though, although I do remember thinking at the time that it seemed a bit kludgy to have to invoke a Perl script to spider the site. Presumably there's quite a bit of interaction between the main C source, the system, and the Perl program that could be solved by having an actual HTTP implementation inline.

      Doh!!! a big ++ to btrott. Of course as I read your reply, it triggered a memory that there was the HTTP sextion of the config file for swish-e.

      Thank you for reminding me to RTFM.

      So we meet again :)

      Anywho, what if I have a template driven website where those templates can take on a few million or so different permutations. For example, a well used message board.

      I have designed atleast one website that had a search functionality for one type of content and another for the message board. Depending on your index this could take a few days to index.

      Are there any search engines that all you to macro in how to find the data and how to write the url for it?

        For this type of need you may have the best luck rolling your own engine, probably using something like Search::InvertedIndex. This allows you the maximum in flexibility--you can customize exactly *what* gets indexed, ie. just the content and title of the message board posts--and flexibility (presumably) in the URLs associated with each of the index entries.
Re: Templated Web Sites and Search Engines
by asiufy (Monk) on May 02, 2001 at 07:18 UTC
    Where are you storing the actual content? If it's in the database, you can also try to conduct the search directly on the database, with the search returning links to dynamically generated pages, that contain whatever was searched for.