Now while this may not be a strictly perl question, it relates to haveint web sites built with a variety of "template" tools - such as HTML::Mason, Template toolkit and EmbPerl::Object. My particular question concerns how you ensure any site internal search engines still return sensible results.

The site I manage has been using conventional (hand crafted) web pages since its inception, and we now have over 300 pages. We have a site search function, using the popular Swish-e tool. This is a C program, that is kicked off by a cron job each night, and scans each file in the server document tree, and builds search indexes and so on. When a person searches our site, they are given (hopefully) a list of pages, identified by document title - that is stuff between the <title> and </title> tags.

Now, since we are about to use EmbPerl::Object to have a far easier to manage site, each page only has the guts of the page as HTML stuff, with standard embperl files making the standard page headers, and so on. Any browser (or spider) getting pages through our server is delivered the complete HTML code, with titles, body stuff and so on. No problem there. But, swish-e, which runs the index generation outside of the web server, only sees the "raw" files. Hence, even though it indexes all the searchable text, there are no title tags in each content file.

Have other people faced this problem? Is there a version of swish-e - or something similar - that can be scheduled on a regular basis, but indexes documents retrieved through the web server itself? I am sure this coudl be done with LWP, but not wanting to invent the wheel . . .

How do those sites with large content management systems provide this search capability?


In reply to Templated Web Sites and Search Engines by Maclir

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.