Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Templated Web Sites and Search Engines

by btrott (Parson)
on May 01, 2001 at 08:58 UTC ( [id://76873]=note: print w/replies, xml ) Need Help??


in reply to Templated Web Sites and Search Engines

Swish-E supports retrieval of docs through HTTP rather than on the filesystem. Take a look at the Spidering section of the manual.

In fact, the implementation is done through LWP. I haven't looked at it much other than to notice that, though, although I do remember thinking at the time that it seemed a bit kludgy to have to invoke a Perl script to spider the site. Presumably there's quite a bit of interaction between the main C source, the system, and the Perl program that could be solved by having an actual HTTP implementation inline.

  • Comment on Re: Templated Web Sites and Search Engines

Replies are listed 'Best First'.
Re: Re: Templated Web Sites and Search Engines
by Maclir (Curate) on May 01, 2001 at 09:50 UTC
    Doh!!! a big ++ to btrott. Of course as I read your reply, it triggered a memory that there was the HTTP sextion of the config file for swish-e.

    Thank you for reminding me to RTFM.

Re: Re: Templated Web Sites and Search Engines
by DrZaius (Monk) on May 01, 2001 at 18:58 UTC
    So we meet again :)

    Anywho, what if I have a template driven website where those templates can take on a few million or so different permutations. For example, a well used message board.

    I have designed atleast one website that had a search functionality for one type of content and another for the message board. Depending on your index this could take a few days to index.

    Are there any search engines that all you to macro in how to find the data and how to write the url for it?

      For this type of need you may have the best luck rolling your own engine, probably using something like Search::InvertedIndex. This allows you the maximum in flexibility--you can customize exactly *what* gets indexed, ie. just the content and title of the message board posts--and flexibility (presumably) in the URLs associated with each of the index entries.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://76873]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-24 19:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found