in reply to RadicalMatterDotCom

Sure, I'd love to see the perl equivalent of google's linux search pool.

Are you sure that dynamic content isn't going to bite you? It gets pretty tricky w/o customizing the spider for each dynamic site... For instance, are you going to spider each of the 100,000 pages of the form:

http://www.perlmonks.org/index.pl?node_id=83485&lastnode_id=1
http://www.perlmonks.org/index.pl?node_id=83485&lastnode_id=1234
http://www.perlmonks.org/index.pl?node_id=83485&lastnode_id=12345678
etc...

Unless your bot is smart enough to know that perlmonks is "loosely unique" on node/node_id values, you could wind up in a nearly infinite loop. Just looking at the node_id/lastnode_id combo, there are something like 100,000 * 100,000 combinations.

Of course if you have a cusom spidering routine for each domain, you can put enough brains into your spider to avoid this problem.... but that severely limits the number of domains you can index.

I've done a similiar thing over at Making perlmonks seach engine friendly if that helps.

-Blake

Replies are listed 'Best First'.
Re: Re: RadicalMatterDotCom
by little (Curate) on Aug 23, 2001 at 02:16 UTC
    A spider does not need to be a registered user, so there are only the nodes, as you skip the part of "lastnode_id". the bigger problem is the duplication of information as the display of a node also includes replies. So whether the spider recognizes single nodes and can assort them to avoid multiple lookups and storing everything n times, if n is the level of "re:" to a node, it could work.
    Anyhow, for archiving purposes, and to make perlmonks more easy searcheable and to allow for better categorization it might be the need on the side of perlmonks.org to have something alike:
    http://www.perlmonks.org/index.pl?node_id=83485&view_mode=plain_txt
    so the spider does not mess up with all the dynamic content as nodelets and menubars.
    And I bet that the everything engine has such a feature, even if its somewhere deep hidden and only used for debugging or so.
    But, sincere excuse, as long as this loads work to vroom, better write a really good spider.
    Yes, I like your idea a lot, cause I believe, all thats been posted until now, would make up a Perl-monks-bookshelf upon tips, traps, tricks and so on. (well except meditations and discussions, but those contents you could offer mindspring.com or pilosophy.org for linking.) {grin}

    Have a nice day
    All decision is left to your taste
      Seems like you've proved my point though. Any spider that successfully indexes perlmonks will have to be specially customized for the site. Or, said another way, perlmonks is not friendly to the general search spider.

      Oh, and I think you might be looking for DisplayType Raw

      -Blake

        Well, at this point I agree. :-)
        BUT, if you register:
        http/www.perlmonks.org/index.html
        a document that does not exist yet, and create an alias for that URI in the apache config, so you can dynamically generate a page only containing all keywords as you can see them in one of the nodelets to each node, make these keywords links to search or super search and you have ONE document listed at google, and from there on the user can search perlmonks, after he'd seen that it's possibly here what he is looking for and without knowing that he will initialize a search when clicking one of those links, but apparently the spider will notice that the link targets to script and stop proceeding. More than that is needed?
        Well, I don't think so :-)
        And ok, that would make it more userfriendly then it is currently now. :-)

        Have a nice day
        All decision is left to your taste