Can dynamic sites be parsed by search engine robots?

footpad has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Can dynamic sites be parsed by search engine robots? by rlk (Pilgrim) on Oct 08, 2000 at 08:03 UTC
Warning: The following is second or thirdhand information. I make no claims as to whether it is accurate or complete, other than I got it from someone who deals with search engine related stuff for a living. One thing that search engines apparently avoid like the plague is pages with URL's like `http://www.example.com/index.pl?foo=bar` If the pages you've submitted look like this, that may be your problem. You may want to throw up a static page or two for purposes of being indexed. Also, make sure you don't have any restrictions in robots.txt that could block the pages from being indexed. If you don't have a robots.txt file, it couldn't hurt to throw one up that explicitly allows it. Finally, I don't know how long you've been waiting, but if it's less than a week (or even two), be patient. Search engines take a while to incorporate submitted pages into their databases. -- Ryan Koppenhaver, Aspiring Perl Hacker "I ask for so little. Just fear me, love me, do as I say and I will be your slave."	[reply]
Re: Can dynamic sites be parsed by search engine robots? by nop (Hermit) on Oct 08, 2000 at 21:07 UTC
Philip Greenspun discusses this issue here: http://philip.greenspun.com/wtr/dead-trees/53005.htm An excerpt: I managed to hide a tremendous amount of my content from the search engines by stupidity in a different direction. I built a question and answer forum for http://photo.net/photo. Because I'm not completely stupid, all the postings were stored in a relational database. I used the AOLServer with its brilliant TCL API to get the data out of the database and onto the Web. The cleanest way to develop software in this API is to create files named "foobar.tcl" among one's document tree. The URLs end up looking like "http://photo.net/bboard/fetch-msg.tcl?msg_id=000037". So far so good. AltaVista comes along and says, "Look at that question mark. Look at the strange .tcl extension. This looks like a CGI script to me. I'm going to be nice and not follow this link even though there is no robots.txt file to discourage me." Then WebCrawler says the same thing. Then Lycos. I achieved oblivion. Then I had a notion that I developed into a concept and finally programmed into an idea: Write another AOLServer TCL program that presents all the messages from URLs that look like static files, e.g., "/fetch-msg-000037.html" and point the search engines to a huge page of links like that. The text of the Q&A forum postings will get indexed out of these pseudo-static files and yet I can retain the user pages with their .tcl URLs. I could convert the user pages to .html URLs but then it would be more tedious to make changes to the software (see my discussion of why the AOLserver *.tcl URLs are so good in the next chapter).	[reply]
RE: Re: Can dynamic sites be parsed by search engine robots? by merlyn (Sage) on Oct 08, 2000 at 21:18 UTC
Just goes to show you that it's not very smart to tie a URL's form to its function. All the HTML on my website is dynamically generated, yet I don't have `.cgi` on the end of any file (and especially not `.tcl` {grin}). From a security perspective, revealing that something is /cgi or .cgi or .pl or .tcl are all dangerous, as they give an attacker a hint of implementation language, which can permit more rapid selection of automated tools. -- Randal L. Schwartz, Perl hacker	[reply]
RE: Can dynamic sites be parsed by search engine robots? by little (Curate) on Oct 08, 2000 at 09:58 UTC
footpad, if you registered your site, a bot will try to open it, as he would do with any available domain, e.g. "www.mypersonalweb.com/personal/info/about/me/" would do it, but notice, that if your provider uses apache, it might happen that the client browser (or in this case the bot) get's an "acces denied", if the trailing slash is omitted when the url is not pointing to the domain root. So first you should check with a browser, if you get the document specified by the URL you registered at the search engine. Normally the search bots will look for meta tags for robots, where you might specify if a search bot shall follow up the links in this page, with "follow" or "nofollow". It is important to notice that this concerns not the directory structure of your web, cause the bots are normal HTTP-clients. So you might also check, if in your sites root directory is a "robots.txt" file as well :-) To check if a specified client is requesting your webserver to get this document, you might look at this one of merlyn's columns for webtechniques where he describes how to generate acces logs. While reading through this code you will easily be able to figure out, which four lines of code to include into your cgi-script, that is responding to the HTTP_requests for the document specified by the URL you registered. Sorry for "talking around the corner", but my English is not as good as it should be. Have a nice day All decision is left to your taste	[reply]
RE: Can dynamic sites be parsed by search engine robots? (webserver logs) by ybiC (Prior) on Oct 08, 2000 at 07:55 UTC
If they're hitting your site, I'd expect to see 'bots reported in your webserver agent logs (if it's configured for agents). Unless I'm mistaken, it could take weeks or even months for search bots to hit you after submit your URL. cheers, ybiC	[reply]
Re: Can dynamic sites be parsed by search engine robots? by adamsj (Hermit) on Oct 08, 2000 at 07:56 UTC
I'm not an expert on search engines, but I don't see why they wouldn't be looking at your page--and if you've put your HTML together right (addendum: even if you haven't), Windows vs. UNIX shouldn't make a difference. How long since you submitted your pages to the search engines is probably the relevant question. With that knowledge, someone more expert than myself can give you a more informed answer.	[reply]
Re: Can dynamic sites be parsed by search engine robots? by Trimbach (Curate) on Oct 08, 2000 at 21:20 UTC
Here's a thought: to guarantee that the default output of your CGI gets cataloged by the search engines, why not create an index.html file that contains a Server-side include to your CGI? That way, whenever the robots hit that page, your web server "includes" the default output of your CGI (which is, like you say, just static HTML) and then spits the whole thing back to the robot. To the robot it looks a regular web page, so it gets indexed and cataloged and All is Right in the World. Of course, this assumes you have access to Server-side includes. Check out the Apache docs on SSI's at http://www.apache.org/docs-1.2/mod/mod_include.html for more information. Even if you're serving from a Windows web server, there should be similar procedures for enabling SSI's. Gary Blackburn Trained Killer	[reply]
Re: Can dynamic sites be parsed by search engine robots? by rdnzl (Sexton) on Oct 08, 2000 at 09:44 UTC
dont forget the robots.txt, no spiders indexed my site without one. -- http://www.whackpack.com => FOO!	[reply]
RE: Re: Can dynamic sites be parsed by search engine robots? by cianoz (Friar) on Oct 08, 2000 at 16:31 UTC
this is not correct: robot.txt allow you to restrict/control how robots index your site, is not necessary to make them work!	[reply]
RE: RE: Re: Can dynamic sites be parsed by search engine robots? by extremely (Priest) on Oct 09, 2000 at 02:34 UTC
Absolutes are so dangerous, usually =) Many robots act differently if there is an active robots.txt file in place. Many use a more liberal tree walker and will follow some dynamic content if you ask nicely. Put in a /robots.txt file and a / rule. -- $you = new YOU; honk() if $you->love(perl)	[reply]
RE: RE: RE: Re: Can dynamic sites be parsed by search engine robots? by cianoz (Friar) on Oct 09, 2000 at 02:51 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.


Pathologically Eclectic Rubbish Lister
	PerlMonks