in reply to Can dynamic sites be parsed by search engine robots?

Philip Greenspun discusses this issue here:
http://philip.greenspun.com/wtr/dead-trees/53005.htm
An excerpt:

I managed to hide a tremendous amount of my content from the search engines by stupidity in a different direction. I built a question and answer forum for http://photo.net/photo. Because I'm not completely stupid, all the postings were stored in a relational database. I used the AOLServer with its brilliant TCL API to get the data out of the database and onto the Web. The cleanest way to develop software in this API is to create files named "foobar.tcl" among one's document tree. The URLs end up looking like "http://photo.net/bboard/fetch-msg.tcl?msg_id=000037".

So far so good.

AltaVista comes along and says, "Look at that question mark. Look at the strange .tcl extension. This looks like a CGI script to me. I'm going to be nice and not follow this link even though there is no robots.txt file to discourage me."

Then WebCrawler says the same thing.

Then Lycos.

I achieved oblivion.

Then I had a notion that I developed into a concept and finally programmed into an idea: Write another AOLServer TCL program that presents all the messages from URLs that look like static files, e.g., "/fetch-msg-000037.html" and point the search engines to a huge page of links like that. The text of the Q&A forum postings will get indexed out of these pseudo-static files and yet I can retain the user pages with their *.tcl URLs. I could convert the user pages to *.html URLs but then it would be more tedious to make changes to the software (see my discussion of why the AOLserver *.tcl URLs are so good in the next chapter).
  • Comment on Re: Can dynamic sites be parsed by search engine robots?

Replies are listed 'Best First'.
RE: Re: Can dynamic sites be parsed by search engine robots?
by merlyn (Sage) on Oct 08, 2000 at 21:18 UTC
    Just goes to show you that it's not very smart to tie a URL's form to its function. All the HTML on my website is dynamically generated, yet I don't have .cgi on the end of any file (and especially not .tcl {grin}).

    From a security perspective, revealing that something is /cgi or .cgi or .pl or .tcl are all dangerous, as they give an attacker a hint of implementation language, which can permit more rapid selection of automated tools.

    -- Randal L. Schwartz, Perl hacker