Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Can dynamic sites be parsed by search engine robots?

by footpad (Abbot)
on Oct 08, 2000 at 07:33 UTC ( [id://35789]=perlquestion: print w/replies, xml ) Need Help??

footpad has asked for the wisdom of the Perl Monks concerning the following question:

The penitent asks:

Perhaps I am overly ambitious, but is there a way to verify that my perl-based site has been visited by the search engines I've submitted it to?

I believe I have coverage not found elsewhere and, yet, my site does not appear. I use META tags. I use Titles. I have carefully chosen keywords that accurately reflect my coverage.

This is not necessarily a perl question, however, I have used perl to implement my site. The actual source pages are nothing more than perl scripts containing variable definitions and content. The source files are parsed when requested.

The content delivered to the end-user's browser is a proper, correct, and complete web page.

Is it possible that the robots are not parsing my pages? If so, can someone advise me how to correct this?

I confess that I am a Windows programmer with little experience with *nix and admit that i may have overreached in my ambition.

  • Comment on Can dynamic sites be parsed by search engine robots?

Replies are listed 'Best First'.
Re: Can dynamic sites be parsed by search engine robots?
by rlk (Pilgrim) on Oct 08, 2000 at 08:03 UTC

    Warning: The following is second or thirdhand information. I make no claims as to whether it is accurate or complete, other than I got it from someone who deals with search engine related stuff for a living.

    One thing that search engines apparently avoid like the plague is pages with URL's like

    http://www.example.com/index.pl?foo=bar

    If the pages you've submitted look like this, that may be your problem. You may want to throw up a static page or two for purposes of being indexed.

    Also, make sure you don't have any restrictions in robots.txt that could block the pages from being indexed. If you don't have a robots.txt file, it couldn't hurt to throw one up that explicitly allows it.

    Finally, I don't know how long you've been waiting, but if it's less than a week (or even two), be patient. Search engines take a while to incorporate submitted pages into their databases.

    --
    Ryan Koppenhaver, Aspiring Perl Hacker
    "I ask for so little. Just fear me, love me, do as I say and I will be your slave."

Re: Can dynamic sites be parsed by search engine robots?
by nop (Hermit) on Oct 08, 2000 at 21:07 UTC
    Philip Greenspun discusses this issue here:
    http://philip.greenspun.com/wtr/dead-trees/53005.htm
    An excerpt:

    I managed to hide a tremendous amount of my content from the search engines by stupidity in a different direction. I built a question and answer forum for http://photo.net/photo. Because I'm not completely stupid, all the postings were stored in a relational database. I used the AOLServer with its brilliant TCL API to get the data out of the database and onto the Web. The cleanest way to develop software in this API is to create files named "foobar.tcl" among one's document tree. The URLs end up looking like "http://photo.net/bboard/fetch-msg.tcl?msg_id=000037".

    So far so good.

    AltaVista comes along and says, "Look at that question mark. Look at the strange .tcl extension. This looks like a CGI script to me. I'm going to be nice and not follow this link even though there is no robots.txt file to discourage me."

    Then WebCrawler says the same thing.

    Then Lycos.

    I achieved oblivion.

    Then I had a notion that I developed into a concept and finally programmed into an idea: Write another AOLServer TCL program that presents all the messages from URLs that look like static files, e.g., "/fetch-msg-000037.html" and point the search engines to a huge page of links like that. The text of the Q&A forum postings will get indexed out of these pseudo-static files and yet I can retain the user pages with their *.tcl URLs. I could convert the user pages to *.html URLs but then it would be more tedious to make changes to the software (see my discussion of why the AOLserver *.tcl URLs are so good in the next chapter).
      Just goes to show you that it's not very smart to tie a URL's form to its function. All the HTML on my website is dynamically generated, yet I don't have .cgi on the end of any file (and especially not .tcl {grin}).

      From a security perspective, revealing that something is /cgi or .cgi or .pl or .tcl are all dangerous, as they give an attacker a hint of implementation language, which can permit more rapid selection of automated tools.

      -- Randal L. Schwartz, Perl hacker

RE: Can dynamic sites be parsed by search engine robots?
by little (Curate) on Oct 08, 2000 at 09:58 UTC
    footpad,
    if you registered your site, a bot will try to open it, as he would do with any available domain, e.g. "www.mypersonalweb.com/personal/info/about/me/"
    would do it, but notice, that if your provider uses apache, it might happen that the client browser (or in this case the bot) get's an "acces denied", if the trailing slash is omitted when the url is not pointing to the domain root.
    So first you should check with a browser, if you get the document specified by the URL you registered at the search engine.
    Normally the search bots will look for meta tags for robots, where you might specify if a search bot shall follow up the links in this page, with "follow" or "nofollow".
    It is important to notice that this concerns not the directory structure of your web, cause the bots are normal HTTP-clients.
    So you might also check, if in your sites root directory is a "robots.txt" file as well :-)
    To check if a specified client is requesting your webserver to get this document, you might look at this one of merlyn's columns for webtechniques where he describes how to generate acces logs. While reading through this code you will easily be able to figure out, which four lines of code to include into your cgi-script, that is responding to the HTTP_requests for the document specified by the URL you registered.
    Sorry for "talking around the corner", but my English is not as good as it should be.
    Have a nice day
    All decision is left to your taste
RE: Can dynamic sites be parsed by search engine robots? (webserver logs)
by ybiC (Prior) on Oct 08, 2000 at 07:55 UTC
    If they're hitting your site, I'd expect to see 'bots reported in your webserver agent logs (if it's configured for agents).

    Unless I'm mistaken, it could take weeks or even months for search bots to hit you after submit your URL.
        cheers,
        ybiC

Re: Can dynamic sites be parsed by search engine robots?
by adamsj (Hermit) on Oct 08, 2000 at 07:56 UTC
    I'm not an expert on search engines, but I don't see why they wouldn't be looking at your page--and if you've put your HTML together right (addendum: even if you haven't), Windows vs. UNIX shouldn't make a difference.

    How long since you submitted your pages to the search engines is probably the relevant question. With that knowledge, someone more expert than myself can give you a more informed answer.

Re: Can dynamic sites be parsed by search engine robots?
by Trimbach (Curate) on Oct 08, 2000 at 21:20 UTC
    Here's a thought: to guarantee that the default output of your CGI gets cataloged by the search engines, why not create an index.html file that contains a Server-side include to your CGI? That way, whenever the robots hit that page, your web server "includes" the default output of your CGI (which is, like you say, just static HTML) and then spits the whole thing back to the robot. To the robot it looks a regular web page, so it gets indexed and cataloged and All is Right in the World.

    Of course, this assumes you have access to Server-side includes. Check out the Apache docs on SSI's at http://www.apache.org/docs-1.2/mod/mod_include.html for more information. Even if you're serving from a Windows web server, there should be similar procedures for enabling SSI's.

    Gary Blackburn
    Trained Killer

Re: Can dynamic sites be parsed by search engine robots?
by rdnzl (Sexton) on Oct 08, 2000 at 09:44 UTC
      this is not correct:
      robot.txt allow you to restrict/control how robots index your site, is not necessary to make them work!

        Absolutes are so dangerous, usually =)

        Many robots act differently if there is an active robots.txt file in place. Many use a more liberal tree walker and will follow some dynamic content if you ask nicely.

        Put in a /robots.txt file and a */* rule.

        --
        $you = new YOU;
        honk() if $you->love(perl)

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://35789]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2024-04-18 05:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found