in reply to Re^7: Super search use DuckDuckGo link broken
in thread Super search use DuckDuckGo link broken

According to the robots.txt specifications I found at Google, it's possible to exclude "orthogonal" pages like &displaytype=print or ;displaytype=edithistory with wildcards.

Any reason not to add

to the list? (Untested)

Bing also suggests adding noindex-meta tags to the pages.

On a tangent

Ideally robots would be presented with a page without nodelets, but I'm not aware of an efficient solution, except checking the user-agent before building the page.

Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^9: Super search use DuckDuckGo link broken (excluding special pages and noise)
by Corion (Patriarch) on May 04, 2025 at 14:38 UTC

    These links are already tagged with rel="nofollow", so Google shouldn't be crawling these, nor should any other bots. Except a lot of them do, so I'm not sure if spending any effort on divining bot behaviours is time well spent.

      That's true, but nofollow is - according to Wikipedia - a misnomer.

      The main purpose is to influence page rankings, especially for links posted by users.

      Bots have no obligation not to follow.

      I will later run tests with various search engines, to see if "displaytype" is included in search results for the main domain www.perlmonks.org (for the other domains engines may still show cached results)

      If not, we are good.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        Oh, thanks, I had the wrong impression of these attributes then!

        I've added the following lines to robots.txt, we'll see if bots honor these:

        Disallow: ?displaytype= Disallow: &displaytype= Disallow: ?node=Super+Search Disallow: &node=Super+Search Disallow: ?node_id=3989

        The main source of requests were crawlers that don't even identify themselves explicitly as bots, so I really doubt that the problematic bots will care about robots.txt.

      Please see subthread at Re^8: Unable to connect, if you haven't already. (And I acknowledge that you might find no new info there.)