in reply to Re^8: Super search use DuckDuckGo link broken (excluding special pages and noise)
in thread Super search use DuckDuckGo link broken

These links are already tagged with rel="nofollow", so Google shouldn't be crawling these, nor should any other bots. Except a lot of them do, so I'm not sure if spending any effort on divining bot behaviours is time well spent.

  • Comment on Re^9: Super search use DuckDuckGo link broken (excluding special pages and noise)
  • Download Code

Replies are listed 'Best First'.
Re^10: Super search use DuckDuckGo link broken (excluding special pages and noise)
by LanX (Saint) on May 04, 2025 at 14:59 UTC
    That's true, but nofollow is - according to Wikipedia - a misnomer.

    The main purpose is to influence page rankings, especially for links posted by users.

    Bots have no obligation not to follow.

    I will later run tests with various search engines, to see if "displaytype" is included in search results for the main domain www.perlmonks.org (for the other domains engines may still show cached results)

    If not, we are good.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      Oh, thanks, I had the wrong impression of these attributes then!

      I've added the following lines to robots.txt, we'll see if bots honor these:

      Disallow: ?displaytype= Disallow: &displaytype= Disallow: ?node=Super+Search Disallow: &node=Super+Search Disallow: ?node_id=3989

      The main source of requests were crawlers that don't even identify themselves explicitly as bots, so I really doubt that the problematic bots will care about robots.txt.

        Thanks for updating robots.txt
        # Be kind. Wait between fetches longer than each fetch takes. User-agent: * Disallow: /bare/ Disallow: /mobile/ Disallow: /*?displaytype= Disallow: /*&displaytype= Disallow: /*?node=Super+Search Disallow: /*&node=Super+Search Disallow: /*?node_id=3989 Crawl-Delay: 20

        But as I already said, you are forgetting the very common semicolon

        Disallow: /*;displaytype= Disallow: /*;node_id=3989

        Personally I'd simplify and skip all ; & ?

        # Be kind. Wait between fetches longer than each fetch takes. User-agent: * Disallow: /bare/ Disallow: /mobile/ Disallow: /*displaytype= Disallow: /*node=Super+Search Disallow: /*node_id=3989 Crawl-Delay: 20

        The only problem I see are IDs starting with 3989 etc (e.g. 39895)

        So probably better:

        Disallow: /*node=Super+Search$ Disallow: /*node_id=3989$ Disallow: /*node=Super+Search; Disallow: /*node_id=3989; Disallow: /*node=Super+Search& Disallow: /*node_id=3989&

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

        From my understanding one still needs a wildcard /* up front,( but I'm not sure)

        Anyway you forgot ; as separator in searchstrings.

        I'll suggest just /*displaytype= it should cover all cases and only searchstrings, since = is meta.

        I thought I read that Google has a page for testing rules online, but I can't find it right now. *

        FWIW: I read at Google that adbots are not covered by * and must be listed explicitly.

        But I suppose those are not our biggest headache.

        On a side note: adding rel='nofollow' to links posted by users should help frustrating spammers.

        Edit

        *) See https://support.google.com/webmasters/answer/6062598?hl=en

        Couldn't test it since it requires a log in.

        Update

        Forgot to mention, I tried finding displaytype nodes with Google and DDG and the nofollow rules seem to be effective on site:www.perlmonks.org.

        Other domains seem to show cached results.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

Re^10: Super search use DuckDuckGo link broken (excluding special pages and noise)
by jdporter (Paladin) on May 05, 2025 at 12:36 UTC

    Please see subthread at Re^8: Unable to connect, if you haven't already. (And I acknowledge that you might find no new info there.)