in reply to Re^10: Super search use DuckDuckGo link broken (excluding special pages and noise)
in thread Super search use DuckDuckGo link broken

Oh, thanks, I had the wrong impression of these attributes then!

I've added the following lines to robots.txt, we'll see if bots honor these:

Disallow: ?displaytype= Disallow: &displaytype= Disallow: ?node=Super+Search Disallow: &node=Super+Search Disallow: ?node_id=3989

The main source of requests were crawlers that don't even identify themselves explicitly as bots, so I really doubt that the problematic bots will care about robots.txt.

Replies are listed 'Best First'.
Re^12: Super search use DuckDuckGo link broken (excluding special pages and noise)
by LanX (Saint) on May 07, 2025 at 21:32 UTC
    Thanks for updating robots.txt
    # Be kind. Wait between fetches longer than each fetch takes. User-agent: * Disallow: /bare/ Disallow: /mobile/ Disallow: /*?displaytype= Disallow: /*&displaytype= Disallow: /*?node=Super+Search Disallow: /*&node=Super+Search Disallow: /*?node_id=3989 Crawl-Delay: 20

    But as I already said, you are forgetting the very common semicolon

    Disallow: /*;displaytype= Disallow: /*;node_id=3989

    Personally I'd simplify and skip all ; & ?

    # Be kind. Wait between fetches longer than each fetch takes. User-agent: * Disallow: /bare/ Disallow: /mobile/ Disallow: /*displaytype= Disallow: /*node=Super+Search Disallow: /*node_id=3989 Crawl-Delay: 20

    The only problem I see are IDs starting with 3989 etc (e.g. 39895)

    So probably better:

    Disallow: /*node=Super+Search$ Disallow: /*node_id=3989$ Disallow: /*node=Super+Search; Disallow: /*node_id=3989; Disallow: /*node=Super+Search& Disallow: /*node_id=3989&

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

Re^12: Super search use DuckDuckGo link broken (excluding special pages and noise)
by LanX (Saint) on May 04, 2025 at 17:30 UTC
    From my understanding one still needs a wildcard /* up front,( but I'm not sure)

    Anyway you forgot ; as separator in searchstrings.

    I'll suggest just /*displaytype= it should cover all cases and only searchstrings, since = is meta.

    I thought I read that Google has a page for testing rules online, but I can't find it right now. *

    FWIW: I read at Google that adbots are not covered by * and must be listed explicitly.

    But I suppose those are not our biggest headache.

    On a side note: adding rel='nofollow' to links posted by users should help frustrating spammers.

    Edit

    *) See https://support.google.com/webmasters/answer/6062598?hl=en

    Couldn't test it since it requires a log in.

    Update

    Forgot to mention, I tried finding displaytype nodes with Google and DDG and the nofollow rules seem to be effective on site:www.perlmonks.org.

    Other domains seem to show cached results.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery