Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Check if your site's been banned with Google

by Alien (Monk)
on Jan 08, 2007 at 13:15 UTC ( #593530=sourcecode: print w/replies, xml ) Need Help??
Category: Miscellaneous
Author/Contact Info Alien ,
Description: Script to check if your site has been banned from Google ... in an ideal world it would have checked for back links too ... but who knows :) ... maybe the next version !

use WWW::Mechanize;
use strict;
use warnings;

my $site=shift || die "Site\n";
my @results=();
my $mech=WWW::Mechanize->new();
$mech->get("http://www.google.com/search?q=site%3A$site") || die "GET\
+n";
my $text=$mech->content;
while($text=~m@<a class=l href=\"(.*?)\">@gi)
{
    push(@results,$1);
}

if($#results==-1)
{
    print "Site is banned with Google , or was not submitted to it!\n"
+;
}
else 
{
    print "$site does NOT appear to be banned ... in fact here are som
+e google searches related to it :\n";
    for my $qq (@results)
    {
        print "$qq\n";
    }
}
Replies are listed 'Best First'.
Re: Check if your site's been banned with Google
by merlyn (Sage) on Jan 08, 2007 at 14:45 UTC

        Oddly enough, the Google AJAX API FAQ lists 15 questions, but only contains 7 answers.

        From a previous reading of the rules of use, you specifically were NOT to use it on anything other than a website, and you were not allowed to do anything other than present the information directly as returned by Google. ... unfortunately, answers #9 and #11 aren't listed right now.

        (of course ... would it then be ethical to scrape that website that you created?)

        Update: 9 and 11, not 8 and 11.

      Is it truly legal for a site to tell you how to use content they provide on the internet? If he was planning to redistribute the info on his own website I could understand, but for a personal command line tool? Isn't that kind of like saying you have to READ the whole page of HTML we send you, you can't just skim it to see if your site worked?

      I'm not saying it is ethical, I'm just curious as to how far googles reach extends over the content it provides. If it were a site I had to register and agree to it's terms of use i could understand that, but this is a case of limited the use of information that is made public by google on purpose. What if I made a GreaseMonkey script that does the same thing and displays it in my browser? Where does the line get drawn? Am I required to view their entire page of HTML based on terms of use that I might not know exist let alone agree to? Could there terms of use then ban me from using information off a search in any other context? Could it state that i must fully read at least one ad before looking to see if my site was among the other sites listed?

      Like I said, I can understand limits on uses of information that you have to register to see or that you plan on reusing for your own profit, but this doesn't seem to fit either of those casses so I'm curious. Just some food for thought, and maybe there is an obvious answer out there that i'm not aware of.


      ___________
      Eric Hodges
Re: Check if your site's been banned with Google
by davidrw (Prior) on Jan 08, 2007 at 14:01 UTC
    The while loop can be written simply as:
    my @results = m@<a class=l href=\"(.*?)\">@gi;
    Also, you may want to look at WWW::Mechanize's find_all_links method (returns WWW::Mechanize::Link objects) so that you're not parsing html yourself ..
    my @results = map { $_->url } grep { $_->attrs->{class} eq '1' } $mech->find_all_links(tag=>'a') ;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://593530]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2022-01-23 14:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (63 votes). Check out past polls.

    Notices?