in reply to Re: searching via www::search on alltheweb
in thread searching via www::search on alltheweb

This node falls below the community's minimum standard of quality and will not be displayed.
  • Comment on Re^2: searching via www::search on alltheweb

Replies are listed 'Best First'.
Re^3: searching via www::search on alltheweb
by Fletch (Bishop) on Jan 18, 2006 at 21:34 UTC
    I did not respond because I fail to see how they could come to such a conclusion.

    So you post several different times asking for help submitting comments to different sites (c.f. Why code is not posting, using www::mechanize to submit to a forum, unable to post to forum, Submitting a form, and perl and a javascript form field; the code in the first four of which being all but indistinguishable from one another save the URL to which the message was going to be posted) . . .

    Then you're asking for help mining search engine results . . .

    And yet you fail to see how people could not come to the conclusion that you were trying to implement some sort of comment spamming scheme (given the recent line of questioning most likely as an attempt to increase page rank with said search engines; in fact Why code is not posting looks to be the start of just such a program) . . .

    /boggle

Re^3: searching via www::search on alltheweb
by Anonymous Monk on Jan 18, 2006 at 17:20 UTC
    "for all *we* know"...

    There are a number of things about your posts which would lead one to assume such things:

    • Posting a comment to a forum (using www::mechanize to submit to a forum unable to post to forum ) is usually something that happens as a one-off and is not something that requires automation, certainly not when it takes days of coding to achieve.
    • When posting "asking for help" to a "read, ask, wait and then read answers" type forum Submitting a form - automated submission doesn't go hand in hand with reading manually and thanking responders for their time and answers.
    • Having to be dishonest about your user agent:
      $agent->agent_alias( 'Windows IE 6' );
      in Why code is not posting, Submitting a form which would appear to be either to cover your tracks or perhaps to get around a site's requirement to use a normal web browser - which presumably would be required by a site owner who didn't *want* people doing automated submissions to their site - any attempts (however circumventable) they have made at restricting access to their site should be respected by considerate web users.
    • Doing a Google search and submitting a post to each of the results???: Why code is not posting and you "fail to see how"...
      I did not respond because I fail to see how they could come to such a conclusion. I still don't see how my question can suddenly lead you to assume I am running a spamming script.

    It's not hard to see why people would assume this. It's not about this one question, (your questions don't sit on this site in isolation and *we* don't always fail to see *your other posts*) it's about the focus of your posts to date and who we are putting our time into helping, and what the end result is that we are helping them to achieve (or that we are perhaps unwittingly unleashing upon the web).

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re^3: searching via www::search on alltheweb
by Anonymous Monk on Jan 18, 2006 at 21:13 UTC
    Try this: http://www.viewz.org/opensoft/

    ----------------------------------------------------

    Viewz Web Search A meta search engine that sends queries to multiple search engines simultaneously, combines the results, and rank them based on some criteria. The results are highly relevant. It also does a comparision among the search engines participating in the search. You need to have LWP::Parallel installed on your box.

    ---------------------------------------------------

      I haven't been able to get vroom working, although I have manually copied and pasted the pm files into the appropriate c:\perl\site\lib\vroom\search\google.pm and c:\perl\site\lib\vroom\vroom.pm directories, it still says vroom\search\google not found and lists the above directories as to where it should be found, the package I am using is this:
      package VROOM::Search::Google; use strict; use VROOM::Search qw(escape_query unescape_sequence); use Time::HiRes qw(gettimeofday); @VROOM::Search::Google::ISA = qw(VROOM::Search); sub prepare_request { my $self = shift; my $query = escape_query(shift); my $params = shift; my $uri = 'http://www.google.com'; $params->{baseurl} = $uri unless defined $params->{baseurl}; $params->{hl} = 'en' unless defined $params->{hl}; $self->{baseurl} = $uri = $params->{baseurl}; $uri .= '/search?q='.$query; while (my ($name, $value) = each %$params) { next if $name =~ /baseurl/; $uri .= '&'.$name.'='.$value; } $self->{initime} = $self->{endtime} = [gettimeofday]; $self->{request} = new HTTP::Request(GET => $uri); } sub store_results { my $self = shift; my $res = shift; $self->{endtime} = [gettimeofday]; if ($res->code != 200) { $self->{request} = undef; return undef; } # # Google doesn't return Content-Length, # so ($res->headers)->content_length will be zero. We're forced to # use Perl function - length. # $self->{fetch}++; $self->{pgsize} += length($res->content); # # If we reach here, HTTP response is OK. Proceed to parse the html # document for search results # my ($HIT, $ENTRY, $NEXT) = (0, 1, 2); my $rank = $self->count; my $hits = 0; my $wish = $HIT; my $result = undef; foreach (split(/(<p>|\n|<\/div>)/i, $res->content)) { next if /^$/; # short circuit for blank lines last if $wish == $NEXT; if ($self->count == $self->maximum) { $self->{request} = undef; return $hits; } #print "#################################################\n"; #print $_, "\n"; # # Ah,found some results. Get approximate results and wish to # see the title/url of the first result. # if ($wish == $HIT && /Results.*?of.*?([0-9,]+).*?\./i) { my $count = $1; $self->approximate($count); $wish = $ENTRY; } # # Extract the url/title and wish to have abstract text # elsif ($wish == $ENTRY && /^<a href=(.*?)>(.*?)<\/a><br><font.*?>(.*?)$/i) { my $url = $1; my $title = $2; my $abstract = $3; $url =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; $url =~ s/(^http:\/\/|\/(index.htm|index.html)*$)//g; $title =~ s/<.*?>//g; $result = new VROOM::Search::Result; $result->url($url); $result->title(unescape_sequence($title)); $result->text(unescape_sequence($abstract)); $result->rank(++$rank); $result->engine('Google'); $self->add_result($result); $self->{pool}->insert($result) if $self->{pool}; $hits++; } # # Extract the url for the next page # elsif ($wish == $ENTRY && /<td nowrap><a href=(.*?)>.*?<span.*?>Next<\/span><\/a>/i) +{ $self->{request}->uri($self->{baseurl}.$1); $wish = $NEXT; } } # # This is important. It signals the search agent not to fetch more + pages. # $self->{request} = undef if $wish != $NEXT; return $hits; } 1; __END__
      my perl code is this:
      #! Perl\bin\perl -w use VROOM::Search::Google; open FILE1, "> sample1.txt" or die "$!"; my $oSearch = new VROOM::Search::Google( ); my $sQuery = VROOM::Search::Google::escape_query('"telefonos" "mundial +"'); $oSearch->native_query($sQuery); while ( my $oResult = $oSearch->next_result() ) { print "Adding: ", $oResult->url, "\n"; print FILE1 $oResult->url, "\n"; } print ref($oSearch);
      I have still not been able to figure out how to amend the languages hl = es in tis case instead of en. Any help will be appreciated.