madsoeni has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I am a real newbie in Perl and from a professor in another university I found a perl code able to count the number of articles in google news mentioning a given keyword by day and country. To be clearer, assume that I want count the number of articles mentioning "hello" in UK day by day. I am using WWW::Mechanize and the code I wrote is the following:
#!/usr/bin/perl -w use WWW::Mechanize; #activate scraper package # waiting time between observations $sleep_per_obs = 5; # www:mechanize agent my $agent = new WWW::Mechanize(onerror => undef); # Safari browser $agent->agent_alias( 'Mac Safari' ); # target file my ($target) = 'data_uk.txt'; print "Data will save to $target \n"; open ($target, '>', $target) or die ("Sorry, couldn't open $target for + writing. \n"); # term to search $term = "hello"; print "Search term is ".$term . ".\n"; for($year=2012;$year<=2014;$year++){ for($month=01;$month<=12;$month++){ for($day=01;$day<=31;$day++){ $url = "https://www.google.com/search?q=$term&hl=en&gl=uk&authuser=0&s +a=X&ei=xXJuUp6tMoLcyQGxp4GgCw&source=lnt&cr=countryUK&tbs=cdr%3A1%2Cc +d_min%3A$month%2F$day%2F$year%2Ccd_max%3A$month%2F$day%2F$year&tbm=nw +s"; # print "URL is ".$url."\n"; $agent->get($url); $content = $agent->content(); $content =~ /(\d+),*(\d*) results/; #assigns results (thousands and hundreds = $1 and $2) to variables my ($results1, $results2) = ($1, $2); if ($results2 eq "") { $combo = $results1; } else { $combo = ($results1*1000+$results2); } if ($combo eq ""){ $combo=0; } print "Number of results $day-$month-$year : $combo \n"; print $target "$day-$month-$year: $combo \n"; sleep 5; } } } close $target;
However, after a while I get stopped and the script returns 0 results, even though I change the waiting time for each request. Does anyone know where I am wrong? Thank you in advance.

Replies are listed 'Best First'.
Re: Count of articles Google News
by ww (Archbishop) on Feb 28, 2015 at 16:50 UTC

    If you attempt to reach the address on Line 31 by pasting it into your browser's address bar, you may see the underlying problem; the timing problem is that your "after a while" (a term of very low precision) may reflect the time burned in the loops at Lines 27-29.

    Additionally:

    • You mention "google news" but there's no Google News address anywhere in your script (as I understand it, for any English language results, searching for ]"google news"_ always requires a variant on "http://news.google.com/").
    • Have you investigated (and ensured) that your effort satisfies any Terms of Use established by Big G?
    • Are you confident that the individual who provided this code (NB: Less "use strict;" but with some odd formatting -- the indentation of the closing brackets at Lines 56-58 leaps off the page) wishes you well?

    updated: Reordered phrases in bullet #1, for clarity.



      Dear www, thanks for your reply. I couldn't see the problem by pasting the address, could you please be more clear about this? I can't see why it is not working... I put that address because with the tbm = nws the address is returning the search results for the news. Finally, what do you mean by the "odd formatting? Thank you!

        NB: This applies to your next reply, as well as to the above... and, I hope, provides some insight on how better to obtain the assistance you seek.

        Please provide exact output; not vague descriptions when you say you've tried to understand a reply from a Monk. "I couldn't see... does NOT tell us what you DID see. Since my understanding of your problem may be defective, you need to tell us specifics.

        Gimmé ("give me...") questions like "what do you mean" are less than welcome: The Monastery's mission is to help you (and others with questions) learn how to solve problems. I meant that the indenting obscures the structure of your code, but this an issue about which you might have garnered a clue by reading some other code rather than merely asking someone to spoon-feed you an answer.
        Similarly, from your next reply, "...I basically need to change the address from which I am doing the requests in the script, isn't it? Would you be so kind to tell me where I could find some examples of this?" reflects no effort on your part represents a pair of "gimmés.

        Please read On asking for help and How do I post a question effectively?; visit the Tutorials section; and -- not least of all -- try some self-help, even if it results only in better-framed questions.

        And since you describe yourself as "real newbie in Perl," you may wish to start by coding solutions to the exercises in something like Learning Perl (commonly also available from other vendors, including used-book emporia) or working thru one of the (many) on-line courses that are made freely available from university-level CS programs.

        Update (forgot to mention this earlier):  Among the practices to adopt early: use strict; -- always, until you know a truly good reason to omit the stricture. Doing so will let Perl itself point out many mistakes, such as those that exist in your code. Correcting those that pop up initially (failure to declare variables with my... and company) will lead to your discovery of several more problems whose resolution will also move you forward to the pont where your code will actually start to produce the results you seek:

        Data will save to data_uk.txt Search term is hello. Number of results 1-1-2012 : 8050 Can't use string ("data_uk.txt") as a symbol ref while "strict refs" i +n use at 1118172.pl line 67.

        At the point in debugging where the code produces the above result, there's still "fixing" to be done ... but by the time you get there, you should be able to complete the repairs.


        ++$anecdote ne $data


        Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
        1. code
        2. verbatim error and/or warning messages
        3. a coherent explanation of what "doesn't work actually means.
Re: Count of articles Google News
by CoVAX (Beadle) on Feb 28, 2015 at 23:55 UTC

    For day-by-day results, consider Google News RSS

    I suggest you abandon Google and instead use Bing's Search API.

    Searched for donut and crumpit. Found donate and stumbit instead.
      > consider Google News RSS

      heh! =D

      Type 1 diabetes PERL clinical trial
      Kidney disease is one of the leading complications of diabetes... To learn more about Preventing Early Renal Loss, visit perl-study.org ...

      Homework problems? Go to http://perl-study.org, check your kidneys! :)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)

      PS: Je suis Charlie!

      Dear CoVAX, thank you for your quick reply. The one about including the Bing's search API is a good suggestion. Hence I basically need to change the address from which I am doing the requests in the script, isn't it? Would you be so kind to tell me where I could find some examples of this?