in reply to Re^2: Combining Excel Parser with Google Scholar Scraper
in thread Combining Excel Parser with Google Scholar Scraper

Again, where is the disconnect? What have you tried to do to combine the two? Is the interpreter outputting errors? Is the issue just that you are not familiar with Perl syntax? Post what you've done (even if it is just pseudocode) and we can work through it together. If you expect a monk to take two scripts you found on the internet and merge them for you, then the Your Work section of How (Not) To Ask A Question is very relevant.
  • Comment on Re^3: Combining Excel Parser with Google Scholar Scraper

Replies are listed 'Best First'.
Re^4: Combining Excel Parser with Google Scholar Scraper
by ochez (Initiate) on Apr 14, 2009 at 16:07 UTC
    #!/usr/bin/perl use WWW::Mechanize; #!/usr/bin/perl -w use strict; use Win32::OLE qw(in with); use Win32::OLE::Const 'Microsoft Excel'; $Win32::OLE::Warn = 3; # die on errors. +.. # get already active Excel application or open new my $Excel = Win32::OLE->GetActiveObject('Excel.Application') || Win32::OLE->new('Excel.Application', 'Quit'); # open Excel file my $Book = $Excel->Workbooks->Open("C:/Documents and Settings/rto5u/My + Documents/CV.xls"); # select worksheet number 1 (you can also select a worksheet by name) my $Sheet = $Book->Worksheets(1); foreach my $row (2..4) { foreach my $col (1..1) { # skip empty cells next unless defined $Sheet->Cells($row,$col)->{'Value'}; my $URL = 'http://scholar.google.com/advanced_scholar_search'; my $FORM_NAME = 'f'; #print "Author Name: "; #chomp ($AUTHOR = <>); my $AUTHOR = "MD Li"; #print "Paper Title: "; #chomp ($TITLE = <>); my $TITLE = $Sheet->Cells($row,$col)->{'Value'}; print "$TITLE"; #my $TITLE = "Region-specific transcriptional response to chro +nic nicotine in rat brain"; my $mech = WWW::Mechanize->new(stack_depth=>10); $mech->get($URL) || die ("Could not connect to $URL.\n"); my $res = $mech->submit_form( form_name => $FORM_NAME, fields => { 'num' => 100, 'as_epq' => $TITLE, 'as_occt' => 'title', 'as_sauthors' => $AUTHOR, 'as_allsubj' => 'all', }, ); while ($res && $res->is_success()){ my $content = $res->content; #print $content; while ($content =~ /<p class=g>(.*?)<\/font>\s\s\s/gs){ my $section = $1; my $title = ""; my $citedby = 0; # get title $title = getTitle($section); $title =~ s/<.*?>//g; $title =~ s/&hellip;/\.\.\./g; # get citedby # $citedby = getCitedBy($section); if ($citedby){ print "\"$title\"\nCited by: $citedby\n\n"; } } $res = $mech->follow_link( text_regex => qr/Next/i); } } } $Book->Close; ###################################################################### +####### sub getTitle($){ my ($section) = @_; my $title; if ($section =~ /<span class="w">.*?<a href.*?>(.*?)<\/a><\/span>/ +s){ # papers with a link $title = $1; }elsif ($section =~ /&nbsp;(.*?)<font size=-1>/s){ # pa +pers w/o a link $title = $1; }else{ $title = $1; } return $title; } #--------------------------------------------------------------------- +------- sub getCitedBy($){ my ($section) = @_; my $citedby; if ($section =~ />Cited by (\d+)</s){ $citedby = $1; } return $citedby; } #--------------------------------------------------------------------- +-------

    The two programs work separately. I tried to put the fetch.pl program within the for loop that goes through the paper titles in the excel spreadsheet...I tried trouble shooting the best I could, but the problem consistently turns about to be, "Can't call method "url" on an undefined value at C:/strawberry/perl/site/lib/WWW/Mechanize.pm line 707"
    Again I apologize for my lack of familiarity with the customs of this forum. I'm not necessarily new to programming, but I am extremely new to Perl syntax

      Don't worry about a present lack of experience - only Larry Wall was born with knowledge of Perl. The rest of us are acolytes.

      The error being reported means that some code in the Mechanize module is attempting to find a subroutine named url on a variable with an undefined value. Since it's more likely this code has a bug rather than WWW::Mechanize, it implies you are either passing it bad values or calling it wrong. My best guess is that the Excel file is misformatted - replicating a parsing issue without the file is question is difficult. Try running the following code and see if the output gives you any indications of what lines in the file may be problematic.

      #!/usr/bin/perl use strict; use WWW::Mechanize; use Win32::OLE qw(in with); use Win32::OLE::Const 'Microsoft Excel'; $Win32::OLE::Warn = 3; # die on errors. +.. # get already active Excel application or open new my $Excel = Win32::OLE->GetActiveObject('Excel.Application') || Win32::OLE->new('Excel.Application', 'Quit'); # open Excel file my $Book = $Excel->Workbooks->Open("C:/Documents and Settings/rto5u/My + Documents/CV.xls"); # select worksheet number 1 (you can also select a worksheet by name) my $Sheet = $Book->Worksheets(1); foreach my $row (2..4) { foreach my $col (1..1) { # skip empty cells next unless defined $Sheet->Cells($row,$col)->{'Value'}; my $URL = 'http://scholar.google.com/advanced_scholar_search'; my $FORM_NAME = 'f'; #print "Author Name: "; #chomp ($AUTHOR = <>); my $AUTHOR = "MD Li"; print "Author Name: $AUTHOR\n"; #print "Paper Title: "; #chomp ($TITLE = <>); my $TITLE = $Sheet->Cells($row,$col)->{'Value'}; print "Paper Title: $TITLE\n"; #print "$TITLE"; #my $TITLE = "Region-specific transcriptional response to chro +nic nicotine in rat brain"; my $mech = WWW::Mechanize->new(stack_depth=>10); $mech->get($URL) || die ("Could not connect to $URL.\n"); my $res = $mech->submit_form( form_name => $FORM_NAME, fields => { 'num' => 100, 'as_epq' => $TITLE, 'as_occt' => 'title', 'as_sauthors' => $AUTHOR, 'as_allsubj' => 'all', }, ); while ($res && $res->is_success()){ my $content = $res->content; #print $content; while ($content =~ /<p class=g>(.*?)<\/font>\s\s\s/gs){ my $section = $1; my $title = ""; my $citedby = 0; # get title $title = getTitle($section); $title =~ s/<.*?>//g; $title =~ s/&hellip;/\.\.\./g; # get citedby # $citedby = getCitedBy($section); if ($citedby){ print "\"$title\"\nCited by: $citedby\n\n"; } } $res = $mech->follow_link( text_regex => qr/Next/i); } } } $Book->Close; ###################################################################### +####### sub getTitle($){ my ($section) = @_; my $title; if ($section =~ /<span class="w">.*?<a href.*?>(.*?)<\/a><\/span>/ +s){ # papers with a link $title = $1; }elsif ($section =~ /&nbsp;(.*?)<font size=-1>/s){ # pa +pers w/o a link $title = $1; }else{ $title = $1; } return $title; } #--------------------------------------------------------------------- +------- sub getCitedBy($){ my ($section) = @_; my $citedby; if ($section =~ />Cited by (\d+)</s){ $citedby = $1; } return $citedby; } #--------------------------------------------------------------------- +-------

      A couple notes on the code:

      1. The lines starting with #! are used to tell Unix-like systems how to interpret the file. They are only meaningful if they are on the first line of a file. The -w switch is equivalent to the warnings pragma.
      2. On your subroutines, you use prototyping behavior, i.e. the ($). This is supposed to tell the Perl interpreter what the argument list looks like. They are generally not used (see subroutine prototypes still bad?). If you are going to use them, the subroutines must be declared at the top of the file, i.e. before they are called in code. This just involves a copy-paste for you.
      3. The foreach indices on $row and $col may not correspond to the areas of the file you intend to loop over.

      If the above does not elucidate your issue, I'll need to see the Excel file in order to debug further.