in reply to Re^2: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
in thread Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
#!/usr/bin/perl -- use strict; use warnings; use WWW::Mechanize 1.73; use Web::Scraper 0.37; use Data::Dump; my $out = scraper { process ".gs_rt", "title[]" => scraper { process ".gs_a", "info" => 'TEXT'; process q{gs_a}, "info" => 'TEXT'; }; }; my $mech = WWW::Mechanize->new(qw/ autocheck 1 /); $mech->show_progress(1); $mech->get( "http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologischen+Ar +beitsmethoden" ); if( $mech->follow_link( url_regex => qr/cites/i, n => 1 ) ){ my $result = $mech->content; my $indi = $mech->uri(); my $res = $out->scrape( $result, $indi ); #~ dd( $result, $res ); dd( $res ); } __END__ $ perl web-scraper-google-pm1057095.pl ** GET http://scholar.google.it/scholar?hl=en&q=Handbuch+der+biologisc +hen+Arbeitsmethoden ==> 200 OK (1s) ** GET http://scholar.google.it/scholar?cites=3692889479872081319&as_s +dt=2005&sciodt=0,5&hl=en&oe=ASCII ==> 200 OK { title => [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}] }
If you want to fixup your 'css paths' use htmltreexpather.pl / xpather.pl , compare the hierarchy
HTML::Element=HASH(0xcac644) 0.1.0.8.0.1.2.0.0 The fire of life. An introduction to animal energetics. /html/body/div/div[5]/div/div[2]/div[2]/div/h3 //div[@id='gs_ccl']/div[2]/div/h3 //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/h3[@class='gs_rt'] ------------------------------------------------------------------ HTML::Element=HASH(0xcac534) 0.1.0.8.0.1.2.0.1 M Kleiber - The fire of life. An introduction to animal energetics., 1 +961 - cabdirect.org /html/body/div/div[5]/div/div[2]/div[2]/div/div //div[@id='gs_ccl']/div[2]/div/div //div[@id='gs_ccl']/div[@style='z-index:400' and @class='gs_r']/div[@c +lass='gs_ri']/div[@class='gs_a'] ------------------------------------------------------------------
gs_a is not a child of gs_rt, they're siblings, they're bot children of gs_ri
//div[@class='gs_r']/div[@class='gs_ri']/h3[@class='gs_rt'] //div[@class='gs_r']/div[@class='gs_ri']/div[@class='gs_a']
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
by sbasbasba (Initiate) on Oct 07, 2013 at 05:19 UTC | |
by Anonymous Monk on Oct 07, 2013 at 06:24 UTC | |
|
Re^4: Problem in using Web::Scraper, coming from HTML::TreeBuilder::XPath
by sbasbasba (Initiate) on Oct 06, 2013 at 20:34 UTC |