in reply to Parsing HTML
Below is a solution. It uses HTML::TreeBuilder::XPath, which (like Corion) I find easier to use than "bare" HTML::TreeBuilder. I also added an option so while working on the code you don't have to keep hitting the live page. it will be more polite, and for you much faster, to use a cache.
Also, the problems you had with weird characters can be solved by telling the code that you want to output UTF-8, using binmode( STDOUT, ':utf8');.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Perl6::Slurp; # to load the page from the cache
use HTML::TreeBuilder::XPath; # easier to use than bare HTML::TreeBuil
+der
# during development we don't want to hit the real page,
# so we'll have a -c switch to use a cache
use Getopt::Std;
my %opt;
getopts( 'c', \%opt); # if called with -c then $opt{c} is true
my $base='http://www.costacrociere.it';
my $url='/it/lista_crociere/capitali_nord_europa-201206.html';
my $cache= 'capitali_nord_europa-201206.html';
# this will get rid of the bad characters you were seeing in the outpu
+t
binmode( STDOUT, ':utf8');
if( ! $opt{c}) { getstore( $base.$url, $cache); } # only get the live
+page without -c
my $page= slurp '<:utf8', $cache;
my $p = HTML::TreeBuilder::XPath->new_from_content( $page );
my @trips= $p->findnodes( '//p[@class="itinerari-info"]');
foreach my $trip (@trips){
# you may want to do something more complex here, but for now it wi
+ll do
print "crociera: ", $trip->as_text, "\n";
}
Re^2: Parsing HTML
by marcoss (Novice) on Jun 12, 2012 at 11:05 UTC
|
Hi mirod, before I go ahead ...THANK YOU!!. XPath opened a brand new world of possibilities for me. I took a look at Zvon's page and also this page, which is a little bit more for beginners. The thing is I was able to use your code and also add a few things for the other pieces of information that I needed to extract. Right now it's working just fine, but there's a detail that I haven't been able to modify (basically because the last part of the code you wrote are almost hieroglyphs to me...xD) Anyway, this is code:
#!/usr/bin/perl -w
use LWP::Simple;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
use strict;
my $debug=1;
my $base='http://www.costacrociere.it';
my $url='/it/lista_crociere/capitali_nord_europa-201207-2.html';
my $page = get($base.$url) or die $!;
my $p = HTML::TreeBuilder::XPath->new_from_content( $page );
binmode( STDOUT, ':utf8');
my @trips= $p->findnodes( '//div[@class="info-cruise"]');
foreach my $trip (@trips){
my $title = $trip->findvalue( './/div[@class="sx"]/h3');
print "Trip name: $title\n";
my $price = $trip->findvalue( './/span[@class="new-price"]');
print "price: $price\n";
my $includes = $trip->findvalue('.//p[@class="info-price"]/spa
+n[6]'); #I added this line
print "Includes: $includes\n";
foreach my $info ( $trip->findnodes( './/p[@class="itinerari-i
+nfo"]//span[@class != "note" and @class != "strike"]')){
my $info_title= $info->findnodes( './b')->[0];
print $info_title->as_text();
$info_title->detach;
my $info_value= $info->as_text;
print ":", $info_value, "\n";
}
my $pic = $trip->findvalue('.//img[@class="image_map"]/@src'); # I
+ added this line.
print "Picture: $base$pic\n";
print "\n";
}
And this is the output, well... just one of the results, all of it is not necessary
Trip name: Fiordi norvegesi e grandi città del Baltico
price: € 2.615,00
Includes: Crociera + Volo
Itinerario : Danimarca, Estonia, Russia, Finlandia, Svezia, Norvegia
Data partenza: 7 luglio 2012
Nave : Costa Luminosa
N.ro giorni crociera : 14
Porto di partenza : Copenhagen
Documenti di viaggio : Passaporto
Picture: http://www.costacrociere.it/B2C/Images/ItineraryV4/CPH11040__
+it-IT.gif#CPH11040
Yes, I know what you're thinking... "That's my code...this guy didn't do anything", and you're quite right, I just added those 2 lines. But the good thing is I'm learning!!.. Using only Treebuilder was giving me a lot of headaches. Ok, so the detail I was telling you about, as you can see in the output, certain pieces of information have an extra space at the beginning. I've been trying with chomp and different print and \n ways, but nothing does the trick. Where should I look?. Right now, what I'm doing is some research to understand what every line of the second foreach loop does. If you can give some directions on this I will greatly appreciate it (again)!!
Cheers!!
marcos | [reply] [d/l] [select] |
|
It's a bit of a pain to figure out where to look, but the as_text method comes from HTML::Element. If you look at the docs, you'll see that in addition to as_text there is also a as_trimmed_text method. I looks like you could use it.
The secon foreach loop comes from looking at the HTML source for the page. The data you want is in the p with a class of itinerari-info, in consecutive span. Some of the span's can be discarded, the ones with classes of note and strike. That's what the XPath experssion returns. Each span includes a b element with the title, which I get in $info_title, display then detach to get it out of the way. The rest of the span is the information itself.
Does this help?
| [reply] |
|
Ok, this clarifies a lot. The as_trimmed_text worked just fine. I tried commenting the detach line, and like you said, it'll print the title twice. But then, it seems like you have seen something I completely overlooked. The strike attribute is only for dates that have been removed, that's why I didn't see it before... but still when I execute the script, the date shows up. Is it a matter of using an if statement?... Because it looks to me that the foreach my $info ( $trip->findnodes( './/p[@class="itinerari-info"]//span[@class != "note" and @class != "strike"]')) should take care of it. mmmm I'm thinking of unless but those are only assumptions... I'll let you know if I fix this, even though probably...eventually, I'll be crying out for help xD. Anyway, thank very much for your time and your patience.
cheers!
marcos
| [reply] [d/l] [select] |
|
| [reply] |
Re^2: Parsing HTML
by marcoss (Novice) on Jun 07, 2012 at 11:59 UTC
|
Hi mirod, thank you so much for the solution provided!! I had to remove some lines because (for what i understand) you're using perl 6 and my version is v5.10.1. I'm not familiar with HTML::TreeBuilder::XPath and the findnode function, so I've been doing some research. I want to see if by using your script I can obtain not only all of the trips with all it's details, but all of the trips with the details separately. for example, this is the output I need for each trip:
Trip Name: Nordic seas
Price: 500
Itinerary: Denmark, Oslo, Helsinki
Departure date: 12/04/2012
Ship Name: Costa Magica
Includes: Cruise
Departure port: Copenhagen
Duration: 7 days
In this way I can later take all those individual pieces of information to a database. Like I said, I'm new to Perl, and all I do is trial & error, so until I have more time to study during the summer I will appreciate all the help you guys at PerlMonks can provide me. Thanks again for all the great work!!! | [reply] [d/l] |
|
Perl6::Slurp is a regular Perl 5 module, it just emulates Perl 6's slurp builtin. Learning a bit of XPath is always useful, look at Zvon's tutorial for example.
As for the rest, you need to look at the source of the page, see what information you need and what XPath queries will get it for you.
The cruise info is not for example in the p.itinerari-info, it's in the div.sx element. From that element you can get the title and price, then go down some more and get the various other fields. .
Here is an example, which does not output the 'Includes' field, you'll have to do this one yourself.:
#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;
use Perl6::Slurp; # to load the page from the cache
use HTML::TreeBuilder::XPath; # easier to use than bare HTML::TreeBuil
+der
# during development we don't want to hit the real page,
# so we'll have a -c switch to use a cache
use Getopt::Std;
my %opt;
getopts( 'c', \%opt); # if called with -c then $opt{c} is true
my $base='http://www.costacrociere.it';
my $url='/it/lista_crociere/capitali_nord_europa-201206.html';
my $cache= 'capitali_nord_europa-201206.html';
# this will get rid of the bad characters you were seeing in the outpu
+t
binmode( STDOUT, ':utf8');
if( ! $opt{c}) { getstore( $base.$url, $cache); } # only get the live
+page without -c
my $page= slurp '<:utf8', $cache;
my $p = HTML::TreeBuilder::XPath->new_from_content( $page );
my @trips= $p->findnodes( '//div[@class="info-cruise"]');
foreach my $trip (@trips){
my $title = $trip->findvalue( './/div[@class="sx"]/h3');
print "$title\n";
my $price = $trip->findvalue( './/span[@class="new-price"]');
print "price: $price\n";
# this is very brittle, but it gives you a base on which you can bu
+ild
foreach my $info ( $trip->findnodes( './/p[@class="itinerari-info"]
+//span[@class != "note" and @class != "strike"]'))
{
my $info_title= $info->findnodes( './b')->[0];
print $info_title->as_text();
$info_title->detach;
my $info_value= $info->as_text;
print ": ", $info_value, "\n";
}
print "\n";
}
| [reply] [d/l] |
|
:D I might approach that like this (look ma, no slurping )
$ lwp-download http://www.costacrociere.it/it/lista_crociere/capitali_nord_europa-201206.html
Saving to 'capitali_nord_europa-201206.html'...
134 KB received in 1 seconds (134 KB/sec)
$ perl htmltreexpather.pl capitali_nord_europa-201206.html _tag p | ack Copenhagen -C3 | head
//div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise
+']/div[@class='sx']/p[@class='note']
------------------------------------------------------------------
HTML::Element=HASH(0xb91ba4) 0.1.0.8.1.0.1.1.1.0.0
Itinerario Danimarca, fiordi norvegesi, Germania Data partenza 17ágiug
+noá2012 Nave Costa Fortuna N.ro giorni crociera á
7 Porto di partenza Copenhagen Documenti di viaggio PassaportoáoáCarta
+ d'identità valida per l'espatrio Possono essere
disponibili le seguenti tariffe
/html/body/form/div/div[2]/div/div[2]/div/div[2]/div/p
//div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI
+nfoCruise']/p
//div[@id='ctl00_cph_PageContent_ucCLR_rpL_ctl00_BoxDescItinaryDx_pnlI
+nfoCruise']/p[@class='itinerari-info']
--
//div[@id='ctl00_cph_PageContent_ucCLR_upCLC']/div[@class='info-cruise
+']/div[@class='sx']/p[@class='note']
------------------------------------------------------------------
Then plug stuff into Web::Scraper , its like XML::Rules #!/usr/bin/perl --
use strict; use warnings;
use Data::Dump;
use URI;
use Web::Scraper;
my $soy = scraper {
## only get leafs/twigs with this @class
## store the results into { info => \@info }
process '.info-cruise' => 'info[]' => scraper {
process './/div[@class="sx"]/h3' => 'title' => 'TEXT';
process '.new-price' => 'price' => 'TEXT';
process '.itinerari-info' => 'span[]' => scraper {
#~ process '//span' => 'span[]' => 'RAW'; ## this
process '//span/b | //span/child::text()' => 'span[]' => s
+ub {
my $ishtml = $_[0]->isa('HTML::Element');
my $keyOrVal = $ishtml ? 'key' : 'val';
my %foo = ( $keyOrVal => $_[0]->getValue );
$foo{raw} = $_[0]->as_XML if $ishtml;
return \%foo;
};
};
};
};
## NOTE Web::Scraper wants URI objects
my $url = URI->new('file:capitali_nord_europa-201206.html');
my $base='http://www.costacrociere.it';
my $ret = $soy->scrape( $url , $base );
#~ dd $ret;
dd $ret->{info}->[0];
__END__
{
price => "\x{20AC} 510,00",
span => [
{
span => [
{ key => " Itinerario ", raw => "<b> Itinerario </b>\
+n" },
{ val => " Danimarca, fiordi norvegesi, Germania" },
{ val => " " },
{ key => "Data partenza", raw => "<b>Data partenza</b
+>\n" },
{ val => " 17\xA0giugno\xA02012 " },
{ key => " Nave ", raw => "<b> Nave </b>\n" },
{ val => " Costa Fortuna" },
{
key => " N.ro giorni crociera \xA0 ",
raw => "<b> N.ro giorni crociera \xA0 </b>\n",
},
{ val => " 7" },
{ key => " Porto di partenza ", raw => "<b> Porto di
+partenza </b>\n" },
{ val => " Copenhagen" },
{
key => " Documenti di viaggio ",
raw => "<b> <a href=\"http://www.costacrociere.it/B
+2C/I/Before_you_go/documentation/travel.htm\" target=\"_blank\">Docum
+enti di viaggio</a> </b>\n",
},
{
val => " Passaporto\xA0o\xA0Carta d'identit\xE0 val
+ida per l'espatrio",
},
{ val => " Possono essere disponibili le seguenti tar
+iffe " },
],
},
],
title => "Le terre dei vichinghi",
}
I wouldn't be surprised if tobyink stops by with a Web::Magic example :) | [reply] [d/l] [select] |
|
|