Re^2: Parsing HTML

Hi mirod, before I go ahead ...THANK YOU!!. XPath opened a brand new world of possibilities for me. I took a look at Zvon's page and also this page, which is a little bit more for beginners. The thing is I was able to use your code and also add a few things for the other pieces of information that I needed to extract. Right now it's working just fine, but there's a detail that I haven't been able to modify (basically because the last part of the code you wrote are almost hieroglyphs to me...xD) Anyway, this is code:

#!/usr/bin/perl -w
use LWP::Simple;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
use strict;
my $debug=1;
my $base='http://www.costacrociere.it';
my $url='/it/lista_crociere/capitali_nord_europa-201207-2.html';
my $page = get($base.$url) or die $!;
my $p = HTML::TreeBuilder::XPath->new_from_content( $page );
binmode( STDOUT, ':utf8');

my @trips= $p->findnodes( '//div[@class="info-cruise"]');
foreach my $trip (@trips){
        my $title = $trip->findvalue( './/div[@class="sx"]/h3');
        print "Trip name: $title\n";
        my $price = $trip->findvalue( './/span[@class="new-price"]');
        print "price: $price\n";
        my $includes = $trip->findvalue('.//p[@class="info-price"]/spa
+n[6]'); #I added this line
        print "Includes: $includes\n";

        foreach my $info ( $trip->findnodes( './/p[@class="itinerari-i
+nfo"]//span[@class != "note" and @class != "strike"]')){
                my $info_title= $info->findnodes( './b')->[0];
                print $info_title->as_text();
                $info_title->detach;
                my $info_value= $info->as_text;
                print ":", $info_value, "\n";
        }
    my $pic = $trip->findvalue('.//img[@class="image_map"]/@src'); # I
+ added this line.
        print "Picture: $base$pic\n";
        print "\n";
}
[download]

And this is the output, well... just one of the results, all of it is not necessary

Trip name: Fiordi norvegesi e grandi città del Baltico
price: € 2.615,00
Includes: Crociera + Volo
 Itinerario : Danimarca, Estonia, Russia, Finlandia, Svezia, Norvegia
Data partenza:  7 luglio 2012 
 Nave : Costa Luminosa
 N.ro giorni crociera   : 14
 Porto di partenza : Copenhagen
 Documenti di viaggio : Passaporto
Picture: http://www.costacrociere.it/B2C/Images/ItineraryV4/CPH11040__
+it-IT.gif#CPH11040
[download]

Yes, I know what you're thinking... "That's my code...this guy didn't do anything", and you're quite right, I just added those 2 lines. But the good thing is I'm learning!!.. Using only Treebuilder was giving me a lot of headaches. Ok, so the detail I was telling you about, as you can see in the output, certain pieces of information have an extra space at the beginning. I've been trying with chomp and different print and \n ways, but nothing does the trick. Where should I look?. Right now, what I'm doing is some research to understand what every line of the second foreach loop does. If you can give some directions on this I will greatly appreciate it (again)!!

Cheers!!

marcos

Comment on Re^2: Parsing HTML Select or Download Code

Replies are listed 'Best First'.
Re^3: Parsing HTML by mirod (Canon) on Jun 12, 2012 at 11:56 UTC
It's a bit of a pain to figure out where to look, but the `as_text` method comes from HTML::Element. If you look at the docs, you'll see that in addition to `as_text` there is also a `as_trimmed_text` method. I looks like you could use it. The secon `foreach` loop comes from looking at the HTML source for the page. The data you want is in the `p` with a `class` of `itinerari-info`, in consecutive `span`. Some of the span's can be discarded, the ones with classes of `note` and `strike`. That's what the XPath experssion returns. Each span includes a `b` element with the title, which I get in `$info_title`, display then `detach` to get it out of the way. The rest of the span is the information itself. Does this help?	[reply]
Re^4: Parsing HTML by marcoss (Novice) on Jun 13, 2012 at 08:22 UTC
Ok, this clarifies a lot. The `as_trimmed_text` worked just fine. I tried commenting the `detach` line, and like you said, it'll print the title twice. But then, it seems like you have seen something I completely overlooked. The strike attribute is only for dates that have been removed, that's why I didn't see it before... but still when I execute the script, the date shows up. Is it a matter of using an `if` statement?... Because it looks to me that the `foreach my $info ( $trip->findnodes( './/p[@class="itinerari-info"]//span[@class != "note" and @class != "strike"]'))` should take care of it. mmmm I'm thinking of `unless` but those are only assumptions... I'll let you know if I fix this, even though probably...eventually, I'll be crying out for help xD. Anyway, thank very much for your time and your patience. cheers! marcos	[reply] [d/l] [select]
Re^3: Parsing HTML by Anonymous Monk on Jun 12, 2012 at 11:34 UTC
Where should I look? :\| Guess. :) Perlmonks. site:perlmonks.org remove extra space at beginning -> How do I remove whitespace at the beginning or end of my string? Tutorials -> Perl documentation documentation, Searching Perl Documentation, How to Read Perldocs, all part of Tutorials sections Understanding and Using PerlMonks, Getting Started with Perl, perlintro	[reply]