My experience with HTML::TreeBuilder showed me that I should use HTML::TreeBuilder::XPath which of course I would recommend you use also.
( HTML::TreeBuilder is nice but HTML::TreeBuilder::XPath is sufficiently abstract in order to be elegant )
Let's see how your code would look if you would use above mentioned module :
1 #!/usr/local/bin/perl 2 use strict; 3 use warnings; 4 use LWP::Simple; 5 use HTML::TreeBuilder::XPath; 6 use Data::Dumper; 7 use feature 'say'; 8 my $url="http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbs +um/GetPage.pl?pdbcode=1r9t&template=main.html"; 9 10 my $p = HTML::TreeBuilder::XPath->new_from_content(get($url)); 11 12 my @chain_tags = $p->findnodes("//td//a[contains(\@href,'chain=') +]"); 13 my @chains = map { $_->attr('href') =~ /chain=(\w)/ } @chain_tags +; 14 say " Number of chains : " . scalar @chains; 15 say @chains;
EDIT:small adjustments
OUTPUT:
Number of chains : 11 AABCEFHIJKL
First of all we have simplified the code from 33 lines to 15 lines. Second , we maintained the meaning of the code , which was
"Give me the a tags which have attribute href which matches regex "&chain=(\w)" so that the a tags have a parent tag td. Now take those a tags and apply a regex on their href attribute and take the word after the chain= substring".
That is exactly what this XPath query says => //td//a[contains(\@href,'chain=')]
I have also used this Firefox addon to check that my XPath was right.
If you're interested in reading more about XPath read here and here.
You also have a mistake in your code , you delete the HTML::TreeBuilder object after the first iteration of the for loop.
In reply to Re: problem parsing html
by spx2
in thread problem parsing html
by paola82
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |