I've found (being fairly close to a beginner myself with parsing HTML), that it's best to attack such a problem in little pieces. Use print/printf along the way to show what your data looks like at the moment (and use Data::Dumper to really inspect your data with a fine tooth comb).
I don't see in your program where you're trying to construct the HTML tree, so I took your program and extended it a bit. Here's what I have:
# Strict use strict; use warnings; # Libraries use Data::Dumper; use LWP::UserAgent; use HTML::TreeBuilder; my $url = 'http://www.somepage.com'; # $browser->cookie_jar({}); #### use if the site requires cookies my $browser = LWP::UserAgent->new; my @ns_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Language' => 'en-US', ); my $response = $browser->get($url, @ns_headers); die "Can't get $url -- ", $response->status_line unless $response->is_ +success; die "Hey, I was expecting HTML, not ", $response->content_type unless $response->content_type eq 'text/html'; # Now get the content, and display it my $content = $response->content; print "TFD> content $content\n"; # Now build the HTML tree my $tree = HTML::TreeBuilder->new_from_content($content); # Now find each occurrence of the desired tag my $tag = 'a'; my $match = $tree->find($tag); if ($match) { # Found it! print "Found tag '$tag' ...\n"; print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"; } else { print "Unable to find tag '$tag'\n"; }
Note that I'm building an HTML tree from the $content which is returned after a successful get from the LWP::UserAgent opbject.
The program then prints out the contents in the line:
print "TFD> content $content\n";
as a debugging step (you can remove that once you're sure you're getting what you expect back from the LWP fetch).
Then you construct the HTML tree with:
my $tree = HTML::TreeBuilder->new_from_content($content);
Finally, you use find to locate an occurrence of the desired tag. In the program above, I searched for the first occurrence of an anchor 'a' with:
my $tag = 'a'; my $match = $tree->find($tag);
which is then rendered both as text and HTML with:
print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"
Does that help you get further along?
In reply to Re: HTML parsing OR capturing text from a string within tags
by liverpole
in thread HTML parsing OR capturing text from a string within tags
by kevyt
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |