in reply to HTML parsing OR capturing text from a string within tags

Hi kevyt,

I've found (being fairly close to a beginner myself with parsing HTML), that it's best to attack such a problem in little pieces.  Use print/printf along the way to show what your data looks like at the moment (and use Data::Dumper to really inspect your data with a fine tooth comb).

I don't see in your program where you're trying to construct the HTML tree, so I took your program and extended it a bit.  Here's what I have:

# Strict use strict; use warnings; # Libraries use Data::Dumper; use LWP::UserAgent; use HTML::TreeBuilder; my $url = 'http://www.somepage.com'; # $browser->cookie_jar({}); #### use if the site requires cookies my $browser = LWP::UserAgent->new; my @ns_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Language' => 'en-US', ); my $response = $browser->get($url, @ns_headers); die "Can't get $url -- ", $response->status_line unless $response->is_ +success; die "Hey, I was expecting HTML, not ", $response->content_type unless $response->content_type eq 'text/html'; # Now get the content, and display it my $content = $response->content; print "TFD> content $content\n"; # Now build the HTML tree my $tree = HTML::TreeBuilder->new_from_content($content); # Now find each occurrence of the desired tag my $tag = 'a'; my $match = $tree->find($tag); if ($match) { # Found it! print "Found tag '$tag' ...\n"; print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"; } else { print "Unable to find tag '$tag'\n"; }

Note that I'm building an HTML tree from the $content which is returned after a successful get from the LWP::UserAgent opbject.

The program then prints out the contents in the line:

print "TFD> content $content\n";

as a debugging step (you can remove that once you're sure you're getting what you expect back from the LWP fetch).

Then you construct the HTML tree with:

my $tree = HTML::TreeBuilder->new_from_content($content);

Finally, you use find to locate an occurrence of the desired tag.  In the program above, I searched for the first occurrence of an anchor 'a' with:

my $tag = 'a'; my $match = $tree->find($tag);

which is then rendered both as text and HTML with:

print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"

Does that help you get further along?


s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/

Replies are listed 'Best First'.
Re^2: HTML parsing OR capturing text from a string within tags
by kevyt (Scribe) on Dec 24, 2006 at 03:34 UTC
    Thanks, I will try this in the morning. I could not get this to print anything worthwhile.
    ### $response->content has the webpage stored in it $a = HTML::Element->new('a', $response->content); $addr = $a->find('tag', 'title'); print $addr;
Re^2: HTML parsing OR capturing text from a string within tags
by kevyt (Scribe) on Dec 24, 2006 at 06:13 UTC
    Thanks Liverpole, That explains a lot. I was not able to get it to work with my example because I guess that long string of goop is not a tag. So, I changed the tag = 'title' and that worked wonderfully!!! I noticed this line s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/ at the end of your posting but Iam not sure what that is for. unless that is a very complex regular expression to parse the data out. Thanks for all of your time and help. I might be able to make something work form what you wrote. Kevin
      Hi kevyt,

      I'm glad you were able to get further with your problem.  Always consider printing out intermediate results, so you know what your data looks like at each step of the way.

      The line at the end of my post:

      s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/

      is just my "signature".  If you run it as a separate Perl script, it prints liverpole.  You can create your own signature by editing your Signature Settings page.


      s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
Re^2: HTML parsing OR capturing text from a string within tags
by kevyt (Scribe) on Dec 24, 2006 at 06:25 UTC
    Liverpole, I tried this
    my $tag = 'div class=\\042mytitle maximumtitle\\042 id=\\042idtitle04 +2'; my $match = $tree->find($tag); if ($match) { # Found it! print "Found tag '$tag' ...\n"; print " As text: ", $match->as_text, "\n"; print " As text: ", $match->as_HTML, "\n"; } else { print "Unable to find tag '$tag'\n"; }
    I like how all of this is suppose to work! I think I read in one of the docs that there is a list of tags in the PM. Maybe I can add this tag to the list of html tags in the PM ? I was hoping that it would think that anything between < > are tags but I guess it does not do that. Thanks, Kevin