With HTML::TreeBuilder, as Your Mother already mentioned, you can do so, but please keep in mind that html may change. I have several monitors running that parse HTML constantly, and I have to change the code on a very regular basis because the people that generate or maintain the HTML keep changing it. So on true advice: be very very defensive in your parsing strategy and don't hardcode the sequence of events: the generator might add a div tag in between or swap the sequence of text and image.
use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_content ($html); foreach my $img ($tree->look_down (_tag => "img")) { my $p = $img->parent; $p->tag eq "div" or next; # <img> not inside a <div> my $txt = $p->as_text; }
As you can see, this module offers you all rope you need to hang yourself or do what you need. It also offers a nice way to generate nicely formatted HTML from parsed trees:
print $tree->as_HTML (undef, " ", {});
In reply to Re: Parsing HTML files
by Tux
in thread Parsing HTML files
by ajju
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |