in reply to Parsing HTML files
With HTML::TreeBuilder, as Your Mother already mentioned, you can do so, but please keep in mind that html may change. I have several monitors running that parse HTML constantly, and I have to change the code on a very regular basis because the people that generate or maintain the HTML keep changing it. So on true advice: be very very defensive in your parsing strategy and don't hardcode the sequence of events: the generator might add a div tag in between or swap the sequence of text and image.
use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_content ($html); foreach my $img ($tree->look_down (_tag => "img")) { my $p = $img->parent; $p->tag eq "div" or next; # <img> not inside a <div> my $txt = $p->as_text; }
As you can see, this module offers you all rope you need to hang yourself or do what you need. It also offers a nice way to generate nicely formatted HTML from parsed trees:
print $tree->as_HTML (undef, " ", {});
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Parsing HTML files
by aquarium (Curate) on Nov 18, 2010 at 22:31 UTC | |
|
Re^2: Parsing HTML files
by ajju (Initiate) on Nov 18, 2010 at 19:57 UTC | |
by planetscape (Chancellor) on Nov 19, 2010 at 05:41 UTC | |
by Tux (Canon) on Nov 19, 2010 at 07:33 UTC |