Re: Parsing HTML files

With HTML::TreeBuilder, as Your Mother already mentioned, you can do so, but please keep in mind that html may change. I have several monitors running that parse HTML constantly, and I have to change the code on a very regular basis because the people that generate or maintain the HTML keep changing it. So on true advice: be very very defensive in your parsing strategy and don't hardcode the sequence of events: the generator might add a div tag in between or swap the sequence of text and image.

use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;

$tree->parse_content ($html);
foreach my $img ($tree->look_down (_tag => "img")) {
    my $p = $img->parent;
    $p->tag eq "div" or next; # <img> not inside a <div>
    my $txt = $p->as_text;
    }
[download]

As you can see, this module offers you all rope you need to hang yourself or do what you need. It also offers a nice way to generate nicely formatted HTML from parsed trees:

print $tree->as_HTML (undef, "  ", {});
[download]

Enjoy, Have FUN! H.Merijn

Comment on Re: Parsing HTML files Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing HTML files by aquarium (Curate) on Nov 18, 2010 at 22:31 UTC
totally agree that scraping html is quite bad and unstable. my rough guide for scraping, from most to least desireable don't scrape...if you can find out if there's a RESTful way to get the results instead, via some API or alternate (e.g. XML format) url if the html is well formed (i.e. xhtml) then it will be almost guaranteed to be well structured, will proper closing tags etc...so use one of the XML based parsers. you can easily get to specific elements via a well defined hierarchy once parsed this stuff gets ugly when you start slurping whole html into a scalar and progressively find markers where to suck bits into desired variables for inspection. even here you can code a bit defensively by picking sane markers, e.g. "id" or "class" elements. never anchor to text containing inline html styles or other bits that are likely to change fequently, like inline javascript or such. finally, if you end up producing html in the CGI, don't mix actual output with styling. write a stylesheet instead. producing a well formed xhtml document in the CGI without inline styles, provides later opportunity to use the output of the CGI via another CGI or whatever. it's also much easier to change the output of the CGI via a stylesheet, rather than digging in perl code. there are frameworks for doing even fancier scraping, where you end up running a browser engine server side, to pretend that your program is a browser. this is necessary when a website dynamically produces most of it's output with javascript. and naturally because javascript is browser/client side code, you won't see the results of that unless you run it. this is pretty horrid stuff. although you can do automatic login and traverse a website and results...it typically breaks as soon as absolutely anything changes on the website. A good/helpful website, even if dynamically fancy rendered with javascript, should provide a RESTful api to get data out. But some companies still insist on not being very helpful. the hardest line to type correctly is: stty erase ^H	[reply]
Re^2: Parsing HTML files by ajju (Initiate) on Nov 18, 2010 at 19:57 UTC
hi Tux, I had my $html="htmlfilepath"; added to your code. Running your code is giving the below error, Use of uninitialized value in subroutine entry at C:/Perl/site/lib/HTML/TreeBuil der.pm line 121.	[reply]
Re^3: Parsing HTML files by planetscape (Chancellor) on Nov 19, 2010 at 05:41 UTC
I found help for this error by typing "uninitialized value in subroutine entry" "TreeBuilder.pm" into Google. Basic debugging checklist may also be helpful. HTH, planetscape	[reply]
Re^3: Parsing HTML files by Tux (Canon) on Nov 19, 2010 at 07:33 UTC
If you would have taken the time to read the documentatio, e.g. using "`perldoc HTML::TreeBuilder`", you should have seen, if the method name parse_content wasn't obvious enough already, that to parse a file, you should use the parse_file method. Enjoy, Have FUN! H.Merijn	[reply] [d/l]