HTML::TreeBuilder question

2ge has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I use this superb module for parsing html files, it is really great. But I run out of idea how to parse quite simple part in complex html file. Ok, it looks like this:

#...here are some divs, ids and so on...
<h4>START</h4>
#...here are some informations
<hr />
#...here are some additional data
[download]

Now I don't know how to specify in HTML::TreeBuilder to get everything from <h4>START</h4> and ending <hr />
I can use simple regexp for that, but I need HTML::TB object, so how I can do it ?
I thought about getting that part into variable via regexp, and after that $tree2->parse($temp);

It is ok via this way?

Comment on HTML::TreeBuilder question Download Code

Replies are listed 'Best First'.
Re: HTML::TreeBuilder question by ww (Archbishop) on Apr 27, 2005 at 16:22 UTC
If I understand your question and data correctly, you may make that work well ... so long as you're very sure there won't be <hr /> inside something you wish to capture ... and, given that precondition, that you're very careful about the greediness (or non-greediness, in this case) of your regex.	[reply]
Re^2: HTML::TreeBuilder question by 2ge (Scribe) on Apr 27, 2005 at 21:46 UTC
WW, thanks for a reply, I will try that tomorrow. But I think (and hope!) there will be no problem at all. My regexp is simple: `/<h4>(.*?)<hr/gm;`	[reply] [d/l]