isync has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

Is there a module out there that does split a html page into chunks of html/text based on paragraph breaks?

I looked into HTML::Treebuilder and HTML::Summary, but both do too much. All I need is a regex like @paras = html =~ /<\/p>|<p>/i; But as always with regexes: "never write yet another parser!" and "you will always forget about a number of cases you didn't anticipate!".

Any hints for me or at least some tips to improve my above presented regex?

Replies are listed 'Best First'.
Re: Best practice: How to split HTML into paragraphs?
by GrandFather (Saint) on Jun 03, 2007 at 23:05 UTC

    Use HTML::TreeBuilder. :-D

    The two pieces of advice you quote are particularly apposite when parsing mark up such as HTML. If you were really worried about "does too much" would you be using Perl?

    The time that you spend figuring out how to use TreeBuilder to do the job will be much less than the time you would spend trying to rewrite the parts of HTML::Parser, HTML::Element and HTML::TreeBuilder that are involved in doing the work you need done.


    DWIM is Perl's answer to Gödel
Re: Best practice: How to split HTML into paragraphs?
by Util (Priest) on Jun 03, 2007 at 23:13 UTC

    I don't know of a module that does what you are asking. Neither do I know of any "best practice" for this problem, as I have never run across the problem before.

    If you want to avoid regexes, here is the best solution that comes to mind. I hope other monks have even better ideas.

Re: Best practice: How to split HTML into paragraphs?
by graff (Chancellor) on Jun 04, 2007 at 13:09 UTC
    Did you happen to read HTML::Tree::Scanning? It's not really a module -- just an extra manual page with a lot of useful information about doing the kind of task you are trying to do.

    It explains the alternatives, which include using regexes -- suitable only if you are confident that your html data is relatively simple and consistent (e.g. you will only be processing pages produced by code that you've written). In that case, a range of solutions is possible, and you'll probably pick one based on the shape of the expected data (and this will fail when the data don't match expectations).

    Update: if I were in that situation, I'd probably start with  @chunks = split /(<\/?p>\s*)+/,$html; where chunks would include  <p> and  </p> as well as all data between these tags -- but only a chunk that immediately follows a  <p> will be a paragraph.

    Another alternative is a straight parser module (e.g. HTML::Parser or HTML::TokeParser as suggested above).

    There is also a sample subroutine in that man page using TreeBuilder and its "look_down" method, which is what you would want in order to pull paragraphs out of a web page. Depending on how variable or complicated your data may be, you might need to check the parameter settings that TreeBuilder uses during its parsing (like "p_strict", affecting where it should infer a  </p> tag).

    Try something out with TreeBuilder, and if it gives you trouble, post what you've tried.

Re: Best practice: How to split HTML into paragraphs?
by isync (Hermit) on Jun 04, 2007 at 20:55 UTC
    Thanks for all your posts!

    It's not such an urgent problem - I will post when issues arise when I begin tackling it.