Did you happen to read HTML::Tree::Scanning? It's not really a module -- just an extra manual page with a lot of useful information about doing the kind of task you are trying to do.

It explains the alternatives, which include using regexes -- suitable only if you are confident that your html data is relatively simple and consistent (e.g. you will only be processing pages produced by code that you've written). In that case, a range of solutions is possible, and you'll probably pick one based on the shape of the expected data (and this will fail when the data don't match expectations).

Update: if I were in that situation, I'd probably start with  @chunks = split /(<\/?p>\s*)+/,$html; where chunks would include  <p> and  </p> as well as all data between these tags -- but only a chunk that immediately follows a  <p> will be a paragraph.

Another alternative is a straight parser module (e.g. HTML::Parser or HTML::TokeParser as suggested above).

There is also a sample subroutine in that man page using TreeBuilder and its "look_down" method, which is what you would want in order to pull paragraphs out of a web page. Depending on how variable or complicated your data may be, you might need to check the parameter settings that TreeBuilder uses during its parsing (like "p_strict", affecting where it should infer a  </p> tag).

Try something out with TreeBuilder, and if it gives you trouble, post what you've tried.


In reply to Re: Best practice: How to split HTML into paragraphs? by graff
in thread Best practice: How to split HTML into paragraphs? by isync

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.