It explains the alternatives, which include using regexes -- suitable only if you are confident that your html data is relatively simple and consistent (e.g. you will only be processing pages produced by code that you've written). In that case, a range of solutions is possible, and you'll probably pick one based on the shape of the expected data (and this will fail when the data don't match expectations).
Update: if I were in that situation, I'd probably start with @chunks = split /(<\/?p>\s*)+/,$html; where chunks would include <p> and </p> as well as all data between these tags -- but only a chunk that immediately follows a <p> will be a paragraph.
Another alternative is a straight parser module (e.g. HTML::Parser or HTML::TokeParser as suggested above).
There is also a sample subroutine in that man page using TreeBuilder and its "look_down" method, which is what you would want in order to pull paragraphs out of a web page. Depending on how variable or complicated your data may be, you might need to check the parameter settings that TreeBuilder uses during its parsing (like "p_strict", affecting where it should infer a </p> tag).
Try something out with TreeBuilder, and if it gives you trouble, post what you've tried.
In reply to Re: Best practice: How to split HTML into paragraphs?
by graff
in thread Best practice: How to split HTML into paragraphs?
by isync
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |