comment on

I don't know of a module that does what you are asking. Neither do I know of any "best practice" for this problem, as I have never run across the problem before.

If you want to avoid regexes, here is the best solution that comes to mind. I hope other monks have even better ideas.

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser;

my $p = HTML::TokeParser->new( *DATA ) or die;

my @html_blocks = ( '' );
while ( my $token = $p->get_token ) {
    my @t = @{ $token };
    my $type     =  $t[0];
    my $type_tag = "$t[0] $t[1]";

    my $text_pos = ($type eq 'S' ) ? 4
                 : ($type eq 'E' ) ? 2
                 : ($type eq 'T' ) ? 1
                 : ($type eq 'C' ) ? 1
                 : ($type eq 'D' ) ? 1
                 : ($type eq 'PI') ? 2
                 :                   die "Can't happen"
                 ;
    my $text = $t[$text_pos];

    push @html_blocks, '' if $html_blocks[-1] and $type_tag eq 'S p';
    $html_blocks[-1] .= $text;
    push @html_blocks, '' if $html_blocks[-1] and $type_tag eq 'E p';

}
pop @html_blocks while $html_blocks[-1] eq '';

use Data::Dumper; $Data::Dumper::Useqq = 1;
print Dumper \@html_blocks;

__END__
<html>
<head>
<title>HTML::TokeParser</title>
</head>

<body>
<p><a name="__index__"></a></p>
<!-- INDEX BEGIN -->

<ul>
    <li>NAME</li>
    <li>SYNOPSIS</li>
</ul>
<!-- INDEX END -->

<hr />
<p>
</p>
<h1>NAME</h1>
<p>HTML::TokeParser - Alternative HTML::Parser interface</p>
<p>
</p>
<hr />
<h1><a name="synopsis">SYNOPSIS</a></h1>
<pre>
 use HTML::TokeParser;
 --snip--
</pre>
<p>
</p>
<hr />
<h1><a name="description">DESCRIPTION</a></h1>
<p>The HTML::TokeParser is an --snip--
</body>

</html>
[download]

Output:

$VAR1 = [
  "<html>\n<head>\n<title>HTML::TokeParser</title>\n</head>\n\n<body>\
+n",
  "<p><a name=\"__index__\"></a></p>",
  "\n<!-- INDEX BEGIN -->\n\n<ul>\n\t<li>NAME</li>\n\t<li>SYNOPSIS</li
+>\n</ul>\n<!-- INDEX END -->\n\n<hr />\n",
  "<p>\n</p>",
  "\n<h1>NAME</h1>\n",
  "<p>HTML::TokeParser - Alternative HTML::Parser interface</p>",
  "\n",
  "<p>\n</p>",
  "\n<hr />\n<h1><a name=\"synopsis\">SYNOPSIS</a></h1>\n<pre>\n use H
+TML::TokeParser;\n --snip--\n</pre>\n",
  "<p>\n</p>",
  "\n<hr />\n<h1><a name=\"description\">DESCRIPTION</a></h1>\n",
  "<p>The HTML::TokeParser is an --snip--\n</body>\n\n</html>\n"
];
[download]

In reply to Re: Best practice: How to split HTML into paragraphs? by Util
in thread Best practice: How to split HTML into paragraphs? by isync

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.