I often have a multi-line string which I need to partition into its "paragraphs" based on a pattern which matches either the beginning or end of a paragraph. After seeking some wise comments from fellow Perl Monks (How to split into paragraphs?), I penned the following musings on the topic, which others may find useful.

If you have a pattern which matches the beginning of a "paragraph", you can use the following code to partition a string into "paragraphs". Notes: This will produce a first element which does not match the pattern if the first match occurs after the beginning of the string. The pattern should not match the empty string.

@list = split /(?=PATTERN)/;
For example, to split AIX stanza files (e.g. /etc/security/passwd):
my $pat = qr/^[ \t]*[^\s:]+:[ \t]*$/m; # allow leading/trailing ws my $pat = qr/^[^\s:]+:/m; $_ = slurp_file; my @stanzas = split /(?=$pat)/o;
If you have a pattern which matches the end of a "paragraph", you can use the following code to partition a string into "paragraphs". Notes: This code properly handles a missing delimiter at the end of the string. The pattern should not match the empty string.
@list = /( .*? PATTERN | .+ )/gsx;
For example, to split paragraphs based on one or more blank lines at the end of a paragraph, use the following. Note the added complication of handling a non-newline-terminated line at the end of the string.
my $pat = qr/(?:^[ \t]*\n)+(?:[ \t]+\z)?/m; $_ = slurp_file; my @list = /( .*? $pat | .+ )/ogsx;
If you don't care about capturing the blank lines between paragraphs, you can use the following code. Notes: This will properly handle a non-newline-terminated blank line at the end of the string. The first list element will be empty if the string starts with a blank line. The second line of code wll remove such a list element.
my @list = split /^\s*(?:\n|\z)/m; shift @list if @list && $list[0] eq ""; # remove empty first element
Here is a pattern which can be used to split a string based on a delimiter followed by zero or more blank lines. It properly handles a non-newline-terminated blank line at the end of the string.
my $delim = qr/^[ \t]*SOMETHING[ \t]*$/m; my $pat = qr/$delim(?:\n[ \t]*)*(?:\n|\z)/o;
Here are two subroutines which can be used to partition a string into paragraphs.
# Partition a string into paragraphs based on a # pattern which matches the beginning of a paragraph. sub partition_para_beg { my ($pat, $str) = @_; $str = $_ unless defined $str; if ("" =~ /$pat/) { require Carp; Carp::croak("invalid pattern matches empty string: \"$pat\"\n"); } split /(?=$pat)/; } # Partition a string into paragraphs based on a # pattern which matches the end of a paragraph. sub partition_para_end { my ($pat, $str) = @_; $str = $_ unless defined $str; if ("" =~ /$pat/) { require Carp; Carp::croak("invalid pattern matches empty string: \"$pat\"\n"); } return $str =~ /(.*?(?:$pat)|.+)/gs; }

In reply to Parsing a string into "paragraphs" by jrw

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.