I often have a multi-line string which I need to partition into its "paragraphs" based on a pattern which matches either the beginning or end of a paragraph. After seeking some wise comments from fellow Perl Monks (How to split into paragraphs?), I penned the following musings on the topic, which others may find useful.

If you have a pattern which matches the beginning of a "paragraph", you can use the following code to partition a string into "paragraphs". Notes: This will produce a first element which does not match the pattern if the first match occurs after the beginning of the string. The pattern should not match the empty string.

@list = split /(?=PATTERN)/;
For example, to split AIX stanza files (e.g. /etc/security/passwd):
my $pat = qr/^[ \t]*[^\s:]+:[ \t]*$/m; # allow leading/trailing ws my $pat = qr/^[^\s:]+:/m; $_ = slurp_file; my @stanzas = split /(?=$pat)/o;
If you have a pattern which matches the end of a "paragraph", you can use the following code to partition a string into "paragraphs". Notes: This code properly handles a missing delimiter at the end of the string. The pattern should not match the empty string.
@list = /( .*? PATTERN | .+ )/gsx;
For example, to split paragraphs based on one or more blank lines at the end of a paragraph, use the following. Note the added complication of handling a non-newline-terminated line at the end of the string.
my $pat = qr/(?:^[ \t]*\n)+(?:[ \t]+\z)?/m; $_ = slurp_file; my @list = /( .*? $pat | .+ )/ogsx;
If you don't care about capturing the blank lines between paragraphs, you can use the following code. Notes: This will properly handle a non-newline-terminated blank line at the end of the string. The first list element will be empty if the string starts with a blank line. The second line of code wll remove such a list element.
my @list = split /^\s*(?:\n|\z)/m; shift @list if @list && $list[0] eq ""; # remove empty first element
Here is a pattern which can be used to split a string based on a delimiter followed by zero or more blank lines. It properly handles a non-newline-terminated blank line at the end of the string.
my $delim = qr/^[ \t]*SOMETHING[ \t]*$/m; my $pat = qr/$delim(?:\n[ \t]*)*(?:\n|\z)/o;
Here are two subroutines which can be used to partition a string into paragraphs.
# Partition a string into paragraphs based on a # pattern which matches the beginning of a paragraph. sub partition_para_beg { my ($pat, $str) = @_; $str = $_ unless defined $str; if ("" =~ /$pat/) { require Carp; Carp::croak("invalid pattern matches empty string: \"$pat\"\n"); } split /(?=$pat)/; } # Partition a string into paragraphs based on a # pattern which matches the end of a paragraph. sub partition_para_end { my ($pat, $str) = @_; $str = $_ unless defined $str; if ("" =~ /$pat/) { require Carp; Carp::croak("invalid pattern matches empty string: \"$pat\"\n"); } return $str =~ /(.*?(?:$pat)|.+)/gs; }

Replies are listed 'Best First'.
Re: Parsing a string into "paragraphs"
by radiantmatrix (Parson) on Nov 22, 2006 at 18:19 UTC

    And why would you slurp-and-split instead of processing the file sequentially? Something along the lines of:

    my $c_para = ''; while (<$INPUT_FH>) { $c_para.=$_; if ( /$end_of_para_pattern/ ) { process_paragraph($c_para); $c_para = ''; # we'll start a new paragraph next pass } }

    has always worked for me. If you really needed an array of paragraphs, you could always push them onto an array instead of calling process_paragraph...

    <radiant.matrix>
    Ramblings and references
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet
      Slurp and split is substantially more readable, IMHO.