2ge has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all good monks! I have small regexp question, my txt file looks like:
-- Paragraph1: text Paragraph2: text1 text2 text3 Paragraph3: text --

Ok, now I want extract everything between Paragraph2 and Paragraph3. But I have small problem, I cant use 'Paragraph2:\s+(.*)\s+Paragraph3:/is' because I dont know name of next Paragraph (Paragraph3) - it could be Paragraph1,2,3 and so on... Hint: Name of Paragraph has ':' at end of line, it is five (5) spaces from start of line, extracted text has always nine (9) spaces from start. Any hints on this? It should be nice regexp, if it is possible... (sorry for my poor English and this is my first post here:)

Thanks for any help! (DelimMatch is the solution?)

P.S. My first question has some errors, so I will corect them: Paragraph[123] should be for example 'Name', 'Address', 'City' and so on (many possibilities, sometimes are all not of them shown!) I cant use 'scalar range operator' nor (.*) because I dont know name of end paragraph. Maybe the best solution is not the find "hard regexp", but real names of paragraphs and after create regexp.

Janitored by davido: paragraph tags added to reflect the layout of the text as posted.

Replies are listed 'Best First'.
Re: Regexp help, multiple lines
by Zaxo (Archbishop) on Aug 15, 2004 at 22:01 UTC

    The scalar range ("flipflop") operator would be handy for this:

    while (<FILE>) { print if /^ {5}Paragraph2:$/ .. /^ {5}Paragraph\d+:$/; }
    The second regex shows how to compensate for numbers you don't know. See perlop for more on the flipflop.

    After Compline,
    Zaxo

Re: Regexp help, multiple lines
by davidj (Priest) on Aug 15, 2004 at 22:02 UTC
Re: Regexp help, multiple lines
by CombatSquirrel (Hermit) on Aug 15, 2004 at 22:02 UTC
    In this case you might want to use \d in your RegEx to match any digit. You probably also want to make that star non-greedy (add a ? to it).
    A common idiom for a problem like this would be the following:
    #!perl use strict; use warnings; my $i = 0; for (<DATA>) { print if /Paragraph2/ ... /Paragraph/; } __DATA__ Paragraph1: text Paragraph2: text1 text2 text3 Paragraph3: text
    Also, if you have fixed-width fields, consider using substr or unpack.
    Hope this helped.
    CombatSquirrel.

    Entropy is the tendency of everything going to hell.
      Hi! thanks for really fast answers, I had some not complete informations in my question, so I updated it. Maybe I should first collect all 'Paragraph' names into array and after them create such regexp ?
      I did know .. and ... operators, but it is nearly same as (.*), it doesn't help here. My regexp should looks like
       /^ {5}\S[^:]+:$\s+([^ {5}\S])/ism
      but ofcourse  '[^ {5}\S]' doesn't work, we know why...:(
        You might want to use something like this
        #!perl use strict; use warnings; my $paragraph = 0; for (<DATA>) { do { ++$paragraph; next; last if $paragraph > 2 } if substr($_, 5, 1) ne ' '; print if $paragraph == 2; } __DATA__ Paragraph1: text Paragraph2: text1 text2 text3 Paragraph3: text
        Hope this helped.
        CombatSquirrel.

        Entropy is the tendency of everything going to hell.
Re: Regexp help, multiple lines
by ysth (Canon) on Aug 16, 2004 at 00:30 UTC
    One key to creating a regex is to look at the data you want to match and describe it in a very elementary way, then translate that into a regex. If your description is "match one or more consecutive lines that each start with 9 spaces", your regex is /(^ {9}.*\n)+/m.
Re: Regexp help, multiple lines
by spoulson (Beadle) on Aug 16, 2004 at 20:08 UTC
    Your solution is probably looking for a way to methodically parse out the headings and data for each section, rather than look for only a specific one. For that, you would want to use a zero width look ahead assertion (?=regexp). One assumption, the headings aren't always (but usually) indented with 5 spaces. Line items are always indented with 9 spaces. Here's what I came up with that worked for me on ActivePerl 5.6.
    use strict; my $data = join("", <DATA>); print "data:\n$data"; # loop for each heading while ($data =~ /\s*(.*?):\s*\n((.|\n)*?\n)(?=(\s*\w+:|\Z))/gc) { print "heading: $1\n"; my $text = $2; while ($text =~ /\s{9}(.*?)\n/gc) { print "line item: $1\n"; } } exit; __DATA__ Paragraph1: text Paragraph2: text1 text2 text3 Paragraph3: text