Regexp help, multiple lines

2ge has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all good monks! I have small regexp question, my txt file looks like:

--

     Paragraph1: 
         text
     Paragraph2: 
         text1
         text2
         text3
     Paragraph3:
         text
--
[download]

Ok, now I want extract everything between Paragraph2 and Paragraph3. But I have small problem, I cant use 'Paragraph2:\s+(.*)\s+Paragraph3:/is' because I dont know name of next Paragraph (Paragraph3) - it could be Paragraph1,2,3 and so on... Hint: Name of Paragraph has ':' at end of line, it is five (5) spaces from start of line, extracted text has always nine (9) spaces from start. Any hints on this? It should be nice regexp, if it is possible... (sorry for my poor English and this is my first post here:)

Thanks for any help! (DelimMatch is the solution?)

P.S. My first question has some errors, so I will corect them: Paragraph[123] should be for example 'Name', 'Address', 'City' and so on (many possibilities, sometimes are all not of them shown!) I cant use 'scalar range operator' nor (.*) because I dont know name of end paragraph. Maybe the best solution is not the find "hard regexp", but real names of paragraphs and after create regexp.

Janitored by davido: paragraph tags added to reflect the layout of the text as posted.

Comment on Regexp help, multiple lines Select or Download Code

Replies are listed 'Best First'.
Re: Regexp help, multiple lines by Zaxo (Archbishop) on Aug 15, 2004 at 22:01 UTC
The scalar range ("flipflop") operator would be handy for this: `while (<FILE>) { print if /^ {5}Paragraph2:$/ .. /^ {5}Paragraph\d+:$/; }` [download] The second regex shows how to compensate for numbers you don't know. See perlop for more on the flipflop. After Compline, Zaxo	[reply] [d/l]
Re: Regexp help, multiple lines by davidj (Priest) on Aug 15, 2004 at 22:02 UTC
Use the range operator. See The Scalar Range Operator by pbeckingham for an excellent tutorial. It has everything you need to know. davidj	[reply]
Re: Regexp help, multiple lines by CombatSquirrel (Hermit) on Aug 15, 2004 at 22:02 UTC
In this case you might want to use `\d` in your RegEx to match any digit. You probably also want to make that star non-greedy (add a ? to it). A common idiom for a problem like this would be the following: `#!perl use strict; use warnings; my $i = 0; for (<DATA>) { print if /Paragraph2/ ... /Paragraph/; } __DATA__ Paragraph1: text Paragraph2: text1 text2 text3 Paragraph3: text` [download] Also, if you have fixed-width fields, consider using `substr` or `unpack`. Hope this helped. CombatSquirrel. Entropy is the tendency of everything going to hell.	[reply] [d/l] [select]
Re^2: Regexp help, multiple lines by 2ge (Scribe) on Aug 15, 2004 at 22:42 UTC
Hi! thanks for really fast answers, I had some not complete informations in my question, so I updated it. Maybe I should first collect all 'Paragraph' names into array and after them create such regexp ? I did know .. and ... operators, but it is nearly same as (.*), it doesn't help here. My regexp should looks like `/^ {5}\S[^:]+:$\s+([^ {5}\S])/ism` but ofcourse `'[^ {5}\S]'` doesn't work, we know why...:(	[reply] [d/l] [select]
Re^3: Regexp help, multiple lines by CombatSquirrel (Hermit) on Aug 16, 2004 at 05:34 UTC
You might want to use something like this `#!perl use strict; use warnings; my $paragraph = 0; for (<DATA>) { do { ++$paragraph; next; last if $paragraph > 2 } if substr($_, 5, 1) ne ' '; print if $paragraph == 2; } __DATA__ Paragraph1: text Paragraph2: text1 text2 text3 Paragraph3: text` [download] Hope this helped. CombatSquirrel. Entropy is the tendency of everything going to hell.	[reply] [d/l]
Re^4: Regexp help, multiple lines by 2ge (Scribe) on Aug 16, 2004 at 08:16 UTC
Re: Regexp help, multiple lines by ysth (Canon) on Aug 16, 2004 at 00:30 UTC
One key to creating a regex is to look at the data you want to match and describe it in a very elementary way, then translate that into a regex. If your description is "match one or more consecutive lines that each start with 9 spaces", your regex is `/(^ {9}.*\n)+/m`.	[reply] [d/l]
Re: Regexp help, multiple lines by spoulson (Beadle) on Aug 16, 2004 at 20:08 UTC
Your solution is probably looking for a way to methodically parse out the headings and data for each section, rather than look for only a specific one. For that, you would want to use a zero width look ahead assertion (?=regexp). One assumption, the headings aren't always (but usually) indented with 5 spaces. Line items are always indented with 9 spaces. Here's what I came up with that worked for me on ActivePerl 5.6. `use strict; my $data = join("", <DATA>); print "data:\n$data"; # loop for each heading while ($data =~ /\s(.?):\s\n((.\|\n)?\n)(?=(\s\w+:\|\Z))/gc) { print "heading: $1\n"; my $text = $2; while ($text =~ /\s{9}(.?)\n/gc) { print "line item: $1\n"; } } exit; __DATA__ Paragraph1: text Paragraph2: text1 text2 text3 Paragraph3: text` [download]	[reply] [d/l]