Regex get Text between two strings with colon

MurciaNew has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

i have a regex problem. I want to get the text between two words with colon (like Example: or Epikrise:)

I have files (newLine is "\r\n").

Example of just two files:
<<< start file 1>>>
Test:\r\n
blablabla1\r\n
blablablabla2\r\n
blablablablabla3\r\n
blablablab4\r\n
blablablaba5\r\n
Test1:\r\n
lalala1\r\n
lalala2\r\n
Hello3:\r\n
mymymymy\r\n
<<< end file 1>>>

<<< start file 2>>>
Test:\r\n
blablabla1\r\n
blablablabla2\r\n
blablablablabla3\r\n
blablablab4\r\n
blablablaba5\r\n
blablablaba6\r\n
Test3:\r\n
lalala1\r\n
lalala2\r\n
lalala3\r\n
City:\r\n
Gigi\r\n
lulu\r\n
Kuku\r\n
<<< end file 2>>>
[download]

With a regexp I want to get all text between "Test:" to that next line with a string with ":"(colon) (for the examples "Test2:" for file 1 or "Test3:" for file 2

as result Test="blablabla1\r\n blablablabla2\r\n blablablablabla3\r\n blablablab4\r\n blablablaba5\r\n"

or Test="blablabla1\r\n blablablabla2\r\n blablablablabla3\r\n blablablab4\r\n blablablaba5\r\n blablablaba6\r\n" )

I tried (.+?(\r\n.+){1,}?(?!\w+?:)) but it does not work

Please help me..... Thanks

MurciaNew (Guido)

Comment on Regex get Text between two strings with colon Download Code

Replies are listed 'Best First'.
Re: Regex get Text between two strings with colon by haukex (Archbishop) on Sep 16, 2017 at 18:44 UTC
The following works, but requires you to slurp the entire file into memory. If the file is large that might not be the best approach, I talked about a different, line-by-line approach in the recent thread about a similar topic, Multi Line Regex Matches... - basically, you'd need to keep the lines between the "start" and "end" markers in a buffer, probably an array. In regards to the following, see the documentation of the anchors `^` and `$`, as well as the modifiers `/s`, `/m`, and `/x`, in perlretut and perlre. `my $data = do { open my $fh, '<', $fn or die $!; local $/; <$fh> }; my ($test) = $data=~m{ ^\w+:\n (.+?) (?: ^\w+:$ \| \z ) }msx;` [download]	[reply] [d/l] [select]
Re: Regex get Text between two strings with colon by talexb (Chancellor) on Sep 16, 2017 at 19:05 UTC
Brother haukex has already replied, but here's my answer, which more or less does the same thing: `#!perl use strict; use warnings; { # Disable the line ending magic, and slurp the entire string into # a scalar. undef $/; my $data = <DATA>; # If we see some text between 'Test:' and 'Test2:' while looking # at a multi-line string, display the resulting capture. if ( $data =~ /Test:(.+)Test2:/s ) { print "Found \|$1\| between titles.\n" } } __DATA__ Test: Blah blah blah 1 Blah blah blah 2 Blah blah blah 5 Blah blah blah 9 Test2: What is this for? How is this happning? Why am I here? Hello3: What` [download] The `$/` variable is the one that tells Perl what the 'end of line' character is. When I `undef` that variable, the whole file gets treated as a single line from an input point of view when I read the text into that variable. The 's' option defines the regexp as a multi-line regexp, telling it to ignore the carriage return (\r) and line feed (\n) characters. When this script is run, I get `Found \| Blah blah blah 1 Blah blah blah 2 Blah blah blah 5 Blah blah blah 9 \| between titles.` [download] That's one way of parsing the file -- another way would be to Read lines until you see the start of the capture (Test:, in this case), then start capturing; Did we see the end of the capture (Test2:, in this case)? If not, capture the line and repeat; otherwise, stop. You could also just capture each block of text into an array, store that array as a hash value, using the label (like Test) as the hash key. Later, just go and get the entry for whichever key you want. Alex / talexb / Toronto Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.	[reply] [d/l] [select]
Re^2: Regex get Text between two strings with colon by MurciaNew (Novice) on Sep 16, 2017 at 19:12 UTC
Thanks for hints. Is there a Solution without modifiers and slurps? Just "pure" regex?	[reply]
Re^3: Regex get Text between two strings with colon by AnomalousMonk (Archbishop) on Sep 16, 2017 at 23:24 UTC
As the Anonymous One has said, modifiers are an entirely kosher part of regexes, and haukex has already linked to a discussion of non-slurping, line-by-line matching. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^3: Regex get Text between two strings with colon by Anonymous Monk on Sep 16, 2017 at 19:49 UTC
modifiers are "pure regex". and if you don't want to slurp, then how do you expect a regex to match over data you haven't read?	[reply]
Re: Regex get Text between two strings with colon by kcott (Archbishop) on Sep 17, 2017 at 05:55 UTC
G'day MurciaNew, Here's a technique that will read your data in blocks, by localising a temporary value for the input record separator ($/). `#!/usr/bin/env perl use strict; use warnings; my %data; { local $/ = ":\n"; my $block_re = qr{\A(.)^(.?):?\Z}ms; my $key; while (<DATA>) { my ($data, $label) = /$block_re/; $data{$key} = $data if defined $key; $data{$key} .= $label if eof DATA; $key = $label; } } use Data::Dump; dd \%data; __DATA__ AAA: 123 456 789 BBB: qwe rty uio CCC: asd fgh jkl` [download] Notes: There's nothing particularly special about the regex (`$block_re`); see perlre if you need to. The first block read only has a label. That's dealt with by the '`if defined $key`' condition. The last block read only has data. That's dealt with by the '`if eof DATA`' condition. You'll need to replace `DATA` with `$fh` (or similar). See open for more on that. I've used Data::Dump just to show the results of the data extraction: that's not part of the technique. The code assumes your example data is representative. If it's not, the general technique should still be sound, but you may need to make some modifications. Here's the output from that code: `{ AAA => "123\n456\n789\n", BBB => "qwe\nrty\nuio\n", CCC => "asd\nfgh\njkl\n", }` [download] — Ken	[reply] [d/l] [select]