MurciaNew has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

i have a regex problem. I want to get the text between two words with colon (like Example: or Epikrise:)

I have files (newLine is "\r\n").

Example of just two files: <<< start file 1>>> Test:\r\n blablabla1\r\n blablablabla2\r\n blablablablabla3\r\n blablablab4\r\n blablablaba5\r\n Test1:\r\n lalala1\r\n lalala2\r\n Hello3:\r\n mymymymy\r\n <<< end file 1>>> <<< start file 2>>> Test:\r\n blablabla1\r\n blablablabla2\r\n blablablablabla3\r\n blablablab4\r\n blablablaba5\r\n blablablaba6\r\n Test3:\r\n lalala1\r\n lalala2\r\n lalala3\r\n City:\r\n Gigi\r\n lulu\r\n Kuku\r\n <<< end file 2>>>

With a regexp I want to get all text between "Test:" to that next line with a string with ":"(colon) (for the examples "Test2:" for file 1 or "Test3:" for file 2

as result Test="blablabla1\r\n blablablabla2\r\n blablablablabla3\r\n blablablab4\r\n blablablaba5\r\n"

or Test="blablabla1\r\n blablablabla2\r\n blablablablabla3\r\n blablablab4\r\n blablablaba5\r\n blablablaba6\r\n" )

I tried (.+?(\r\n.+){1,}?(?!\w+?:)) but it does not work

Please help me..... Thanks

MurciaNew (Guido)

Replies are listed 'Best First'.
Re: Regex get Text between two strings with colon
by haukex (Archbishop) on Sep 16, 2017 at 18:44 UTC

    The following works, but requires you to slurp the entire file into memory. If the file is large that might not be the best approach, I talked about a different, line-by-line approach in the recent thread about a similar topic, Multi Line Regex Matches... - basically, you'd need to keep the lines between the "start" and "end" markers in a buffer, probably an array. In regards to the following, see the documentation of the anchors ^ and $, as well as the modifiers /s, /m, and /x, in perlretut and perlre.

    my $data = do { open my $fh, '<', $fn or die $!; local $/; <$fh> }; my ($test) = $data=~m{ ^\w+:\n (.+?) (?: ^\w+:$ | \z ) }msx;
Re: Regex get Text between two strings with colon
by talexb (Chancellor) on Sep 16, 2017 at 19:05 UTC

    Brother haukex has already replied, but here's my answer, which more or less does the same thing:

    #!perl use strict; use warnings; { # Disable the line ending magic, and slurp the entire string into # a scalar. undef $/; my $data = <DATA>; # If we see some text between 'Test:' and 'Test2:' while looking # at a multi-line string, display the resulting capture. if ( $data =~ /Test:(.+)Test2:/s ) { print "Found |$1| between titles.\n" } } __DATA__ Test: Blah blah blah 1 Blah blah blah 2 Blah blah blah 5 Blah blah blah 9 Test2: What is this for? How is this happning? Why am I here? Hello3: What
    The $/ variable is the one that tells Perl what the 'end of line' character is. When I undef that variable, the whole file gets treated as a single line from an input point of view when I read the text into that variable.

    The 's' option defines the regexp as a multi-line regexp, telling it to ignore the carriage return (\r) and line feed (\n) characters. When this script is run, I get

    Found | Blah blah blah 1 Blah blah blah 2 Blah blah blah 5 Blah blah blah 9 | between titles.

    That's one way of parsing the file -- another way would be to

    • Read lines until you see the start of the capture (Test:, in this case), then start capturing;
    • Did we see the end of the capture (Test2:, in this case)? If not, capture the line and repeat; otherwise, stop.
    You could also just capture each block of text into an array, store that array as a hash value, using the label (like Test) as the hash key. Later, just go and get the entry for whichever key you want.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Thanks for hints. Is there a Solution without modifiers and slurps? Just "pure" regex?

        As the Anonymous One has said, modifiers are an entirely kosher part of regexes, and haukex has already linked to a discussion of non-slurping, line-by-line matching.


        Give a man a fish:  <%-{-{-{-<

        modifiers are "pure regex".

        and if you don't want to slurp, then how do you expect a regex to match over data you haven't read?

Re: Regex get Text between two strings with colon
by kcott (Archbishop) on Sep 17, 2017 at 05:55 UTC

    G'day MurciaNew,

    Here's a technique that will read your data in blocks, by localising a temporary value for the input record separator ($/).

    #!/usr/bin/env perl use strict; use warnings; my %data; { local $/ = ":\n"; my $block_re = qr{\A(.*)^(.*?):?\Z}ms; my $key; while (<DATA>) { my ($data, $label) = /$block_re/; $data{$key} = $data if defined $key; $data{$key} .= $label if eof DATA; $key = $label; } } use Data::Dump; dd \%data; __DATA__ AAA: 123 456 789 BBB: qwe rty uio CCC: asd fgh jkl

    Notes:

    • There's nothing particularly special about the regex ($block_re); see perlre if you need to.
    • The first block read only has a label. That's dealt with by the 'if defined $key' condition.
    • The last block read only has data. That's dealt with by the 'if eof DATA' condition.
    • You'll need to replace DATA with $fh (or similar). See open for more on that.
    • I've used Data::Dump just to show the results of the data extraction: that's not part of the technique.
    • The code assumes your example data is representative. If it's not, the general technique should still be sound, but you may need to make some modifications.

    Here's the output from that code:

    { AAA => "123\n456\n789\n", BBB => "qwe\nrty\nuio\n", CCC => "asd\nfgh\njkl\n", }

    — Ken