Multiline regex

adrya407 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Multiline regex by hippo (Archbishop) on Jun 22, 2016 at 12:27 UTC
This looks to work for me on your sample data: `#!/usr/bin/env perl use strict; use warnings; my $data = 'unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla '; my ($getVariable) = $data =~ /(my_variable=.)^\w+=/sm; print $getVariable;` [download] Update:* See Eily's first reply below. If unwanted_line3 is not the last unwanted line in the set after all, then you'll need a non-greedy grab on the capture group like so: `#!/usr/bin/env perl use strict; use warnings; my $data = 'unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla nonsense unwanted_line4=blabla whocansay '; my ($getVariable) = $data =~ /(my_variable=.*?)^\w+=/sm; print $getVariable;` [download]	[reply] [d/l] [select]
Re^2: Multiline regex by Eily (Monsignor) on Jun 22, 2016 at 12:31 UTC
This works in this case because the next '=' after the required variable also happens to be the last. `.?` instead of `.` will make sure perl finds the first '=' after the match on "my_variable". It's simpler than my solution with look ahead assertion though, guess I overdid it :)	[reply] [d/l] [select]
Re^2: Multiline regex by adrya407 (Novice) on Jun 22, 2016 at 12:39 UTC
Due to confidentiality reasons i can't post the data i'm working on, it works almost perfect, except it still gets `the unwanted_line3=text` but not other lines after that, so it just needs a slight adjustement. Thank you kindly, sir!	[reply] [d/l]
Re^3: Multiline regex by hippo (Archbishop) on Jun 22, 2016 at 12:47 UTC
Did you add in the question mark as Eily suggested? If you can't provide an equivalent sample dataset which shows the problem it's going to be very difficult to assist you further, unfortunately.	[reply]
Re^4: Multiline regex by adrya407 (Novice) on Jun 22, 2016 at 13:08 UTC
Re^2: Multiline regex by Anonymous Monk on Jun 22, 2016 at 12:45 UTC
Fails for: `my $data = 'unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla unwanted_line4=blabla ';` [download] Get extra unwanted lines	[reply] [d/l]
Re^2: Multiline regex by Anonymous Monk on Jun 22, 2016 at 13:26 UTC
Fails if no lines after wanted data, like so: `my $data = <<END; unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 END` [download] Fails to match.	[reply] [d/l]
Re: Multiline regex by Eily (Monsignor) on Jun 22, 2016 at 12:29 UTC
First, for multiline matching you can use the /s modifier to make . match "\n". The second tool you may want to use is look ahead assertion, which allows you to check that a match is followed by something, without including that something in the match. The logic then becomes "match the shortest string possible, but it has to be directly followed by the next variable". Something like: `/my_variable=(.?)(?=unwanted_variable\|\z)/s`. Where the ? after the means "as little as possible".	[reply] [d/l]
Re: Multiline regex (Updated!) by haukex (Archbishop) on Jun 22, 2016 at 22:26 UTC
Hi adrya407, Personally I like to implement this kind of thing using a state machine type approach. Although it certainly takes more lines of code than a single regex, it doesn't require you to read the entire file into memory, and personally I find the conditions (especially complex ones) are more easily expressed in Perl conditionals than in regexes, and because of that I think it's more easily extensible - it looks like you've got some variant of INI file there, so I hope it's not too wild a thought that you may need to get more than just "my_variable" from the file in the future. Or maybe you later find you need to add support for skipping comment lines, etc. Anyway, this is just One Way To Do It. In this example I'm using the definedness of `$myvar` to keep state, in a more complex situation I'd use a separate state variable. ~~The repeated code (printing `$myvar`) could be refactored into an (anonymous) `sub`.~~ Update: The previous version of the code didn't do anything when it encountered a "[...]" line, so "my_variable" would continue to accumulate afterwards. I've updated the code to now cause "[...]" to end a "my_variable" definition and also refactored the code that handles a completed `$myvar` into an anonymous sub. use warnings; use strict; my $myvar; my $take = sub { return unless defined $myvar; chomp($myvar); print "<<$myvar>>\n"; undef $myvar; }; while (<DATA>) { if (my ($k,$v) = /^(\w+)=(.*)$/s) { $take->(); $myvar = $v if $k eq 'my_variable'; } elsif (/^\[.+\]$/) { $take->(); } else { $myvar .= $_ if defined $myvar; } } $take->(); __DATA__ unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla unwanted_line4=blabla unwanted_line5=blabla my_variable=important_content_section4 important_content_section5 important_content_section6 [stepxyz#xxxx] unwanted_content1 unwanted_line6=blabla my_variable=important_content_section7 unwanted_line7=blabla my_variable=important_content_section8 my_variable=important_content_section9 unwanted_line8=blabla [download] Output: `<<important_content_section1 important_content_section2 important_content_section3>> <<important_content_section4 important_content_section5 important_content_section6>> <<important_content_section7>> <<important_content_section8>> <<important_content_section9>>` [download] Hope this helps, -- Hauke D	[reply] [d/l] [select]
Re: Multiline regex by Anonymous Monk on Jun 22, 2016 at 12:42 UTC
`#!/usr/bin/perl # http://perlmonks.org/?node_id=1166246 use strict; use warnings; my $data = <<END; unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla unwanted_line4=blabla END my ($getVariable) = $data =~ /(my_variable=.\n(?:[^=\n]\n)*)/; print $getVariable;` [download]	[reply] [d/l]
Re^2: Multiline regex by adrya407 (Novice) on Jun 22, 2016 at 12:46 UTC
This works, you saved me, thank you kindly!!!	[reply]
Re^3: Multiline regex by adrya407 (Novice) on Jun 22, 2016 at 14:18 UTC
How can i add another skipping option? If i dont have un `unwanted_line=blaba` i might have `[stepxyz#...]`, how can i put an "or" in that regex to check for `[step`?	[reply] [d/l] [select]
Re^4: Multiline regex by Anonymous Monk on Jun 22, 2016 at 15:28 UTC
Re: Multiline regex by kcott (Archbishop) on Jun 26, 2016 at 02:32 UTC
G'day adrya407, Whenever you find yourself in this position — needing to access multi-line blocks of data delimited by known or calculable boundaries — consider using one of the Range Operators. Here's an example which successfully handles all three of your posted formats: `#!/usr/bin/env perl use strict; use warnings; my $start_re = qr{(?x: ^ my_variable = )}; my $end_re = qr{(?x: (?<! ^ my_variable ) = \| ^ \[ [^\]]* \] $ \| ^ [*]{3} \s+ Block \s+ \d+ )}; while (<DATA>) { print if /$start_re/ .. /$end_re/ && next; }` [download] `__DATA__`, and the same test data posted in the OP, are in the spoiler: <Reveal this spoiler or all in this thread> Output: `my_variable=important_content_section1 important_content_section2 important_content_section3 my_variable=important_content_section1 important_content_section2 important_content_section3 my_variable=important_content_section1 important_content_section2 important_content_section3` [download] I think that also covers most of the points from your update. See also "perlre: Extended Patterns", which has details of the regex constructs I've used — such as `(?x: pattern )` and `(?<! pattern )` — and many more like them. — Ken	[reply] [d/l] [select]