adrya407 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to create a regex to extract the variable "my_variable" from a text format like this stored in $data:
unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla
or
unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 [stepxyz#xxxx]
I tryed the following regex:
my ($getVariable) = $data =~ /(my_variable=.*\n(.+[^=]\n?)*)/;
important_content_section1 is never empty, that's why i used my_variable=.*\n but it does get what i want only if i got this:
unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=
What i want to get in $getVarible is:
my_variable=important_content_section1 important_content_section2 important_content_section3
Update: Based on anonymus monk response, i got the solution for first data format. This is the solution: my ($getVariable) = $data =~ /(my_variable=.*\n(?:[^=\n]*\n)*)/; How can i update the regex to stop also at the lines that match [step?

Replies are listed 'Best First'.
Re: Multiline regex
by hippo (Archbishop) on Jun 22, 2016 at 12:27 UTC

    This looks to work for me on your sample data:

    #!/usr/bin/env perl use strict; use warnings; my $data = 'unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla '; my ($getVariable) = $data =~ /(my_variable=.*)^\w+=/sm; print $getVariable;

    Update: See Eily's first reply below. If unwanted_line3 is not the last unwanted line in the set after all, then you'll need a non-greedy grab on the capture group like so:

    #!/usr/bin/env perl use strict; use warnings; my $data = 'unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla nonsense unwanted_line4=blabla whocansay '; my ($getVariable) = $data =~ /(my_variable=.*?)^\w+=/sm; print $getVariable;

      This works in this case because the next '=' after the required variable also happens to be the last. .*? instead of .* will make sure perl finds the first '=' after the match on "my_variable". It's simpler than my solution with look ahead assertion though, guess I overdid it :)

      Due to confidentiality reasons i can't post the data i'm working on, it works almost perfect, except it still gets the unwanted_line3=text but not other lines after that, so it just needs a slight adjustement. Thank you kindly, sir!

        Did you add in the question mark as Eily suggested? If you can't provide an equivalent sample dataset which shows the problem it's going to be very difficult to assist you further, unfortunately.

      Fails for:

      my $data = 'unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla unwanted_line4=blabla ';

      Get extra unwanted lines

      Fails if no lines after wanted data, like so:

      my $data = <<END; unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 END

      Fails to match.

Re: Multiline regex
by Eily (Monsignor) on Jun 22, 2016 at 12:29 UTC

    First, for multiline matching you can use the /s modifier to make . match "\n". The second tool you may want to use is look ahead assertion, which allows you to check that a match is followed by something, without including that something in the match. The logic then becomes "match the shortest string possible, but it has to be directly followed by the next variable". Something like: /my_variable=(.*?)(?=unwanted_variable|\z)/s. Where the ? after the * means "as little as possible".

Re: Multiline regex (Updated!)
by haukex (Archbishop) on Jun 22, 2016 at 22:26 UTC

    Hi adrya407,

    Personally I like to implement this kind of thing using a state machine type approach. Although it certainly takes more lines of code than a single regex, it doesn't require you to read the entire file into memory, and personally I find the conditions (especially complex ones) are more easily expressed in Perl conditionals than in regexes, and because of that I think it's more easily extensible - it looks like you've got some variant of INI file there, so I hope it's not too wild a thought that you may need to get more than just "my_variable" from the file in the future. Or maybe you later find you need to add support for skipping comment lines, etc. Anyway, this is just One Way To Do It. In this example I'm using the definedness of $myvar to keep state, in a more complex situation I'd use a separate state variable. The repeated code (printing $myvar) could be refactored into an (anonymous) sub.

    Update: The previous version of the code didn't do anything when it encountered a "[...]" line, so "my_variable" would continue to accumulate afterwards. I've updated the code to now cause "[...]" to end a "my_variable" definition and also refactored the code that handles a completed $myvar into an anonymous sub.

    use warnings; use strict; my $myvar; my $take = sub { return unless defined $myvar; chomp($myvar); print "<<$myvar>>\n"; undef $myvar; }; while (<DATA>) { if (my ($k,$v) = /^(\w+)=(.*)$/s) { $take->(); $myvar = $v if $k eq 'my_variable'; } elsif (/^\[.+\]$/) { $take->(); } else { $myvar .= $_ if defined $myvar; } } $take->(); __DATA__ unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla unwanted_line4=blabla unwanted_line5=blabla my_variable=important_content_section4 important_content_section5 important_content_section6 [stepxyz#xxxx] unwanted_content1 unwanted_line6=blabla my_variable=important_content_section7 unwanted_line7=blabla my_variable=important_content_section8 my_variable=important_content_section9 unwanted_line8=blabla

    Output:

    <<important_content_section1 important_content_section2 important_content_section3>> <<important_content_section4 important_content_section5 important_content_section6>> <<important_content_section7>> <<important_content_section8>> <<important_content_section9>>

    Hope this helps,
    -- Hauke D

Re: Multiline regex
by Anonymous Monk on Jun 22, 2016 at 12:42 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1166246 use strict; use warnings; my $data = <<END; unwanted_line1=blabla unwanted_line2=blabla my_variable=important_content_section1 important_content_section2 important_content_section3 unwanted_line3=blabla unwanted_line4=blabla END my ($getVariable) = $data =~ /(my_variable=.*\n(?:[^=\n]*\n)*)/; print $getVariable;
      This works, you saved me, thank you kindly!!!
        How can i add another skipping option? If i dont have un unwanted_line=blaba i might have [stepxyz#...], how can i put an "or" in that regex to check for [step?
Re: Multiline regex
by kcott (Archbishop) on Jun 26, 2016 at 02:32 UTC

    G'day adrya407,

    Whenever you find yourself in this position — needing to access multi-line blocks of data delimited by known or calculable boundaries — consider using one of the Range Operators. Here's an example which successfully handles all three of your posted formats:

    #!/usr/bin/env perl use strict; use warnings; my $start_re = qr{(?x: ^ my_variable = )}; my $end_re = qr{(?x: (?<! ^ my_variable ) = | ^ \[ [^\]]* \] $ | ^ [*]{3} \s+ Block \s+ \d+ )}; while (<DATA>) { print if /$start_re/ .. /$end_re/ && next; }

    __DATA__, and the same test data posted in the OP, are in the spoiler:

    Output:

    my_variable=important_content_section1 important_content_section2 important_content_section3 my_variable=important_content_section1 important_content_section2 important_content_section3 my_variable=important_content_section1 important_content_section2 important_content_section3

    I think that also covers most of the points from your update.

    See also "perlre: Extended Patterns", which has details of the regex constructs I've used — such as (?x: pattern ) and (?<! pattern ) — and many more like them.

    — Ken