how to write multi-line regex

herman4016 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to write multi-line regex by kcott (Archbishop) on Mar 24, 2014 at 05:27 UTC
G'day herman4016, You're reading the input line by line; accordingly, your regex is only trying to match against the single line in `$_` for any given iteration of the `while` loop. What you need to do is read the input in records. You can do this by locally setting `$/`, the input record sparator (see "perlvar: Variables related to filehandles"). You didn't post your data in `<code>...</code>` tags (in fact, you appear to have "`<p>...</p>`" tags with embedded "`<br />`" tags and what looks like extraneous whitespace), so I'm having to guess what it really looks like. In the script below, I've assumed paragraph mode (records are separated by two or more blank lines) but maybe "`ENDEL;`" might be a better end-of-record indicator. You have an additional problem in your posted regex with greedy matches (e.g. " `.`"). If you don't understand that, see Quantifiers* under "perlre: Regular Expressions". The following script shows the technique you'll need. You may need to adjust this based on my various comments above. `#!/usr/bin/env perl -l use strict; use warnings; my $re = qr{TEXT;\s+LAYER 13[1-7];\s+TEXTTYPE 0;.*?STRING ([^;]+)}s; { local $/ = ""; while (<DATA>) { print $1 if /$re/; } } __DATA__ ... ... TEXT; LAYER 133; TEXTTYPE 0; PRESENTATION 0,2,0; STRANS 0,0,0; XY 1; X: 91410; Y: 50020; STRING AVDD12; ENDEL; BOUNDARY; LAYER 108; DATATYPE 0; XY 5; X: 0; Y: 0; X: 0; Y: 53530; X: 91410; Y: 53530; X: 91410; Y: 0; X: 0; Y: 0; ENDEL; ... ...` [download] Output: `AVDD12` [download] -- Ken	[reply] [d/l] [select]
Re^2: how to write multi-line regex by herman4016 (Acolyte) on Mar 24, 2014 at 09:08 UTC
Thanks Ken and everyone for your kindly reply, following is based on Ken's code, `#!/usr/bin/perl use strict; use diagnostics; use 5.010; open my $fh, $ARGV[0] or die "File $ARGV[0] not found!\n"; my $re = qr{TEXT;\s+LAYER 13[1-7];\s+TEXTTYPE 0;.*?STRING ([^;:]+)}s; { local $/ = ""; while (<$fh>) { say $1 if /$re/; } }` [download]	[reply] [d/l]
Re: how to write multi-line regex by BrowserUk (Patriarch) on Mar 24, 2014 at 04:50 UTC
Try this: `#! perl -slw use strict; $/ = ''; # Paragraph mode while (<DATA>) { if( /TEXT;\s+?LAYER 13[1-7];.+?STRING\s+(\S+)/s ) { print $1; } else{ print "not found!\n"; } } __DATA__ TEXT; LAYER 133; TEXTTYPE 0; PRESENTATION 0,2,0; STRANS 0,0,0; XY 1; X: 91410; Y: 50020; STRING AVDD12; ENDEL; BOUNDARY; LAYER 108; DATATYPE 0; XY 5; X: 0; Y: 0; X: 0; Y: 53530; X: 91410; Y: 53530; X: 91410; Y: 0; X: 0; Y: 0;` [download] Output: `C:\test>junk.pl AVDD12; not found!` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: how to write multi-line regex by shmem (Chancellor) on Mar 24, 2014 at 07:48 UTC
Your input looks like a perfect candidate for "paragraph mode" using the switches -n and -00 (see perlrun): `#!/usr/bin/perl -n00 /^TEXT;\nLAYER 13[1-7];\nTEXTTYPE 0;.+STRING (.+?);/s and print $1,"\n +";` [download] The //s modifier is necessary, since without it, the "." character would not match a newline. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l]
Re^2: how to write multi-line regex by kcott (Archbishop) on Mar 25, 2014 at 04:15 UTC
As well as the '`s`' modifier which, as you correctly state, is needed for '`.`' to match a newline; you'll also need the '`m`' modifier for '`^`' to match the start of each line in a multi-line string (which paragraph mode [`-00`] will give you). Without this modifier, '`^`' will match only once at the start of the string. ("perlre: Modifiers" and "perlre: Metacharacters" have details of both of those.) The following two points are really more a comment on the way the OP posted the sample data than on your solution. The start of the regex (`/^TEXT ...`) assumes `TEXT` starts at the beginning of a line. While I agree that is likely to be the case for the real data, the HTML source for the posted data suggests otherwise: `<p> ... <br /> ... <br /> TEXT; <br /> LAYER 133;` [download] The two '`\n`'s in the regex suffer from a similar problem. While it's likely that the real data has lines that are only separated by a single newline, the posted data has additional whitespace before and after various lines. -- Ken	[reply] [d/l] [select]
Re: how to write multi-line regex by NetWallah (Canon) on Mar 24, 2014 at 04:44 UTC
If you are sure blocks of logical "records" wont get mixed up, you can try this regex: `/TEXT;\p{Space}*LAYER 13[1-7];\s+TEXTTYPE\p{Space}0;.+STRING\s(\S+\w +).+?ENDEL;/s` [download] To get proper logical blocks, you probably need something like: `local $/="ENDEL;\n";` [download] But if you use this, you will need to drop the trailing ENDEL; in the regex. What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against? -Larry Wall, 1992	[reply] [d/l] [select]
Re: how to write multi-line regex by locked_user sundialsvc4 (Abbot) on Mar 24, 2014 at 16:26 UTC
Even if you could and even if you do “do it this way,” I frankly wouldn’t. Instead, I would process this file a line at a time, “`awk`-style,” using logic that gathers information from each line as-presented (or ignores the line, as the case may be), then does something with the accumulated information when an appropriate sentinel line – e.g. `ENDDEL;` or an empty-line or end-of-file – is encountered. The difficulty of a “clever multi-line regex” approach is not so much that you can manage to get such a thing to at-least appear to work in a handful of test cases, but rather that it is likely to be fairly well-nigh impossible to prove that the algorithm actually works for every well-formed file that is presented to it. Let alone that it will correctly reject any file that is not well-formed. Then, the next near-impossibility will be to maintain the thing over time, continually adapting it to meet evolving conditions and/or to deal with bugs in the (third-party supplied) data feed that the aforesaid third-party just won’t ever get around to fixing. It happens. A lot. The line-by-line approach, on the other hand, works well. Some line will mark the beginning of a potentially-useful set of information, while another line (and/or end-of-file) will mark the end. In-between these two lines are: (a) lines that contain more useful things; and (b) lines that you recognize but choose to ignore; and (c) lines that you do not recognize, meaning either that your program is now insufficient or that the data-vendor has once again screwed-up. Robust, `awk`-style logic can be built in this way, and, if built well, it will last for years. Therefore, it’s my strong opinion that this is the result that you ought to take here.