Regex Question

set_uk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex Question by tos (Deacon) on Jun 03, 2003 at 11:26 UTC
Hi, could this be what you want ? `# cat re1 use warnings; use strict; undef $/; my $tout = <DATA>; my $z; while ($tout =~ /((?:DES\|TN).+?(DATE\|ZONE)[^\n]+\n\s*\n)/sg) { print "\n\n", ++$z, ":\n", $1; } __DATA__ DES MAIL TN 001 0 02 00 TYPE SL1 CDEN DD CUST 0 KLS 1 FDN TGAR 0 LDN NO NCOS 0 09 DATE 9 MAR 2000 TN 001 0 02 01 05 RLS 06 TRN 07 AO3 08 09 ZONE 002 TN 001 0 02 01 05 RLS ZONE 001 07 AO3 08 09 DATE 9 MAR 2000` [download] output `# perl -w ./re1 1: DES MAIL TN 001 0 02 00 TYPE SL1 CDEN DD CUST 0 KLS 1 FDN TGAR 0 LDN NO NCOS 0 09 DATE 9 MAR 2000 2: TN 001 0 02 01 05 RLS 06 TRN 07 AO3 08 09 ZONE 002 3: TN 001 0 02 01 05 RLS ZONE 001 07 AO3 08 09 DATE 9 MAR 2000` [download]	[reply] [d/l] [select]
Re: Re: Regex Question by set_uk (Pilgrim) on Jun 03, 2003 at 12:14 UTC
That is exactly what I wanted. Thanks Not sure I understand why it works though. why does -> `[^\n]` - match all the data after the DATE and ZONE to the end of line doesn't it just mean match a beginning of line and a new line?	[reply] [d/l]
Re: Re: Re: Regex Question by tos (Deacon) on Jun 03, 2003 at 12:42 UTC
Hi, glad that i could help you. The `[^\n]` is a negated character class. Here it represents all characters which aren't "newlines". The newline-characters remained, though undefined `$/`, still in your string. The regex `[^\n]+\n` matches on 1-n non-newlines followed by a newline. After this there can be 0-n whitespaces `\s*` followed by another newline. This is your blank-line. greetings, tos	[reply] [d/l] [select]
Re: Regex Question by BrowserUk (Patriarch) on Jun 03, 2003 at 09:24 UTC
The first thing I noticed was that you are undefing $/. Apart from that just localising it it enough to set to undef, that is usually done when you are slurping the entire file. The second thing I noticed was that you are using /smx. If I am interpretung your snippet correctly, you are slurping the whole file, asking that . match \n (/s), asking that ^ and $ match either side of \n (/m), and then hoping that `/(^DES\|^TN).?/` will match just a single line, with the correct first 2 or 3 letters. It won't. It will match starting at the first newline followed by DES or TN, but with go on to match the rest of the entire file as there is nothing to stop .? matching. By the time you get to your end criteria, the is nothing left for it to match against. A lot of assumptions based on little evidence, but it does fit what I see:)	[reply] [d/l]
Re: Re: Regex Question by set_uk (Pilgrim) on Jun 03, 2003 at 10:45 UTC
I now have:- `((?:^DES\|^TN).?(?:(?:DATE\|ZONE)[ A-Z0-9])(?=$^$)?)` [download] But given the data I posted earlier it still matches ZONE even when not followed by blank line. Trying to use (?=$^$)? to say only match the previous pattern if followed by the first blank line. But given data :- `TN 001 0 02 01 05 RLS ZONE 001 07 AO3 08 09 DATE 9 MAR 2000` [download] Only matches upto ZONE and not DATE.	[reply] [d/l] [select]
Re: Re: Re: Regex Question by BrowserUk (Patriarch) on Jun 03, 2003 at 11:11 UTC
Given the format of your data, you would be much better of using "paragraph mode". Ie. Setting $/ to '' rather than undef. The each read will give you exactly what (I think) you are trying to acheive with your regex. Try this on your data to see what I mean, the see perlvar $INPUT_RECORD_SEPERATOR for the details. `#! perl -slw use strict; local $/ = ''; while( <DATA> ) { print "'$_'"; }` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re: Re: Re: Re: Regex Question by set_uk (Pilgrim) on Jun 03, 2003 at 11:51 UTC
Re: Re: Re: Re: Re: Regex Question by BrowserUk (Patriarch) on Jun 03, 2003 at 12:53 UTC
Re: Regex Question by dws (Chancellor) on Jun 03, 2003 at 09:04 UTC
Can someone tell me how to start searching for either of 2 words and stop searching when I discover either of the other 2 words followed by a blank line. You're on the right track, though I suspect that the +? might not be doing what you want it do. It would help to show us a few representative records. Would something like `while ( /^((?:DES\|TN).?(?:DATE\|ZONE)[ A-Z0-9])$^$/sm ) { # $1 is the matched record, without the blank line }` [download] work?	[reply] [d/l]
Re: Re: Regex Question by set_uk (Pilgrim) on Jun 03, 2003 at 09:33 UTC
Data looks as follows:- `DES MAIL TN 001 0 02 00 TYPE SL1 CDEN DD CUST 0 KLS 1 FDN TGAR 0 LDN NO NCOS 0 09 DATE 9 MAR 2000 TN 001 0 02 01 05 RLS 06 TRN 07 AO3 08 09 ZONE 002 TN 001 0 02 01 05 RLS ZONE 001 07 AO3 08 09 DATE 9 MAR 2000` [download]	[reply] [d/l]