set_uk has asked for the wisdom of the Perl Monks concerning the following question:

I have thunk myself to a standstill trying to work this regex out.

I have a file full of records split across multiple lines. The records start with wordA or wordB and end with either wordC or wordD followed by a blank line.

Can someone tell me how to start searching for either of 2 words and stop searching when I discover either of the other 2 words followed by a blank line.

I should be able to use the range operator - but its not playing.

Trying local $/ = undef; $StartTN = qr/(^DES|^TN).*?/; $EndTN = qr/((DATE[ A-Z0-9]*)+?(?=^$)|ZONE [ A-Z0-9]*(?=^$))/; if ( /($StartTN)/smx .. /($EndTN)/smx) { }

Replies are listed 'Best First'.
Re: Regex Question
by tos (Deacon) on Jun 03, 2003 at 11:26 UTC
    Hi,

    could this be what you want ?

    # cat re1 use warnings; use strict; undef $/; my $tout = <DATA>; my $z; while ($tout =~ /((?:DES|TN).+?(DATE|ZONE)[^\n]+\n\s*\n)/sg) { print "\n\n", ++$z, ":\n", $1; } __DATA__ DES MAIL TN 001 0 02 00 TYPE SL1 CDEN DD CUST 0 KLS 1 FDN TGAR 0 LDN NO NCOS 0 09 DATE 9 MAR 2000 TN 001 0 02 01 05 RLS 06 TRN 07 AO3 08 09 ZONE 002 TN 001 0 02 01 05 RLS ZONE 001 07 AO3 08 09 DATE 9 MAR 2000
    output
    # perl -w ./re1 1: DES MAIL TN 001 0 02 00 TYPE SL1 CDEN DD CUST 0 KLS 1 FDN TGAR 0 LDN NO NCOS 0 09 DATE 9 MAR 2000 2: TN 001 0 02 01 05 RLS 06 TRN 07 AO3 08 09 ZONE 002 3: TN 001 0 02 01 05 RLS ZONE 001 07 AO3 08 09 DATE 9 MAR 2000
      That is exactly what I wanted. Thanks Not sure I understand why it works though. why does ->  [^\n] - match all the data after the DATE and ZONE to the end of line doesn't it just mean match a beginning of line and a new line?
        Hi,

        glad that i could help you.

        The [^\n] is a negated character class. Here it represents all characters which aren't "newlines". The newline-characters remained, though undefined $/, still in your string.

        The regex [^\n]+\n matches on 1-n non-newlines followed by a newline.

        After this there can be 0-n whitespaces \s* followed by another newline. This is your blank-line.

        greetings, tos

Re: Regex Question
by BrowserUk (Patriarch) on Jun 03, 2003 at 09:24 UTC

    The first thing I noticed was that you are undefing $/. Apart from that just localising it it enough to set to undef, that is usually done when you are slurping the entire file.

    The second thing I noticed was that you are using /smx.

    If I am interpretung your snippet correctly, you are slurping the whole file, asking that . match \n (/s), asking that ^ and $ match either side of \n (/m), and then hoping that

    /(^DES|^TN).*?/ will match just a single line, with the correct first 2 or 3 letters. It won't.

    It will match starting at the first newline followed by DES or TN, but with go on to match the rest of the entire file as there is nothing to stop .*? matching.

    By the time you get to your end criteria, the is nothing left for it to match against.

    A lot of assumptions based on little evidence, but it does fit what I see:)

      I now have:-
      ((?:^DES|^TN).*?(?:(?:DATE|ZONE)[ A-Z0-9]*)(?=$^$)?)
      But given the data I posted earlier it still matches ZONE even when not followed by blank line.

      Trying to use (?=$^$)? to say only match the previous pattern if followed by the first blank line.

      But given data :-

      TN 001 0 02 01 05 RLS ZONE 001 07 AO3 08 09 DATE 9 MAR 2000
      Only matches upto ZONE and not DATE.

        Given the format of your data, you would be much better of using "paragraph mode". Ie. Setting $/ to '' rather than undef. The each read will give you exactly what (I think) you are trying to acheive with your regex.

        Try this on your data to see what I mean, the see perlvar $INPUT_RECORD_SEPERATOR for the details.

        #! perl -slw use strict; local $/ = ''; while( <DATA> ) { print "'$_'"; }

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: Regex Question
by dws (Chancellor) on Jun 03, 2003 at 09:04 UTC
    Can someone tell me how to start searching for either of 2 words and stop searching when I discover either of the other 2 words followed by a blank line.

    You're on the right track, though I suspect that the +? might not be doing what you want it do. It would help to show us a few representative records.

    Would something like

    while ( /^((?:DES|TN).*?(?:DATE|ZONE)[ A-Z0-9]*)$^$/sm ) { # $1 is the matched record, without the blank line }
    work?
      Data looks as follows:-
      DES MAIL TN 001 0 02 00 TYPE SL1 CDEN DD CUST 0 KLS 1 FDN TGAR 0 LDN NO NCOS 0 09 DATE 9 MAR 2000 TN 001 0 02 01 05 RLS 06 TRN 07 AO3 08 09 ZONE 002 TN 001 0 02 01 05 RLS ZONE 001 07 AO3 08 09 DATE 9 MAR 2000