eh3civic has asked for the wisdom of the Perl Monks concerning the following question:

I have some large data sets that have oddly formatted lines, and much of it I don't need. The general rule is that I need to keep every instance of lines that start with "john#", are followed by {2,5}new lines, and then lines starting with "jacob - \d\.0"

bhgfsggdsgsg -- john1 weruwearnwrnweuarar jjafdaiuweifweofiuwe jacob - 1.0 -- nfaslf23523525 john2 asfsjldf43tgre john3 asbdfhskafbv3333v sdfahh34ttg sadfhk34t3wtg sdfhk3gfwghhw3 jacob - 2.0

The output that I need would look like this..

john1 > jacob - 1.0 john3 > jacob - 2.0

Obviously my data is a little different than this, but I have every regex pulling exactly what I need, but just not in the way I want it. I can't seem to figure out how to tell it to take John# only when followed by a Jacob. I don't want to keep a John# unless it is followed by a Jacob. For instance, the code below would look at when there are 3 lines between them...I know it isn't right, but the multi-match, multi-line thing has me confused.

if($line1 =~ /(john)^.{1,100}$^.{1,100}$^.{1,100}$^(jacob \- \d\.0)/s) { ($JOHN,$JACOB) = ($1,$2); print MYOUTPUTFILE1 "$JOHN - $JACOB"; }

Replies are listed 'Best First'.
Re: Matching consecutive "different" regex patterns across multiple lines
by Anonymous Monk on Apr 23, 2011 at 11:53 UTC
    buffer it
    my @buffer; while ... if( start condition ){ if( @buffer ){ INFO("start condition without end condition, discarding buffer +"); } @buffer = $line; } elsif( end condition ){ print OUTFILE @buffer; undef @buffer; } else { push @buffer, $line; }

      More explicitly, Anonymonk is suggesting processing the input line-by-line using something like the following rather than attempting to match a whole chunk at once. Here I am just keeping the john line. If you actually need the lines between, push them to a @buffer as Anonymonk suggests.

      my $john; while (defined(local $_ = <DATA>)) { if (/^(john\d+)$/) { $john = $1; } elsif (/^(jacob \- \d\.0)$/) { if ($john) { print "$john - $1\n"; } else { die "Jacob is not preceded by John!"; } undef $john; } } __DATA__ bhgfsggdsgsg -- john1 weruwearnwrnweuarar jjafdaiuweifweofiuwe jacob - 1.0 -- nfaslf23523525 john2 asfsjldf43tgre john3 asbdfhskafbv3333v sdfahh34ttg sadfhk34t3wtg sdfhk3gfwghhw3 jacob - 2.0

      Good Day,
          Dean

Re: Matching consecutive "different" regex patterns across multiple lines
by wind (Priest) on Apr 23, 2011 at 15:57 UTC

    Here is a regex solution just for fun:

    use strict; use warnings; my $data = do {local $/; <DATA>}; while ($data =~ /^(john.*)\n(?:(?!john).*\n)*?(jacob.*)/mg) { print "$1 > $2\n"; } __DATA__ bhgfsggdsgsg -- john1 weruwearnwrnweuarar jjafdaiuweifweofiuwe jacob - 1.0 -- nfaslf23523525 john2 asfsjldf43tgre john3 asbdfhskafbv3333v sdfahh34ttg sadfhk34t3wtg sdfhk3gfwghhw3 jacob - 2.0

    However, I'd also advise line by line processing for this type of problem.

    my $buffer; while (<DATA>) { $buffer = $1 if /^(john.*)/; if ($buffer && /^jacob/) { print "$buffer > $_"; $buffer = ''; } }