Matching consecutive "different" regex patterns across multiple lines

eh3civic has asked for the wisdom of the Perl Monks concerning the following question:

I have some large data sets that have oddly formatted lines, and much of it I don't need. The general rule is that I need to keep every instance of lines that start with "john#", are followed by {2,5}new lines, and then lines starting with "jacob - \d\.0"

bhgfsggdsgsg
--
john1
weruwearnwrnweuarar
jjafdaiuweifweofiuwe
jacob - 1.0
--
nfaslf23523525
john2
asfsjldf43tgre
john3
asbdfhskafbv3333v
sdfahh34ttg
sadfhk34t3wtg
sdfhk3gfwghhw3
jacob - 2.0
[download]

The output that I need would look like this..

john1 > jacob - 1.0
john3 > jacob - 2.0
[download]

Obviously my data is a little different than this, but I have every regex pulling exactly what I need, but just not in the way I want it. I can't seem to figure out how to tell it to take John# only when followed by a Jacob. I don't want to keep a John# unless it is followed by a Jacob. For instance, the code below would look at when there are 3 lines between them...I know it isn't right, but the multi-match, multi-line thing has me confused.

if($line1 =~ /(john)^.{1,100}$^.{1,100}$^.{1,100}$^(jacob \- \d\.0)/s)
{
($JOHN,$JACOB) = ($1,$2);
print MYOUTPUTFILE1 "$JOHN - $JACOB";
}
[download]

Comment on Matching consecutive "different" regex patterns across multiple lines Select or Download Code

Replies are listed 'Best First'.
Re: Matching consecutive "different" regex patterns across multiple lines by Anonymous Monk on Apr 23, 2011 at 11:53 UTC
buffer it `my @buffer; while ... if( start condition ){ if( @buffer ){ INFO("start condition without end condition, discarding buffer +"); } @buffer = $line; } elsif( end condition ){ print OUTFILE @buffer; undef @buffer; } else { push @buffer, $line; }` [download]	[reply] [d/l]
Re^2: Matching consecutive "different" regex patterns across multiple lines by duelafn (Parson) on Apr 23, 2011 at 13:04 UTC
More explicitly, Anonymonk is suggesting processing the input line-by-line using something like the following rather than attempting to match a whole chunk at once. Here I am just keeping the john line. If you actually need the lines between, push them to a `@buffer` as Anonymonk suggests. `my $john; while (defined(local $_ = <DATA>)) { if (/^(john\d+)$/) { $john = $1; } elsif (/^(jacob \- \d\.0)$/) { if ($john) { print "$john - $1\n"; } else { die "Jacob is not preceded by John!"; } undef $john; } } __DATA__ bhgfsggdsgsg -- john1 weruwearnwrnweuarar jjafdaiuweifweofiuwe jacob - 1.0 -- nfaslf23523525 john2 asfsjldf43tgre john3 asbdfhskafbv3333v sdfahh34ttg sadfhk34t3wtg sdfhk3gfwghhw3 jacob - 2.0` [download] Good Day, Dean	[reply] [d/l] [select]
Re: Matching consecutive "different" regex patterns across multiple lines by wind (Priest) on Apr 23, 2011 at 15:57 UTC
Here is a regex solution just for fun: `use strict; use warnings; my $data = do {local $/; <DATA>}; while ($data =~ /^(john.)\n(?:(?!john).\n)?(jacob.)/mg) { print "$1 > $2\n"; } __DATA__ bhgfsggdsgsg -- john1 weruwearnwrnweuarar jjafdaiuweifweofiuwe jacob - 1.0 -- nfaslf23523525 john2 asfsjldf43tgre john3 asbdfhskafbv3333v sdfahh34ttg sadfhk34t3wtg sdfhk3gfwghhw3 jacob - 2.0` [download] However, I'd also advise line by line processing for this type of problem. `my $buffer; while (<DATA>) { $buffer = $1 if /^(john.*)/; if ($buffer && /^jacob/) { print "$buffer > $_"; $buffer = ''; } }` [download]	[reply] [d/l] [select]