I have some large data sets that have oddly formatted lines, and much of it I don't need. The general rule is that I need to keep every instance of lines that start with "john#", are followed by {2,5}new lines, and then lines starting with "jacob - \d\.0"
bhgfsggdsgsg -- john1 weruwearnwrnweuarar jjafdaiuweifweofiuwe jacob - 1.0 -- nfaslf23523525 john2 asfsjldf43tgre john3 asbdfhskafbv3333v sdfahh34ttg sadfhk34t3wtg sdfhk3gfwghhw3 jacob - 2.0
The output that I need would look like this..
john1 > jacob - 1.0 john3 > jacob - 2.0
Obviously my data is a little different than this, but I have every regex pulling exactly what I need, but just not in the way I want it. I can't seem to figure out how to tell it to take John# only when followed by a Jacob. I don't want to keep a John# unless it is followed by a Jacob. For instance, the code below would look at when there are 3 lines between them...I know it isn't right, but the multi-match, multi-line thing has me confused.
if($line1 =~ /(john)^.{1,100}$^.{1,100}$^.{1,100}$^(jacob \- \d\.0)/s) { ($JOHN,$JACOB) = ($1,$2); print MYOUTPUTFILE1 "$JOHN - $JACOB"; }
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |