Re: Text Extraction

There is probably a clever way to do it with grep and Range Operators, but here is a way using state variables. You should change the flags to more meaningful names for your application:

use strict;
use warnings;

my $flag1 = 0;
my $flag2 = 0;
my @lines;
while (<DATA>) {
    $flag2 = 1 if $flag1 and /A1/;
    $flag1 = 1 if /SUBSCRIBER/;
    push @lines, $_ if $flag1;
    if (/NATIONAL/) {
        print @lines if $flag2;
        $flag1 = 0;
        $flag2 = 0;
        @lines = ();
    }
}

__DATA__
foo
bar
SUBSCRIBER
goo
hoo
nada
NATIONAL

SUBSCRIBER
goo
A1
NATIONAL
junk
junk
junk
[download]

Prints:

SUBSCRIBER
goo
A1
NATIONAL
[download]

Update: Ok, here's my solution with Range Operators:

use strict;
use warnings;

my $flag = 0;
my @lines;
while (<DATA>) {
    if (/SUBSCRIBER/ .. /NATIONAL/) {
        push @lines, $_;
        $flag = 1 if /A1/;
        if (/NATIONAL/) {
            print @lines if $flag; 
            $flag = 0;
            @lines = ();
        }
    }
}
[download]

Comment on Re: Text Extraction Select or Download Code

Replies are listed 'Best First'.
Re^2: Text Extraction by JonDepp (Novice) on Feb 08, 2010 at 19:16 UTC
Here is an example of my input file and these follow the same structure over and over: SUBSCRIBER DEMOGRAPHIC INFORMATION BIRTHDATE GENDER MEMBER IDENTIFICATION NUMBER NAME XXXXXXXXX TRACE NUMBER: XXXXXXXX CLAIM CLAIM PAYORS CLAIM NUMBER: XXXXXXXXXXXXXXXXXXXX PERIOD BEG PERIOD END MEDICAL RECORD NUMBER: 01/14/2010 01/14/2010 BILLING TYPE: EFFECTIVE ADJUDICATION PAYMENT CHARGE PAYMENT CHECK STATUS DATE PAYMENT DATE METHOD AMOUNT AMOUNT CHECK DATE NUMBER 02/01/2010 XX.XX 0.00 CLAIM LEVEL STATUS CATEGORY: A1 STATUS: 19 MODIFIER: PR PAGE: 11 CLINIC # XXXXXX (C980 ) XXXXXX REPORT NO: CPR601.01 SOMEINSURANCE HEALTH CARE CLAIM STATUS NOTIFICATION ISA CONTROL NO: XXXXXXXXXX ISA PROCESS DATE: 10/02/02 ISA PROCESS TIME: 04:52 GROUP CONTROL NO: XXXXXX ST CONTROL NO: XXXXXXXXX BHT REFERENCE ID: XXXXXXXXX BHT DATE: 02/02/2010 PAYOR NAME: SOMEINSURANCE ID: XXXXX PROVIDER NAME: XXXXXXXXXXX XXXXXXXXX XXXXX XX NATIONAL PROVIDER ID: XXXXXXXXXX I need everythin between SUBSCRIBER DEMOGRAPHIC - NATIONAL PROVIDER ID only if the CLAIM STATUS CATEGORY CODE is other than A1 (A3, A4, F2...there's a bunch). Here is the code I have so far. `use strict; use warnings; open TEST, "tests.txt" or die $!; open OUTPUT, "> output1.txt" or die$!; my @data; my $data; while (<TEST>) { if (/SUBSCRIBER DEMOGRAPHIC/../CLAIM LEVEL STATUS CATEGORY/) { @data = $_; next; foreach ( $data, @data) { if ($data =~ /A1/) { print OUTPUT @data; } } } } close TEST; close OUTPUT;` [download] This code gets me no errors in syntax when I run it but I get 0 KB output file. Please Help!!	[reply] [d/l]
Re^2: Text Extraction by JonDepp (Novice) on Feb 05, 2010 at 16:44 UTC
That worked a lot better. I just realized that the text file I'm parsing has those regular expressions occurring all over the place so I have to refine the ones in my code. This is a great start and I'm sure I'll be back once I refine those expressions. Thanks for all the help!!	[reply]
Re^3: Text Extraction by planetscape (Chancellor) on Feb 06, 2010 at 15:04 UTC
You might benefit from something in My Favourite Regex Tools. HTH, planetscape	[reply]