micwood has asked for the wisdom of the Perl Monks concerning the following question:

Howdy:

I have a hopefully easy problem. I am trying to isolate information between two patterns to both capture the information between (here, the case name) and count the instances a particular word (here, the word Education but I would also like to count the number of words in general). I am able to isolate the beginning pattern and the end pattern across multiple line but I can't figure out how to capture and count ALL of the this text between the patterns. When I run my script, I only capture the last matched line and only count the last instance. I am not able to manipulate the entire string of text across multiple lines between the established patterns.

This is my script:

my $casename = ''; my $count = 0; $/=""; # paragraph input mode while (<IN>) { if (/^\d+\sof\s\d+\sDOCUMENTS$/mx ... /^No.\s\w+$/mx){ if (/^(.*)/){ $casename=$1;} if (/EDUCATION/g){ $count++;} next;} } print "Case Name: $casename\n"; print "Education: $count\n";

Which does isolate the correct bounded text in a much larger file, as such:

97 of 141 DOCUMENTS HOKE COUNTY BOARD OF EDUCATION; HALIFAX COUNTY BOARD OF EDUCATION; ROBESON COUNTY BOARD OF EDUCATION; CUMBERLAND COUNTY BOARD OF EDUCA\TION; VANCE COUNTY BOARD OF EDUCATION; RANDY L. HASTY, individually and as guardian ad litem of Randell B. Hasty; STEVEN R. SUNKEL, individually and as guardian ad litem of Andrew J. Sunkel; LIONEL WHIDBEE, individually and as guardian ad litem of Jeremy L. Whidbee; TYRONE T. WILLIAMS, individually and as guardian ad litem of Trevelyn L. Williams; D.E. LOCKLEAR, JR., individually and as guardian ad litem of Jason E. Locklear; ANGUS B. THOMPSON II, individually and as guardian ad litem of Vandaliah J. Thompson; MARY ELIZABETH LOWERY, individually and as guardian ad litem of Lannie Rae Lowery; JENNIE G. PEARSON, individually and as guardian ad litem of Sharese D. Pearson; BENITA B. TIPTON, individually and as guardian ad litem of Whitney B. Tipton; DANA HOLTON JENKINS, individually and as guardian ad litem of Rachel M. Jenkins; LEON R. ROBINSON, individually and as guardian ad litem of Justin A. Robinson, Plaintiffs and ASHEVILLE CITY BOARD OF EDUCATION; BUNCOMBE COUNTY BOARD OF EDUCATION; CHARLOTTE-MECKLENBURG BOARD OF EDUCATION; DURHAM PUBLIC SCHOOLS BOARD OF EDUCATION; WAKE COUNTY BOARD OF EDUCATION; WINSTON-SALEM/FORSYTH COUNTY BOARD OF EDUCATION; CASSANDRA INGRAM, individually and as guardian ad litem of Darris Ingram; CAROL PENLAND, individually and as guardian ad litem of Jeremy Penland; DARLENE HARRIS, individually and as guardian ad litem of Shamek Harris; NETTIE THOMPSON, individually and as guardian ad litem of Annette Renee Thompson; OPHELIA AIKEN, individually and as guardian ad litem of Brandon Bell, Plaintiff-Intervenors v. STATE OF NORTH CAROLINA and the STATE BOARD OF EDUCATION, Defendants No. 530PA02

But, the returned information is as follows:

Case Name: No. 530PA02 Education: 2

Any idea what I am doing wrong? Thanks so much for your help! Can't tell you how much I appreciate it. Best, Michael

Replies are listed 'Best First'.
Re: Capture/Counting text between patterns across multiple lines
by johngg (Canon) on Nov 02, 2007 at 22:04 UTC
    As long as your input files are not too large it might be easier to slurp the entire file into memory as a string and capture each document in turn using a global regex match. Here, I capture the document in $1 and, within it, the case name in $2. I have put the data (including a second document) after the __END__ token in the script so I can read it with the <DATA> filehande which Perl automatically opens for us.

    use strict; use warnings; my $allDocs = do { local $/; <DATA>; }; my $rxExtractDoc = qr {(?xms) ( \d+\sof\s\d+\sDOCUMENTS .*? No\.\s(\w+) \n ) }; while ( $allDocs =~ m{$rxExtractDoc}g ) { my $document = $1; my $caseName = $2; my $edCount = () = $document =~ m{EDUCATION}g; print qq{Case Name: $caseName\n}, qq{Education: $edCount\n}, q {-} x 25, qq{\n}; } __END__ Blurfl Blurfl No. 32323IY883 97 of 141 DOCUMENTS HOKE COUNTY BOARD OF EDUCATION; HALIFAX COUNTY BOARD OF EDUCATION; ROBESON COUNTY BOARD OF EDUCATION; CUMBERLAND COUNTY BOARD OF EDUCA\TION; VANCE COUNTY BOARD OF EDUCATION; RANDY L. HASTY, individually and as guardian ad litem of Randell B. Hasty; STEVEN R. SUNKEL, individually and as guardian ad litem of Andrew J. Sunkel; LIONEL WHIDBEE, individually and as guardian ad litem of Jeremy L. Whidbee; TYRONE T. WILLIAMS, individually and as guardian ad litem of Trevelyn L. Williams; D.E. LOCKLEAR, JR., individually and as guardian ad litem of Jason E. Locklear; ANGUS B. THOMPSON II, individually and as guardian ad litem of Vandaliah J. Thompson; MARY ELIZABETH LOWERY, individually and as guardian ad litem of Lannie Rae Lowery; JENNIE G. PEARSON, individually and as guardian ad litem of Sharese D. Pearson; BENITA B. TIPTON, individually and as guardian ad litem of Whitney B. Tipton; DANA HOLTON JENKINS, individually and as guardian ad litem of Rachel M. Jenkins; LEON R. ROBINSON, individually and as guardian ad litem of Justin A. Robinson, Plaintiffs and ASHEVILLE CITY BOARD OF EDUCATION; BUNCOMBE COUNTY BOARD OF EDUCATION; CHARLOTTE-MECKLENBURG BOARD OF EDUCATION; DURHAM PUBLIC SCHOOLS BOARD OF EDUCATION; WAKE COUNTY BOARD OF EDUCATION; WINSTON-SALEM/FORSYTH COUNTY BOARD OF EDUCATION; CASSANDRA INGRAM, individually and as guardian ad litem of Darris Ingram; CAROL PENLAND, individually and as guardian ad litem of Jeremy Penland; DARLENE HARRIS, individually and as guardian ad litem of Shamek Harris; NETTIE THOMPSON, individually and as guardian ad litem of Annette Renee Thompson; OPHELIA AIKEN, individually and as guardian ad litem of Brandon Bell, Plaintiff-Intervenors v. STATE OF NORTH CAROLINA and the STATE BOARD OF EDUCATION, Defendants No. 530PA02 98 of 141 DOCUMENTS HOKE COUNTY BOARD OF EDUCATION; HALIFAX COUNTY BOARD OF EDUCATION; CUMBERLAND COUNTY BOARD OF EDUCATION; VANCE COUNTY BOARD OF EDUCATION; RANDY L. HASTY, individually and as guardian ad litem of Randell B. Hasty; STEVEN R. SUNKEL, individually and as guardian ad litem of Andrew J. Sunkel; LIONEL WHIDBEE, individually and as guardian ad litem of Jeremy L. Whidbee; TYRONE T. WILLIAMS, individually and as guardian ad litem of Trevelyn L. Williams; D.E. LOCKLEAR, JR., individually and as guardian ad litem of Jason E. Locklear; ANGUS B. THOMPSON II, individually and as guardian ad litem of Vandaliah J. Thompson; MARY ELIZABETH LOWERY, individually and as guardian ad litem of Lannie Rae Lowery; JENNIE G. PEARSON, individually and as guardian ad litem of Sharese D. Pearson; BENITA B. TIPTON, individually and as guardian ad litem of Whitney B. Tipton; DANA HOLTON JENKINS, individually and as guardian ad litem of Rachel M. Jenkins; LEON R. ROBINSON, individually and as guardian ad litem of Justin A. Robinson, Plaintiffs and ASHEVILLE CITY BOARD OF EDUCATION; BUNCOMBE COUNTY BOARD OF EDUCATION; WINSTON-SALEM/FORSYTH COUNTY BOARD OF EDUCATION; CASSANDRA INGRAM, individually and as guardian ad litem of Darris Ingram; CAROL PENLAND, individually and as guardian ad litem of Jeremy Penland; DARLENE HARRIS, individually and as guardian ad litem of Shamek Harris; NETTIE THOMPSON, individually and as guardian ad litem of Annette Renee Thompson; OPHELIA AIKEN, individually and as guardian ad litem of Brandon Bell, Plaintiff-Intervenors v. STATE OF NORTH CAROLINA and the STATE BOARD OF EDUCATION, Defendants No. 534PA08 End of document set

    Here's the output.

    Case Name: 530PA02 Education: 11 ------------------------- Case Name: 534PA08 Education: 8 -------------------------

    I hope this is of use.

    Cheers,

    JohnGG

      wow, these are all great! and i think i can definitely adapt them for my needs. I apologize in my unclearness....I should have noted that the "case name" that I was trying to grab was all of the text between the two patterns... that is, the looooooong string:

      HOKE COUNTY BOARD OF EDUCATION; HALIFAX COUNTY BOARD OF EDUCATION; ROBESON COUNTY BOARD OF EDUCATION; CUMBERLAND COUNTY BOARD OF EDUCA\TION; VANCE COUNTY BOARD OF EDUCATION; RANDY L. HASTY, individually and as guardian ad litem of Randell B. Hasty; STEVEN R. SUNKEL, individually and as guardian ad litem of Andrew J. Sunkel; LIONEL WHIDBEE, individually and as guardian ad litem of Jeremy L. Whidbee; TYRONE T. WILLIAMS, individually and as guardian ad litem of Trevelyn L. Williams; D.E. LOCKLEAR, JR., individually and as guardian ad litem of Jason E. Locklear; ANGUS B. THOMPSON II, individually and as guardian ad litem of Vandaliah J. Thompson; MARY ELIZABETH LOWERY, individually and as guardian ad litem of Lannie Rae Lowery; JENNIE G. PEARSON, individually and as guardian ad litem of Sharese D. Pearson; BENITA B. TIPTON, individually and as guardian ad litem of Whitney B. Tipton; DANA HOLTON JENKINS, individually and as guardian ad litem of Rachel M. Jenkins; LEON R. ROBINSON, individually and as guardian ad litem of Justin A. Robinson, Plaintiffs and ASHEVILLE CITY BOARD OF EDUCATION; BUNCOMBE COUNTY BOARD OF EDUCATION; CHARLOTTE-MECKLENBURG BOARD OF EDUCATION; DURHAM PUBLIC SCHOOLS BOARD OF EDUCATION; WAKE COUNTY BOARD OF EDUCATION; WINSTON-SALEM/FORSYTH COUNTY BOARD OF EDUCATION; CASSANDRA INGRAM, individually and as guardian ad litem of Darris Ingram; CAROL PENLAND, individually and as guardian ad litem of Jeremy Penland; DARLENE HARRIS, individually and as guardian ad litem of Shamek Harris; NETTIE THOMPSON, individually and as guardian ad litem of Annette Renee Thompson; OPHELIA AIKEN, individually and as guardian ad litem of Brandon Bell, Plaintiff-Intervenors v. STATE OF NORTH CAROLINA and the STATE BOARD OF EDUCATION, Defendants

      ..not the case number which is what my script was incorrectly capturing. But I think I can adapt the solutions that you have provided to get that information into a scaler.

      best, Michael

Re: Capture/Counting text between patterns across multiple lines
by GrandFather (Saint) on Nov 02, 2007 at 21:22 UTC

    You can use (?{...}) (see perreref's EXTENDED CONSTRUCTS) to execute a code fragment inside a regex so:

    while (<IN>) { if (m/^\d+\sof\s\d+\sDOCUMENTS$/mx ... m/^No.\s\w+$/mx){ $casename = $1 if /^(.*)/; /EDUCATION(?{$count++})/g; } }

    would seem to do what you want.


    Perl is environmentally friendly - it saves trees
Re: Capture/Counting text between patterns across multiple lines
by igelkott (Priest) on Nov 02, 2007 at 21:04 UTC
    The search for EDUCATION just reports the number of lines (paragraphs in this case) which have at least one match. I believe that this is because the search is in a scalar context. The simplest change would be to replace the search with a substitution like: $count += (s/EDUCATION/EDUCATION/g); but that's probably not so efficient. Another approach would be to give the original search a list context like: $count += () = (/EDUCATION/g); but that's not quite as readable. To put this into context, you could rewrite the while block to:
    while (<IN>) { if (/^\d+\sof\s\d+\sDOCUMENTS$/mx .. /^No.\s(\w+)$/mx){ $casename=$1; # $count += (s/EDUCATION/EDUCATION/g); $count += () = (/EDUCATION/g); } }
Re: Capture/Counting text between patterns across multiple lines
by mwah (Hermit) on Nov 02, 2007 at 21:35 UTC
    I am trying to isolate information between two patterns to both capture the information between (here, the case name) and count the instances a particular word (here, the word Education but I would also like to count the number of words in general).

    From your description, I didn't really understand what you tried to accomplish (this may be a language problem for me). I understood - you want to count different things from different paragraphs. Im not sure what to use here, maybe sth. like a parser or just simple counting. I tried to put this (what I understood) in a small example below:

    ... my ($casename, $count, $wordcount) = ('', 0, 0); $/ = ''; # paragraph mode open my $fh, '<', 'data.txt' or die "can't open, $!"; while( <$fh> ) { REPEAT: { # Play parser /\G^\d+\s+of\s+\d+\s+DOCUMENTS$/gc && do { redo; }; # enter acti +on here /\G^No\.\s+(\w+)$/gc && do { $casename = $1; redo; }; } # simply count stuff $count += () = /EDUCATION/g; $wordcount += () = /\w+/g } close $fh; print "Case Name: $casename\n"; print "Education: $count\n"; print "Words: $wordcount\n"; ...

    The /gc regex parser thing is somehow explained in Perlfaq6.

    Regards

    mwa