Extracting blocks of text

walker has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extracting blocks of text by Rhose (Priest) on Jan 30, 2004 at 14:38 UTC
You could also use the range (flip-flop) operator. The sample below will print lines from the line which starts with "head" (^ anchors to the start) to the one which ends with "tail" (\s$ allows some white space after tail.) #!/usr/bin/perl use strict; use warnings; while(<DATA>) { print if /^head/i../tail\s$/i; } __DATA__ HEAD gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla tail gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus head bla bla gugus gugus tail bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus head bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus tail gugus gugus [download] Output `HEAD gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla tail gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus head bla bla gugus gugus tail` [download] Update If you have the camel book, you can find a discussion on this starting on page 90 (2nd Edition).	[reply] [d/l] [select]
Re: Extracting blocks of text by BrowserUk (Patriarch) on Jan 30, 2004 at 15:01 UTC
You can use $/ (see perlvar) and set it to a string to control what the diamond operator see's as a line ending. By setting this to 'head' and then 'tail' alternately, you can move through you large file in chunks, discarding the 1st, 3rd, 5th and printing the 2nd, 4th & 6th etc. #! perl -slw use strict; open IN, '<', $ARGV[ 0 ] or die $!; $/ = 'head'; while( <IN> ) { local $/ = 'tail'; print scalar <IN>; } close IN; __END__ P:\test>type junk.txt The quick brown fox jumps over the lazy dog 0001 head The quick brown fox jumps over the lazy dog 0002 The quick brown fox jumps over the lazy dog 0003 The quick brown fox jumps over the lazy dog 0004 The quick brown fox jumps over the lazy dog 0005 tail The quick brown fox jumps over the lazy dog 0006 The quick brown fox jumps over the lazy dog 0007 The quick brown fox jumps over the lazy dog 0008 headThe quick brown fox jumps over the lazy dog 0009 The quick brown fox jumps over the lazy dog 0010 tail The quick brown fox jumps over the lazy dog 0011 The quick brown fox jumps over the lazy dog 0012 P:\test>235232 junk.txt The quick brown fox jumps over the lazy dog 0002 The quick brown fox jumps over the lazy dog 0003 The quick brown fox jumps over the lazy dog 0004 The quick brown fox jumps over the lazy dog 0005 tail The quick brown fox jumps over the lazy dog 0009 The quick brown fox jumps over the lazy dog 0010 tail [download] The caveat is that if the chunks you are discarding (between 'tail' and then next 'head' marker) are very large, they will consume large amounts of memory. As implemented above, the 'head' marker is discarded, but the 'tail' marker is printed. Add or delete as neccessary. This also assumes that by "including the lines the key words are on.", you do not mean that you want any text preceding the 'head' marker, if the head marker is in the middle of a line, nor anything after the 'tail' marker if it can appear in the middle of a line. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Timing (and a little luck) are everything!	[reply] [d/l]
Re^2: Extracting blocks of text by adenardo (Initiate) on Jun 28, 2006 at 20:23 UTC
this has been an educating discussion...how about a twist? I am looking to parse a large file, and extract blocks of text that begin with the word term. I can't always anticipate how the block will end, other than by stating that whenever the word term appears, a new block begins. is there a way to create an array where each element is a text block that begins with the word term, and that element ends immediately before the next occurance of the word term? `example file: term { yada yada 12345 () ... } term only occurs here { could be 30 lines here but never that word again until another block starts yadada } term, etc. _END_` [download] so, this file would hopefully result in an array with 3 elements. another challenge, is that the last text block will not have the word term at the end of it. thanks in advance :-) ad3	[reply] [d/l]
Re^3: Extracting blocks of text by BrowserUk (Patriarch) on Jun 28, 2006 at 21:37 UTC
Assuming the file is small enough to slurp, then split does the job nicely: `#! perl -slw use strict; my @array = split 'term', do{ local $/; <DATA> }; shift @array; ## Discard leading null print '---', "\n", $_, "\n" for @array; __DATA__ term { yada yada 12345 () ... } term only occurs here { could be 30 lines here but never that word again until another block starts yadada } term, etc.` [download] That discards the term itself. If you want to retain the term in each element, then perhaps the simplest way is to just put it back after the split. Just substitute this line into the above. `my @array = map{ "term$_" } split 'term', do{ local $/; <DATA> };` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Extracting blocks of text by pelagic (Priest) on Jan 30, 2004 at 14:17 UTC
Here is a easy solution: `#!/usr/bin/perl use strict; my $inputfile = shift; my $withinBlock = 0; open (IN, "<$inputfile") \|\| die "could not open $inputfile\n"; while (<IN>) { if (/head/) { $withinBlock = 1; print $_; if (/tail/) { $withinBlock = 0; print "\n"; } } if ($withinBlock) { print $_; if (/tail/) { $withinBlock = 0; print "\n"; } } } close (IN);` [download] I run it with file bla head gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla tail gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus head bla bla gugus gugus tail bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus head bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus tail gugus gugus [download] and it showed bla head gugus gugus bla bla gugus gugus bla head gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla tail gugus bla bla gugus gugus bla bla gugus head bla bla gugus gugus tail bla bla gugus gugus bla bla gugus gugus head bla bla gugus gugus bla bla gugus gugus head bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus tail gugus gugus [download] it does not work properly if after a tail there is a head on the same line ... pelagic	[reply] [d/l] [select]
Re: Re: Extracting blocks of text by walker (Initiate) on Feb 01, 2004 at 03:40 UTC
This one worked GREAT !!! I need to print 5 lines after the "tail" key word...and I don't understand why are there's 2 tests for tail and 2 print commands ?	[reply]
Re: Re: Re: Extracting blocks of text by graff (Chancellor) on Feb 02, 2004 at 04:45 UTC
I need to print 5 lines after the "tail" key word... Why didn't you say so in the first place? That would change how people answer the question. and I don't understand why are there's 2 tests for tail and 2 print commands ? Well, actually, there's no need for the duplication. The following would work just as well -- and would cover your little "amendment" to the original spec: `#!/usr/bin/perl use strict; my $inputfile = shift; my $withinBlock = 0; open (IN, "<$inputfile") \|\| die "could not open inputfile\n"; while (<IN>) { if (/head/) { $withinBlock = 6; } if ($withinBlock) { print $_; $withingBlock-- unless $withinBlock == 6; } if (/tail/) { $withinBlock = 5; } } close (IN);` [download] Note that if there is a new "head" line within the five lines that follow a "tail", the $withinblock state variable gets reset to 6, and will stay there till the next "tail". If there is no "head" within the next five lines, it will decrement to 0, turning off the output. Another "feature" of this version is that if there is a "tail" line without a previous "head", the five lines following "tail" will still get printed. One more thing: since the head and tail regexes are not anchored, the logic will fire whenever these words happen to show up in the data -- e.g: blah blah head This is a bunch of text in a target block. It includes excerpts from a book on animals, which have tails. So this line will cause the output to be turned off after the next five lines, i.e. here. So you won't get to see this line or this one. tail But you'll see this one and these lines too. Now the output is off again, but since we're taking about animals, which all have heads, the output is now on again, and you see the previous and current lines, as well as this and the next two... [download]	[reply] [d/l] [select]
Re: Re: Re: Re: Extracting blocks of text by walker (Initiate) on Feb 02, 2004 at 14:17 UTC
Re: Extracting blocks of text by mr_mischief (Monsignor) on Jan 30, 2004 at 14:41 UTC
This is a classic case for use of a flag variable. `# init variable to show we're not in the blcok my $in_block = 0; while ( <> ) # process line by line { $in_block = 1 if /^head/; # test for start of block and # set flag true if needed print if $in_block; # print if we're in the block $in_block = 0 if /tail$/; # test for end of block and # set flag false if needed }` [download] Sorry if I misunderstood your question, but according to the way I read it I think this is close. Given this file: `fvewvwef vfewejmnvwev evfjerwvnrevjwe wervkjvwe wevrjvrenwvr head vfjlevnerojvnerve head refejrverjvnerjovnerojvn ercjncer rljnelrkvnervervekjnve tail fknvbekjev nweclkneclknerclkernclenelrknclencekn cwlknelcnlcwnejnrjnrjcnjcncjncnccjn tail vjenvlejnvlejnrvlejnvejnvejnvejnvejvnejv head efcjonecjnercjnerjcnerjnc crjencerjncejlrcn tail` [download] I get this output: `head refejrverjvnerjovnerojvn ercjncer rljnelrkvnervervekjnve tail fknvbekjev nweclkneclknerclkernclenelrknclencekn cwlknelcnlcwnejnrjnrjcnjcncjncnccjn tail head efcjonecjnercjnerjcnerjnc crjencerjncejlrcn tail` [download] Sometimes a simple procedural style works really well, even if you have bells and whistles available. This could be written the same in almost any language. Perl just makes it easier. Christopher E. Stith	[reply] [d/l] [select]