in reply to Extracting blocks of text

You can use $/ (see perlvar) and set it to a string to control what the diamond operator see's as a line ending. By setting this to 'head' and then 'tail' alternately, you can move through you large file in chunks, discarding the 1st, 3rd, 5th and printing the 2nd, 4th & 6th etc.

#! perl -slw use strict; open IN, '<', $ARGV[ 0 ] or die $!; $/ = 'head'; while( <IN> ) { local $/ = 'tail'; print scalar <IN>; } close IN; __END__ P:\test>type junk.txt The quick brown fox jumps over the lazy dog 0001 head The quick brown fox jumps over the lazy dog 0002 The quick brown fox jumps over the lazy dog 0003 The quick brown fox jumps over the lazy dog 0004 The quick brown fox jumps over the lazy dog 0005 tail The quick brown fox jumps over the lazy dog 0006 The quick brown fox jumps over the lazy dog 0007 The quick brown fox jumps over the lazy dog 0008 headThe quick brown fox jumps over the lazy dog 0009 The quick brown fox jumps over the lazy dog 0010 tail The quick brown fox jumps over the lazy dog 0011 The quick brown fox jumps over the lazy dog 0012 P:\test>235232 junk.txt The quick brown fox jumps over the lazy dog 0002 The quick brown fox jumps over the lazy dog 0003 The quick brown fox jumps over the lazy dog 0004 The quick brown fox jumps over the lazy dog 0005 tail The quick brown fox jumps over the lazy dog 0009 The quick brown fox jumps over the lazy dog 0010 tail

The caveat is that if the chunks you are discarding (between 'tail' and then next 'head' marker) are very large, they will consume large amounts of memory.

As implemented above, the 'head' marker is discarded, but the 'tail' marker is printed. Add or delete as neccessary.

This also assumes that by "including the lines the key words are on.", you do not mean that you want any text preceding the 'head' marker, if the head marker is in the middle of a line, nor anything after the 'tail' marker if it can appear in the middle of a line.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Timing (and a little luck) are everything!

Replies are listed 'Best First'.
Re^2: Extracting blocks of text
by adenardo (Initiate) on Jun 28, 2006 at 20:23 UTC
    this has been an educating discussion...how about a twist? I am looking to parse a large file, and extract blocks of text that begin with the word term. I can't always anticipate how the block will end, other than by stating that whenever the word term appears, a new block begins. is there a way to create an array where each element is a text block that begins with the word term, and that element ends immediately before the next occurance of the word term?
    example file: term { yada yada 12345 () ... } term only occurs here { could be 30 lines here but never that word again until another block starts yadada } term, etc. _END_
    so, this file would hopefully result in an array with 3 elements. another challenge, is that the last text block will not have the word term at the end of it. thanks in advance :-) ad3

      Assuming the file is small enough to slurp, then split does the job nicely:

      #! perl -slw use strict; my @array = split 'term', do{ local $/; <DATA> }; shift @array; ## Discard leading null print '---', "\n", $_, "\n" for @array; __DATA__ term { yada yada 12345 () ... } term only occurs here { could be 30 lines here but never that word again until another block starts yadada } term, etc.

      That discards the term itself. If you want to retain the term in each element, then perhaps the simplest way is to just put it back after the split. Just substitute this line into the above.

      my @array = map{ "term$_" } split 'term', do{ local $/; <DATA> };

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.