wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, I'm working with large text files, and would like to extract a block of text (I think I can eventually manage to ID the beginning and end of the block). Slurping the files is not an option due to memory constraints. Line-by-line processing works for extracting a single line, but not multiple lines. Anyone have any recommendations for a strategy that allows extraction of the block while conserving memory? Thank you!
  • Comment on Avoid memory error while extracting text block

Replies are listed 'Best First'.
Re: Avoid memory error while extracting text block
by BrowserUk (Patriarch) on Apr 14, 2016 at 00:50 UTC

    If the tokens you use to identify the start and end of the block can be guaranteed to not cross line boundaries; line by line discovery is relatively simple:

    open INFILE, ...; my @block; 1 while ( $_ = <INFILE> ) !~ /START TOKEN/; push @block, $_; push @block, $_ while ($_ = <INFILE>) !~ /END TOKEN/; close INFILE; ## Use @block perhaps trimming first and last lines first.
    Of course, that gets a little more complicated if the start token can appear more than once without a matching end token.

    If the tokens can span lines then a running buffer should work for most purposes. Ie. Appending new lines to the end and discarding lines from the beginning, once the buffer has achieved the minimum length of the token, and then searching the buffer after each new line is added, until the token is found.

    Basically, you need to tell us more about the nature of the start & end tokens before we could help you further.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi BrowserUK, included a sample of the text below, with some flexibility for the starting and ending points. Does this help? Thanks so much!

      <DOCUMENT> [could start as early as here] <TYPE>EX-21 <SEQUENCE>7 <FILENAME>v144610_ex21.htm [start no later than here] <TEXT> <html> Whole buncha stuff in the middle </html> [end here or </TEXT> end here or </DOCUMENT> end here at the latest] <DOCUMENT> <TYPE>EX-23.1 <SEQUENCE>8 <FILENAME>v144610_ex23-1.htm

        If the intent is to split this compound file into separate files, I think I'd do soemthing like this:

        #! perl -slw use strict; # assuming the name of the compound file is supplied on the command li +ne until( eof( ARGV ) ) { my @buffer = <>; ## Put the <DOC +UMENT> tag into a clean buffer push @buffer, scalar( <> ), scalar( <> ); ## Ditto <TYPE> + & <SEQUENCE> my( $filename ) = ( my $line = <> ) =~ m[<FILENAME>(\S+)]; push @buffer, $line; open OUT, '>', $filename or die $!; print OUT for @buffer; print OUT until ( $_ = <> ) =~ m[</DOCUMENT>]; close OUT; }

        Note:That is untested code and will probably need tweaks. Eg. I'm not convinced that it will print the final tag to the output files; but then maybe you'd want to strip those anyway.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
        If the "end of block" identifier cannot occur inside the document, you could read the file "one document at a time":
        open my $fh, "<", "File/name" or die "Could not open file:$!"; $/="</DOCUMENT>"; while (my $doc = <$fh>){ # Process, or search through contents of $doc.... } close $fh;

                This is not an optical illusion, it just looks like one.

        Question: at the beginning of your code, you have a WHILE statement like so: while ( $_ = <$FH_IN`> ) !~ /START TOKEN/; I'm having difficulty interpreting this statement. So, here's my take, which I know is incorrect: while the Perl special operator has the same value as the current file handle/name, then I see the complement to the binding operator (i.e., does not bind) to my start token. I can't get there. What is the statement telling me?