in reply to Avoid memory error while extracting text block

If the tokens you use to identify the start and end of the block can be guaranteed to not cross line boundaries; line by line discovery is relatively simple:

open INFILE, ...; my @block; 1 while ( $_ = <INFILE> ) !~ /START TOKEN/; push @block, $_; push @block, $_ while ($_ = <INFILE>) !~ /END TOKEN/; close INFILE; ## Use @block perhaps trimming first and last lines first.
Of course, that gets a little more complicated if the start token can appear more than once without a matching end token.

If the tokens can span lines then a running buffer should work for most purposes. Ie. Appending new lines to the end and discarding lines from the beginning, once the buffer has achieved the minimum length of the token, and then searching the buffer after each new line is added, until the token is found.

Basically, you need to tell us more about the nature of the start & end tokens before we could help you further.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Avoid memory error while extracting text block
by wrkrbeee (Scribe) on Apr 14, 2016 at 01:01 UTC

    Hi BrowserUK, included a sample of the text below, with some flexibility for the starting and ending points. Does this help? Thanks so much!

    <DOCUMENT> [could start as early as here] <TYPE>EX-21 <SEQUENCE>7 <FILENAME>v144610_ex21.htm [start no later than here] <TEXT> <html> Whole buncha stuff in the middle </html> [end here or </TEXT> end here or </DOCUMENT> end here at the latest] <DOCUMENT> <TYPE>EX-23.1 <SEQUENCE>8 <FILENAME>v144610_ex23-1.htm

      If the intent is to split this compound file into separate files, I think I'd do soemthing like this:

      #! perl -slw use strict; # assuming the name of the compound file is supplied on the command li +ne until( eof( ARGV ) ) { my @buffer = <>; ## Put the <DOC +UMENT> tag into a clean buffer push @buffer, scalar( <> ), scalar( <> ); ## Ditto <TYPE> + & <SEQUENCE> my( $filename ) = ( my $line = <> ) =~ m[<FILENAME>(\S+)]; push @buffer, $line; open OUT, '>', $filename or die $!; print OUT for @buffer; print OUT until ( $_ = <> ) =~ m[</DOCUMENT>]; close OUT; }

      Note:That is untested code and will probably need tweaks. Eg. I'm not convinced that it will print the final tag to the output files; but then maybe you'd want to strip those anyway.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Thank you!!
      If the "end of block" identifier cannot occur inside the document, you could read the file "one document at a time":
      open my $fh, "<", "File/name" or die "Could not open file:$!"; $/="</DOCUMENT>"; while (my $doc = <$fh>){ # Process, or search through contents of $doc.... } close $fh;

              This is not an optical illusion, it just looks like one.

      Question: at the beginning of your code, you have a WHILE statement like so: while ( $_ = <$FH_IN`> ) !~ /START TOKEN/; I'm having difficulty interpreting this statement. So, here's my take, which I know is incorrect: while the Perl special operator has the same value as the current file handle/name, then I see the complement to the binding operator (i.e., does not bind) to my start token. I can't get there. What is the statement telling me?

        Common usage has conditioned us to believe that the test condition of the while loop is what's inside the parentheses. In this case, the negative match operator is part of it as well. The body of the while loop is the 1.

        1 while ($_ = <$FH_IN) !~ /START TOKEN;

        is the equivalent of:

        while (<$FH_IN> !~ /START TOKEN/) { 1 }

        You can get Perl to show you how it interprets a statement like that using the B::Deparse module.

        $: perl -MO=Deparse -e '1 while ($_ = <>) !~ /START TOKEN/' '???' until ($_ = <ARGV>) =~ /START TOKEN/; -e syntax OK

        Perl turns the while with a negative match operator into an until with the positive operator.

        But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)