Avoid memory error while extracting text block

wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Avoid memory error while extracting text block by BrowserUk (Patriarch) on Apr 14, 2016 at 00:50 UTC
If the tokens you use to identify the start and end of the block can be guaranteed to not cross line boundaries; line by line discovery is relatively simple: `open INFILE, ...; my @block; 1 while ( $_ = <INFILE> ) !~ /START TOKEN/; push @block, $_; push @block, $_ while ($_ = <INFILE>) !~ /END TOKEN/; close INFILE; ## Use @block perhaps trimming first and last lines first.` [download] Of course, that gets a little more complicated if the start token can appear more than once without a matching end token. If the tokens can span lines then a running buffer should work for most purposes. Ie. Appending new lines to the end and discarding lines from the beginning, once the buffer has achieved the minimum length of the token, and then searching the buffer after each new line is added, until the token is found. Basically, you need to tell us more about the nature of the start & end tokens before we could help you further. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Avoid memory error while extracting text block by wrkrbeee (Scribe) on Apr 14, 2016 at 01:01 UTC
Hi BrowserUK, included a sample of the text below, with some flexibility for the starting and ending points. Does this help? Thanks so much! `<DOCUMENT> [could start as early as here] <TYPE>EX-21 <SEQUENCE>7 <FILENAME>v144610_ex21.htm [start no later than here] <TEXT> <html> Whole buncha stuff in the middle </html> [end here or </TEXT> end here or </DOCUMENT> end here at the latest] <DOCUMENT> <TYPE>EX-23.1 <SEQUENCE>8 <FILENAME>v144610_ex23-1.htm` [download]	[reply] [d/l]
Re^3: Avoid memory error while extracting text block by BrowserUk (Patriarch) on Apr 14, 2016 at 04:53 UTC
If the intent is to split this compound file into separate files, I think I'd do soemthing like this: `#! perl -slw use strict; # assuming the name of the compound file is supplied on the command li +ne until( eof( ARGV ) ) { my @buffer = <>; ## Put the <DOC +UMENT> tag into a clean buffer push @buffer, scalar( <> ), scalar( <> ); ## Ditto <TYPE> + & <SEQUENCE> my( $filename ) = ( my $line = <> ) =~ m[<FILENAME>(\S+)]; push @buffer, $line; open OUT, '>', $filename or die $!; print OUT for @buffer; print OUT until ( $_ = <> ) =~ m[</DOCUMENT>]; close OUT; }` [download] Note:That is untested code and will probably need tweaks. Eg. I'm not convinced that it will print the final tag to the output files; but then maybe you'd want to strip those anyway. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Avoid memory error while extracting text block by wrkrbeee (Scribe) on Apr 14, 2016 at 12:59 UTC
Re^3: Avoid memory error while extracting text block by NetWallah (Canon) on Apr 14, 2016 at 03:48 UTC
If the "end of block" identifier cannot occur inside the document, you could read the file "one document at a time": `open my $fh, "<", "File/name" or die "Could not open file:$!"; $/="</DOCUMENT>"; while (my $doc = <$fh>){ # Process, or search through contents of $doc.... } close $fh;` [download] This is not an optical illusion, it just looks like one.	[reply] [d/l]
Re^3: Avoid memory error while extracting text block by wrkrbeee (Scribe) on Apr 14, 2016 at 01:55 UTC
Question: at the beginning of your code, you have a WHILE statement like so: while ( $_ = <$FH_IN`> ) !~ /START TOKEN/; I'm having difficulty interpreting this statement. So, here's my take, which I know is incorrect: while the Perl special operator has the same value as the current file handle/name, then I see the complement to the binding operator (i.e., does not bind) to my start token. I can't get there. What is the statement telling me?	[reply]
Re^4: Avoid memory error while extracting text block by GotToBTru (Prior) on Apr 14, 2016 at 03:41 UTC