in reply to Re: Avoid memory error while extracting text block
in thread Avoid memory error while extracting text block

Hi BrowserUK, included a sample of the text below, with some flexibility for the starting and ending points. Does this help? Thanks so much!

<DOCUMENT> [could start as early as here] <TYPE>EX-21 <SEQUENCE>7 <FILENAME>v144610_ex21.htm [start no later than here] <TEXT> <html> Whole buncha stuff in the middle </html> [end here or </TEXT> end here or </DOCUMENT> end here at the latest] <DOCUMENT> <TYPE>EX-23.1 <SEQUENCE>8 <FILENAME>v144610_ex23-1.htm

Replies are listed 'Best First'.
Re^3: Avoid memory error while extracting text block
by BrowserUk (Patriarch) on Apr 14, 2016 at 04:53 UTC

    If the intent is to split this compound file into separate files, I think I'd do soemthing like this:

    #! perl -slw use strict; # assuming the name of the compound file is supplied on the command li +ne until( eof( ARGV ) ) { my @buffer = <>; ## Put the <DOC +UMENT> tag into a clean buffer push @buffer, scalar( <> ), scalar( <> ); ## Ditto <TYPE> + & <SEQUENCE> my( $filename ) = ( my $line = <> ) =~ m[<FILENAME>(\S+)]; push @buffer, $line; open OUT, '>', $filename or die $!; print OUT for @buffer; print OUT until ( $_ = <> ) =~ m[</DOCUMENT>]; close OUT; }

    Note:That is untested code and will probably need tweaks. Eg. I'm not convinced that it will print the final tag to the output files; but then maybe you'd want to strip those anyway.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you!!
Re^3: Avoid memory error while extracting text block
by NetWallah (Canon) on Apr 14, 2016 at 03:48 UTC
    If the "end of block" identifier cannot occur inside the document, you could read the file "one document at a time":
    open my $fh, "<", "File/name" or die "Could not open file:$!"; $/="</DOCUMENT>"; while (my $doc = <$fh>){ # Process, or search through contents of $doc.... } close $fh;

            This is not an optical illusion, it just looks like one.

Re^3: Avoid memory error while extracting text block
by wrkrbeee (Scribe) on Apr 14, 2016 at 01:55 UTC
    Question: at the beginning of your code, you have a WHILE statement like so: while ( $_ = <$FH_IN`> ) !~ /START TOKEN/; I'm having difficulty interpreting this statement. So, here's my take, which I know is incorrect: while the Perl special operator has the same value as the current file handle/name, then I see the complement to the binding operator (i.e., does not bind) to my start token. I can't get there. What is the statement telling me?

      Common usage has conditioned us to believe that the test condition of the while loop is what's inside the parentheses. In this case, the negative match operator is part of it as well. The body of the while loop is the 1.

      1 while ($_ = <$FH_IN) !~ /START TOKEN;

      is the equivalent of:

      while (<$FH_IN> !~ /START TOKEN/) { 1 }

      You can get Perl to show you how it interprets a statement like that using the B::Deparse module.

      $: perl -MO=Deparse -e '1 while ($_ = <>) !~ /START TOKEN/' '???' until ($_ = <ARGV>) =~ /START TOKEN/; -e syntax OK

      Perl turns the while with a negative match operator into an until with the positive operator.

      But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)