wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Hello PerlMonks, Inherited a program that creates the "Out of Memory" error. Modified the code to read one line at a time rather than store the entire file in a variable (or so I thought). Program runs/works, but still encounters the "Out of Memory" error (obviously I failed). I suspect that the problem lies with a foreach statement at line 59. I'm thinking that the notion of reading all files into an array is unnecessary. Hence, lines 51-59 can be dropped. However, my dilemma now is how do I assign the filename to $file in lines 63 and 86? Apologize for such a trivial question to experienced users. Program is below. I am grateful for any insight you may have. Thanks!!

#!/usr/bin/perl -w #use strict; # This program extracts data from an SEC filing, including chunks of t +ext use File::stat; #This program is going to obtain and extract the entire audit opinion +but you can #extract whatever text you are interested in by changing the regular e +xpressions for the start #and end strings below. #This program was written by Andy Leone, May 15 2007 and updated July + 25, 2008. #You are free to use this program for your own use. My only request i +s #that you make an acknowledgement in any research manuscripts that #benefit from the program. my $startstring='((^\s*?)((We\s*(have|were)\s*(audited\s*the\s*(Statem +ent\s*of\s*Financial\s*Condition|consolidated|accompanying|combined|b +alance\s*sheets)|(completed\s*|engaged\s*to\s*perform)\s*an\s*integra +ted\s*audit))|In\s*our\s*opinion,\s*the\s*(consolidated|accompanying) +))'; my $startstringhtm='(((We\s*(have|were)\s*(audited\s*the\s*(Statement\ +s*of\s*Financial\s*Condition|consolidated|accompanying|combined|balan +ce\s*sheets)|(completed\s*|engaged\s*to\s*perform)\s*an\s*integrated\ +s*audit))|In\s*our\s*opinion,\s*the\s*(consolidated|accompanying)))'; + #Specify the end of the text you are looking for. my $endstring='((^\s*)/s/|^\s*(Date:\s*)?(\d{1,2}\s*)?((January|Februa +ry|March|April|May|June|July|August|September|October|November|Decemb +er))\s*(\d{1,2},)?\s*\d{4}(\s*$|,\s{0,3}except|\s*/s|\s*s/|\d{1,2}))' +; my $endstringhtm='((>|^\s*)(/s/)?\s*(Date:\s*)?(\d{1,2}\s*)?(January|F +ebruary|March|April|May|June|July|August|September|October|November|D +ecember))\s*(&\w+?;\s*)?(\d{1,2},)?\s*\d{4}(\s*$|\s*[,\(]\s{0,3}excep +t|\s*with\s*respect\s*to\s*our\s*opinion|<\/P>|<BR>|\s{0,1}\<\/FONT\> +|\d{1,2})'; #Specify the directory containing the files that you want to read my $direct="E:\\Research\\SEC filings 10K and 10Q\\Data\\Filing Docs\\ +2008test"; my $outfile="E:\\Research\\SEC filings 10K and 10Q\\Data\\Header Data\ +\Data2008test.txt"; #If Windows "\\", if Mac "/"; my $slash='\\'; $outfiler=">$outfile"; open(OUTPUT, "$outfiler") || die "file for 2006 1: $!"; #The following two steps open the directory containing the files you p +lan to read #and then stores the name of each file in an array called @New1. opendir(DIR1,"$direct")||die "Can't open directory"; my @New1=readdir(DIR1); #We will now loop through each file. THe file names #have been stored in the array called @New1; foreach $file(@New1) { #This prevents me from reading the first two entries in a directory . +and ..; if ($file=~/^\./){next;} #Initialize the variable names. my $cik=-99; my $form_type=""; my $report_date=-99; my $file_date=-99; my $name=""; my $sic=-99; my $HTML=0; my $Audit_Opinion="Not Found"; my $Going_Concern=0; my $ao="Not Found"; my $tree="Empty"; my $data=""; #Open the file and put the file in variable called $data #$data will contain the entire filing { # this step removes the default end of line character (\n) # so the the entire file can be read in at once. local $/; #read the contents into data open (INPUT, "$direct$slash"."$file"); while ($data=<INPUT> ) { #The following steps obtain basic data from the filings if ($data=~m/<HTML>/i){$HTML=1;} if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;} if($data=~m/^\s*FORM\s*TYPE:\s*(.*$)/m){$form_type=$1;} if($data=~m/^\s*CONFORMED\s*PERIOD\s*OF\s*REPORT:\s*(\d*)/m){$report +_date=$1;} if($data=~m/^\s*FILED\s*AS\s*OF\s*DATE:\s*(\d*)/m){$file_date=$1;} if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;} if($data=~m/^\s*STANDARD\s*INDUSTRIAL\s*CLASSIFICATION:.*?\[(\d{4})/ +m){$sic=$1;} my $filesize = -s $direct . '/' . $file; my $sb = stat($direct . '/' . $file)->size; print OUTPUT "$cik,$form_type,$report_date,$file_date,$name,$sic,$file +size,$sb\n"; } close INPUT or die "cannot close $file: $!"; } }

Replies are listed 'Best First'.
Re: READ one line at a time
by BrowserUk (Patriarch) on Dec 27, 2014 at 16:19 UTC

    You need to comment out or remove this line:

    local $/;

    It means that you are still slurping the files whole.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you! It is working beautifully now! Happy New Year to you!!!
Re: READ one line at a time
by Anonymous Monk on Dec 27, 2014 at 16:03 UTC
    Thanks, this is better now. This program still slurps the whole file:
    { local $/; open (INPUT, "$direct$slash"."$file"); while ($data=<INPUT> ) { ...
    The entire file is slurped into $data, due to local $/ (there is even a comment describing that). No wonder your machine runs out of memory... Delete the line with local $/ and test what that does.

    (you should know that this code is VERY badly written...)

    Also I don't see what those monstrous regexes ($startstring, $endstringhtm) are doing there.
      Deleted local $/ ... program run, but is voluminous, duplicated entries, expanded the file size from about 3 MB to 26 MB. Clueless here. I am grateful for your help. And I am not surprised that the code is poorly written.

        Hello wrkrbeee,

        program run, but is voluminous, duplicated entries, expanded the file size from about 3 MB to 26 MB

        With the whole file slurped into $data in one go, the line:

        print OUTPUT "$cik,$form_type,$report_date,$file_date,$name,$sic,$file +size,$sb\n";

        runs once per file. With local $/; commented out, each file is read line-by-line, and the output code is called once for each line. No surprise, then, that the output file increases dramatically in size!

        If you’re going to read each file line-by-line (as you should), you’re going to have to change the logic of the code accordingly. Exactly what the new logic should be depends on what the output file is supposed to look like. Note that, at present, each of the variables $cik, $form_type, etc., is initialised once per file, but if the file is read line-by-line, these variables are all reset (without being re-initialised) each time a line is read.

        It will help a lot if you can provide a small amount of sample input, together with the corresponding output you wish to obtain.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        Move these 3 lines down outside the while
        my $filesize = -s $direct . '/' . $file; my $sb = stat($direct . '/' . $file)->size; print OUTPUT "$cik,$form_type,$report_date,$file_date,$name,$sic,$file +size,$sb\n";
        poj
Re: READ one line at a time
by Anonymous Monk on Dec 27, 2014 at 15:31 UTC
    Be so kind and put this thing in <code></code> tags. Like this:
    <code> your program here </code>
      I apologize! Tried to incorporate the tags. Did I do it correctly? :-(
        Well now remove line numbers. And it would be good to remove most comments too.
Re: READ one line at a time
by Anonymous Monk on Dec 29, 2014 at 18:56 UTC
    Whew! Can you please now include (in code-tags, of course) a snapshot of the final program as-modified that eventually worked?
Re: READ one line at a time
by LanX (Saint) on Feb 09, 2015 at 00:49 UTC