READ one line at a time

wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Hello PerlMonks, Inherited a program that creates the "Out of Memory" error. Modified the code to read one line at a time rather than store the entire file in a variable (or so I thought). Program runs/works, but still encounters the "Out of Memory" error (obviously I failed). I suspect that the problem lies with a foreach statement at line 59. I'm thinking that the notion of reading all files into an array is unnecessary. Hence, lines 51-59 can be dropped. However, my dilemma now is how do I assign the filename to $file in lines 63 and 86? Apologize for such a trivial question to experienced users. Program is below. I am grateful for any insight you may have. Thanks!!

 #!/usr/bin/perl -w
#use strict;
# This program extracts data from an SEC filing, including chunks of t
+ext

use File::stat;

#This program is going to obtain and extract the entire audit opinion 
+but you can
#extract whatever text you are interested in by changing the regular e
+xpressions for the start
#and end strings below.

#This program was written by Andy Leone, May 15 2007 and updated  July
+ 25, 2008.
#You are free to use this program for your own use.  My only request i
+s
#that you make an acknowledgement in any research manuscripts that
#benefit from the program.


my $startstring='((^\s*?)((We\s*(have|were)\s*(audited\s*the\s*(Statem
+ent\s*of\s*Financial\s*Condition|consolidated|accompanying|combined|b
+alance\s*sheets)|(completed\s*|engaged\s*to\s*perform)\s*an\s*integra
+ted\s*audit))|In\s*our\s*opinion,\s*the\s*(consolidated|accompanying)
+))';    
my $startstringhtm='(((We\s*(have|were)\s*(audited\s*the\s*(Statement\
+s*of\s*Financial\s*Condition|consolidated|accompanying|combined|balan
+ce\s*sheets)|(completed\s*|engaged\s*to\s*perform)\s*an\s*integrated\
+s*audit))|In\s*our\s*opinion,\s*the\s*(consolidated|accompanying)))';
+    
#Specify the end of the text you are looking for.
my $endstring='((^\s*)/s/|^\s*(Date:\s*)?(\d{1,2}\s*)?((January|Februa
+ry|March|April|May|June|July|August|September|October|November|Decemb
+er))\s*(\d{1,2},)?\s*\d{4}(\s*$|,\s{0,3}except|\s*/s|\s*s/|\d{1,2}))'
+;
my $endstringhtm='((>|^\s*)(/s/)?\s*(Date:\s*)?(\d{1,2}\s*)?(January|F
+ebruary|March|April|May|June|July|August|September|October|November|D
+ecember))\s*(&\w+?;\s*)?(\d{1,2},)?\s*\d{4}(\s*$|\s*[,\(]\s{0,3}excep
+t|\s*with\s*respect\s*to\s*our\s*opinion|<\/P>|<BR>|\s{0,1}\<\/FONT\>
+|\d{1,2})';


#Specify the directory containing the files that you want to read
my $direct="E:\\Research\\SEC filings 10K and 10Q\\Data\\Filing Docs\\
+2008test";
my $outfile="E:\\Research\\SEC filings 10K and 10Q\\Data\\Header Data\
+\Data2008test.txt";

#If Windows "\\", if Mac "/";
my $slash='\\';


$outfiler=">$outfile";

open(OUTPUT, "$outfiler") || die "file for 2006 1: $!";



#The following two steps open the directory containing the files you p
+lan to read
#and then stores the name of each file in an array called @New1.
opendir(DIR1,"$direct")||die "Can't open directory";
my @New1=readdir(DIR1);

#We will now loop through each file.  THe file names
#have been stored in the array called @New1;

foreach $file(@New1)
{
#This prevents me from reading the first two entries in a directory . 
+and ..;

if ($file=~/^\./){next;}
#Initialize the variable names.
my $cik=-99;
my $form_type="";
my $report_date=-99;
my $file_date=-99;
my $name="";
my $sic=-99;
my $HTML=0;
my $Audit_Opinion="Not Found";
my $Going_Concern=0;
my $ao="Not Found";
my $tree="Empty";
my $data="";
#Open the file and put the file in variable called $data
#$data will contain the entire filing
{
# this step removes the default end of line character (\n)
# so the the entire file can be read in at once.
local $/;
 
#read the contents into data

open (INPUT, "$direct$slash"."$file");

while ($data=<INPUT> ) {

#The following steps obtain basic data from the filings
  if ($data=~m/<HTML>/i){$HTML=1;}
  if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;}
  if($data=~m/^\s*FORM\s*TYPE:\s*(.*$)/m){$form_type=$1;}
  if($data=~m/^\s*CONFORMED\s*PERIOD\s*OF\s*REPORT:\s*(\d*)/m){$report
+_date=$1;}
  if($data=~m/^\s*FILED\s*AS\s*OF\s*DATE:\s*(\d*)/m){$file_date=$1;}
  if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;}
  if($data=~m/^\s*STANDARD\s*INDUSTRIAL\s*CLASSIFICATION:.*?\[(\d{4})/
+m){$sic=$1;}
my $filesize = -s $direct . '/' . $file;
my $sb = stat($direct . '/' . $file)->size;
print OUTPUT "$cik,$form_type,$report_date,$file_date,$name,$sic,$file
+size,$sb\n";
}
close INPUT or die "cannot close $file: $!";
}
}
[download]

Comment on READ one line at a time Download Code

Replies are listed 'Best First'.

Re: READ one line at a time
by BrowserUk (Patriarch) on Dec 27, 2014 at 16:19 UTC

You need to comment out or remove this line:

        local $/;
[download]

It means that you are still slurping the files whole.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]

Re^2: READ one line at a time

by wrkrbeee (Scribe) on Dec 27, 2014 at 17:19 UTC

Thank you! It is working beautifully now! Happy New Year to you!!!

[reply]

Re: READ one line at a time
by Anonymous Monk on Dec 27, 2014 at 16:03 UTC

{
  local $/;
 
  open (INPUT, "$direct$slash"."$file");

  while ($data=<INPUT> ) {
  ...
[download]

$data

local $/

(you should know that this code is VERY badly written...)

$startstring, $endstringhtm

[reply]
[d/l]
[select]

Re^2: READ one line at a time

by wrkrbeee (Scribe) on Dec 27, 2014 at 16:10 UTC

Deleted local $/ ... program run, but is voluminous, duplicated entries, expanded the file size from about 3 MB to 26 MB. Clueless here. I am grateful for your help. And I am not surprised that the code is poorly written.

[reply]

Re^3: Read one line at a time

by Athanasius (Archbishop) on Dec 27, 2014 at 16:36 UTC

Hello wrkrbeee,

program run, but is voluminous, duplicated entries, expanded the file size from about 3 MB to 26 MB

With the whole file slurped into $data in one go, the line:

print OUTPUT "$cik,$form_type,$report_date,$file_date,$name,$sic,$file
+size,$sb\n";
[download]

runs once per file. With local $/; commented out, each file is read line-by-line, and the output code is called once for each line. No surprise, then, that the output file increases dramatically in size!

If you’re going to read each file line-by-line (as you should), you’re going to have to change the logic of the code accordingly. Exactly what the new logic should be depends on what the output file is supposed to look like. Note that, at present, each of the variables $cik, $form_type, etc., is initialised once per file, but if the file is read line-by-line, these variables are all reset (without being re-initialised) each time a line is read.

It will help a lot if you can provide a small amount of sample input, together with the corresponding output you wish to obtain.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^4: READ one line at a time

by wrkrbeee (Scribe) on Dec 27, 2014 at 16:40 UTC

Re^5: Read one line at a time

by Athanasius (Archbishop) on Dec 27, 2014 at 16:49 UTC

Some notes below your chosen depth have not been shown here

Re^3: READ one line at a time

by poj (Abbot) on Dec 27, 2014 at 16:25 UTC

while

my $filesize = -s $direct . '/' . $file;
my $sb = stat($direct . '/' . $file)->size;
print OUTPUT "$cik,$form_type,$report_date,$file_date,$name,$sic,$file
+size,$sb\n";
[download]

[reply]
[d/l]
[select]

Re^4: READ one line at a time

by wrkrbeee (Scribe) on Dec 27, 2014 at 17:08 UTC

Re^4: READ one line at a time

by wrkrbeee (Scribe) on Dec 27, 2014 at 16:29 UTC

Re: READ one line at a time
by Anonymous Monk on Dec 27, 2014 at 15:31 UTC

<code></code>

<code>
your program here
</code>
[download]

[reply]
[d/l]
[select]

Re^2: READ one line at a time

by wrkrbeee (Scribe) on Dec 27, 2014 at 15:38 UTC

I apologize! Tried to incorporate the tags. Did I do it correctly? :-(

[reply]

Re^3: READ one line at a time

by Anonymous Monk on Dec 27, 2014 at 15:44 UTC

Well now remove line numbers. And it would be good to remove most comments too.

[reply]

Re^4: READ one line at a time

by wrkrbeee (Scribe) on Dec 27, 2014 at 15:54 UTC

Re^4: READ one line at a time

by wrkrbeee (Scribe) on Dec 27, 2014 at 17:11 UTC

Re: READ one line at a time
by Anonymous Monk on Dec 29, 2014 at 18:56 UTC

Whew!

[reply]

Re: READ one line at a time
by LanX (Saint) on Feb 09, 2015 at 00:49 UTC

> Inherited a program that creates the "Out of Memory" error

did you inherit it or did you find it in "Komunitas Perl Indonesia" ?

https://groups.yahoo.com/neo/groups/id-perl/conversations/messages/2122

Look programming is not a cut&paste game where you can play try and error with online communities.

Cheers Rolf

PS: Je suis Charlie!

[reply]