micwood has asked for the wisdom of the Perl Monks concerning the following question:

Greetings:

I have what is probably an easy question but I/O stuff always confuses me. Basically, I am parsing a document with many repeated records. However, I have to clean the document before I can parse out the data. As is, my program cleans the original input, outputs it into a file, and then inputs that file to begin the parse. I really don't mind this work around (and it works fine as is) but the file that I am going to have to parse will be huge (like several gigs and I don't know if creating an intermediate fill will be a problem memory-wise). Is there a way to take out the intermediate step such that the output from the "cleaning" can be accessed without printing to another file?

The beginning of the program involving the creation of the first output and then accessing it (creating a second output) is as follows:

open(OUT, ">/Users/micwood/Desktop/output.txt"); while (<>) { s/\r//g; s/\t//g; s/(<h4>Award\s\#\d+<\/h4>)/\nEND-OF-DOCUMENT\n$1/g; s/(<!-- \/noindex --><\/font>)/\nEND-OF-DOCUMENT\n$1/g; print OUT "$_";} close OUT; open (IN, "/Users/micwood/Desktop/output.txt"); open(OUT2, ">/Users/micwood/Desktop/output2.txt"); my $allDocs = do { local $/; <IN>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)<\/h4>(.*?)END-OF-DOCUMENT) }; while ($allDocs =~ m{$rxExtractDoc}g )
...etc, etc...the rest of the program is just the data that I pull from the records. Any advice would be much appreciated. Best, Michael

Replies are listed 'Best First'.
Re: Using output again without printing
by BrowserUk (Patriarch) on Jul 31, 2008 at 22:21 UTC
Re: Using output again without printing
by moritz (Cardinal) on Jul 31, 2008 at 22:28 UTC
    One easy way to do it is to split it into two scripts, and connect them with a pipe. In the first script you produce output to STDOUT, in the second you read from STDIN.

    But in your case that won't help you at all, because you're slurping input into a variable at once, which means a chunk by chunk processing can't happen.

    What you can do is to sest local $/ = 'END-OF-DOCUMENT' and thus read the file block by block (assuming you actually have multiple such blocks in your file).

      Moritz: (and others) Thanks for your advice. As for reading the new "clean" file in by blocks (as that is what I think you are suggesting), that would be great. The "cleaning" of the first file puts the "END-OF-DOCUMENT" record divider at the end of each record, whereas "h4>Award\s\#\d+<\/h4" is the start of the record, so it should work. But I am a bit confused (my Perl skills are not up to par). Do I need to modify other parts of the document? By just adding  local $/ = 'END-OF-DOCUMENT', the program no longer parses the data from the records. Should it still be using the new clean document with <IN>, but now only reading a record at a time, as such:
      open(OUT, ">/Users/micwood/Desktop/output.txt"); while (<>) { s/\r//g; s/\t//g; s/(<h4>Award\s\#\d+<\/h4>)/\nEND-OF-DOCUMENT\n$1/g; s/(<!-- \/noindex --><\/font>)/\nEND-OF-DOCUMENT\n$1/g; print OUT "$_";} close OUT; my $novalue = '.'; # temp value my $temp = '.'; # temp value my $awardhashref= (); open (IN, "/Users/micwood/Desktop/output.txt"); open(OUT2, ">/Users/micwood/Desktop/output2.csv"); my $allDocs = do { local $/ = 'END-OF-DOCUMENT'; <IN>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)<\/h4>(.*?)END-OF-DOCUMENT) }; while ($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'entireaward'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}=~ s/\n//g; if ($award{'entireaward'} =~ m{Dollars Obligated<\/td><td align= +right>\$([^<]+?)<\/font>}gi){ $award{'B_dollob'} = $1};
      etc, etc Which is fine, as long as it doesn't read the entire new "clean" file as once since I don't think memory could handle that. But if all I need to do is add  local $/ = 'END-OF-DOCUMENT', any clue why it no longer works? Thanks again, and I hope my questions are too simple (just not very good at this).
        Success!!! I played around with it a bit and found another record separator so I didn't have to rely on my created one in the "cleaning" (ie, the "END-OF-DOCUMENT") and then relocated the other "cleaning" commands in that first part of the script to the block read into the memory. And just if you are curious it now looks as such:
        open(OUT, ">/Users/micwood/Desktop/output.csv"); my $novalue = '.'; # temp value my $temp = '.'; # temp value my $awardhashref= (); my $allDocs = do { local $/ = '<\/table>\n<hr>\n<br>'; <>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)<\/h4>(.*?)<\/table>\n<hr>\n<br>) }; while ($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'entireaward'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}=~ s/\n//g; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; if ($award{'entireaward'} =~ m{Dollars Obligated<\/td><td align= +right>\$([^<]+?)<\/font>}gi){ $award{'B_dollob'} = $1};
        etc, etc, And it works! Thanks, again. Best, Michael
Re: Using output again without printing
by pjotrik (Friar) on Jul 31, 2008 at 22:23 UTC
    You can always divide the code into two and connect them with a pipe. In this case, I would recommend to replace the first part with sed.

    But, you read the entire file to memory in the second phase. That's not a good idea if the file is really big. Try to find an appropriate record separator or read the file in blocks and check whether the part you've read contains a complete record.