Using output again without printing

micwood has asked for the wisdom of the Perl Monks concerning the following question:

Greetings:

I have what is probably an easy question but I/O stuff always confuses me. Basically, I am parsing a document with many repeated records. However, I have to clean the document before I can parse out the data. As is, my program cleans the original input, outputs it into a file, and then inputs that file to begin the parse. I really don't mind this work around (and it works fine as is) but the file that I am going to have to parse will be huge (like several gigs and I don't know if creating an intermediate fill will be a problem memory-wise). Is there a way to take out the intermediate step such that the output from the "cleaning" can be accessed without printing to another file?

The beginning of the program involving the creation of the first output and then accessing it (creating a second output) is as follows:

open(OUT, ">/Users/micwood/Desktop/output.txt");
while (<>) {
    s/\r//g;
    s/\t//g;    
    s/(<h4>Award\s\#\d+<\/h4>)/\nEND-OF-DOCUMENT\n$1/g;
    s/(<!-- \/noindex --><\/font>)/\nEND-OF-DOCUMENT\n$1/g;
print OUT "$_";}
close OUT;

open (IN, "/Users/micwood/Desktop/output.txt");
open(OUT2, ">/Users/micwood/Desktop/output2.txt");

my $allDocs = do
   {       local $/;
       <IN>;
   };
my $rxExtractDoc = qr
   {(?xms)
           (<h4>Award\s\#(\d+)<\/h4>(.*?)END-OF-DOCUMENT)           
   };
while ($allDocs =~ m{$rxExtractDoc}g )
[download]

...etc, etc...the rest of the program is just the data that I pull from the records. Any advice would be much appreciated. Best, Michael

Comment on Using output again without printing Download Code

Replies are listed 'Best First'.
Re: Using output again without printing by BrowserUk (Patriarch) on Jul 31, 2008 at 22:21 UTC
Why not keep the two processes as separate scripts and then pipe the output of the first as the input to the second: `clean < hugeFile \| parse > whatever` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Using output again without printing by moritz (Cardinal) on Jul 31, 2008 at 22:28 UTC
One easy way to do it is to split it into two scripts, and connect them with a pipe. In the first script you produce output to STDOUT, in the second you read from STDIN. But in your case that won't help you at all, because you're slurping input into a variable at once, which means a chunk by chunk processing can't happen. What you can do is to sest `local $/ = 'END-OF-DOCUMENT'` and thus read the file block by block (assuming you actually have multiple such blocks in your file).	[reply] [d/l]
Re^2: Using output again without printing by micwood (Acolyte) on Aug 01, 2008 at 05:18 UTC
Moritz: (and others) Thanks for your advice. As for reading the new "clean" file in by blocks (as that is what I think you are suggesting), that would be great. The "cleaning" of the first file puts the "END-OF-DOCUMENT" record divider at the end of each record, whereas "h4>Award\s\#\d+<\/h4" is the start of the record, so it should work. But I am a bit confused (my Perl skills are not up to par). Do I need to modify other parts of the document? By just adding `local $/ = 'END-OF-DOCUMENT'`, the program no longer parses the data from the records. Should it still be using the new clean document with <IN>, but now only reading a record at a time, as such: open(OUT, ">/Users/micwood/Desktop/output.txt"); while (<>) { s/\r//g; s/\t//g; s/(<h4>Award\s\#\d+<\/h4>)/\nEND-OF-DOCUMENT\n$1/g; s/(<!-- \/noindex --><\/font>)/\nEND-OF-DOCUMENT\n$1/g; print OUT "$_";} close OUT; my $novalue = '.'; # temp value my $temp = '.'; # temp value my $awardhashref= (); open (IN, "/Users/micwood/Desktop/output.txt"); open(OUT2, ">/Users/micwood/Desktop/output2.csv"); my $allDocs = do { local $/ = 'END-OF-DOCUMENT'; <IN>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)<\/h4>(.*?)END-OF-DOCUMENT) }; while ($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'entireaward'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}=~ s/\n//g; if ($award{'entireaward'} =~ m{Dollars Obligated<\/td><td align= +right>\$([^<]+?)<\/font>}gi){ $award{'B_dollob'} = $1}; [download] etc, etc Which is fine, as long as it doesn't read the entire new "clean" file as once since I don't think memory could handle that. But if all I need to do is add `local $/ = 'END-OF-DOCUMENT'`, any clue why it no longer works? Thanks again, and I hope my questions are too simple (just not very good at this).	[reply] [d/l] [select]
Re^3: Using output again without printing by micwood (Acolyte) on Aug 01, 2008 at 06:00 UTC
Success!!! I played around with it a bit and found another record separator so I didn't have to rely on my created one in the "cleaning" (ie, the "END-OF-DOCUMENT") and then relocated the other "cleaning" commands in that first part of the script to the block read into the memory. And just if you are curious it now looks as such: open(OUT, ">/Users/micwood/Desktop/output.csv"); my $novalue = '.'; # temp value my $temp = '.'; # temp value my $awardhashref= (); my $allDocs = do { local $/ = '<\/table>\n<hr>\n<br>'; <>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)<\/h4>(.*?)<\/table>\n<hr>\n<br>) }; while ($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'entireaward'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}=~ s/\n//g; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; if ($award{'entireaward'} =~ m{Dollars Obligated<\/td><td align= +right>\$([^<]+?)<\/font>}gi){ $award{'B_dollob'} = $1}; [download] etc, etc, And it works! Thanks, again. Best, Michael	[reply] [d/l]
Re: Using output again without printing by pjotrik (Friar) on Jul 31, 2008 at 22:23 UTC
You can always divide the code into two and connect them with a pipe. In this case, I would recommend to replace the first part with sed. But, you read the entire file to memory in the second phase. That's not a good idea if the file is really big. Try to find an appropriate record separator or read the file in blocks and check whether the part you've read contains a complete record.	[reply]