in reply to Re: Using output again without printing
in thread Using output again without printing

Moritz: (and others) Thanks for your advice. As for reading the new "clean" file in by blocks (as that is what I think you are suggesting), that would be great. The "cleaning" of the first file puts the "END-OF-DOCUMENT" record divider at the end of each record, whereas "h4>Award\s\#\d+<\/h4" is the start of the record, so it should work. But I am a bit confused (my Perl skills are not up to par). Do I need to modify other parts of the document? By just adding  local $/ = 'END-OF-DOCUMENT', the program no longer parses the data from the records. Should it still be using the new clean document with <IN>, but now only reading a record at a time, as such:
open(OUT, ">/Users/micwood/Desktop/output.txt"); while (<>) { s/\r//g; s/\t//g; s/(<h4>Award\s\#\d+<\/h4>)/\nEND-OF-DOCUMENT\n$1/g; s/(<!-- \/noindex --><\/font>)/\nEND-OF-DOCUMENT\n$1/g; print OUT "$_";} close OUT; my $novalue = '.'; # temp value my $temp = '.'; # temp value my $awardhashref= (); open (IN, "/Users/micwood/Desktop/output.txt"); open(OUT2, ">/Users/micwood/Desktop/output2.csv"); my $allDocs = do { local $/ = 'END-OF-DOCUMENT'; <IN>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)<\/h4>(.*?)END-OF-DOCUMENT) }; while ($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'entireaward'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}=~ s/\n//g; if ($award{'entireaward'} =~ m{Dollars Obligated<\/td><td align= +right>\$([^<]+?)<\/font>}gi){ $award{'B_dollob'} = $1};
etc, etc Which is fine, as long as it doesn't read the entire new "clean" file as once since I don't think memory could handle that. But if all I need to do is add  local $/ = 'END-OF-DOCUMENT', any clue why it no longer works? Thanks again, and I hope my questions are too simple (just not very good at this).

Replies are listed 'Best First'.
Re^3: Using output again without printing
by micwood (Acolyte) on Aug 01, 2008 at 06:00 UTC
    Success!!! I played around with it a bit and found another record separator so I didn't have to rely on my created one in the "cleaning" (ie, the "END-OF-DOCUMENT") and then relocated the other "cleaning" commands in that first part of the script to the block read into the memory. And just if you are curious it now looks as such:
    open(OUT, ">/Users/micwood/Desktop/output.csv"); my $novalue = '.'; # temp value my $temp = '.'; # temp value my $awardhashref= (); my $allDocs = do { local $/ = '<\/table>\n<hr>\n<br>'; <>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)<\/h4>(.*?)<\/table>\n<hr>\n<br>) }; while ($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'entireaward'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}=~ s/\n//g; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; if ($award{'entireaward'} =~ m{Dollars Obligated<\/td><td align= +right>\$([^<]+?)<\/font>}gi){ $award{'B_dollob'} = $1};
    etc, etc, And it works! Thanks, again. Best, Michael