Moritz: (and others) Thanks for your advice.
As for reading the new "clean" file in by blocks (as that is what I think you are suggesting), that would be great. The "cleaning" of the first file puts the "END-OF-DOCUMENT" record divider at the end of each record, whereas "h4>Award\s\#\d+<\/h4" is the start of the record, so it should work. But I am a bit confused (my Perl skills are not up to par). Do I need to modify other parts of the document? By just adding local $/ = 'END-OF-DOCUMENT', the program no longer parses the data from the records. Should it still be using the new clean document with <IN>, but now only reading a record at a time, as such:
open(OUT, ">/Users/micwood/Desktop/output.txt");
while (<>) {
s/\r//g;
s/\t//g;
s/(<h4>Award\s\#\d+<\/h4>)/\nEND-OF-DOCUMENT\n$1/g;
s/(<!-- \/noindex --><\/font>)/\nEND-OF-DOCUMENT\n$1/g;
print OUT "$_";}
close OUT;
my $novalue = '.'; # temp value
my $temp = '.'; # temp value
my $awardhashref= ();
open (IN, "/Users/micwood/Desktop/output.txt");
open(OUT2, ">/Users/micwood/Desktop/output2.csv");
my $allDocs = do
{
local $/ = 'END-OF-DOCUMENT';
<IN>;
};
my $rxExtractDoc = qr
{(?xms)
(<h4>Award\s\#(\d+)<\/h4>(.*?)END-OF-DOCUMENT)
};
while ($allDocs =~ m{$rxExtractDoc}g )
{
my %award = (); # award hash
$award{'entireaward'}= $1;
$award{'A_awardno'}= $2;
$award{'entireaward'}=~ s/\n//g;
if ($award{'entireaward'} =~ m{Dollars Obligated<\/td><td align=
+right>\$([^<]+?)<\/font>}gi){
$award{'B_dollob'} = $1};
etc, etc
Which is fine, as long as it doesn't read the entire new "clean" file as once since I don't think memory could handle that. But if all I need to do is add local $/ = 'END-OF-DOCUMENT', any clue why it no longer works?
Thanks again, and I hope my questions are too simple (just not very good at this).
| [reply] [d/l] [select] |
Success!!!
I played around with it a bit and found another record separator so I didn't have to rely on my created one in the "cleaning" (ie, the "END-OF-DOCUMENT") and then relocated the other "cleaning" commands in that first part of the script to the block read into the memory. And just if you are curious it now looks as such:
open(OUT, ">/Users/micwood/Desktop/output.csv");
my $novalue = '.'; # temp value
my $temp = '.'; # temp value
my $awardhashref= ();
my $allDocs = do
{
local $/ = '<\/table>\n<hr>\n<br>';
<>;
};
my $rxExtractDoc = qr
{(?xms)
(<h4>Award\s\#(\d+)<\/h4>(.*?)<\/table>\n<hr>\n<br>)
};
while ($allDocs =~ m{$rxExtractDoc}g )
{
my %award = (); # award hash
$award{'entireaward'}= $1;
$award{'A_awardno'}= $2;
$award{'entireaward'}=~ s/\n//g;
$award{'entireaward'}=~ s/\t//g;
$award{'entireaward'}=~ s/\r//g;
if ($award{'entireaward'} =~ m{Dollars Obligated<\/td><td align=
+right>\$([^<]+?)<\/font>}gi){
$award{'B_dollob'} = $1};
etc, etc, And it works!
Thanks, again. Best, Michael
| [reply] [d/l] |