in reply to Large file data extraction

I think there is a problem in first do loop [...]. It appears to be reading in the entire file

I don't see any do *loop* in the code you presented, and the do is only reading until the contents of input record seperator ($/) is found, not the entire file.

But then again, I doubt your IRS exists in the file. '<hr>\r' should be "<hr>\r" if you want to the \r to match a carriage return.

Replies are listed 'Best First'.
Re^2: Large file data extraction
by micwood (Acolyte) on Aug 12, 2008 at 03:13 UTC
    okay. i am slowly getting there. you are correct, the delimiter should have been in "<hr>\r"such that it was slurping the entire file in with '<hr>\r'. however, as it stands now, the following changes to the program only reads in the first record and then quits without looping to the next record:
    my $allDocs = do { local $/ = "<hr>\r"; <>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)(.*?)<hr>) }; while ($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'record'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}= $3; # $award{'entireaward'}=~ s/\n//g; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; if ($award{'entireaward'} =~ m{Dollars Obligated(.*?)\$([^<]+?)< +}gi){ $award{'B_dollob'} = $2};
    {the rest of the code continues.} But it doesn't loop to the next record once it finishes extracting the first record? And I am not sure why? Thoughts? Thank you so much again. This is very helpful (and I am at wits end as is) :-) Also, someone mentioned that there is no input....I have the file on the command line such that <> refers to it. Is that not correct?

      But it doesn't loop to the next record once it finishes extracting the first record

      Your regexp is wrong, or your data isn't what you think it is. What's in $allDocs?

      someone mentioned that there is no input....I have the file on the command line such that <> refers to it.

      It seems that eosbuddy missed or didn't understand <>.

        Thanks everyone, especially Cristoforo. I incorporated suggestions and made a few tweaks and it works fine now (I think), reading only record by record. Thanks again. Can't tell you how much it helped. Best, Michael
        Thanks everybody (especially Cristoforo). I implemented suggestions and made some tweaks and now works great. Thanks for the help!!!

      $allDocs won't contain all records, only the first. <> in scalar context will return one 'record'. Use something like:

      local $/ = "<hr>\r"; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)(.*?)<hr>) }; while(<>) { # read one record at a time if($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'record'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}= $3; # $award{'entireaward'}=~ s/\n//g; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; # ... rest of code } }

      And yes, <> will read from STDIN or filenames from command line.

      I hope this helps.

      Peter Stuifzand