in reply to Re: Large file data extraction
in thread Large file data extraction

okay. i am slowly getting there. you are correct, the delimiter should have been in "<hr>\r"such that it was slurping the entire file in with '<hr>\r'. however, as it stands now, the following changes to the program only reads in the first record and then quits without looping to the next record:
my $allDocs = do { local $/ = "<hr>\r"; <>; }; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)(.*?)<hr>) }; while ($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'record'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}= $3; # $award{'entireaward'}=~ s/\n//g; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; if ($award{'entireaward'} =~ m{Dollars Obligated(.*?)\$([^<]+?)< +}gi){ $award{'B_dollob'} = $2};
{the rest of the code continues.} But it doesn't loop to the next record once it finishes extracting the first record? And I am not sure why? Thoughts? Thank you so much again. This is very helpful (and I am at wits end as is) :-) Also, someone mentioned that there is no input....I have the file on the command line such that <> refers to it. Is that not correct?

Replies are listed 'Best First'.
Re^3: Large file data extraction
by ikegami (Patriarch) on Aug 12, 2008 at 03:51 UTC

    But it doesn't loop to the next record once it finishes extracting the first record

    Your regexp is wrong, or your data isn't what you think it is. What's in $allDocs?

    someone mentioned that there is no input....I have the file on the command line such that <> refers to it.

    It seems that eosbuddy missed or didn't understand <>.

      Thanks everyone, especially Cristoforo. I incorporated suggestions and made a few tweaks and it works fine now (I think), reading only record by record. Thanks again. Can't tell you how much it helped. Best, Michael
      Thanks everybody (especially Cristoforo). I implemented suggestions and made some tweaks and now works great. Thanks for the help!!!
Re^3: Large file data extraction
by peter (Sexton) on Aug 12, 2008 at 15:04 UTC

    $allDocs won't contain all records, only the first. <> in scalar context will return one 'record'. Use something like:

    local $/ = "<hr>\r"; my $rxExtractDoc = qr {(?xms) (<h4>Award\s\#(\d+)(.*?)<hr>) }; while(<>) { # read one record at a time if($allDocs =~ m{$rxExtractDoc}g ) { my %award = (); # award hash $award{'record'}= $1; $award{'A_awardno'}= $2; $award{'entireaward'}= $3; # $award{'entireaward'}=~ s/\n//g; $award{'entireaward'}=~ s/\t//g; $award{'entireaward'}=~ s/\r//g; # ... rest of code } }

    And yes, <> will read from STDIN or filenames from command line.

    I hope this helps.

    Peter Stuifzand