comment on

I put together a partial solution - it's difficult to provide an accurate answer without seeing the rest of your parsing code and some sample data.

I've not used any of the HTML parsers, so I can't say how they might work. Like you, I've rolled my own parser, but as I say, it's difficult without seeing the data.

I'm assuming you are reading a file created on Windows on a Unix machine. That would explain why you are using the \r at different places in your code. Perhaps this might give a little start.

#!/usr/bin/perl
use strict;
use warnings;

my $awardhashref; # Why needing this? Already printing out keys in the
+ loop.

# use s modifier so '.' matches newlines
# No need to end regex with <hr> - your record already terminates with
+ it.
my $rxExtractDoc = qr{(<h4>Award\s#(\d+)(.*?))}s;

my $out = "/Users/micwood/Desktop/output.csv";
open OUT, '>', $out or die "Unable to open $out for writing";

{
    local $/ = "<hr>";
    while (<>) {
        chomp;
        if (/$rxExtractDoc/) {
            my %award;
            $award{record}= $1;
            $award{A_awardno}= $2; 
            $award{entireaward}= $3;
            
            # Do you really want to replace each tab
            # with the 'empty string", (nothing)?
            $award{entireaward}=~ s/\t//g;
            
            # Eliminate Windows's \r
            $award{entireaward}=~ s/\r//g;
            
            if ($award{entireaward} =~ m{Dollars Obligated.*?\$([^<]+)
+<}is){
                $award{B_dollob} = $1;
            };
            
            if ($award{entireaward} =~ m{Current Contract Value.*?\$([
+^<]+)<}is){
                $award{C_currentconvalue} = $1;
            };
      
            #... further parsing
            
             print # print to terminal
                   qq{Award Number: $award{A_awardno}\n},
                   qq{Dollars Obligated: $award{B_dollob}\n},
                   qq{Current Contract Value: $award{C_currentconvalue
+}\n},
                   qq{Ultimate Contract Value: $award{D_ultconvalue}\n
+},
                   qq{Contracting Agency: $award{E_conagency}\n},     
+ 
            
                    q {-} x 25,
                   qq{\n};

            delete $award{entireaward};
            delete $award{record};
            
            # print to file
            print OUT join(',', map {"$award{$_}"} sort keys %award), 
+"\n";
            
            # $awardhashref= \%award;    ?    
        }
    }
}
close OUT or die "Unable to close $out";
[download]

Update: Added chomp and changed inner while loop to an if. Also, set $/ to <hr>. Thanks ikegami.

Update2: Changed the print to output file. I was printing the keys instead of the values.

In reply to Re: Large file data extraction by Cristoforo
in thread Large file data extraction by micwood

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.