I put together a partial solution - it's difficult to provide an accurate answer without seeing the rest of your parsing code and some sample data.

I've not used any of the HTML parsers, so I can't say how they might work. Like you, I've rolled my own parser, but as I say, it's difficult without seeing the data.

I'm assuming you are reading a file created on Windows on a Unix machine. That would explain why you are using the \r at different places in your code. Perhaps this might give a little start.

#!/usr/bin/perl use strict; use warnings; my $awardhashref; # Why needing this? Already printing out keys in the + loop. # use s modifier so '.' matches newlines # No need to end regex with <hr> - your record already terminates with + it. my $rxExtractDoc = qr{(<h4>Award\s#(\d+)(.*?))}s; my $out = "/Users/micwood/Desktop/output.csv"; open OUT, '>', $out or die "Unable to open $out for writing"; { local $/ = "<hr>"; while (<>) { chomp; if (/$rxExtractDoc/) { my %award; $award{record}= $1; $award{A_awardno}= $2; $award{entireaward}= $3; # Do you really want to replace each tab # with the 'empty string", (nothing)? $award{entireaward}=~ s/\t//g; # Eliminate Windows's \r $award{entireaward}=~ s/\r//g; if ($award{entireaward} =~ m{Dollars Obligated.*?\$([^<]+) +<}is){ $award{B_dollob} = $1; }; if ($award{entireaward} =~ m{Current Contract Value.*?\$([ +^<]+)<}is){ $award{C_currentconvalue} = $1; }; #... further parsing print # print to terminal qq{Award Number: $award{A_awardno}\n}, qq{Dollars Obligated: $award{B_dollob}\n}, qq{Current Contract Value: $award{C_currentconvalue +}\n}, qq{Ultimate Contract Value: $award{D_ultconvalue}\n +}, qq{Contracting Agency: $award{E_conagency}\n}, + q {-} x 25, qq{\n}; delete $award{entireaward}; delete $award{record}; # print to file print OUT join(',', map {"$award{$_}"} sort keys %award), +"\n"; # $awardhashref= \%award; ? } } } close OUT or die "Unable to close $out";
Update: Added chomp and changed inner while loop to an if. Also, set $/ to <hr>. Thanks ikegami.

Update2: Changed the print to output file. I was printing the keys instead of the values.


In reply to Re: Large file data extraction by Cristoforo
in thread Large file data extraction by micwood

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.