I've not used any of the HTML parsers, so I can't say how they might work. Like you, I've rolled my own parser, but as I say, it's difficult without seeing the data.
I'm assuming you are reading a file created on Windows on a Unix machine. That would explain why you are using the \r at different places in your code. Perhaps this might give a little start.
Update: Added chomp and changed inner while loop to an if. Also, set $/ to <hr>. Thanks ikegami.#!/usr/bin/perl use strict; use warnings; my $awardhashref; # Why needing this? Already printing out keys in the + loop. # use s modifier so '.' matches newlines # No need to end regex with <hr> - your record already terminates with + it. my $rxExtractDoc = qr{(<h4>Award\s#(\d+)(.*?))}s; my $out = "/Users/micwood/Desktop/output.csv"; open OUT, '>', $out or die "Unable to open $out for writing"; { local $/ = "<hr>"; while (<>) { chomp; if (/$rxExtractDoc/) { my %award; $award{record}= $1; $award{A_awardno}= $2; $award{entireaward}= $3; # Do you really want to replace each tab # with the 'empty string", (nothing)? $award{entireaward}=~ s/\t//g; # Eliminate Windows's \r $award{entireaward}=~ s/\r//g; if ($award{entireaward} =~ m{Dollars Obligated.*?\$([^<]+) +<}is){ $award{B_dollob} = $1; }; if ($award{entireaward} =~ m{Current Contract Value.*?\$([ +^<]+)<}is){ $award{C_currentconvalue} = $1; }; #... further parsing print # print to terminal qq{Award Number: $award{A_awardno}\n}, qq{Dollars Obligated: $award{B_dollob}\n}, qq{Current Contract Value: $award{C_currentconvalue +}\n}, qq{Ultimate Contract Value: $award{D_ultconvalue}\n +}, qq{Contracting Agency: $award{E_conagency}\n}, + q {-} x 25, qq{\n}; delete $award{entireaward}; delete $award{record}; # print to file print OUT join(',', map {"$award{$_}"} sort keys %award), +"\n"; # $awardhashref= \%award; ? } } } close OUT or die "Unable to close $out";
Update2: Changed the print to output file. I was printing the keys instead of the values.
In reply to Re: Large file data extraction
by Cristoforo
in thread Large file data extraction
by micwood
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |