I do appreciate the help on this issue, it has been wonderful. I have been intrigued by Perl and what it can do. It is definitely quite different from anything that I have used before. I did have to dive into the solutions that were provided a little, because it was not working right out of the gate for me, and it came down to leading spaces in a line of the input in front of "Interpolated Hydrograph". For some reason I left that off on the original post.
I modified that, added a header row, and removed the \r on the MCE->gather so there are no blank lines. The time difference is night and day:
Batch (~275k lines) = ~3 hours
Perl <1 second
The final code that works like a charm that I am using:
use strict;
use warnings;
use MCE::Loop;
use MCE::Candy;
my $input_file = shift || 'input.txt';
my $output_file = shift || 'output.txt';
my $match_string = " INTERPOLATED HYDROGRAPH A
+T ";
open my $ofh, ">", $output_file
or die "cannot open '$output_file' for writing: $!\n";
print $ofh "HEC1_ID,Q100,V100\n";
MCE::Loop::init {
use_slurpio => 1, chunk_size => 1, max_workers => 4,
gather => MCE::Candy::out_iter_fh($ofh),
RS => "\n${match_string}",
};
## Below, each worker receives one record at a time
## Output order is preserved via MCE::Candy::out_iter_fh
## line 1 CAC40 # INTERPOLATED HYDROGRAPH AT CAC40
## line 2 # blank line here
## line 3 # PEAK FLOW TIME MAXIMUM AVERAGE FLOW
## line 4 # 6-HR 24-HR 72-HR 166.58-HR
## line 5 # + (CFS) (HR)
## line 6 # (CFS)
## line 7 1223. # + 1223. 12.67 890. 588. 245. 106.
## line 8 # (INCHES) .154 .408 .509 .509
## line 9 1456. # (AC-FT) 441. 1166. 1456. 1456.
## line 10 # CUMULATIVE AREA = 53.67 SQ MI
mce_loop_f {
my ( $mce, $chunk_ref, $chunk_id ) = @_;
## Skip initial record containing header lines including *** ***
if ( $chunk_id == 1 && $$chunk_ref !~ /^${match_string}/ ) {
## Gathering here is necessary when preserving output order,
## to let the manager process know chunk_id 1 has completed.
MCE->gather( $chunk_id, "" );
MCE->next;
}
## Each record begins with INTERPOLATED HYDROGRAPH.
my ( $k1, $k2, $k3 ) = ( "", "", "" );
open my $ifh, "<", $chunk_ref;
while ( <$ifh> ) {
$k1 = $1 and next if $. == 1 && /(\S+)\s*$/;
$k2 = $1 and next if $. == 7 && /^\S+\s+(\S+)/;
$k3 = $1 and last if $. == 9 && /(\S+)\s*$/;
}
close $ifh;
## Gather values.
MCE->gather( $chunk_id, "$k1,$k2,$k3\n" );
} $input_file;
Thanks again. I hope to be learning more of this in the future. |