in reply to Re: Perl solution for current batch file to extract specific column text
in thread Perl solution for current batch file to extract specific column text

Quite interesting. Although I vaguely knew about the MCE module(s) before, I have never tried to use it/them, because I had the feeling that it would not bring much benefit when reading only one very large file (my input files are often gigabytes or even tens of GB large). It appears from your example that I was probably dead-wrong.

I should probably give it a try. Although one of my problem, presently, is that I am currently stuck with very old versions of Perl (5.8) (because of the AIX and VMS versions I am working on), so that dependencies might be fairly difficult to resolve.

But, having said that, we should move relatively soon (hopefully just a few months) to new hardware (blades) with much more recent versions of Linux, thus enabling much more recent versions of Perl. If not now, at least then, I might be able to take advantage of the MCE module(s).

Thank you for the information.

  • Comment on Re^2: Perl solution for current batch file to extract specific column text

Replies are listed 'Best First'.
Re^3: Perl solution for current batch file to extract specific column text
by oryan (Initiate) on Aug 06, 2015 at 22:45 UTC

    I do appreciate the help on this issue, it has been wonderful. I have been intrigued by Perl and what it can do. It is definitely quite different from anything that I have used before. I did have to dive into the solutions that were provided a little, because it was not working right out of the gate for me, and it came down to leading spaces in a line of the input in front of "Interpolated Hydrograph". For some reason I left that off on the original post.

    I modified that, added a header row, and removed the \r on the MCE->gather so there are no blank lines. The time difference is night and day:

    Batch (~275k lines) = ~3 hours

    Perl <1 second

    The final code that works like a charm that I am using:

    use strict; use warnings; use MCE::Loop; use MCE::Candy; my $input_file = shift || 'input.txt'; my $output_file = shift || 'output.txt'; my $match_string = " INTERPOLATED HYDROGRAPH A +T "; open my $ofh, ">", $output_file or die "cannot open '$output_file' for writing: $!\n"; print $ofh "HEC1_ID,Q100,V100\n"; MCE::Loop::init { use_slurpio => 1, chunk_size => 1, max_workers => 4, gather => MCE::Candy::out_iter_fh($ofh), RS => "\n${match_string}", }; ## Below, each worker receives one record at a time ## Output order is preserved via MCE::Candy::out_iter_fh ## line 1 CAC40 # INTERPOLATED HYDROGRAPH AT CAC40 ## line 2 # blank line here ## line 3 # PEAK FLOW TIME MAXIMUM AVERAGE FLOW ## line 4 # 6-HR 24-HR 72-HR 166.58-HR ## line 5 # + (CFS) (HR) ## line 6 # (CFS) ## line 7 1223. # + 1223. 12.67 890. 588. 245. 106. ## line 8 # (INCHES) .154 .408 .509 .509 ## line 9 1456. # (AC-FT) 441. 1166. 1456. 1456. ## line 10 # CUMULATIVE AREA = 53.67 SQ MI mce_loop_f { my ( $mce, $chunk_ref, $chunk_id ) = @_; ## Skip initial record containing header lines including *** *** if ( $chunk_id == 1 && $$chunk_ref !~ /^${match_string}/ ) { ## Gathering here is necessary when preserving output order, ## to let the manager process know chunk_id 1 has completed. MCE->gather( $chunk_id, "" ); MCE->next; } ## Each record begins with INTERPOLATED HYDROGRAPH. my ( $k1, $k2, $k3 ) = ( "", "", "" ); open my $ifh, "<", $chunk_ref; while ( <$ifh> ) { $k1 = $1 and next if $. == 1 && /(\S+)\s*$/; $k2 = $1 and next if $. == 7 && /^\S+\s+(\S+)/; $k3 = $1 and last if $. == 9 && /(\S+)\s*$/; } close $ifh; ## Gather values. MCE->gather( $chunk_id, "$k1,$k2,$k3\n" ); } $input_file;

    Thanks again. I hope to be learning more of this in the future.

      Thank you oryan for sharing the before and after results. That is really amazing. Sometimes, providing solutions based on the initial post may not be spot on. But, we tried nonetheless.

      Kind regards, Mario