in reply to Perl solution for current batch file to extract specific column text

Update: Increased chunk size to 400.

Below, a parallel version with chunking enabled for the solution provided by monk Laurent_R. I ran against an input file containing 500k records.

Serial: 2.574 seconds. Parallel: 0.895 seconds, which includes the time to fork and reap children under a Unix environment. Afterwards, the output contains 500k lines.

The test machine is a 2.6 GHz Haswel Core i7 with RAM at 1600 MHz.

Optionally, the script can receive the input_file and output_file as arguments.

use strict; use warnings; use MCE::Loop; use MCE::Candy; my $input_file = shift || 'input.txt'; my $output_file = shift || 'output.txt'; open my $ofh, ">", $output_file or die "cannot open '$output_file' for writing: $!\n"; MCE::Loop::init { use_slurpio => 1, chunk_size => 400, max_workers => 4, gather => MCE::Candy::out_iter_fh($ofh), RS => "\nINTERPOLATED HYDROGRAPH", }; ## Each worker receives many records determined by chunk_size. ## Output order is preserved via MCE::Candy::out_iter_fh mce_loop_f { my ( $mce, $chunk_ref, $chunk_id ) = @_; open my $ifh, "<", $chunk_ref; my $output = ""; while ( my $line = <$ifh> ) { chomp $line; # remove newline character from end of line if ( $line =~ /INTERPOLATED HYDROGRAPH AT (\w+)$/ ) { $output .= $1; $line = <$ifh> for 1..6; # skip 5 lines my $val2 = (split / /, $line)[1]; # get the second column $output .= " $val2"; $line = <$ifh> for 1..2; # skip one line chomp $line; my $val3 = (split / /, $line)[-1]; # get the last column $output .= " $val3\r\n"; } } close $ifh; MCE->gather( $chunk_id, $output ); } $input_file; close $ofh;

Kind regards, Mario.

  • Comment on Re: Perl solution for current batch file to extract specific column text
  • Download Code

Replies are listed 'Best First'.
Re^2: Perl solution for current batch file to extract specific column text
by Laurent_R (Canon) on Aug 04, 2015 at 20:23 UTC
    Quite interesting. Although I vaguely knew about the MCE module(s) before, I have never tried to use it/them, because I had the feeling that it would not bring much benefit when reading only one very large file (my input files are often gigabytes or even tens of GB large). It appears from your example that I was probably dead-wrong.

    I should probably give it a try. Although one of my problem, presently, is that I am currently stuck with very old versions of Perl (5.8) (because of the AIX and VMS versions I am working on), so that dependencies might be fairly difficult to resolve.

    But, having said that, we should move relatively soon (hopefully just a few months) to new hardware (blades) with much more recent versions of Linux, thus enabling much more recent versions of Perl. If not now, at least then, I might be able to take advantage of the MCE module(s).

    Thank you for the information.

      I do appreciate the help on this issue, it has been wonderful. I have been intrigued by Perl and what it can do. It is definitely quite different from anything that I have used before. I did have to dive into the solutions that were provided a little, because it was not working right out of the gate for me, and it came down to leading spaces in a line of the input in front of "Interpolated Hydrograph". For some reason I left that off on the original post.

      I modified that, added a header row, and removed the \r on the MCE->gather so there are no blank lines. The time difference is night and day:

      Batch (~275k lines) = ~3 hours

      Perl <1 second

      The final code that works like a charm that I am using:

      use strict; use warnings; use MCE::Loop; use MCE::Candy; my $input_file = shift || 'input.txt'; my $output_file = shift || 'output.txt'; my $match_string = " INTERPOLATED HYDROGRAPH A +T "; open my $ofh, ">", $output_file or die "cannot open '$output_file' for writing: $!\n"; print $ofh "HEC1_ID,Q100,V100\n"; MCE::Loop::init { use_slurpio => 1, chunk_size => 1, max_workers => 4, gather => MCE::Candy::out_iter_fh($ofh), RS => "\n${match_string}", }; ## Below, each worker receives one record at a time ## Output order is preserved via MCE::Candy::out_iter_fh ## line 1 CAC40 # INTERPOLATED HYDROGRAPH AT CAC40 ## line 2 # blank line here ## line 3 # PEAK FLOW TIME MAXIMUM AVERAGE FLOW ## line 4 # 6-HR 24-HR 72-HR 166.58-HR ## line 5 # + (CFS) (HR) ## line 6 # (CFS) ## line 7 1223. # + 1223. 12.67 890. 588. 245. 106. ## line 8 # (INCHES) .154 .408 .509 .509 ## line 9 1456. # (AC-FT) 441. 1166. 1456. 1456. ## line 10 # CUMULATIVE AREA = 53.67 SQ MI mce_loop_f { my ( $mce, $chunk_ref, $chunk_id ) = @_; ## Skip initial record containing header lines including *** *** if ( $chunk_id == 1 && $$chunk_ref !~ /^${match_string}/ ) { ## Gathering here is necessary when preserving output order, ## to let the manager process know chunk_id 1 has completed. MCE->gather( $chunk_id, "" ); MCE->next; } ## Each record begins with INTERPOLATED HYDROGRAPH. my ( $k1, $k2, $k3 ) = ( "", "", "" ); open my $ifh, "<", $chunk_ref; while ( <$ifh> ) { $k1 = $1 and next if $. == 1 && /(\S+)\s*$/; $k2 = $1 and next if $. == 7 && /^\S+\s+(\S+)/; $k3 = $1 and last if $. == 9 && /(\S+)\s*$/; } close $ifh; ## Gather values. MCE->gather( $chunk_id, "$k1,$k2,$k3\n" ); } $input_file;

      Thanks again. I hope to be learning more of this in the future.

        Thank you oryan for sharing the before and after results. That is really amazing. Sometimes, providing solutions based on the initial post may not be spot on. But, we tried nonetheless.

        Kind regards, Mario