Extract string to file

oryan has asked for the wisdom of the Perl Monks concerning the following question:

So, my first post and experience here with Perl has prompted me to find better solutions to what I do using Perl. First post: http://perlmonks.org/index.pl?node_id=1137275.

I am trying to modify this solution for another project of mine. I can get all the individual parts to work, but they won’t come together. The idea is I have a large file (again) and am trying to extract text (again) from a single line that starts with a certain delimiter (the previous solution pulled text from different lines). I can get each column to extract to my csv individually, but I can’t get them to do it all at the same time. Here is what I have, modified from the previous solution

use strict;
use warnings;

use MCE::Loop;
use MCE::Candy;

## Input and output files defined here as well as what we are searchin
+g for in the input file
my $input_file  = shift || 'InputFile.OH1';
my $output_file = shift || 'OutputFile.csv';
my $match_string = "+                          ";

open my $ofh, ">", $output_file
   or die "cannot open '$output_file' for writing: $!\n";

## This writes a header row that GIS sees as the column heading
print $ofh "HEC1_ID,Q100_Base,TTP,Area\n";
   
MCE::Loop::init {
   use_slurpio => 1, chunk_size => 1, max_workers => 4,
   gather => MCE::Candy::out_iter_fh($ofh),
   RS => "\n${match_string}",
};

## Below, each worker receives one record at a time
## Output order is preserved via MCE::Candy::out_iter_fh

## EXAMPLE INPUT FILE
Line  1##+                           BPI30      1319.   13.50         
+477.        147.         49.       4.64
Line  2##
Line  3##          ROUTED TO
Line  4##+                           RPI30      1220.   13.75         
+475.        147.         49.       4.64
Line  5##
Line  6##          HYDROGRAPH AT
Line  7##+                           BPI31       765.   12.42         
+102.         26.          9.        .73
Line  8##
Line  9##          2 COMBINED AT
Line 10##+                           CPI31      1242.   13.75         
+571.        172.         58.       5.37

mce_loop_f {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;

   ## Skip initial record containing header lines including *** ***
   if ( $chunk_id == 1 && $$chunk_ref !~ /^${match_string}/ ) {
      ## Gathering here is necessary when preserving output order,
      ## to let the manager process know chunk_id 1 has completed.
      MCE->gather( $chunk_id, "" );
      MCE->next;
   }

   ## Each record begins with "+                          "
   my ( $k1, $k2, $k3, $k4 ) = ( "", "", "", "" );
   open my $ifh, "<", $chunk_ref;
   while ( <$ifh> ) {
      $k1 = $1 and next if $. == 1 && /^\S\s+(\S+)/;
      $k2 = $1 and next if $. == 1 && /^\S\s+\S+\s+(\S+)/;
      $k3 = $1 and next if $. == 1 && /^\S\s+\S+\s+\S+\s+(\S+)/;
      $k4 = $1 and last if $. == 1 && /(\S+)\s*$/;
   }
   close $ifh;

## Gather values.
## This outputs everything to the output file in the format below.
   MCE->gather( $chunk_id, "$k1,$k2,$k3,$k4\n" );
} $input_file;
[download]

As an example from Line 1 I am trying to extract BPI30, 1319, 13.50, 4.64, and then RPI30, 1220, 13.75, 4.64 from Line 4, etc. The lines always begin with “+ (spaces)“. Each one of the $k1-4 will extract the correct data into the right column, but it won’t put it together. Any help is appreciated, and I hope it is a silly oversight on my part. Thanks.

Comment on Extract string to file Download Code

Replies are listed 'Best First'.
Re: Extract string to file by aitap (Curate) on Aug 14, 2015 at 09:14 UTC
I've read the rationale for using MCE for this task in the comments for the previous post, but I still don't understand why parallelism helps here: surely gathering four parts from a 101-character string should not be a computationally expensive task? Each one of the $k1-4 will extract the correct data into the right column, but it won’t put it together. First of all, there is a warning, ^+ matches null string many times in regex; marked by <-- HERE in m/^+ <-- HERE / at 1138497.pl line 31, <$_IN_FILE> line 1. which means that you should have escaped (quotemeta or `\Q...\E`) your `$match_string` before interpolating it into a regex. Next, in your `while ( <$ifh> )` loop you try to process next string after getting the first column of the first string, while all your data are in one single string. You should remove next from your `while` loop. Furthermore, it may make sense to rewrite the string processing part to make the string match only once and speed the process up. Try this: use warnings; use strict; print "HEC1_ID,Q100_Base,TTP,Area\n"; while (<DATA>) { s{ ^ # at the beginning of the line \+ # followed by literal plus (?:\s+(\S+)) # column one (?:\s+(\S+)) # column two (?:\s+(\S+)) # column three (?:\s+\S+){3} # skip three columns \s+(\S+)$ # catch the last one, too }{$1,$2,$3,$4}x and print; } __DATA__ + BPI30 1319. 13.50 477. + 147. 49. 4.64 ROUTED TO + RPI30 1220. 13.75 475. + 147. 49. 4.64 HYDROGRAPH AT + BPI31 765. 12.42 102. + 26. 9. .73 2 COMBINED AT + CPI31 1242. 13.75 571. + 172. 58. 5.37 [download] `HEC1_ID,Q100_Base,TTP,Area BPI30,1319.,13.50,4.64 RPI30,1220.,13.75,4.64 BPI31,765.,12.42,.73 CPI31,1242.,13.75,5.37` [download]	[reply] [d/l] [select]
Re^2: Extract string to file by oryan (Initiate) on Aug 28, 2015 at 15:47 UTC
Thank you for the reply. I forgot to add my final solution here. I tried modifying my previous post question solution with the quotemeta as you had suggested but it did not seem to be working for me. While it has the warning, it does still provide the needed output. On this post your suggestions worked great. I simply added a slight modification so that I could set an input file and output to a file instead of onscreen. Here is what I ended up with. use warnings; use strict; ## Select input and output files my $input_file = 'InputFile.OH1'; my $output_file = 'OutputFile.csv'; open my $ifh, "<", $input_file or die "cannot open '$input_file' for reading: $!\n"; open my $ofh, ">", $output_file or die "cannot open '$output_file' for writing: $!\n"; ## Creates a header at the beginning of the file my $header = "HEC1_ID,Q100_Base,TTP,Area\n"; print $ofh $header; ## Extracts data from input and sends through STDOUT to output file select $ofh; while (<$ifh>) { s{ ^ # at the beginning of the line \+ # followed by literal plus (?:\s+(\S+)) # column one (?:\s+(\S+)) # column two (?:\s+(\S+)) # column three (?:\s+\S+){3} # skip three columns \s+(\S+)$ # catch the last one, too }{$1,$2,$3,$4}x and print; } select STDOUT; [download] Once again, thanks for the help	[reply] [d/l]
Re: Extract string to file by marioroy (Prior) on Sep 01, 2015 at 11:37 UTC
The serial solution by aitap seems fast enough. I was curious and wanted to see if running faster is possible against a 200 MB input file. The serial version completes in 4.566 seconds and parallel in 1.440 seconds, which is beyond 3x faster. Perl regular expressions involves CPU time and the reason for running faster. IO is sequential in MCE, not parallel or random, to minimize unnecessary delays; e.g. seek time. Perl is fast at these things :) use warnings; use strict; use MCE::Loop; use MCE::Candy; ## Select input and output files my $input_file = 'InputFile.OH1'; my $output_file = 'OutputFile.csv'; open my $ofh, ">", $output_file or die "cannot open '$output_file' for writing: $!\n"; ## Creates a header at the beginning of the file my $header = "HEC1_ID,Q100_Base,TTP,Area\n"; print $ofh $header; ## Extracts data from input in parallel. MCE::Loop::init { use_slurpio => 1, chunk_size => '20k', max_workers => 4, gather => MCE::Candy::out_iter_fh($ofh), }; mce_loop_f { my ( $output, $mce, $chunk_ref, $chunk_id ) = ( '', @_ ); open my $ifh, "<", $chunk_ref; while (<$ifh>) { s{ ^ # at the beginning of the line \+ # followed by literal plus (?:\s+(\S+)) # column one (?:\s+(\S+)) # column two (?:\s+(\S+)) # column three (?:\s+\S+){3} # skip three columns \s+(\S+)$ # catch the last one, too }{$1,$2,$3,$4}x and do { $output .= $_ }; } close $ifh; MCE->gather( $chunk_id, $output ); } $input_file; close $ofh; [download] Kind regards, Mario.	[reply] [d/l]