Re: script optmization

Loading 200 MB into available memory is likely possible on today's hardward. If so, the following runs ~ 7 times faster. The idea here is iteratating over seq one time. For larger data files, one can chunk 300 MB at a time time and not forgetting to read till the end of line to have a complete chunk. Then process the chunk similarly.

use strict;
use warnings;
use autodie;

open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt";

my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt');

open(my $fh, $f1);
foreach (<$fh>) {
  chomp; s/^\s+|\s+$//g;
  push @seq, $_;
}
close $fh;

@seq = sort bylen @seq;  # need to sort @seq by length.

my $data; { open($fh, $f2); local $/; $data = <$fh>; }

foreach my $r (@seq) {
  my $t = $r; $t =~ s/\h+/bbb/g;
  $data =~ s/$r/$t/g;
}

print Newfile $data;
close Newfile ;

exit 0;

sub bylen {
  length($b) <=> length($a);
}
[download]

Comment on Re: script optmization Download Code

Replies are listed 'Best First'.
Re^2: script optmization by Anonymous Monk on May 14, 2017 at 22:04 UTC
For larger data files and not wanting to deal with chunking manually, then there is the parallel MCE module. This is what one might construct using the MCE::Flow module. We're running 4 workers. Therefore chunking at 24 MB is plenty. Perl and CPAN are amazing allowing this. use strict; use warnings; use autodie; use MCE::Flow; open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt"; Newfile->autoflush(1); # important, enable autoflush my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt'); open(my $fh, $f1); foreach (<$fh>) { chomp; s/^\s+\|\s+$//g; push @seq, $_; } close $fh; @seq = sort bylen @seq; # need to sort @seq by length. MCE::Flow::init { max_workers => 4, chunk_size => '24m', init_relay => 1, use_slurpio => 1, }; # For best performance, provide MCE the path, e.g. $f2 # versus a file handle. Workers communicate among themselves # the next offset without involving the manager process. mce_flow_f sub { my ($mce, $slurp_ref, $chunk_id) = @_; foreach my $r (@seq) { my $t = $r; $t =~ s/\h+/bbb/g; $$slurp_ref =~ s/$r/$t/g; } # Relay capability is useful for running something orderly. # For this use case, we've enabled autoflush on the file above. # Only one worker is allowed to run when entering the block. MCE::relay sub { print Newfile $$slurp_ref }; }, $f2; MCE::Flow::finish(); close Newfile; exit 0; sub bylen { length($b) <=> length($a); } [download]	[reply] [d/l]
Re^3: script optmization by Anonymous Monk on May 14, 2017 at 23:00 UTC
Time to run against a 200 MB file. That is the OP's input file and appending the same 884,873 times to make a 200 MB file. serial: 12.557 seconds slurped: 1.644 seconds 7.6 x parallel: 0.531 seconds 23.6 x	[reply]
Re^4: script optmization by Anonymous Monk on May 14, 2017 at 23:27 UTC
For memory constraint systems, one may run using a smaller size for chunk_size. In this case 4 MB will do just fine. For this demonstration, a smaller chunk size value decreases the wait time for workers to read again. It completes in 0.476 seconds for 26.4 x performance increase. `chunk_size => '4m',` [download] The slurped example is likely fast enough. Parallel is nice if you want that. But, serial doesn't take that long either. All completed in less than 20 seconds.	[reply] [d/l]