in reply to script optmization

Loading 200 MB into available memory is likely possible on today's hardward. If so, the following runs ~ 7 times faster. The idea here is iteratating over seq one time. For larger data files, one can chunk 300 MB at a time time and not forgetting to read till the end of line to have a complete chunk. Then process the chunk similarly.

use strict; use warnings; use autodie; open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt"; my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt'); open(my $fh, $f1); foreach (<$fh>) { chomp; s/^\s+|\s+$//g; push @seq, $_; } close $fh; @seq = sort bylen @seq; # need to sort @seq by length. my $data; { open($fh, $f2); local $/; $data = <$fh>; } foreach my $r (@seq) { my $t = $r; $t =~ s/\h+/bbb/g; $data =~ s/$r/$t/g; } print Newfile $data; close Newfile ; exit 0; sub bylen { length($b) <=> length($a); }

Replies are listed 'Best First'.
Re^2: script optmization
by Anonymous Monk on May 14, 2017 at 22:04 UTC

    For larger data files and not wanting to deal with chunking manually, then there is the parallel MCE module. This is what one might construct using the MCE::Flow module. We're running 4 workers. Therefore chunking at 24 MB is plenty. Perl and CPAN are amazing allowing this.

    use strict; use warnings; use autodie; use MCE::Flow; open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt"; Newfile->autoflush(1); # important, enable autoflush my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt'); open(my $fh, $f1); foreach (<$fh>) { chomp; s/^\s+|\s+$//g; push @seq, $_; } close $fh; @seq = sort bylen @seq; # need to sort @seq by length. MCE::Flow::init { max_workers => 4, chunk_size => '24m', init_relay => 1, use_slurpio => 1, }; # For best performance, provide MCE the path, e.g. $f2 # versus a file handle. Workers communicate among themselves # the next offset without involving the manager process. mce_flow_f sub { my ($mce, $slurp_ref, $chunk_id) = @_; foreach my $r (@seq) { my $t = $r; $t =~ s/\h+/bbb/g; $$slurp_ref =~ s/$r/$t/g; } # Relay capability is useful for running something orderly. # For this use case, we've enabled autoflush on the file above. # Only one worker is allowed to run when entering the block. MCE::relay sub { print Newfile $$slurp_ref }; }, $f2; MCE::Flow::finish(); close Newfile; exit 0; sub bylen { length($b) <=> length($a); }

      Time to run against a 200 MB file. That is the OP's input file and appending the same 884,873 times to make a 200 MB file.

      serial: 12.557 seconds

      slurped: 1.644 seconds 7.6 x

      parallel: 0.531 seconds 23.6 x

        For memory constraint systems, one may run using a smaller size for chunk_size. In this case 4 MB will do just fine. For this demonstration, a smaller chunk size value decreases the wait time for workers to read again. It completes in 0.476 seconds for 26.4 x performance increase.

        chunk_size => '4m',

        The slurped example is likely fast enough. Parallel is nice if you want that. But, serial doesn't take that long either. All completed in less than 20 seconds.