http://qs1969.pair.com?node_id=11149910


in reply to Script exponentially slower as number of files to process increases

I think your problem is that fork is an expensive operation. The system has to copy the parent process with all its memory (*), duplicate its file descriptors etc. And you do that for each file in the queue, then throw away the process and fork a new one.

What you want is a worker pool with long lived worker processes that divide the workload more or less evenly. The threads example earlier in this thread does exactly that. At the moment I can't find an equivalent, ready-made module with forks (Parallel::ForkManager came to mind, but that one doesn't work for this), but you can make do by preparing the divided workload yourself then explicitly assigning the parts to the forked workers:

#!/usr/bin/perl use strict; use warnings; use 5.34.0; use Env; use utf8; use POSIX "sys_wait_h"; #for waitpid FLAGS use Time::HiRes qw(gettimeofday tv_interval); use open ':std', ':encoding(UTF-8)'; my $benchmark = 1; # print timings for loops my $TMP='/tmp'; my $HOME = $ENV{HOME}; my $IN; my $OUT; my @data = glob("data-* ??/data-*"); my $filecount = scalar(@data); die if $filecount < 0; say "Parsing $filecount files"; my $wordfile="data.dat"; truncate $wordfile, 0; #$|=1; # substitute whole words my %whole = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die }; # substitute on prefix my %prefix = qw{ need need talk talk tak take used use using use }; # substitute on substring my %substring = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %whole ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %prefix ]})\w*}i; my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %substring ]})\w*} +i; truncate $wordfile, 0; my $maxforks = 64; print "maxforks: $maxforks\n"; my $forkcount = 0; my $infile; my $subdir = 0; my $subdircount = 255; my $tempdir = "temp"; mkdir "$tempdir"; mkdir "$tempdir/$subdir" while ($subdir++ <= $subdircount); $subdir = 0; my $i = 0; my $t0 = [gettimeofday]; my $elapsed; my $batch_size = int(@data / $maxforks) + 1; my @batched_data; push @batched_data, [splice @data, 0, $batch_size] while @data; for my $worker_id (0..$maxforks-1) { if (my $pid = fork) { ++$forkcount; } else { for my $i (0..$#{$batched_data[$worker_id]}) { my $infile = $batched_data[$worker_id][$i]; my $subdir = $worker_id + 1; open my $IN, '<', $infile or exit(0); open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0); while (<$IN>) { tr/-!"#%&()*',.\/:;?@\[\\\]”_“{’}><^)(|/ /; # no punct + " s/^/ /; s/\n/ \n/; s/[[:digit:]]{1,12}//g; s/w(as|ere)/be/gi; s{$re2}{ $prefix{lc $1} }g; # prefix s{$re3}{ $substring{lc $1} }g; # part s{$re1}{ $whole{lc $1} }g; # whole print $OUT "$_"; } close $OUT; close $IN; } defined $pid and exit(0); # $pid==0 -->child, must exit itself } } ### now wait for all children to finish, no matter who they are 1 while wait != -1; # avoid zombies this is a blocking operation local @ARGV = glob("$tempdir/*/*"); my @text = <>; unlink glob "$tempdir/*/*"; open $OUT, '>', $wordfile or die "Error opening $wordfile"; print $OUT @text; close $OUT; $elapsed = tv_interval($t0); print "regex: $elapsed\n" if $benchmark;

Note that this version does not process the files in the same order as yours, but that doesn't appear to matter?

(*) yes, it says that in the manual that "Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child", but even on a fast, modern system it is going to take at least a few milliseconds per fork.

Replies are listed 'Best First'.
Re^2: Script exponentially slower as number of files to process increases
by xnous (Sexton) on Jan 28, 2023 at 01:08 UTC
    You nailed it! So, if I understand correctly, the crucial difference is that my original script forked a process for every single file in queue, while yours only for $maxforks, right?

    The documentation on fork() says it "does a fork(2) system call to create a new process running the same program at the same point" and "great care has gone into making it extremely efficient (for example, using copy-on-write technology on data pages)", so I assumed an array with a few hundred thousand elements wouldn't pose a problem. Obviously, I was wrong.

    I benchmarked the 3 proposed updates from choroba, marioroy (who had also been of great help in my previous question) and kikuchiyo. The tests ran in a loop utilizing 8, 16, 32, etc up to 4096 threads over 3457, 34570 and 345700 files and the results are very interesting and demonstrate the strong and weak points of each implementation. The test box was a mobile 4th-gen, 4c/8t i7 with 32GB of RAM. All tests ran on tmpfs.

    I created some graphs for easier interpretation of the results but it appears I can't link to imgur; is there an "allowed" image posting site I can link to?

    All said, I'd like to thank every one of you, your insight and wisdom have fully answered my questions.

      the results are very interesting...

      The kikuchiyou.pl script not involving IPC should be first. It's a great solution. If memory is plentiful, why not consume a little memory for input data before spawning.

      The tests ran in a loop utilizing 8, 16, 32, etc up to 4096 threads...

      It's interesting seeing one attempting 4,000+ workers for a use-case that is somewhat CPU-bound. On another note, what coolness witnessing the operating system coping with this. For example, the threads and MCE::Child solutions involve IPC; e.g. workers entering a critical section -- who's next to read input from the queue or channel.

      On my system, the kikuchiyo.pl script exits early running 512 workers. I modified the script to figure why that is.

      --- kikuchiyo1.pl 2023-01-27 23:31:34.592261566 -0600 +++ kikuchiyo2.pl 2023-01-27 23:31:12.488762580 -0600 @@ -92,12 +92,16 @@ for my $worker_id (0..$maxforks-1) { if (my $pid = fork) { ++$forkcount; + } elsif (!defined $pid) { + warn "fork failed for worker_id $worker_id\n"; } else { for my $i (0..$#{$batched_data[$worker_id]}) { my $infile = $batched_data[$worker_id][$i]; my $subdir = $worker_id + 1; - open my $IN, '<', $infile or exit(0); - open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0); + open my $IN, '<', $infile + or die "[$worker_id] open error: infile"; + open my $OUT, '>', "$tempdir/$subdir/text-$i" + or die "[$worker_id] open error: outfile\n"; while (<$IN>) { tr/-!"#%&()*',.\/:;?@\[\\\]”_“{’}><^)(|/ /; # no punc +t " s/^/ /;

      Possibly a ulimit issue -n issue. My open-files ulimit is 1024. Edit: It has nothing to do with ulimit as the number of open files ulimit is per process. Workers 256+ exit early.

      [256] open error: outfile [257] open error: outfile [258] open error: outfile ... [511] open error: outfile

      The threads and MCE::Child solutions pass for 512 workers. Again, regex is used here -- kinda CPU bound. Is running more workers than the number of logical CPU cores improving performance? There are golden CPU samples out there.

        First of all, thank you for your explanations and the work you put in suggesting an alternative.

        Is running more workers than the number of logical CPU cores improving performance? There are golden CPU samples out there.

        I don't know what to say, other than try to benignly bypass the perlmonks' filter and show somehow the graphs (if a janitor would be kind enough to URLify these, thanks). It does seem that in certain scenarios the first part of your statement holds true.

      • First test, https://i.imgur.com/CpclI9L.png - 3457 files
      • Second test, https://i.imgur.com/cDi41fC.png 34570 files
      • Third test, https://i.imgur.com/yNZokCx.png - 345700 files, due to long run times, I omitted the clearly slower Threads::Queue solution.
      • Final test, https://i.imgur.com/2NVovHx.png - same load as in #3, but only for fork() which proved to be the fastest, across the 512-4096 range in 128-step increments, trying to find the sweet spot.
      • You need to copy/paste the links by hand, but it's worth the trouble. The gain of fork() from 256 to 512 processes is almost unbelievable, while the performance of the other implementations is practically linear.

        EDIT: But of course it is, it's due to workers exiting early.

        I also tested your updated script but it showed no tangible improvement on my setup.

      This topic presented an opportunity for me to brush up on MCE, particularly the input iterator. Notice the parallel routine to process files, the input iterator run by the manager process, and MCE::Loop to launch parallel into action. The $chunk_id is used to compute $i, matching the data orderly. Moreover, MCE->last to leave the loop entirely (including further input data) due to open file error.

      Update: I had met for the input iterator to obtain data after workers have spawned. Now, workers consume little memory. Hence, the reason making this demonstration.

      Sample run from a 32-core, 64-threads box:

      I captured the "real" via the UNIX time command. We're dealing with big data. The UNIX time command is helpful to capture the overall time. This includes loading modules, running, and reaping workers. Not to forget, global cleanup in Perl.

      The total time includes post-processing, reading temp-files, writing to data.dat, and unlinking.

        A chunking engine allows one to specify a smaller chunk size for more possibilities. What we can do is have workers append to a local string variable. Upon completing the chunk, workers write the buffer to the target file. This works out quite well, further reducing the overall time. This rid of the time taken before and after W.R.T. temp dirs/files creation, reading, and cleanup.

        Why smaller chunk size? Workers append to a local string variable for this demonstration. A greater chunk size also means greater memory consumption, potentially.

        Making this possible is MCE::Relay (for orderly) or MCE::Mutex (unorderly). Not seeing all CPU cores at 100% utilization? That's because workers now cooperate orderly or serially when appending to the target file. Please no need to spin 4,000+ workers or beyond the physical limit of the box. That's really unnecessary.

        Look at the time saved. Merging results is now handled by workers versus post-processing, previously taking 31.510 seconds.

        # MCE::Relay $ time ../mario-mceiter2.pl Parsing 345701 files maxforks: 64 chunksize: 50 regex: 13.628399 real 0m13.670s user 11m11.484s sys 0m 4.739s # MCE::Mutex $ time ../mario-mceiter2.pl Parsing 345701 files maxforks: 64 chunksize: 50 regex: 12.646211 real 0m12.689s user 11m28.535s sys 0m 4.827s

        mario-mceiter2.pl