Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Script exponentially slower as number of files to process increases

by xnous (Sexton)
on Jan 25, 2023 at 20:04 UTC ( [id://11149878]=perlquestion: print w/replies, xml ) Need Help??

xnous has asked for the wisdom of the Perl Monks concerning the following question:

Hello, monks. With your help I've written a script that processes a large number of text files, efficiently. I run this script inside directories containing 1K to 10K files, usually less than 5K.

However, I've noticed that attempting to process larger number of files, i.e. several directories at once, the script gets exponentially slower. For example, while a run on 3.5K files would takes around 4.5 seconds, on 35K files takes 90 instead of 45 seconds and on 350K files it runs for hours.

This has baffled me, as I'm using subdirectories to organize the data, and filesystem operations shouldn't impact performance negatively; additionally, the data filenames are glob()bed into an array which is looped over and not slurped in at once and processed in bulk (although, in my tests I tried that approach which exhibited the same behavior).

What's very interesting is that when I put a counter to stop processing at 1000 files, I got increasingly longer processing times with each subdirectory added to the list, despite only processing 1000 files from it. Also, I always copy my data to /tmp which is mounted as tmpfs to reduce SSD wear and achieve maximum read/write performance. Testing:

wget http://www.astro.sunysb.edu/fwalter/AST389/TEXTS/Nightfall.htm html2text-cpp Nightfall.htm >nightfall.txt mkdir 00; for i in `seq -w 0 3456`; do head -$((RANDOM/128)) nightfall +.txt >00/data-$i; done
This will create a directory ("00") with 3,456 random sized files inside. Perl script:
#!/usr/bin/perl use strict; use warnings; use 5.36.0; use Env; use utf8; use POSIX "sys_wait_h"; #for waitpid FLAGS use Time::HiRes qw(gettimeofday tv_interval); use open ':std', ':encoding(UTF-8)'; my $benchmark = 1; # print timings for loops my $TMP='/tmp'; my $HOME = $ENV{HOME}; my $IN; my $OUT; my @data = glob("data-* ??/data-*"); my $filecount = scalar(@data); die if $filecount < 0; say "Parsing $filecount files"; my $wordfile="data.dat"; truncate $wordfile, 0; #$|=1; # substitute whole words my %whole = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die }; # substitute on prefix my %prefix = qw{ need need talk talk tak take used use using use }; # substitute on substring my %substring = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %whole ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %prefix ]})\w*}i; my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %substring ]})\w*} +i; truncate $wordfile, 0; my $maxforks = 64; print "maxforks: $maxforks\n"; my $forkcount = 0; my $infile; my $subdir = 0; my $subdircount = 255; my $tempdir = "temp"; mkdir "$tempdir"; mkdir "$tempdir/$subdir" while ($subdir++ <= $subdircount); $subdir = 0; my $i = 0; my $t0 = [gettimeofday]; my $elapsed; foreach $infile(@data) { $forkcount -= waitpid(-1, WNOHANG) > 0 while $forkcount >= $maxfor +ks; # do { $elapsed=tv_interval($t0); print "elapsed: $elapsed\n"; die; + } if $i++ >1000; # 1000 files test $i++; # comment out if you uncomment the above line $subdir = 1 if $subdir++ > $subdircount; if (my $pid = fork) { # $pid defined and !=0 -->parent ++$forkcount; } else { # $pid==0 -->child open my $IN, '<', $infile or exit(0); open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0); while (<$IN>) { tr/-!"#%&()*',.\/:;?@\[\\\]”_“{’}><^)(|/ /; # no punct " s/^/ /; s/\n/ \n/; s/[[:digit:]]{1,12}//g; s/w(as|ere)/be/gi; s{$re2}{ $prefix{lc $1} }g; # prefix s{$re3}{ $substring{lc $1} }g; # part s{$re1}{ $whole{lc $1} }g; # whole print $OUT "$_"; } close $OUT; close $IN; defined $pid and exit(0); # $pid==0 -->child, must exit itself } } ### now wait for all children to finish, no matter who they are 1 while wait != -1; # avoid zombies this is a blocking operation local @ARGV = glob("$tempdir/*/*"); my @text = <>; unlink glob "$tempdir/*/*"; open $OUT, '>', $wordfile or die "Error opening $wordfile"; print $OUT @text; close $OUT; $elapsed = tv_interval($t0); print "regex: $elapsed\n" if $benchmark;
Add more directories to process:

for dir in $(seq -w 01 10); do cp -a 00 $dir; done

Any help and insight will be greatly appreciated.

Replies are listed 'Best First'.
Re: Script exponentially slower as number of files to process increases
by choroba (Cardinal) on Jan 25, 2023 at 23:28 UTC
    Devel::NYTProf can handle forking code (use nytprofmerge). I don't have html2text-cpp available, so I can't replicate your experiment. But are you sure your machine has enough resources to run 64 forks all reading the same file and each writing to a different one? Also, waitpid might be non-blocking, but it still takes some time to process, and you're calling it millions of times.

    Update: I rewrote your script to use Thread::Queue using 8 threads on my 4 CPU machine. Processing just 00 takes

    Parsing 3457 files regex: 4.126449 real 0m4.194s user 0m13.465s sys 0m0.414s

    while processing 00 .. 04

    Parsing 17285 files regex: 20.588749 real 0m20.681s user 1m7.373s sys 0m1.873s

    Nothing exponential here, right?

    Note that I didn't try to understand what the code does, I just replaced fork with threads.

    I also changed the final output to work line by line instead of reading the whole contents into memory, but it didn't help when using fork.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Does your Perl lack threads support? Fortunately, there is MCE::Child and MCE::Channel that run similarly to threads. The following are the changes to choroba's script. Basically, I replaced threads with MCE::Child and Thread::Queue with MCE::Channel. That's it, no other changes.

      9,10c9,10 < use threads; < use Thread::Queue; --- > use MCE::Child; > use MCE::Channel; 88c88 < my $queue = 'Thread::Queue'->new; --- > my $queue = 'MCE::Channel'->new; 110c110 < my @workers = map threads->create(\&process_file), 1 .. $threads; --- > my @workers = map MCE::Child->create(\&process_file), 1 .. $threads;

      Let's see how they perform in a directory containing 35,841 files. I'm on a Linux box and running from /tmp/. The scripts are configured to spin 8 threads or processes.

      # threads, Thread::Queue Parsing 35841 files regex: 12.427632 real 0m12.486s user 1m21.869s sys 0m1.009s # MCE::Child, MCE::Channel Parsing 35841 files regex: 8.971663 real 0m9.035s user 0m56.504s sys 0m1.097s

      Another monk, kikuchiyo posted a parallel demonstration. I'm running this simply for the monk whom may like to know how it performs.

      Parsing 35841 files maxforks: 8 regex: 8.622583 real 0m8.953s user 0m52.559s sys 0m1.006s

      Seeing many cores near 100% simultaneously is magical. There is { threads, Thread::Queue }; { MCE::Child, MCE::Channels }; or roll your own. All three demonstrations work well.

      Let's imagine for a moment on becoming a CPU or the OS and a directory containing 350K files in it. Actually, imagine on being Perl itself. May I suggest a slight improvement... Try to populate the @data array after spawning threads or processes. This is especially true on the Windows platform. Unix OS'es benefit from Copy-on-Write, typically. That did not work for this use-case. See below for before and after results.

      It's quite natural to want to create the data array first, before spinning workers. The problem is that Perl threads make a copy, including emulated fork on the Windows platform. It's not likely a problem for a few thousand items. But 350K, that's unnecessary copy per each thread.

      # threads (same applies to running MCE::Child or parallel module of yo +ur choice) my @workers = map threads->create(\&process_file), 1 .. $threads; my @data = glob("data-* ??/data-*"); my $filecount = scalar(@data); if ($filecount <= 0) { $queue->end; $_->join for @workers; die "there are no files to process"; } say "Parsing $filecount files"; foreach $infile (@data) { $subdir = 1 if $subdir++ > $subdircount; $queue->enqueue([$infile, $subdir, $i++]); } $queue->end; $_->join for @workers;

      I created a directory containing 135,842 files. Before: threads consume 178 MB; after update: threads consume 98 MB. Interestingly, for MCE::Child... before and after update: each worker process consume ~ 30 MB and ~ 10 MB, respectively.

      Next, I tested before and after for a directory containing 350K files; spawning 32 workers. Threads before and after update consume 1,122 MB and 240 MB, respectively. Likewise, each MCE::Child process consume before and after update ~ 63 MB and ~ 10 MB, distinctively.

      Indeed, your threads rewrite scales linearly and I will adopt it, thank you. However, I'm still intrigued as to why the fork version behaves as such.

      My "production" system is a 10-year old, 8-thread i7 but with plenty of RAM (32GB) to handle the task at hand. Initially, the script didn't even have a fork limit neither waitpid() or wait(), except the final one after the regex loop, as I just used the solution presented to me by Marshall on my previous question and it didn't make a difference in performance anyway, as I experimented with various limits ranging from 8 to 512, thinking at first that unconstrained forking was the cause.

      I noticed your version only pumps "user" time on my system monitor, while fork shows a significant amount of "system" time, at least 10% of total CPU, whether $maxforks is 8 or unlimited.

      All these said, I'd still like to find out why the initial script becomes so slow when the files to process multiply, even when I limit its scope to the first 1000 files. It's almost unreasonable.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Script exponentially slower as number of files to process increases
by tybalt89 (Monsignor) on Jan 26, 2023 at 03:00 UTC
    $forkcount -= waitpid(-1, WNOHANG) > 0 while $forkcount >= $maxforks;

    This line hard loops and is probably what is chewing your "system" time. Replace with

    wait, $forkcount-- while $forkcount >= $maxforks;
      No, it doesn't. It actually makes no difference whether that wait() is there or not. I'm getting the same results with my version, yours or no $forkcount at all. It's very weird.

        Is there a fork limit on your system? You are not checking for fork failures...

        What's your ulimit -u for max user processes?

Re: Script exponentially slower as number of files to process increases
by kikuchiyo (Hermit) on Jan 26, 2023 at 23:38 UTC

    I think your problem is that fork is an expensive operation. The system has to copy the parent process with all its memory (*), duplicate its file descriptors etc. And you do that for each file in the queue, then throw away the process and fork a new one.

    What you want is a worker pool with long lived worker processes that divide the workload more or less evenly. The threads example earlier in this thread does exactly that. At the moment I can't find an equivalent, ready-made module with forks (Parallel::ForkManager came to mind, but that one doesn't work for this), but you can make do by preparing the divided workload yourself then explicitly assigning the parts to the forked workers:

    #!/usr/bin/perl use strict; use warnings; use 5.34.0; use Env; use utf8; use POSIX "sys_wait_h"; #for waitpid FLAGS use Time::HiRes qw(gettimeofday tv_interval); use open ':std', ':encoding(UTF-8)'; my $benchmark = 1; # print timings for loops my $TMP='/tmp'; my $HOME = $ENV{HOME}; my $IN; my $OUT; my @data = glob("data-* ??/data-*"); my $filecount = scalar(@data); die if $filecount < 0; say "Parsing $filecount files"; my $wordfile="data.dat"; truncate $wordfile, 0; #$|=1; # substitute whole words my %whole = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die }; # substitute on prefix my %prefix = qw{ need need talk talk tak take used use using use }; # substitute on substring my %substring = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %whole ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %prefix ]})\w*}i; my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %substring ]})\w*} +i; truncate $wordfile, 0; my $maxforks = 64; print "maxforks: $maxforks\n"; my $forkcount = 0; my $infile; my $subdir = 0; my $subdircount = 255; my $tempdir = "temp"; mkdir "$tempdir"; mkdir "$tempdir/$subdir" while ($subdir++ <= $subdircount); $subdir = 0; my $i = 0; my $t0 = [gettimeofday]; my $elapsed; my $batch_size = int(@data / $maxforks) + 1; my @batched_data; push @batched_data, [splice @data, 0, $batch_size] while @data; for my $worker_id (0..$maxforks-1) { if (my $pid = fork) { ++$forkcount; } else { for my $i (0..$#{$batched_data[$worker_id]}) { my $infile = $batched_data[$worker_id][$i]; my $subdir = $worker_id + 1; open my $IN, '<', $infile or exit(0); open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0); while (<$IN>) { tr/-!"#%&()*',.\/:;?@\[\\\]”_“{’}><^)(|/ /; # no punct + " s/^/ /; s/\n/ \n/; s/[[:digit:]]{1,12}//g; s/w(as|ere)/be/gi; s{$re2}{ $prefix{lc $1} }g; # prefix s{$re3}{ $substring{lc $1} }g; # part s{$re1}{ $whole{lc $1} }g; # whole print $OUT "$_"; } close $OUT; close $IN; } defined $pid and exit(0); # $pid==0 -->child, must exit itself } } ### now wait for all children to finish, no matter who they are 1 while wait != -1; # avoid zombies this is a blocking operation local @ARGV = glob("$tempdir/*/*"); my @text = <>; unlink glob "$tempdir/*/*"; open $OUT, '>', $wordfile or die "Error opening $wordfile"; print $OUT @text; close $OUT; $elapsed = tv_interval($t0); print "regex: $elapsed\n" if $benchmark;

    Note that this version does not process the files in the same order as yours, but that doesn't appear to matter?

    (*) yes, it says that in the manual that "Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child", but even on a fast, modern system it is going to take at least a few milliseconds per fork.

      You nailed it! So, if I understand correctly, the crucial difference is that my original script forked a process for every single file in queue, while yours only for $maxforks, right?

      The documentation on fork() says it "does a fork(2) system call to create a new process running the same program at the same point" and "great care has gone into making it extremely efficient (for example, using copy-on-write technology on data pages)", so I assumed an array with a few hundred thousand elements wouldn't pose a problem. Obviously, I was wrong.

      I benchmarked the 3 proposed updates from choroba, marioroy (who had also been of great help in my previous question) and kikuchiyo. The tests ran in a loop utilizing 8, 16, 32, etc up to 4096 threads over 3457, 34570 and 345700 files and the results are very interesting and demonstrate the strong and weak points of each implementation. The test box was a mobile 4th-gen, 4c/8t i7 with 32GB of RAM. All tests ran on tmpfs.

      I created some graphs for easier interpretation of the results but it appears I can't link to imgur; is there an "allowed" image posting site I can link to?

      All said, I'd like to thank every one of you, your insight and wisdom have fully answered my questions.

        the results are very interesting...

        The kikuchiyou.pl script not involving IPC should be first. It's a great solution. If memory is plentiful, why not consume a little memory for input data before spawning.

        The tests ran in a loop utilizing 8, 16, 32, etc up to 4096 threads...

        It's interesting seeing one attempting 4,000+ workers for a use-case that is somewhat CPU-bound. On another note, what coolness witnessing the operating system coping with this. For example, the threads and MCE::Child solutions involve IPC; e.g. workers entering a critical section -- who's next to read input from the queue or channel.

        On my system, the kikuchiyo.pl script exits early running 512 workers. I modified the script to figure why that is.

        --- kikuchiyo1.pl 2023-01-27 23:31:34.592261566 -0600 +++ kikuchiyo2.pl 2023-01-27 23:31:12.488762580 -0600 @@ -92,12 +92,16 @@ for my $worker_id (0..$maxforks-1) { if (my $pid = fork) { ++$forkcount; + } elsif (!defined $pid) { + warn "fork failed for worker_id $worker_id\n"; } else { for my $i (0..$#{$batched_data[$worker_id]}) { my $infile = $batched_data[$worker_id][$i]; my $subdir = $worker_id + 1; - open my $IN, '<', $infile or exit(0); - open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0); + open my $IN, '<', $infile + or die "[$worker_id] open error: infile"; + open my $OUT, '>', "$tempdir/$subdir/text-$i" + or die "[$worker_id] open error: outfile\n"; while (<$IN>) { tr/-!"#%&()*',.\/:;?@\[\\\]”_“{’}><^)(|/ /; # no punc +t " s/^/ /;

        Possibly a ulimit issue -n issue. My open-files ulimit is 1024. Edit: It has nothing to do with ulimit as the number of open files ulimit is per process. Workers 256+ exit early.

        [256] open error: outfile [257] open error: outfile [258] open error: outfile ... [511] open error: outfile

        The threads and MCE::Child solutions pass for 512 workers. Again, regex is used here -- kinda CPU bound. Is running more workers than the number of logical CPU cores improving performance? There are golden CPU samples out there.

        This topic presented an opportunity for me to brush up on MCE, particularly the input iterator. Notice the parallel routine to process files, the input iterator run by the manager process, and MCE::Loop to launch parallel into action. The $chunk_id is used to compute $i, matching the data orderly. Moreover, MCE->last to leave the loop entirely (including further input data) due to open file error.

        Update: I had met for the input iterator to obtain data after workers have spawned. Now, workers consume little memory. Hence, the reason making this demonstration.

        Sample run from a 32-core, 64-threads box:

        I captured the "real" via the UNIX time command. We're dealing with big data. The UNIX time command is helpful to capture the overall time. This includes loading modules, running, and reaping workers. Not to forget, global cleanup in Perl.

        The total time includes post-processing, reading temp-files, writing to data.dat, and unlinking.

Re: Script exponentially slower as number of files to process increases
by tybalt89 (Monsignor) on Jan 27, 2023 at 15:29 UTC

    New theory: the fork is failing with EAGAIN or ERESTARTNOINTR.
    To prove fork is not failing, add the following line immediately before the open my $IN, '<', $infile or exit(0); line.

    defined $pid or die "fork failed $!";

Re: Script exponentially slower as number of files to process increases
by tybalt89 (Monsignor) on Jan 27, 2023 at 01:25 UTC

    BTW, @text is an awfully big array. Think it might cause swapping?

      In my answer, I show how to process the output line by line instead. It didn't change the exponential behaviour, though.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11149878]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2024-03-28 17:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found