learnedbyerror has asked for the wisdom of the Perl Monks concerning the following question:
Oh Monks,
The village idiot returneth and seeks your wisdom yet again especially that of Monk Marioroy
I have a reasonably large dataset of html files ( +/- 950K files, average size 23K, total size 21G) which I need to parse, manipulate and save the processed output to a simple text file. Given the volume, I decided to take a parallelized approach to this. For the past several years, I have used MCE for tasks such as this. I was able to quickly get a working solution up and running that averages 380 seconds of clock time start to finish. This is plenty fast enough to meet my needs. The generated output file is 7.0 GB in size with approximately 25MM rows
The following is the extracted/condensed code of the meat of the process. In addition to the modules show below, I also have Sereal Encoder/Decoder installed. Installation of Sereal knocked 40 secs of the initial run time.
use File::Map qq(map_file); use MCE; use MCE::Candy; use Path::Iterator::Rule; my $rule = Path::Iterator::Rule->new->file->name(qr/[.](html)$/); my $iterator = $rule->iter_fast("topdir"); open my $fh_out, ">:utf8", $self->fn_out; my $mce = MCE->new( gather => MCE::Candy::out_iter_fh($fh_out), max_workers => 'auto', user_func => \&parse_file, )->spawn; $mce->process($iterator); sub parse_file { my ( $mce, $chunk_ref, $chunk_id ) = @_; map_file my $text, $chunk_ref->[0], '<'; my (@posts) = $text =~ m/ \<\!--XXXX:\ (\d+)--\>\<\!--YYYY:\ (\d+).+? (?:<\!--AAAAA--\>(.*?)\<\!--\/AAAAA--\>|) \<\!--BBBB--\>(.*?)\<\!--\/BBBB--\>.+? \<\!--CCCC--\>(.+?)\<\!--\/CCCC--\> /msgx; … Do some stuff with posts and place results in multiline string $output $mce->gather( $chunk_id, $output ); }
But given a little bit of boredom, I decided to investigate to see just how efficient this is. My first look was at htop when the program is running. As expected, it kicked off 8 forked processes on my Mac with an i7 proc. However, I noticed that none of these were running at capacity. Usually they were hitting at 60 – 70% utilization per logical core. This led me to conclude that I probably have an I/O bottleneck somewhere
I ran the following tests
Doing the math ( 380 – 160 – 5 – 15 = 200 ) leaves me believing that the gap of 200 seconds is the time required to move the data from the child process back to the parent. This seems large to me. I based this on:
I am in general familiar with the woes of IPC speed. I have other parallelized programs that I have written where I had to use something like a finely tuned BerkeleyDB implementation for the purpose of IPC. Most of those projects were much larger and had a much more complex analytical pattern often times requiring multiple programs to communicate with each other. But, I still wonder whether there is an issue with the IPC between the MCE child and parent processes.
So my question(s) are:
Thanks in advance for the help!
lbe
UPDATE: May 2, 2018 11:12 GMT-6
Now I really feel like the village idiot. I have been running this code and testing against various perl versions and none of them have finished in less than about 380 seconds ... until this morning. When I read anonymous monk's question regarding threads, I reran the test and it ran in 288 seconds (about 8 seconds above my theoretical number above. I thought wow, I may be onto something. When I went back and tested without threads, meaning using fork, it now ran in 278 seconds. Both of these test were on perl 5.26.2 with threads compiled. I reran the same test on 5.26.2 without threads compiled and got about the same numbers. Last night, my last run was on 2.27.11 (a dev release). I re-ran both with and without threads and got essentially the same time.
I have checked and compared my current program file with the a my last commit yesterday and other than the addition of use threads for the the threaded test, there is no difference. I have also validated that nothing has changed in the source and that the generated results match my previous results. There is some bad juju, or maybe I should say good juju with the performance improvement, going on somewhere. I'll continue to run more tests and will update the thread if I learn anything new
Thank You! to all of you that responded!
lbe
UPDATE2: May 3, 2018 15:21 GMT-6
I'm adding this update, really reply, here since it actually is in response to several of the threads below and thought it would read better instead of getting lost at too low of a level. Thanks to all of you for your additional input. I'm devising some tests to get some additional data to help profile duration and throughput. I will share my findings when I complete. It will likely take me a couple of days to complete given immediate workload.
In the mean time, I have bad news. When I went through the faster tests line by line, I found an error in my code. They actual run times are what I had first posted. All of my benchmarks on perl 5.26.2 run for 380 - 358 secs. There is still a delay of approximately 100 seconds that I can currently only explain with fear, uncertainty and doubt. marioroy I will follow up on your suggestions now that I know this
One clarification, the read processes are independent of each other. In the execution of the process_file function, one file is read, analytics computed and one multiline gather/print is use to persist the results. This program will never read the same file more than once. File selection order is pseudo-random in that a file is processed in the order returned from Path::Iterator::Rule. The read from the file is at the high level a single read - meaning either a file slurp into a string variable or execution of a regexp match against the map_file. I need to investigate block sizes on the SSD and how Apple APFS handles reads and calculate statistics on file size distribution to estimate how efficient or inefficient the SSD may be in what it has to read vs. what it transfers.
I have previously run Devel::NYTProf on the single threaded version and made some modifications to reduce the computation time. I also changed from open to map_file which resulted in a modest time reduction for reading. I do want to test the assertion that things could possibly be slower with map_file when running in parallel.
Unfortunately, I know of no package that can profile this while running at or nearly at speed. As such, I am going to have to instrument the code to record timings and volumes read/written in a way that minimizes impact on the execution profile - yes I know, Heisenberg Uncertainty Principle.
Lastly, I conducted some read benchmarking using fgrep which I believe most people will agree is pretty fast at reading data across multiple files in a single process. My methodology is:
NOTE: The machine on which this running is have 16GB of RAM. The total bytes read in each run is 21.GB. So the file cache will be overrun in a single run eliminating any cache assistance from run to run. I ran made runs from 8 processors down to 1. I ran the 8 processor run twice and through the first one to make sure that there was no cache assistance
The command to accomplish this is:
find directory_name-depth -mindepth 3 -maxdepth 3 -print | xargs -L1 -P 8 fgrep -r '.' | pv >/dev/nullWhere 8 is the number of processes run.
The output of this command looks like
18.6GiB 0:01:43 [ 184MiB/s]The results are:
| CPU Count | Total Throughput Rate MB/sec | Per Processor Throughput Rate (MB/sec) | Per Processor efficiency relative to 1 CPU |
|---|---|---|---|
| 1 | 31 | 31 | 0% |
| 2 | 61 | 31 | 100% |
| 3 | 85 | 28 | 92% |
| 4 | 106 | 27 | 86% |
| 5 | 117 | 23 | 76% |
| 6 | 135 | 23 | 76% |
| 7 | 149 | 21 | 69% |
| 8 | 174 | 22 | 71% |
My initial interpretation of these results is that my code is not IO bound. With 8 fgrep processes running, the total time to read the files is 103 secs. Whereas my best read time is 160 secs. when I use 8 processes. I will experiment to see if I can get this reduced any further.
I will update this post in another 2 or 3 days once I have additional information
Thanks!
lbe
UPDATE3: May 6, 2018 00:30 GMT-6
I have instrumented the code and surprisingly don’t see any measurable impact on overall run time. Maybe, Heisenberg doesn’t apply here :). The processor throughputs posted in Update2 are consistent with my new measurements. On this i7, my overall runtime decreases until I reach 8 worker threads. I increased from 9 – 12 workers and saw approximately the same run times as with 8 threads. This is consistent with my expectations since the i7 has 8 logical cores.
The overall run times are:
* map_file is 15% faster than using a single line slurp-eaze read
The breakdown of the run time for the 382 secs above is:
I ran the above tests with everything closed on the Mac and with it disconnected from the network to minimize any competition for CPU or I/O cycles. All processes ran in memory without swapping to disk. There is 16 GB of RAM in this machine, only 9 GB were in use while the program ran.
The files statistics are:
My interpretation of the data that I have gathered is:
Thanks to all of you who asked questions and provided input. Most special thanks to marioroy for his response and for MCE!
lbe
UPDATE4: May 7, 2018 08:00 GMT-6
Hello marioroy,
I took you advice and created a chunking iterator and in short have significant improvement. I decided to deviate from PIR for now and cheat and create an iterator based on the Mac’s native find command. The iterator code is:
Use File::Which qw(which); sub get_dir_file_list_iter { my $dir = shift; my $FIND = which 'find'; my $CMD = qq/$FIND -L "$dir" -type f -print/; return ( sub { my $chunk_size = shift // 1; my @ary; while ( my $fn = <$FH> ) { chomp $fn; push( @ary, $fn ); last if @ary == $chunk_size; } return (@ary); } ); }
Let me try to cutoff some of the flames on calling an external to do something that could be done with pure perl. At this point, I am trying to optimize speed. In the vast majority of my perl development, I color inside of the lines; however, at times when performance is my main concern, I cheat and leverage executables outside of perl that are optimized for a specific role. find is one of those. I recognize that there are potential problems with unanticipated side effects such as zombie processes, race conditions … In this case, I have decided to accept these risks as this approach reduces iteration clock time in this app from ~60 seconds to ~20 seconds based upon instrumented timing. In general, I advocate using perlish tools like PIR, File::Find …
My MCE code now looks like
use MCE; use MCE::Candy; my $iterator = get_dir_file_list_iter($dir); open my $fh_out, ">, $ fn_out; my $mce = MCE->new( gather => MCE::Candy::out_iter_fh($fh_out), chunk_size => $iter_file_ct, max_workers => $max_workers, user_func => \&parse_files, )->spawn; $mce->process($iterator); $mce->shutdown() );
With respect to overall run time, with the find base iterator and a chunk_size of 250, I am down to ~300 secs. from my original ~380 secs. I have not done sufficient testing to validate what contributed to the specific reduction. I have created a shell script to run benchmark based upon a number of different variations and will update once it completes
marioroy, I had already been thinking about using MCE::Hobo and MCE::Queue to do something similar to your suggestions in Re^3: MCE: Slow IPC between child and gather process in parent. I will try this variation once the above testing completes
Thanks for you guidance and willingness to help!
lbe
UPDATE5: May 7, 2018 23:00 GMT-6
OK, ran some benchmarks today. My observation based upon them are:
I think this may be about as good as I am going to get using this system unless I can find a way to read the data from the disk faster. The read throughput is less than half of what I was able to achieve with 8 processes runing recursive fgrep.
At this point, I am going to close my testing insofar as updating in this thread. I don't think I will get much more speed out of using the MCE::Hobo and MCE::Queue, though I will give it a try. I'll also perform some additional benchmarking on reading to see if other options like File::Slurper, sysopen/sysread ...
Thanks to all for your comments and advice and a special thanks to marioroy for piping in with guidance on MCE
lbe
|
|---|