comment on

Oh Monks,

The village idiot returneth and seeks your wisdom yet again especially that of Monk Marioroy

I have a reasonably large dataset of html files ( +/- 950K files, average size 23K, total size 21G) which I need to parse, manipulate and save the processed output to a simple text file. Given the volume, I decided to take a parallelized approach to this. For the past several years, I have used MCE for tasks such as this. I was able to quickly get a working solution up and running that averages 380 seconds of clock time start to finish. This is plenty fast enough to meet my needs. The generated output file is 7.0 GB in size with approximately 25MM rows

The following is the extracted/condensed code of the meat of the process. In addition to the modules show below, I also have Sereal Encoder/Decoder installed. Installation of Sereal knocked 40 secs of the initial run time.

use File::Map qq(map_file);
use MCE;
use MCE::Candy;
use Path::Iterator::Rule;

my $rule = Path::Iterator::Rule->new->file->name(qr/[.](html)$/);
my $iterator = $rule->iter_fast("topdir");

open my $fh_out, ">:utf8", $self->fn_out;

my $mce = MCE->new(
    gather      => MCE::Candy::out_iter_fh($fh_out),
    max_workers => 'auto',
    user_func   => \&parse_file,
)->spawn;
$mce->process($iterator);

sub parse_file {
    my ( $mce, $chunk_ref, $chunk_id ) = @_;
    map_file my $text, $chunk_ref->[0], '<';
    my (@posts) = $text =~ m/
        \<\!--XXXX:\ (\d+)--\>\<\!--YYYY:\ (\d+).+?
        (?:<\!--AAAAA--\>(.*?)\<\!--\/AAAAA--\>|)
        \<\!--BBBB--\>(.*?)\<\!--\/BBBB--\>.+?
        \<\!--CCCC--\>(.+?)\<\!--\/CCCC--\>
    /msgx;

    …  Do some stuff with posts and place 
       results in multiline string $output

$mce->gather( $chunk_id, $output );
}
[download]

But given a little bit of boredom, I decided to investigate to see just how efficient this is. My first look was at htop when the program is running. As expected, it kicked off 8 forked processes on my Mac with an i7 proc. However, I noticed that none of these were running at capacity. Usually they were hitting at 60 – 70% utilization per logical core. This led me to conclude that I probably have an I/O bottleneck somewhere

I ran the following tests

Immediately following the map_file call, I added: my $t = $text; return; to allow me to see how quickly I could read all of the data. This finished consistently in 160 secs.
Similarly, I placed a return; just before the gather call to let me see the read and processing total time. Surprisingly, this only added 5 seconds for a total of 165 secs. I gained even more respect for the perl Regexp engine and its efficiency of working with a memory mapped file.
Wrote a simple program to generate 7GB of data with similar characteristics of the file referenced above and wrote this to disk. This ran consistently at 15 seconds

Doing the math ( 380 – 160 – 5 – 15 = 200 ) leaves me believing that the gap of 200 seconds is the time required to move the data from the child process back to the parent. This seems large to me. I based this on:

I can read 21GB from disk, an SSD, in 160 secs.
I can write 7GB to disk, same SSD, in 15 secs.
Shouldn’t an in-memory transfer be closers to 15 seconds that 160?

I am in general familiar with the woes of IPC speed. I have other parallelized programs that I have written where I had to use something like a finely tuned BerkeleyDB implementation for the purpose of IPC. Most of those projects were much larger and had a much more complex analytical pattern often times requiring multiple programs to communicate with each other. But, I still wonder whether there is an issue with the IPC between the MCE child and parent processes.

So my question(s) are:

Is this as good as it gets for MCE?
Should I create an additional process with MCE::Hobo and use MCE::Queue to move the data from the child to the file writer?
Am I missing something?

Thanks in advance for the help!

lbe

UPDATE: May 2, 2018 11:12 GMT-6

Now I really feel like the village idiot. I have been running this code and testing against various perl versions and none of them have finished in less than about 380 seconds ... until this morning. When I read anonymous monk's question regarding threads, I reran the test and it ran in 288 seconds (about 8 seconds above my theoretical number above. I thought wow, I may be onto something. When I went back and tested without threads, meaning using fork, it now ran in 278 seconds. Both of these test were on perl 5.26.2 with threads compiled. I reran the same test on 5.26.2 without threads compiled and got about the same numbers. Last night, my last run was on 2.27.11 (a dev release). I re-ran both with and without threads and got essentially the same time.

I have checked and compared my current program file with the a my last commit yesterday and other than the addition of use threads for the the threaded test, there is no difference. I have also validated that nothing has changed in the source and that the generated results match my previous results. There is some bad juju, or maybe I should say good juju with the performance improvement, going on somewhere. I'll continue to run more tests and will update the thread if I learn anything new

Thank You! to all of you that responded!

lbe

UPDATE2: May 3, 2018 15:21 GMT-6

I'm adding this update, really reply, here since it actually is in response to several of the threads below and thought it would read better instead of getting lost at too low of a level. Thanks to all of you for your additional input. I'm devising some tests to get some additional data to help profile duration and throughput. I will share my findings when I complete. It will likely take me a couple of days to complete given immediate workload.

In the mean time, I have bad news. When I went through the faster tests line by line, I found an error in my code. They actual run times are what I had first posted. All of my benchmarks on perl 5.26.2 run for 380 - 358 secs. There is still a delay of approximately 100 seconds that I can currently only explain with fear, uncertainty and doubt. marioroy I will follow up on your suggestions now that I know this

One clarification, the read processes are independent of each other. In the execution of the process_file function, one file is read, analytics computed and one multiline gather/print is use to persist the results. This program will never read the same file more than once. File selection order is pseudo-random in that a file is processed in the order returned from Path::Iterator::Rule. The read from the file is at the high level a single read - meaning either a file slurp into a string variable or execution of a regexp match against the map_file. I need to investigate block sizes on the SSD and how Apple APFS handles reads and calculate statistics on file size distribution to estimate how efficient or inefficient the SSD may be in what it has to read vs. what it transfers.

I have previously run Devel::NYTProf on the single threaded version and made some modifications to reduce the computation time. I also changed from open to map_file which resulted in a modest time reduction for reading. I do want to test the assertion that things could possibly be slower with map_file when running in parallel.

Unfortunately, I know of no package that can profile this while running at or nearly at speed. As such, I am going to have to instrument the code to record timings and volumes read/written in a way that minimizes impact on the execution profile - yes I know, Heisenberg Uncertainty Principle.

Lastly, I conducted some read benchmarking using fgrep which I believe most people will agree is pretty fast at reading data across multiple files in a single process. My methodology is:

Create a list of directories 3 levels below the top of the directory structure of interest. This list contains 1,107 entries.
Pipe this list into xargs using -P to control the number of forked processes running at a time. This will form more process than the MCE approach, but this should be negligible within the overall runtime
call fgrep -r '.' level_3_directory. fgrep will recurse down through the tree and cat the file contents to STDOUT
Pipe the output from all of the threads to pv to record the time and throughput

NOTE: The machine on which this running is have 16GB of RAM. The total bytes read in each run is 21.GB. So the file cache will be overrun in a single run eliminating any cache assistance from run to run. I ran made runs from 8 processors down to 1. I ran the 8 processor run twice and through the first one to make sure that there was no cache assistance

The command to accomplish this is:

find directory_name-depth -mindepth 3 -maxdepth 3 -print | xargs -L1 -P 8 fgrep -r '.' | pv >/dev/null

Where 8 is the number of processes run.

The output of this command looks like

18.6GiB 0:01:43 [ 184MiB/s]

The results are:

CPU Count	Total Throughput Rate MB/sec	Per Processor Throughput Rate (MB/sec)	Per Processor efficiency relative to 1 CPU
1	31	31	0%
2	61	31	100%
3	85	28	92%
4	106	27	86%
5	117	23	76%
6	135	23	76%
7	149	21	69%
8	174	22	71%

My initial interpretation of these results is that my code is not IO bound. With 8 fgrep processes running, the total time to read the files is 103 secs. Whereas my best read time is 160 secs. when I use 8 processes. I will experiment to see if I can get this reduced any further.

I will update this post in another 2 or 3 days once I have additional information

Thanks!

lbe

UPDATE3: May 6, 2018 00:30 GMT-6

I have instrumented the code and surprisingly don’t see any measurable impact on overall run time. Maybe, Heisenberg doesn’t apply here :). The processor throughputs posted in Update2 are consistent with my new measurements. On this i7, my overall runtime decreases until I reach 8 worker threads. I increased from 9 – 12 workers and saw approximately the same run times as with 8 threads. This is consistent with my expectations since the i7 has 8 logical cores.

The overall run times are:

382 secs – MCE, 8 workers, read with map_file
1,822 secs – single process, read with map_file

* map_file is 15% faster than using a single line slurp-eaze read

The breakdown of the run time for the 382 secs above is:

48% - Read
13% - Calculations
4% - Write
20% - Path::Iterator::Rule
15% - Overhead – MCE, context switches …

I ran the above tests with everything closed on the Mac and with it disconnected from the network to minimize any competition for CPU or I/O cycles. All processes ran in memory without swapping to disk. There is 16 GB of RAM in this machine, only 9 GB were in use while the program ran.

The files statistics are:

File Count = 949,670
Min File Size = 189 B
Average File Size = 22,178 B
Max File Size = 985,373 B
Std Dev of Files Size = 23,262 B
80% of the files are < 30,000 B
90% of the files are < 50,000 B

My interpretation of the data that I have gathered is:

When running a single process, the limiting factor is the alternating read/write I/O
When using MCE, overall run time reduces and aggregate I/O increases until the number of worker processes equals the number of virtual cores
The impact of time required for IPC is less than 10% (38 secs) of the overall run time. This is much more than offset by the 79% (1,440) second reduction in overall run time.
map_file is approximately 15 % faster than PerlIO for this directory structure and file size distribution
Another approximate 12% reduction in run time can be saved by using the unix find command to prefetch the names of the files of interest instead of using Path::Iterator::Rule
Overall, I am satisfied that this set of code is reasonably optimized as it exists.
Further improvement in run time would require moving to something like BerkelyDB and possibly using Sereal with compression enabled to reduce disk I/O and eliminate nearly one million file opens and closes.

Thanks to all of you who asked questions and provided input. Most special thanks to marioroy for his response and for MCE!

lbe

UPDATE4: May 7, 2018 08:00 GMT-6

Hello marioroy,

I took you advice and created a chunking iterator and in short have significant improvement. I decided to deviate from PIR for now and cheat and create an iterator based on the Mac’s native find command. The iterator code is:

Use File::Which qw(which);

sub get_dir_file_list_iter {
    my $dir = shift;

    my $FIND = which 'find';
    my $CMD = qq/$FIND -L "$dir" -type f -print/;

    return (
        sub {
            my $chunk_size = shift // 1;
            my @ary;
            while ( my $fn = <$FH> ) {
                chomp $fn;
                push( @ary, $fn );
                last if @ary == $chunk_size;
            }
            return (@ary);
        }
    );
}
[download]

Let me try to cutoff some of the flames on calling an external to do something that could be done with pure perl. At this point, I am trying to optimize speed. In the vast majority of my perl development, I color inside of the lines; however, at times when performance is my main concern, I cheat and leverage executables outside of perl that are optimized for a specific role. find is one of those. I recognize that there are potential problems with unanticipated side effects such as zombie processes, race conditions … In this case, I have decided to accept these risks as this approach reduces iteration clock time in this app from ~60 seconds to ~20 seconds based upon instrumented timing. In general, I advocate using perlish tools like PIR, File::Find …

My MCE code now looks like

use MCE;
use MCE::Candy;

my $iterator = get_dir_file_list_iter($dir);
open my $fh_out, ">, $ fn_out;
my $mce = MCE->new(
    gather      => MCE::Candy::out_iter_fh($fh_out),
    chunk_size  => $iter_file_ct,
    max_workers => $max_workers,
    user_func   => \&parse_files,
)->spawn;
$mce->process($iterator);
$mce->shutdown()
);
[download]

With respect to overall run time, with the find base iterator and a chunk_size of 250, I am down to ~300 secs. from my original ~380 secs. I have not done sufficient testing to validate what contributed to the specific reduction. I have created a shell script to run benchmark based upon a number of different variations and will update once it completes

marioroy, I had already been thinking about using MCE::Hobo and MCE::Queue to do something similar to your suggestions in Re^3: MCE: Slow IPC between child and gather process in parent. I will try this variation once the above testing completes

Thanks for you guidance and willingness to help!

lbe

UPDATE5: May 7, 2018 23:00 GMT-6

OK, ran some benchmarks today. My observation based upon them are:

380 sec – My base time before engaging in this optimization exercise
360 sec – 20 sec improvement when I change the read process from open to File::Map's map_file
325 sec – 35 sec improvement when I moved from using PIR to using my custom iterator based upon the Mac’s find binary
290 sec – 35 sec improvement when implementing chunking. For this machine, code and data, the sweet spot is around 500 file names per chunk. This results in each gather call transferring ~ 4MB of data as opposed to the 1 file chunk of ~ 8KB
270 sec - the average clock time spent in each process processing files. The overhead for MCE IPC is down to 20 secs (290-270)
210 sec - the average time each process spends reading the file into memory and applying the first regexp. This is about 40 secs slower than my map file speed test decribed above.
74 MB/sec - the average read spead (total file size / elapsed time)

I think this may be about as good as I am going to get using this system unless I can find a way to read the data from the disk faster. The read throughput is less than half of what I was able to achieve with 8 processes runing recursive fgrep.

At this point, I am going to close my testing insofar as updating in this thread. I don't think I will get much more speed out of using the MCE::Hobo and MCE::Queue, though I will give it a try. I'll also perform some additional benchmarking on reading to see if other options like File::Slurper, sysopen/sysread ...

Thanks to all for your comments and advice and a special thanks to marioroy for piping in with guidance on MCE

lbe

In reply to MCE: Slow IPC between child and gather process in parent by learnedbyerror

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.