gravid has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have this code:

my %hash; foreach my $dir (@dirs) { if (-e "$dir/syn.log") { proccess_1_file("$dir/syn.log"); } } sub proccess_1_file { my $file = $_[0]; open(my $fh, "$file") or die "cannot open $file $!"; while (<$fh>) { chomp; my $line =$_; if ($line =~ /^area=(.*)/) { $hash{"$dir"}{"area"} = $1; }

I want to parallelise the files processing.<\p>

something like having a thread for each $dir, while all thread update the $hash.

How can I do that?

Thx

Guy

Replies are listed 'Best First'.
Re: hashes & threads
by Eily (Monsignor) on Jul 28, 2016 at 14:04 UTC

    Using threads together with threads::shared (to share the hash) might help, or it might make things worse.

    If your problem is that your script is running too slow because you process a lot of files, or big files, you can try to see if that time is spent mostly reading, or processing. Just remove the content of the while loop and write while(<$fh>) {} and see if it runs significantly faster. If it does not, the issue comes from reading the files, and chances are multiple threads won't be able to do anything about it, because if all those folders are on the same device all your threads will have to wait for that device to be available and they will just run in a sequence in the best case scenario (they may force the device to jump from one file to another in a worse case).

    You may be able to win a little time by not reading line by line but going directly for the first occurence of "area=", by setting the input record separator:

    { # block to limit the effect of local local $/ = "\narea="; <$fh>; # read until the first "\narea="; } # here we are back to reading until the end of the line if (not eof $fh) # if the end of the file hasn't been reached while tr +ying to find "area=" { $hash{$dir}{area} = <$fh>; # read the rest of the line }

    Edit: it seems there can only be one "area=" for each file, so I have removed the outer loop. You can stop reading the file as soon as you have found a result with last if you keep your current code. Also, my example would ignore an "area=" line if it is the first of the file.

      Hi,

      I did the check you said, and I see fast runtime when while {}. so threads will not help here,Unless I submit a grid job from each thread.

      I wonder if a job can be just a subroutine, or it has to be a program..

      Guy

        If by running the empty while loop (but still reading the files) your script runs significantly faster, threads can make a difference (because it's not just time spent waiting for the file to be read). If it does not run faster, parallelisation of any sort is not going to be any more efficient, because all your parallel threads or processes are going to wait their turn to access the device.

        Although if - unlike what I expected - processing the file takes more time than reading it, you should try to see why, because the regular expression should be quite efficient with the ^ making it fail after checking a few characters on an incorrect line. Maybe returning from the function as soon as you have found the "area=" line can be a significant change...

        As a general rule, you should try to find exactly which part of the program is slowing it down (benchmark it) and why before you try to come up with solutions. Overall optimization often is a bad idea, because most of the program runs so ridiculously fast compared to the slowest part that not keeping the simpler version is a waste of development time.

Re: hashes & threads
by NetWallah (Canon) on Jul 28, 2016 at 14:02 UTC
    You could adapt the "kid" subroutine in the code from this node to do your bidding. (Uses Thread::Queue).

            "Software interprets lawyers as damage, and routes around them" - Larry Wall

Re: hashes & threads
by Marshall (Canon) on Jul 28, 2016 at 20:45 UTC
    This thread might have some relevance Perl's poor disk IO performance. Also, wild as it may seem, launching a fast grepper might speed things by just feeding in lines with "^area". Lots of overhead with that approach, but with huge files, might be worth it? I don't know. Benchmarking needed. If the problem is slow Perl line by line I/O, threads won't help.
Re: hashes & threads
by marioroy (Prior) on Aug 06, 2016 at 05:06 UTC

    gravid, welcome to the monastery.

    The following provide 4 demonstrations. MCE modules support Perl not built with threads support. The forth demonstration ensures graceful IO from shared storage or from a network-based file system. In that case, MCE does sequential IO among workers. Parallel IO is possible by adding the MCE option: parallel_io => 1, to have workers read simultaneously (random IO).

    threads and threads::shared

    use strict; use warnings; use threads; use threads::shared; my @dirs = qw( a b c d e f ); my %hash = map { $_ => shared_clone({}) } @dirs; my @thrs; foreach my $dir (@dirs) { if (-e "$dir/syn.log") { push @thrs, threads->create(\&process_1_file, $dir, "syn.log"); } } $_->join for @thrs; print $hash{"a"}{"area"}, "\n"; sub process_1_file { my ($dir, $file) = @_; open(my $fh, "$dir/$file") or die "cannot open $dir/$file $!"; while (<$fh>) { chomp; if ($_ =~ /^area=(.*)/) { $hash{"$dir"}{"area"} = $1; } } }

    MCE::Hobo and MCE::Shared

    use strict; use warnings; use MCE::Hobo; use MCE::Shared; my @dirs = qw( a b c d e f ); my %hash = map { $_ => MCE::Shared->hash } @dirs; foreach my $dir (@dirs) { if (-e "$dir/syn.log") { MCE::Hobo->create(\&process_1_file, $dir, "syn.log"); } } MCE::Hobo->waitall; print $hash{"a"}{"area"}, "\n"; sub process_1_file { my ($dir, $file) = @_; open(my $fh, "$dir/$file") or die "cannot open $dir/$file $!"; while (<$fh>) { chomp; if ($_ =~ /^area=(.*)/) { $hash{"$dir"}{"area"} = $1; } } }

    MCE::Loop and MCE::Shared

    use strict; use warnings; use MCE::Loop; use MCE::Shared; my @dirs = qw( a b c d e f ); my %hash = map { $_ => MCE::Shared->hash } @dirs; MCE::Loop->init( max_workers => $#dirs, chunk_size => 1 ); mce_loop { my $dir = $_; proccess_1_file($dir, "syn.log") if (-e "$dir/syn.log"); } @dirs; print $hash{"a"}{"area"}, "\n"; sub proccess_1_file { my ($dir, $file) = @_; open(my $fh, "$dir/$file") or die "cannot open $dir/$file $!"; while (<$fh>) { chomp; if ($_ =~ /^area=(.*)/) { $hash{"$dir"}{"area"} = $1; } } }

    MCE and MCE::Shared

    use strict; use warnings; use MCE; use MCE::Shared; my @dirs = qw( a b c d e f ); my %hash = map { $_ => MCE::Shared->hash } @dirs; my $mce = MCE->new( max_workers => 4, chunk_size => '12m', use_slurpio => 1, user_func => sub { my ($mce, $slurp_ref, $chunk_id) = @_; my $dir = $mce->user_args()->[0]; open my $MEM_FH, "<", $slurp_ref; while (<$MEM_FH>) { chomp; if ($_ =~ /^area=(.*)/) { $hash{"$dir"}{"area"} = $1; } } close $MEM_FH; } )->spawn; foreach my $dir (@dirs) { if (-e "$dir/syn.log") { $mce->process({ user_args => [ $dir ] }, "$dir/syn.log"); } } $mce->shutdown; print $hash{"a"}{"area"}, "\n";

    Warm regards. Mario

Re: hashes & threads
by perlfan (Parson) on Jul 28, 2016 at 16:15 UTC
    I hate recommending non-Perl solutions, but Perl really isn't good for this. For true threading in an environment that is familiar to Perl programmers, like to recommend looking at Qore.