http://qs1969.pair.com?node_id=1131653


in reply to Re^2: How to split file for threading?
in thread How to split file for threading?

File is remote, but I think everything I do is remote since I have to log in to the server

The question is whether the file is local or remote to the processor running the program; not the human who initiates it.

You could try this, but it is doubtful if it will be any faster when using multiple threads than just one unless the file is on a local, fast, SSD:

#! perl -slw use strict; use threads; sub worker { my( $filename, $target, $start, $end ) = @_; open my $fh, '<', $filename or die $!; seek $fh, $start, 0; <$fh> if $start > 0; ## discard first partial line my $count = 0; 1+index( <$fh>, $target ) and ++$count while tell( $fh ) < $end; return $count; } our $T //= 4; my( $filename, $target ) = @ARGV; my $fsize = -s $filename; my $chunksize = int( $fsize / $T ); my @chunks = map{ $_ * $chunksize } 0 .. $T-1; push @chunks, $fsize; my @threads = map{ threads->new( \&worker, $filename, $target, $chunks[ $_ ], $chunks +[ $_+1 ] ) } 0 .. $T-1; my $total = 0; $total += $_->join for @threads; print "Found $total '$target' lines";

Usage:

thisScript.pl -T=n theFile.txt "the string"

Note:The count is printed to stdout. Redirect it if you need it in a file.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

Replies are listed 'Best First'.
Re^4: How to split file for threading?
by diyaz (Beadle) on Jun 23, 2015 at 16:48 UTC
    Thank you very much. I like to learn by example and this is great. I'll be honest I'm trying to understand how the code works as I am still very novice to Perl and programming in general. what do these lines mean?   1+index( <$fh>, $target ) and ++$count while tell( $fh ) < $end; Is this reading the file line by line, returning line number where $target is found, and increasing $count for every time $target is found? Tell function returns the index position just to check end of file? Why couldn't you do while (<$fh>)?
    our $T //= 4;
    I couldn't really google two forward slashes so I don't know what this means.
    my @threads = map{ threads->new( \&worker, $filename, $target, $chunks[ $_ ], $chunks +[ $_+1 ] ) } 0 .. $T-1;
    Ok map will transform arrays, but here it seems like it is assigning 3 threads to @threads? threads->new() creates new thread while passing the worker sub and 4 variables? or is it creating a worker sub and passing those variables to worker?
    my $total = 0; $total += $_->join for @threads;
    this concatenates the thread results? but i don't see where or when the lines with $target were passed back and stored? Thanks again

      Sorry, but I don't think you are ready to do multitasking yet.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
        Speaking of judgement ... when do you retire?
Re^4: How to split file for threading?
by marioroy (Prior) on Jun 26, 2015 at 17:22 UTC

    Update: Added chunk_size option to the example.

    Update: Added note on one worker reading at any given time.

    The example by BrowserUk is awesome. Perhaps you're having to work with Perl not compiled with threads; e.g. running Perl on Solaris. A pattern counter can be done with MCE using the following code. The usage is the same between the two scripts.

    This is safe from random I/O thrashing. The read pattern is sequential, not random between workers due to one worker reading at any given time.

    #! perl -slw use strict; use warnings; use MCE::Loop; our $T //= 'auto'; my ( $file_name, $target ) = @ARGV; MCE::Loop::init { use_slurpio => 1, max_workers => $T, chunk_size => 1024 * 1024 * 16, }; my @result = mce_loop_f { my ( $mce, $slurped_chunk, $chunk_id ) = @_; my $count = 0; $count++ while ( $$slurped_chunk =~ /$target/g ); MCE->gather($count); } $file_name; my $total = 0; $total += shift @result while @result; print "Found $total '$target' lines";

    Usage:

    mce_script.pl -T=n theFile.txt "the string"